Advanced Scraping Guide
Learn how to improve your Firecrawl scraping with advanced options.
This guide will walk you through the different endpoints of Firecrawl and how to use them fully with all its parameters.
Basic scraping with Firecrawl (/scrape)
To scrape a single page and get clean markdown content, you can use the /scrape
endpoint.
Scraping pdfs
Firecrawl supports scraping pdfs by default. You can use the /scrape
endpoint to scrape a pdf link and get the text content of the pdf. You can disable this by setting pageOptions.parsePDF
to false
.
Page Options
When using the /scrape
endpoint, you can customize the scraping behavior with the pageOptions
parameter. Here are the available options:
Getting cleaner content with onlyMainContent
- Type:
boolean
- Description: Only return the main content of the page, excluding headers, navigation bars, footers, etc.
- Default:
false
Getting the HTML with includeHtml
- Type:
boolean
- Description: Include the HTML version content of the page. This will add an
html
key in the response. - Default:
false
Getting the raw HTML with includeRawHtml
- Type:
boolean
- Description: Include the raw HTML content of the page. This will add an
rawHtml
key in the response. - Default:
false
Getting a screenshot of the page with screenshot
- Type:
boolean
- Decription: Include a screenshot of the top of the page that you are scraping.
- Default:
false
Waiting for the page to load with waitFor
- Type:
integer
- Description: To be used only as a last resort. Wait for a specified amount of milliseconds for the page to load before fetching content.
- Default:
0
Example Usage
In this example, the scraper will:
- Return only the main content of the page.
- Include the raw HTML content in the response in the
html
key. - Wait for 5000 milliseconds (5 seconds) for the page to load before fetching the content.
Here is the API Reference for it: Scrape Endpoint Documentation
Extractor Options
When using the /scrape
endpoint, you can specify options for extracting structured information from the page content using the extractorOptions
parameter. Here are the available options:
mode
-
Type:
string
-
Enum:
["llm-extraction", "llm-extraction-from-raw-html"]
-
Description: The extraction mode to use.
llm-extraction
: Extracts information from the cleaned and parsed content.llm-extraction-from-raw-html
: Extracts information directly from the raw HTML.
-
Type:
string
-
Description: A prompt describing what information to extract from the page.
extractionSchema
- Type:
object
- Description: The schema for the data to be extracted. This defines the structure of the extracted data.
Example Usage
Adjusting Timeout
You can adjust the timeout for the scraping process using the timeout
parameter in milliseconds.
Example Usage
Crawling Multiple Pages
To crawl multiple pages, you can use the /crawl
endpoint. This endpoint allows you to specify a base URL you want to crawl and all accessible subpages will be crawled.
Returns a jobId
Check Crawl Job
Used to check the status of a crawl job and get its result.
Crawler Options
When using the /crawl
endpoint, you can customize the crawling behavior with the crawlerOptions
parameter. Here are the available options:
includes
- Type:
array
- Description: URL patterns to include in the crawl. Only URLs matching these patterns will be crawled.
- Example:
["/blog/*", "/products/*"]
excludes
- Type:
array
- Description: URL patterns to exclude from the crawl. URLs matching these patterns will be skipped.
- Example:
["/admin/*", "/login/*"]
returnOnlyUrls
- Type:
boolean
- Description: If set to
true
, the response will only include a list of URLs instead of the full document data. - Default:
false
maxDepth
- Type:
integer
- Description: Maximum depth to crawl relative to the entered URL. A maxDepth of 0 scrapes only the entered URL. A maxDepth of 1 scrapes the entered URL and all pages one level deep. A maxDepth of 2 scrapes the entered URL and all pages up to two levels deep. Higher values follow the same pattern.
- Example:
2
mode
- Type:
string
- Enum:
["default", "fast"]
- Description: The crawling mode to use.
fast
mode crawls websites without a sitemap 4x faster but may be less accurate and is not recommended for heavily JavaScript-rendered websites. - Default:
default
limit
- Type:
integer
- Description: Maximum number of pages to crawl.
- Default:
10000
Example Usage
In this example, the crawler will:
- Only crawl URLs that match the patterns
/blog/*
and/products/*
. - Skip URLs that match the patterns
/admin/*
and/login/*
. - Return the full document data for each page.
- Crawl up to a maximum depth of 2.
- Use the fast crawling mode.
- Crawl a maximum of 1000 pages.
Page Options + Crawler Options
You can combine the pageOptions
and crawlerOptions
parameters to customize both the full crawling behavior.
Example Usage
In this example, the crawler will:
- Return only the main content for each page.
- Include the raw HTML content for each page.
- Wait for 5000 milliseconds for each page to load before fetching its content.
- Only crawl URLs that match the patterns
/blog/*
and/products/*
. - Crawl up to a maximum depth of 2.
- Use the fast crawling mode.
Extractor Options + Crawler Options
Coming soon…