Learn how to improve your Firecrawl scraping with advanced options.
/scrape
endpoint.
/scrape
endpoint to scrape a PDF link and get the text content of the PDF. You can disable this by setting parsePDF
to false
.
/scrape
endpoint, you can customize the scraping behavior with many parameters. Here are the available options:
formats
array
["markdown", "links", "html", "rawHtml", "screenshot", "json"]
markdown
: Returns the scraped content in Markdown format.links
: Includes all hyperlinks found on the page.html
: Provides the content in HTML format.rawHtml
: Delivers the raw HTML content, without any processing.screenshot
: Includes a screenshot of the page as it appears in the browser.json
: Extracts structured information from the page using the LLM.["markdown"]
onlyMainContent
boolean
false
to return the full page content.true
includeTags
array
excludeTags
array
waitFor
integer
0
timeout
integer
30000
(30 seconds)parsePDF
boolean
true
, the PDF content is extracted and converted to markdown format, with billing based on the number of pages (1 credit per page). When false
, the PDF file is returned in base64 encoding with a flat rate of 1 credit total.true
<h1>
, <p>
, <a>
, and elements with the class .main-content
, while excluding any elements with the IDs #ad
and #footer
./scrape
endpoint, you can specify options for extracting structured information from the page content using the extract
parameter. Here are the available options:
object
string
string
"Extract the features of the product"
/scrape
endpoint, Firecrawl allows you to perform various actions on a web page before scraping its content. This is particularly useful for interacting with dynamic content, navigating through pages, or accessing content that requires user interaction.
object
type
: "wait"
milliseconds
: Number of milliseconds to wait.object
type
: "screenshot"
fullPage
: Should the screenshot be full-page or viewport sized? (default: false
)object
type
: "click"
selector
: Query selector to find the element by.object
type
: "write"
text
: Text to type.selector
: Query selector for the input field.object
type
: "press"
key
: Key to press.object
type
: "scroll"
direction
: Direction to scroll ("up"
or "down"
).amount
: Amount to scroll in pixels.object
actions.scrapes
array of the response.type
: "scrape"
object
actions.pdfs
array of the response.type
: "pdf"
format
: The page size of the resulting PDF (default: "Letter"
)landscape
: Whether to generate the PDF in landscape orientation (default: false
)scale
: The scale multiplier of the resulting PDF (default: 1
)object
actions.javascriptReturns
array of the response.type
: "executeJavascript"
script
: JavaScript code to execute./crawl
endpoint. This endpoint allows you to specify a base URL you want to crawl and all accessible subpages will be crawled.
next
parameter. This parameter is a URL to the next page of results. You can use this parameter to get the next page of results.
/crawl
endpoint, you can customize the crawling behavior with request body parameters. Here are the available options:
includePaths
array
^/blog/.*
will match any URL that starts with /blog/
.["^/blog/.*$", "^/docs/.*$"]
excludePaths
array
^/admin/.*
will exclude any URL that starts with /admin/
.["^/admin/.*$", "^/private/.*$"]
maxDepth
integer
/features/feature-1
, then no results would be returned unless maxDepth
is at least 2.2
limit
integer
10000
allowBackwardLinks
boolean
false
allowExternalLinks
boolean
limit
and maxDepth
values.false
allowSubdomains
boolean
example.com
, this would allow following links to blog.example.com
or api.example.com
.false
delay
number
undefined
scrapeOptions
parameter. This parameter allows you to customize the scraping behavior for each page.
object
{"formats": ["markdown", "links", "html", "rawHtml", "screenshot"], "includeTags": ["h1", "p", "a", ".main-content"], "excludeTags": ["#ad", "#footer"], "onlyMainContent": false, "waitFor": 1000, "timeout": 15000}
{ "formats": ["markdown"] }
^/blog/.*$
and ^/docs/.*$
.^/admin/.*$
and ^/private/.*$
./map
/map
endpoint is adept at identifying URLs that are contextually related to a given website. This feature is crucial for understanding a site’s contextual link environment, which can greatly aid in strategic site analysis and navigation planning.
/map
endpoint, you need to send a GET request with the URL of the page you want to map. Here is an example using curl
:
search
string
"blog"
limit
integer
100
ignoreSitemap
boolean
true
includeSubdomains
boolean
true