Firecrawl supports scraping PDFs by default. You can use the /scrape endpoint to scrape a PDF link and get the text content of the PDF. You can disable this by setting parsePDF to false.
Getting the full page content as markdown with onlyMainContent
Type: boolean
Description: By default, the scraper will only return the main content of the page, excluding headers, navigation bars, footers, etc. Set this to false to return the full page content.
Description: Controls how PDF files are processed during scraping. When true, the PDF content is extracted and converted to markdown format, with billing based on the number of pages (1 credit per page). When false, the PDF file is returned in base64 encoding with a flat rate of 1 credit total.
Include the markdown, raw HTML, HTML, links and screenshot in the response.
The response will include only the HTML tags <h1>, <p>, <a>, and elements with the class .main-content, while excluding any elements with the IDs #ad and #footer.
Wait for 1000 milliseconds (1 second) for the page to load before fetching the content.
Set the maximum duration of the scrape request to 15000 milliseconds (15 seconds).
Return PDF files in base64 format instead of converting them to markdown.
When using the /scrape endpoint, you can specify options for extracting structured information from the page content using the extract parameter. Here are the available options:
When using the /scrape endpoint, Firecrawl allows you to perform various actions on a web page before scraping its content. This is particularly useful for interacting with dynamic content, navigating through pages, or accessing content that requires user interaction.
Description: Scrape the current page content, returns the url and the html. The scraped content will be returned in the actions.scrapes array of the response.
To crawl multiple pages, you can use the /crawl endpoint. This endpoint allows you to specify a base URL you want to crawl and all accessible subpages will be crawled.
If the content is larger than 10MB or if the crawl job is still running, the response will include a next parameter. This parameter is a URL to the next page of results. You can use this parameter to get the next page of results.
Description: Regex patterns to include in the crawl. Only URLs matching these patterns will be crawled. For example, ^/blog/.* will match any URL that starts with /blog/.
Description: Regex patterns to exclude from the crawl. URLs matching these patterns will be skipped. For example, ^/admin/.* will exclude any URL that starts with /admin/.
Description: Maximum absolute depth to crawl from the base of the entered URL. For example, if the entered URL’s path is /features/feature-1, then no results would be returned unless maxDepth is at least 2.
Description: This option allows the crawler to follow links that point to external domains. Be careful with this option, as it can cause the crawl to stop only based only on thelimit and maxDepth values.
Description: Allows the crawler to follow links to subdomains of the main domain. For example, if crawling example.com, this would allow following links to blog.example.com or api.example.com.
Description: Delay in seconds between scrapes. This helps respect website rate limits and prevent overwhelming the target website. If not provided, the crawler may use the robots.txt crawl delay if available.
As part of the crawler options, you can also specify the scrapeOptions parameter. This parameter allows you to customize the scraping behavior for each page.
The /map endpoint is adept at identifying URLs that are contextually related to a given website. This feature is crucial for understanding a site’s contextual link environment, which can greatly aid in strategic site analysis and navigation planning.