Advanced Scraping Guide

This guide will walk you through the different endpoints of Firecrawl and how to use them fully with all its parameters.

Basic scraping with Firecrawl (/scrape)

To scrape a single page and get clean markdown content, you can use the /scrape endpoint.

# pip install firecrawl-py

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="YOUR_API_KEY")

content = app.scrape_url("https://docs.firecrawl.dev")

Scraping pdfs

Firecrawl supports scraping pdfs by default. You can use the /scrape endpoint to scrape a pdf link and get the text content of the pdf. You can disable this by setting pageOptions.parsePDF to false.

Page Options

When using the /scrape endpoint, you can customize the scraping behavior with the pageOptions parameter. Here are the available options:

Getting cleaner content with `onlyMainContent`

Type: boolean
Description: Only return the main content of the page, excluding headers, navigation bars, footers, etc.
Default: false

Getting the HTML with `includeHtml`

Type: boolean
Description: Include the HTML version content of the page. This will add an html key in the response.
Default: false

Getting the raw HTML with `includeRawHtml`

Type: boolean
Description: Include the raw HTML content of the page. This will add an rawHtml key in the response.
Default: false

Getting a screenshot of the page with `screenshot`

Type: boolean
Decription: Include a screenshot of the top of the page that you are scraping.
Default: false

Waiting for the page to load with `waitFor`

Type: integer
Description: To be used only as a last resort. Wait for a specified amount of milliseconds for the page to load before fetching content.
Default: 0

Example Usage

curl -X POST https://api.firecrawl.dev/v0/scrape \
    -H '
    Content-Type: application/json' \
    -H 'Authorization : Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "pageOptions": {
        "onlyMainContent": true,
        "includeHtml": true,
        "includeRawHtml":true,
        "screenshot": true,
        "waitFor": 5000
      }
    }'

In this example, the scraper will:

Return only the main content of the page.
Include the raw HTML content in the response in the html key.
Wait for 5000 milliseconds (5 seconds) for the page to load before fetching the content.

Here is the API Reference for it: Scrape Endpoint Documentation

Extractor Options

When using the /scrape endpoint, you can specify options for extracting structured information from the page content using the extractorOptions parameter. Here are the available options:

mode

Type: string
Enum: ["llm-extraction", "llm-extraction-from-raw-html"]
Description: The extraction mode to use.
- llm-extraction: Extracts information from the cleaned and parsed content.
- llm-extraction-from-raw-html: Extracts information directly from the raw HTML.
Type: string
Description: A prompt describing what information to extract from the page.

extractionSchema

Type: object
Description: The schema for the data to be extracted. This defines the structure of the extracted data.

Example Usage

curl -X POST https://api.firecrawl.dev/v0/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev/",
      "extractorOptions": {
        "mode": "llm-extraction",
        "extractionPrompt": "Based on the information on the page, extract the information from the schema. ",
        "extractionSchema": {
          "type": "object",
          "properties": {
            "company_mission": {
                      "type": "string"
            },
            "supports_sso": {
                      "type": "boolean"
            },
            "is_open_source": {
                      "type": "boolean"
            },
            "is_in_yc": {
                      "type": "boolean"
            }
          },
          "required": [
            "company_mission",
            "supports_sso",
            "is_open_source",
            "is_in_yc"
          ]
        }
      }
    }'

{
  "success": true,
  "data": {
    "content": "Raw Content",
    "metadata": {
      "title": "Mendable",
      "description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
      "robots": "follow, index",
      "ogTitle": "Mendable",
      "ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
      "ogUrl": "https://docs.firecrawl.dev/",
      "ogImage": "https://docs.firecrawl.dev/mendable_new_og1.png",
      "ogLocaleAlternate": [],
      "ogSiteName": "Mendable",
      "sourceURL": "https://docs.firecrawl.dev/"
    },
    "llm_extraction": {
      "company_mission": "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to",
      "supports_sso": true,
      "is_open_source": false,
      "is_in_yc": true
    }
  }
}

Adjusting Timeout

You can adjust the timeout for the scraping process using the timeout parameter in milliseconds.

Example Usage

curl -X POST https://api.firecrawl.dev/v0/scrape \
    -H '
    Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "timeout": 50000
    }'

Crawling Multiple Pages

To crawl multiple pages, you can use the /crawl endpoint. This endpoint allows you to specify a base URL you want to crawl and all accessible subpages will be crawled.

curl -X POST https://api.firecrawl.dev/v0/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev"
    }'

Returns a jobId

{ "jobId": "1234-5678-9101" }

Check Crawl Job

Used to check the status of a crawl job and get its result.

curl -X GET https://api.firecrawl.dev/v0/crawl/status/1234-5678-9101 \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Crawler Options

When using the /crawl endpoint, you can customize the crawling behavior with the crawlerOptions parameter. Here are the available options:

`includes`

Type: array
Description: URL patterns to include in the crawl. Only URLs matching these patterns will be crawled.
Example: ["/blog/*", "/products/*"]

`excludes`

Type: array
Description: URL patterns to exclude from the crawl. URLs matching these patterns will be skipped.
Example: ["/admin/*", "/login/*"]

`returnOnlyUrls`

Type: boolean
Description: If set to true, the response will only include a list of URLs instead of the full document data.
Default: false

`maxDepth`

Type: integer
Description: Maximum depth to crawl relative to the entered URL. A maxDepth of 0 scrapes only the entered URL. A maxDepth of 1 scrapes the entered URL and all pages one level deep. A maxDepth of 2 scrapes the entered URL and all pages up to two levels deep. Higher values follow the same pattern.
Example: 2

`mode`

Type: string
Enum: ["default", "fast"]
Description: The crawling mode to use. fast mode crawls websites without a sitemap 4x faster but may be less accurate and is not recommended for heavily JavaScript-rendered websites.
Default: default

`limit`

Type: integer
Description: Maximum number of pages to crawl.
Default: 10000

Example Usage

curl -X POST https://api.firecrawl.dev/v0/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization : Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "crawlerOptions": {
        "includes": ["/blog/*", "/products/*"],
        "excludes": ["/admin/*", "/login/*"],
        "returnOnlyUrls": false,
        "maxDepth": 2,
        "mode": "fast",
        "limit": 1000
      }
    }'

In this example, the crawler will:

Only crawl URLs that match the patterns /blog/* and /products/*.
Skip URLs that match the patterns /admin/* and /login/*.
Return the full document data for each page.
Crawl up to a maximum depth of 2.
Use the fast crawling mode.
Crawl a maximum of 1000 pages.

Page Options + Crawler Options

You can combine the pageOptions and crawlerOptions parameters to customize both the full crawling behavior.

Example Usage

curl -X POST https://api.firecrawl.dev/v0/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "pageOptions": {
        "onlyMainContent": true,
        "includeHtml": true,
        "includeRawHtml": true,
        "screenshot": true,
        "waitFor": 5000
      },
      "crawlerOptions": {
        "includes": ["/blog/*", "/products/*"],
        "maxDepth": 2,
        "mode": "fast",
      }
    }'

In this example, the crawler will:

Return only the main content for each page.
Include the raw HTML content for each page.
Wait for 5000 milliseconds for each page to load before fetching its content.
Only crawl URLs that match the patterns /blog/* and /products/*.
Crawl up to a maximum depth of 2.
Use the fast crawling mode.

Extractor Options + Crawler Options

Coming soon…

Get Started

Standard Features

Agentic Features

Webhooks

Use Cases

Contributing

Advanced Scraping Guide

Basic scraping with Firecrawl (/scrape)

Scraping pdfs

Page Options

Getting cleaner content with `onlyMainContent`

Getting the HTML with `includeHtml`

Getting the raw HTML with `includeRawHtml`

Getting a screenshot of the page with `screenshot`

Waiting for the page to load with `waitFor`

Example Usage

Extractor Options

mode

extractionSchema

Example Usage

Adjusting Timeout

Example Usage

Crawling Multiple Pages

Check Crawl Job

Crawler Options

`includes`

`excludes`

`returnOnlyUrls`

`maxDepth`

`mode`

`limit`

Example Usage

Page Options + Crawler Options

Example Usage

Extractor Options + Crawler Options

Get Started

Standard Features

Agentic Features

Webhooks

Use Cases

Contributing

​Basic scraping with Firecrawl (/scrape)

​Scraping pdfs

​Page Options

​Getting cleaner content with onlyMainContent

​Getting the HTML with includeHtml

​Getting the raw HTML with includeRawHtml

​Getting a screenshot of the page with screenshot

​Waiting for the page to load with waitFor

​Example Usage

​Extractor Options

​mode

​extractionSchema

​Example Usage

​Adjusting Timeout

​Example Usage

​Crawling Multiple Pages

​Check Crawl Job

​Crawler Options

​includes

​excludes

​returnOnlyUrls

​maxDepth

​mode

​limit

​Example Usage

​Page Options + Crawler Options

​Example Usage

​Extractor Options + Crawler Options

Basic scraping with Firecrawl (/scrape)

Scraping pdfs

Page Options

Getting cleaner content with `onlyMainContent`

Getting the HTML with `includeHtml`

Getting the raw HTML with `includeRawHtml`

Getting a screenshot of the page with `screenshot`

Waiting for the page to load with `waitFor`

Example Usage

Extractor Options

mode

extractionSchema

Example Usage

Adjusting Timeout

Example Usage

Crawling Multiple Pages

Check Crawl Job

Crawler Options

`includes`

`excludes`

`returnOnlyUrls`

`maxDepth`

`mode`

`limit`

Example Usage

Page Options + Crawler Options

Example Usage

Extractor Options + Crawler Options