This guide will walk you through the different endpoints of Firecrawl and how to use them fully with all its parameters.

Basic scraping with Firecrawl (/scrape)

To scrape a single page and get clean markdown content, you can use the /scrape endpoint.

# pip install firecrawl-py

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="YOUR_API_KEY")

content = app.scrape_url("https://mendable.ai")

Scraping pdfs

Firecrawl supports scraping pdfs by default. You can use the /scrape endpoint to scrape a pdf link and get the text content of the pdf. You can disable this by setting pageOptions.parsePDF to false.

Page Options

When using the /scrape endpoint, you can customize the scraping behavior with the pageOptions parameter. Here are the available options:

Getting cleaner content with onlyMainContent

  • Type: boolean
  • Description: Only return the main content of the page, excluding headers, navigation bars, footers, etc.
  • Default: false

Getting the HTML with includeHtml

  • Type: boolean
  • Description: Include the HTML version content of the page. This will add an html key in the response.
  • Default: false

Getting the raw HTML with includeRawHtml

  • Type: boolean
  • Description: Include the raw HTML content of the page. This will add an rawHtml key in the response.
  • Default: false

Getting a screenshot of the page with screenshot

  • Type: boolean
  • Decription: Include a screenshot of the top of the page that you are scraping.
  • Default: false

Waiting for the page to load with waitFor

  • Type: integer
  • Description: To be used only as a last resort. Wait for a specified amount of milliseconds for the page to load before fetching content.
  • Default: 0

Sending Headers for Authentication and more with headers

  • Type: object
  • Description: Headers to send with the request. Can be used to send cookies, user-agent, etc.
  • Default: {}

Example Usage

curl -X POST https://api.firecrawl.dev/v0/scrape \
    -H '
    Content-Type: application/json' \
    -H 'Authorization : Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://mendable.ai",
      "pageOptions": {
        "onlyMainContent": true,
        "includeHtml": true,
        "includeRawHtml":true,
        "screenshot": true,
        "waitFor": 5000,
        "headers": {
          "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
          "Cookies": "cookie1=value1; cookie2=value2"
        }
      }
    }'

In this example, the scraper will:

  • Return only the main content of the page.
  • Include the raw HTML content in the response in the html key.
  • Wait for 5000 milliseconds (5 seconds) for the page to load before fetching the content.

Here is the API Reference for it: Scrape Endpoint Documentation

Extractor Options

When using the /scrape endpoint, you can specify options for extracting structured information from the page content using the extractorOptions parameter. Here are the available options:

mode

  • Type: string

  • Enum: ["llm-extraction", "llm-extraction-from-raw-html"]

  • Description: The extraction mode to use.

    • llm-extraction: Extracts information from the cleaned and parsed content.
    • llm-extraction-from-raw-html: Extracts information directly from the raw HTML.
  • Type: string

  • Description: A prompt describing what information to extract from the page.

extractionSchema

  • Type: object
  • Description: The schema for the data to be extracted. This defines the structure of the extracted data.

Example Usage

curl -X POST https://api.firecrawl.dev/v0/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://www.mendable.ai/",
      "extractorOptions": {
        "mode": "llm-extraction",
        "extractionPrompt": "Based on the information on the page, extract the information from the schema. ",
        "extractionSchema": {
          "type": "object",
          "properties": {
            "company_mission": {
                      "type": "string"
            },
            "supports_sso": {
                      "type": "boolean"
            },
            "is_open_source": {
                      "type": "boolean"
            },
            "is_in_yc": {
                      "type": "boolean"
            }
          },
          "required": [
            "company_mission",
            "supports_sso",
            "is_open_source",
            "is_in_yc"
          ]
        }
      }
    }'
{
  "success": true,
  "data": {
    "content": "Raw Content",
    "metadata": {
      "title": "Mendable",
      "description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
      "robots": "follow, index",
      "ogTitle": "Mendable",
      "ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
      "ogUrl": "https://mendable.ai/",
      "ogImage": "https://mendable.ai/mendable_new_og1.png",
      "ogLocaleAlternate": [],
      "ogSiteName": "Mendable",
      "sourceURL": "https://mendable.ai/"
    },
    "llm_extraction": {
      "company_mission": "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to",
      "supports_sso": true,
      "is_open_source": false,
      "is_in_yc": true
    }
  }
}

Adjusting Timeout

You can adjust the timeout for the scraping process using the timeout parameter in milliseconds.

Example Usage

curl -X POST https://api.firecrawl.dev/v0/scrape \
    -H '
    Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://mendable.ai",
      "timeout": 50000
    }'

Crawling Multiple Pages

To crawl multiple pages, you can use the /crawl endpoint. This endpoint allows you to specify a base URL you want to crawl and all accessible subpages will be crawled.

curl -X POST https://api.firecrawl.dev/v0/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://mendable.ai"
    }'

Returns a jobId

{ "jobId": "1234-5678-9101" }

Check Crawl Job

Used to check the status of a crawl job and get its result.

curl -X GET https://api.firecrawl.dev/v0/crawl/status/1234-5678-9101 \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Crawler Options

When using the /crawl endpoint, you can customize the crawling behavior with the crawlerOptions parameter. Here are the available options:

includes

  • Type: array
  • Description: URL patterns to include in the crawl. Only URLs matching these patterns will be crawled.
  • Example: ["/blog/*", "/products/*"]

excludes

  • Type: array
  • Description: URL patterns to exclude from the crawl. URLs matching these patterns will be skipped.
  • Example: ["/admin/*", "/login/*"]

returnOnlyUrls

  • Type: boolean
  • Description: If set to true, the response will only include a list of URLs instead of the full document data.
  • Default: false

maxDepth

  • Type: integer
  • Description: Maximum depth to crawl relative to the entered URL. A maxDepth of 0 scrapes only the entered URL. A maxDepth of 1 scrapes the entered URL and all pages one level deep. A maxDepth of 2 scrapes the entered URL and all pages up to two levels deep. Higher values follow the same pattern.
  • Example: 2

mode

  • Type: string
  • Enum: ["default", "fast"]
  • Description: The crawling mode to use. fast mode crawls websites without a sitemap 4x faster but may be less accurate and is not recommended for heavily JavaScript-rendered websites.
  • Default: default

limit

  • Type: integer
  • Description: Maximum number of pages to crawl.
  • Default: 10000

Example Usage

curl -X POST https://api.firecrawl.dev/v0/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization : Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://mendable.ai",
      "crawlerOptions": {
        "includes": ["/blog/*", "/products/*"],
        "excludes": ["/admin/*", "/login/*"],
        "returnOnlyUrls": false,
        "maxDepth": 2,
        "mode": "fast",
        "limit": 1000
      }
    }'

In this example, the crawler will:

  • Only crawl URLs that match the patterns /blog/* and /products/*.
  • Skip URLs that match the patterns /admin/* and /login/*.
  • Return the full document data for each page.
  • Crawl up to a maximum depth of 2.
  • Use the fast crawling mode.
  • Crawl a maximum of 1000 pages.

Page Options + Crawler Options

You can combine the pageOptions and crawlerOptions parameters to customize both the full crawling behavior.

Example Usage

curl -X POST https://api.firecrawl.dev/v0/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://mendable.ai",
      "pageOptions": {
        "onlyMainContent": true,
        "includeHtml": true,
        "includeRawHtml": true,
        "screenshot": true,
        "waitFor": 5000
      },
      "crawlerOptions": {
        "includes": ["/blog/*", "/products/*"],
        "maxDepth": 2,
        "mode": "fast",
      }
    }'

In this example, the crawler will:

  • Return only the main content for each page.
  • Include the raw HTML content for each page.
  • Wait for 5000 milliseconds for each page to load before fetching its content.
  • Only crawl URLs that match the patterns /blog/* and /products/*.
  • Crawl up to a maximum depth of 2.
  • Use the fast crawling mode.

Extractor Options + Crawler Options

Coming soon…