Extract

Introducing /extract (Beta)

Please use the new /extract Open Beta docs by clicking here 🔥

Extract structured data from a single, multiple URLs, or entire websites using Large Language Models (LLMs). Our new /extract endpoint allows you to:

Extract structured data from full websites at once
Connect or build data enrichment applications that need structured data from websites
Develop AI applications that need clean data from multiple websites

Considerations

The /extract endpoint provides flexible data extraction with customizable schemas. Results can be improved through prompt tuning. It is currently in beta and we welcome your feedback.

Extracting Data

/extract endpoint

Used to extract structured data from entire websites. When specifying URLs, you can append /* to the URL to extract information from the entire website path rather than just a single page. For example, https://firecrawl.dev/* will attempt to extract data from all pages on the firecrawl.dev domain. The /* is still in under testing so please let us know if you have any issues by emailing help@firecrawl.com.

Usage

from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field

# Initialize the FirecrawlApp with your API key
app = FirecrawlApp(api_key='your_api_key')

class ExtractSchema(BaseModel):
    company_mission: str
    supports_sso: bool
    is_open_source: bool
    is_in_yc: bool

data = app.extract([
  'https://docs.firecrawl.dev/*', 
  'https://firecrawl.dev/', 
  'https://www.ycombinator.com/companies/'
], prompt='Extract the company mission, whether it supports SSO, whether it is open source, and whether it is in Y Combinator from the page.', schema=ExtractSchema.model_json_schema())
print(data)

For more details about the parameters, refer to the API Reference.

Response (sdks)

JSON

{
  "success": true,
  "data": {
    "company_mission": "Firecrawl is the easiest way to extract data from the web. Developers use us to reliably convert URLs into LLM-ready markdown or structured data with a single API call.",
    "supports_sso": false,
    "is_open_source": true,
    "is_in_yc": true
  }
}

Response (async or not using sdks)

JSON

{
  "success": true,
  "id": "850eb555-db9c-42b9-9d96-bac1fca8bb23",
  "urlTrace": []
}

Checking extract status

You can use the /extract/ID endpoint to check the status of an extract job.

This endpoint only works for extract jobs that are in progress or extract jobs that have completed recently (within the last 24 hours).

curl -X GET https://api.firecrawl.dev/v1/extract/<extract_id> \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY'

Pending Response

Extract jobs can have one of the following states:

completed: The extract job has finished successfully.
processing: The extract job is still in progress.
failed: The extract job encountered an error and did not complete.
cancelled: The extract job was cancelled by the user.

JSON

{
  "success": true,
  "data": [],
  "status": "processing",
  "expiresAt": "2025-01-08T20:58:12.000Z"
}

Completed Response

JSON

{
  "success": true,
  "data": {
      "company_mission": "Firecrawl is the easiest way to extract data from the web. Developers use us to reliably convert URLs into LLM-ready markdown or structured data with a single API call.",
      "supports_sso": false,
      "is_open_source": true,
      "is_in_yc": true
    },
  "status": "completed",
  "expiresAt": "2025-01-08T20:58:12.000Z"
}

Extracting without schema

You can now extract without a schema by just passing a prompt to the endpoint. The LLM chooses the structure of the data.

curl -X POST https://api.firecrawl.dev/v1/extract \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "urls": [
        "https://docs.firecrawl.dev/",
        "https://firecrawl.dev/"
      ],
      "prompt": "Extract Firecrawl'\''s mission from the page."
    }'

Improving Results with Web Search & External Links

If you want to improve the results of the extraction, you can pass an enableWebSearch parameter to the endpoint. This will allow it to attempt to find the data from external links - outside the scope of the provided URLs.

Billing

While /extract is in beta, we are charging 5 credits per URL scraped used to form the final response. This is to prevent abuse. This will be changed in the future.

Get Started

Standard Features

Agentic Features

Contributing

Introducing /extract (Beta)

Considerations

Extracting Data

/extract endpoint

Usage

Response (sdks)

Response (async or not using sdks)

Checking extract status

Pending Response

Completed Response

Extracting without schema

Improving Results with Web Search & External Links

Billing

Get Started

Standard Features

Agentic Features

Contributing

​Introducing /extract (Beta)

​Considerations

​Extracting Data

​/extract endpoint

​Usage

​Response (sdks)

​Response (async or not using sdks)

​Checking extract status

​Pending Response

​Completed Response

​Extracting without schema

​Improving Results with Web Search & External Links

​Billing

Introducing /extract (Beta)

Considerations

Extracting Data

/extract endpoint

Usage

Response (sdks)

Response (async or not using sdks)

Checking extract status

Pending Response

Completed Response

Extracting without schema

Improving Results with Web Search & External Links

Billing