Scrape and extract structured data with Firecrawl
Firecrawl uses AI to get structured data from web pages in 3 steps:
-
Set the Schema:
Tell us what data you want by defining a JSON schema (using OpenAI’s format) along with the webpage URL.
-
Make the Request:
Send your URL and schema to our scrape endpoint. See how here:
Scrape Endpoint Documentation
-
Get Your Data:
Get back clean, structured data matching your schema that you can use right away.
This makes getting web data in the format you need quick and easy.
/scrape (with json) endpoint
Used to extract structured data from scraped pages.
from firecrawl import FirecrawlApp, JsonConfig
from pydantic import BaseModel, Field
# Initialize the FirecrawlApp with your API key
app = FirecrawlApp(api_key='your_api_key')
class ExtractSchema(BaseModel):
company_mission: str
supports_sso: bool
is_open_source: bool
is_in_yc: bool
json_config = JsonConfig(
extractionSchema=ExtractSchema.model_json_schema(),
mode="llm-extraction",
pageOptions={"onlyMainContent": True}
)
llm_extraction_result = app.scrape_url(
'https://firecrawl.dev',
formats=["json"],
json_options=json_config
)
print(llm_extraction_result)
Output:
{
"success": true,
"data": {
"json": {
"company_mission": "AI-powered web scraping and data extraction",
"supports_sso": true,
"is_open_source": true,
"is_in_yc": true
},
"metadata": {
"title": "Firecrawl",
"description": "AI-powered web scraping and data extraction",
"robots": "follow, index",
"ogTitle": "Firecrawl",
"ogDescription": "AI-powered web scraping and data extraction",
"ogUrl": "https://firecrawl.dev/",
"ogImage": "https://firecrawl.dev/og.png",
"ogLocaleAlternate": [],
"ogSiteName": "Firecrawl",
"sourceURL": "https://firecrawl.dev/"
},
}
}
You can now extract without a schema by just passing a prompt
to the endpoint. The llm chooses the structure of the data.
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev/",
"formats": ["json"],
"jsonOptions": {
"prompt": "Extract the company mission from the page."
}
}'
Output:
{
"success": true,
"data": {
"json": {
"company_mission": "AI-powered web scraping and data extraction",
},
"metadata": {
"title": "Firecrawl",
"description": "AI-powered web scraping and data extraction",
"robots": "follow, index",
"ogTitle": "Firecrawl",
"ogDescription": "AI-powered web scraping and data extraction",
"ogUrl": "https://firecrawl.dev/",
"ogImage": "https://firecrawl.dev/og.png",
"ogLocaleAlternate": [],
"ogSiteName": "Firecrawl",
"sourceURL": "https://firecrawl.dev/"
},
}
}
JSON options object
The jsonOptions
object accepts the following parameters:
schema
: The schema to use for the extraction.
systemPrompt
: The system prompt to use for the extraction.
prompt
: The prompt to use for the extraction without a schema.