高级抓取指南

本指南将带你了解 Firecrawl 的各个端点，并讲解如何结合所有参数充分发挥其功能。

使用 Firecrawl 进行基础抓取

要抓取单个页面并获取干净的 Markdown 内容，可以使用 /scrape 端点。

# pip install firecrawl-py

from firecrawl import Firecrawl

firecrawl = Firecrawl(api_key="fc-YOUR-API-KEY")

doc = firecrawl.scrape("https://firecrawl.dev")

print(doc.markdown)

抓取 PDF

Firecrawl 支持 PDF。需要确保解析 PDF 时，请使用 parsers 选项（例如 parsers: ["pdf"]）。

抓取选项

使用 /scrape 端点时，你可以通过以下选项定制抓取。

formats（`formats`）

类型: array
字符串: ["markdown", "links", "html", "rawHtml", "summary", "images"]
对象 formats:
- JSON: { type: "json", prompt, schema }
- 截图: { type: "screenshot", fullPage?, quality?, viewport? }
- 变更跟踪: { type: "changeTracking", modes?, prompt?, schema?, tag? }（需要 markdown）
默认值: ["markdown"]

全页内容 vs 主内容（`onlyMainContent`）

类型: boolean
描述: 默认仅返回主内容。将其设为 false 可返回全页内容。
默认值: true

包含标签（`includeTags`）

类型: array
描述: 抓取时要包含的 HTML 标签/类名/ID。

排除标签（`excludeTags`）

类型: array
描述: 在抓取时需要排除的 HTML 标签/类名/ID。

等待页面就绪（`waitFor`）

类型: integer
描述: 抓取前额外等待的毫秒数（尽量少用）。该等待时间是在 Firecrawl 智能等待功能基础上的额外延时。
默认值: 0

新鲜度与缓存（`maxAge`）

类型: integer（毫秒）
描述: 如果页面的缓存版本在 maxAge 内仍然有效，Firecrawl 会立即返回；否则将抓取最新内容并更新缓存。将其设为 0 可始终获取最新内容。
默认值: 172800000（2 天）

请求超时（`timeout`）

类型: integer
描述: 在中止前的最长时长（毫秒）。
默认值: 30000（30 秒）

PDF 解析（`parsers`）

类型: array
描述: 用于控制解析行为。要解析 PDF，请设置 parsers: ["pdf"]。

Actions (`actions`)

使用 /scrape 端点时，Firecrawl 允许你在抓取页面内容之前对网页执行各类 actions。这对于与动态内容交互、在页面间导航，或访问需要用户交互才能显示的内容特别有用。

Type: array
Description: 抓取前执行的一系列浏览器步骤。
Supported actions:
- wait { milliseconds }
- click { selector }
- write { selector, text }
- press { key }
- scroll { direction: "up" | "down" }
- scrape { selector }（抓取子元素）
- executeJavascript { script }
- pdf（在部分流程中触发 PDF 渲染）

from firecrawl import Firecrawl

firecrawl = Firecrawl(api_key='fc-YOUR-API-KEY')

doc = firecrawl.scrape('https://example.com', {
  actions: [
    { type: 'wait', milliseconds: 1000 },
    { type: 'click', selector: '#accept' },
    { type: 'scroll', direction: 'down' },
    { type: 'write', selector: '#q', text: 'firecrawl' },
    { type: 'press', key: 'Enter' }
  ],
  formats: ['markdown']
})

print(doc.markdown)

示例用法

cURL

curl -X POST https://api.firecrawl.dev/v2/scrape \
    -H '
    Content-Type: application/json' \
    -H 'Authorization: Bearer fc-YOUR-API-KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "formats": [
        "markdown",
        "links",
        "html",
        "rawHtml",
        { "type": "screenshot", "fullPage": true, "quality": 80 }
      ],
      "includeTags": ["h1", "p", "a", ".main-content"],
      "excludeTags": ["#ad", "#footer"],
      "onlyMainContent": false,
      "waitFor": 1000,
      "timeout": 15000,
      "parsers": ["pdf"]
    }'

在此示例中，scraper 将：

以 Markdown 返回完整页面内容。
在响应中包含 Markdown、原始 HTML、HTML、链接以及截图。
仅包含 HTML 标签 <h1>、<p>、<a>，以及类名为 .main-content 的元素，同时排除任何 ID 为 #ad 和 #footer 的元素。
在开始抓取前等待 1000 毫秒（1 秒），以便页面加载。
将抓取请求的最长持续时间设置为 15000 毫秒（15 秒）。
通过 parsers: ["pdf"] 显式解析 PDF。

API 参考： Scrape Endpoint Documentation

通过格式进行 JSON 提取

在 formats 中使用 JSON 格式对象，一次完成结构化数据的提取：

curl -X POST https://api.firecrawl.dev/v2/scrape \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer fc-YOUR-API-KEY' \
  -d '{
    "url": "https://firecrawl.dev",
    "formats": [{
      "type": "json",
      "prompt": "提取该产品的功能",
      "schema": {"type": "object", "properties": {"features": {"type": "object"}}, "required": ["features"]}
    }]
  }'

/extract 端点

当你需要通过状态轮询进行异步抽取时，请使用专用的抽取作业 API。

import Firecrawl from '@mendable/firecrawl-js';

const firecrawl = new Firecrawl({ apiKey: 'fc-YOUR-API-KEY' });

// 启动抽取作业
const started = await firecrawl.startExtract({
  urls: ['https://docs.firecrawl.dev'],
  prompt: 'Extract title',
  schema: { type: 'object', properties: { title: { type: 'string' } }, required: ['title'] }
});

// 轮询状态
const status = await firecrawl.getExtractStatus(started.id);
console.log(status.status, status.data);

抓取多个页面

要抓取多个页面，请使用 /v2/crawl 端点。

cURL

curl -X POST https://api.firecrawl.dev/v2/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer fc-YOUR-API-KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev"
    }'

返回一个 ID

{ "id": "1234-5678-9101" }

检查爬取任务

用于查看爬取任务的状态并获取其结果。

cURL

curl -X GET https://api.firecrawl.dev/v2/crawl/1234-5678-9101 \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer fc-你的API密钥'

分页/下一页 URL

如果内容超过 10MB，或爬取任务仍在运行，响应可能包含一个 next 参数，即指向下一页结果的 URL。

爬取提示与参数预览

你可以提供自然语言的 prompt，让 Firecrawl 自动推导爬取设置。请先预览结果：

cURL

curl -X POST https://api.firecrawl.dev/v2/crawl/params-preview \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer fc-YOUR-API-KEY' \
  -d '{
    "url": "https://docs.firecrawl.dev",
    "prompt": "提取文档与博客"
  }'

爬虫选项

使用 /v2/crawl 端点时，可以通过以下方式自定义爬取行为：

includePaths

类型: array
说明: 要包含的正则表达式模式。
示例: ["^/blog/.*$", "^/docs/.*$"]

excludePaths

类型: array
描述: 用于排除的正则表达式模式。
示例: ["^/admin/.*$", "^/private/.*$"]

maxDiscoveryDepth

类型: integer
描述: 查找新 URL 的最大发现深度。

limit

类型: integer
描述: 爬取的最大页面数量。
默认值: 10000

crawlEntireDomain

类型: boolean
描述: 跨同级/父级页面扩展爬取以覆盖整个域名。
默认值: false

allowExternalLinks

类型: boolean
描述: 跟随指向外部域名的链接。
默认值: false

allowSubdomains

类型: boolean
描述: 允许跟踪主域的子域名。
默认值: false

delay

Type: number
Description: 每次抓取之间的延迟（单位：秒）。
Default: undefined

scrapeOptions

类型: object
描述: 抓取器选项（参见上面的格式）。
示例: { "formats": ["markdown", "links", {"type": "screenshot", "fullPage": true}], "includeTags": ["h1", "p", "a", ".main-content"], "excludeTags": ["#ad", "#footer"], "onlyMainContent": false, "waitFor": 1000, "timeout": 15000}
默认值: formats: ["markdown"]，默认启用缓存（maxAge 约 2 天）

示例用法

cURL

curl -X POST https://api.firecrawl.dev/v2/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer fc-你的-API-KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "includePaths": ["^/blog/.*$", "^/docs/.*$"],
      "excludePaths": ["^/admin/.*$", "^/private/.*$"],
      "maxDiscoveryDepth": 2,
      "limit": 1000
    }'

映射网站链接

/v2/map 端点用于识别与指定网站相关的 URL。

使用方法

cURL

curl -X POST https://api.firecrawl.dev/v2/map \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer fc-YOUR-API-KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev"
    }'

Map 选项

search

类型: string
描述: 过滤包含指定文本的链接。

limit

类型: integer
描述: 返回的链接数量最大值。
默认值: 100

sitemap

类型: "only" | "include" | "skip"
描述: 控制在映射时对 sitemap 的使用。
默认值: "include"

includeSubdomains

类型: boolean
说明: 是否包含该网站的子域名。
默认值: true

相关 API 参考：/map 端点文档感谢阅读！

快速上手

新功能

标准功能

Webhook 回调

开发者指南

使用案例

贡献

​使用 Firecrawl 进行基础抓取

​抓取 PDF

​抓取选项

​formats（formats）

​全页内容 vs 主内容（onlyMainContent）

​包含标签（includeTags）

​排除标签（excludeTags）

​等待页面就绪（waitFor）

​新鲜度与缓存（maxAge）

​请求超时（timeout）

​PDF 解析（parsers）

​Actions (actions)

​示例用法

​通过格式进行 JSON 提取

​/extract 端点

​抓取多个页面

​检查爬取任务

​分页/下一页 URL

​爬取提示与参数预览

​爬虫选项

​includePaths

​excludePaths

​maxDiscoveryDepth

​limit

​crawlEntireDomain

​allowExternalLinks

​allowSubdomains

​delay

​scrapeOptions

​示例用法

​映射网站链接

​使用方法

​Map 选项

​search

​limit

​sitemap

​includeSubdomains

使用 Firecrawl 进行基础抓取

抓取 PDF

抓取选项

formats（`formats`）

全页内容 vs 主内容（`onlyMainContent`）

包含标签（`includeTags`）

排除标签（`excludeTags`）

等待页面就绪（`waitFor`）

新鲜度与缓存（`maxAge`）

请求超时（`timeout`）

PDF 解析（`parsers`）

Actions (`actions`)

示例用法

通过格式进行 JSON 提取

/extract 端点

抓取多个页面

检查爬取任务

分页/下一页 URL

爬取提示与参数预览

爬虫选项

includePaths

excludePaths

maxDiscoveryDepth

limit

crawlEntireDomain

allowExternalLinks

allowSubdomains

delay

scrapeOptions

示例用法

映射网站链接

使用方法

Map 选项

search

limit

sitemap

includeSubdomains