爬取 - Firecrawl Docs

根据选项爬取多个 URL

curl --request POST \
  --url https://api.firecrawl.dev/v1/crawl \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "url": "<string>",
  "excludePaths": [
    "<string>"
  ],
  "includePaths": [
    "<string>"
  ],
  "maxDepth": 10,
  "maxDiscoveryDepth": 123,
  "ignoreSitemap": false,
  "ignoreQueryParameters": false,
  "limit": 10000,
  "allowBackwardLinks": false,
  "crawlEntireDomain": false,
  "allowExternalLinks": false,
  "allowSubdomains": false,
  "delay": 123,
  "maxConcurrency": 123,
  "webhook": {
    "url": "<string>",
    "headers": {},
    "metadata": {},
    "events": [
      "completed"
    ]
  },
  "scrapeOptions": {
    "onlyMainContent": true,
    "includeTags": [
      "<string>"
    ],
    "excludeTags": [
      "<string>"
    ],
    "maxAge": 0,
    "headers": {},
    "waitFor": 0,
    "mobile": false,
    "skipTlsVerification": false,
    "timeout": 30000,
    "parsePDF": true,
    "jsonOptions": {
      "schema": {},
      "systemPrompt": "<string>",
      "prompt": "<string>"
    },
    "actions": [
      {
        "type": "wait",
        "milliseconds": 2,
        "selector": "#my-element"
      }
    ],
    "location": {
      "country": "US",
      "languages": [
        "en-US"
      ]
    },
    "removeBase64Images": true,
    "blockAds": true,
    "proxy": "basic",
    "storeInCache": true,
    "formats": [
      "markdown"
    ],
    "changeTrackingOptions": {
      "modes": [
        "git-diff"
      ],
      "schema": {},
      "prompt": "<string>",
      "tag": null
    }
  },
  "zeroDataRetention": false
}
'

{
  "success": true,
  "id": "<string>",
  "url": "<string>"
}

POST

crawl

根据选项爬取多个 URL

curl --request POST \
  --url https://api.firecrawl.dev/v1/crawl \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "url": "<string>",
  "excludePaths": [
    "<string>"
  ],
  "includePaths": [
    "<string>"
  ],
  "maxDepth": 10,
  "maxDiscoveryDepth": 123,
  "ignoreSitemap": false,
  "ignoreQueryParameters": false,
  "limit": 10000,
  "allowBackwardLinks": false,
  "crawlEntireDomain": false,
  "allowExternalLinks": false,
  "allowSubdomains": false,
  "delay": 123,
  "maxConcurrency": 123,
  "webhook": {
    "url": "<string>",
    "headers": {},
    "metadata": {},
    "events": [
      "completed"
    ]
  },
  "scrapeOptions": {
    "onlyMainContent": true,
    "includeTags": [
      "<string>"
    ],
    "excludeTags": [
      "<string>"
    ],
    "maxAge": 0,
    "headers": {},
    "waitFor": 0,
    "mobile": false,
    "skipTlsVerification": false,
    "timeout": 30000,
    "parsePDF": true,
    "jsonOptions": {
      "schema": {},
      "systemPrompt": "<string>",
      "prompt": "<string>"
    },
    "actions": [
      {
        "type": "wait",
        "milliseconds": 2,
        "selector": "#my-element"
      }
    ],
    "location": {
      "country": "US",
      "languages": [
        "en-US"
      ]
    },
    "removeBase64Images": true,
    "blockAds": true,
    "proxy": "basic",
    "storeInCache": true,
    "formats": [
      "markdown"
    ],
    "changeTrackingOptions": {
      "modes": [
        "git-diff"
      ],
      "schema": {},
      "prompt": "<string>",
      "tag": null
    }
  },
  "zeroDataRetention": false
}
'

{
  "success": true,
  "id": "<string>",
  "url": "<string>"
}

注意：此 API 的全新 v2 版本现已上线，功能和性能均有所提升。

授权

Authorization

string

header

必填

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

请求体

application/json

url

string<uri>

必填

开始爬取时使用的起始 URL

excludePaths

string[]

根据 URL pathname 的正则表达式模式，将匹配的 URL 排除在抓取之外。例如，如果你在基础 URL firecrawl.dev 上设置 "excludePaths": ["blog/.*"]，那么所有匹配该模式的结果都会被排除，例如：https://www.firecrawl.dev/blog/firecrawl-launch-week-1-recap。

includePaths

string[]

用于在抓取中指定要包含哪些 URL 的 URL 路径名正则表达式模式。只有与指定模式匹配的路径才会包含在响应中。例如，如果你为基础 URL firecrawl.dev 设置 "includePaths": ["blog/.*"]，则只有与该模式匹配的结果会被包含，例如：https://www.firecrawl.dev/blog/firecrawl-launch-week-1-recap。

maxDepth

integer

默认值:10

从输入 URL 的基础路径开始可爬取的最大绝对深度。简单来说，就是被抓取 URL 的路径名中允许包含的斜杠数量上限。

maxDiscoveryDepth

integer

基于发现顺序的最大抓取深度。根站点及站点地图中的页面的发现深度为 0。比如，如果你将其设置为 1，并启用 ignoreSitemap，你只会抓取输入的 URL，以及该页面上所有被链接到的 URL。

ignoreSitemap

boolean

默认值:false

爬取时忽略网站的 sitemap

ignoreQueryParameters

boolean

默认值:false

请勿对同一路径使用不同（或无）查询参数进行重复抓取

limit

integer

默认值:10000

要抓取的最大页面数。默认上限为 10000。

allowBackwardLinks

boolean

默认值:false

已弃用

⚠️ 已弃用：请改用“crawlEntireDomain”。此选项允许爬虫跟踪指向同级或父级 URL 的内部链接，而不仅限于子路径。

crawlEntireDomain

boolean

默认值:false

允许爬虫跟踪到同级或父级的站内链接，而不仅仅是子路径。

false：只爬取更深层（子级）URL。 → 例如 /features/feature-1 → /features/feature-1/tips ✅ → 不会跟踪 /pricing 或 / ❌

true：爬取任意站内链接，包括同级和父级。 → 例如 /features/feature-1 → /pricing、/ 等 ✅

当需要在嵌套路径之外更广泛地覆盖站内链接时，将其设置为 true。

allowExternalLinks

boolean

默认值:false

允许爬虫跟随链接访问外部网站。

allowSubdomains

boolean

默认值:false

允许爬虫跟随指向主域子域的链接。

delay

number

每次抓取之间的延迟时间（秒）。有助于遵守网站的速率限制。

maxConcurrency

integer

最大抓取并发数。该参数用于为本次抓取任务设置并发上限；如果未指定，将沿用你团队的并发限制。

webhook

object

Webhook 规范对象。

显示子属性

scrapeOptions

object

显示子属性

zeroDataRetention

boolean

默认值:false

如果为 true，则本次爬取将不会保留任何数据。若要启用此功能，请联系 [email protected]

响应

成功的响应

success

boolean

string

url

string<uri>

获取批量抓取错误

获取爬取状态

⌘I

使用 API

抓取 API 端点

爬取端点

映射端点

搜索接口

Extract 端点

账户端点

爬取

授权

请求体

响应