Extract

利用 LLM 从网页中提取结构化数据

curl --request POST \
  --url https://api.firecrawl.dev/v1/extract \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "urls": [
    "<string>"
  ],
  "prompt": "<string>",
  "schema": {},
  "enableWebSearch": false,
  "ignoreSitemap": false,
  "includeSubdomains": true,
  "showSources": false,
  "scrapeOptions": {
    "onlyMainContent": true,
    "includeTags": [
      "<string>"
    ],
    "excludeTags": [
      "<string>"
    ],
    "maxAge": 0,
    "headers": {},
    "waitFor": 0,
    "mobile": false,
    "skipTlsVerification": false,
    "timeout": 30000,
    "parsePDF": true,
    "jsonOptions": {
      "schema": {},
      "systemPrompt": "<string>",
      "prompt": "<string>"
    },
    "actions": [
      {
        "type": "wait",
        "milliseconds": 2,
        "selector": "#my-element"
      }
    ],
    "location": {
      "country": "US",
      "languages": [
        "en-US"
      ]
    },
    "removeBase64Images": true,
    "blockAds": true,
    "proxy": "basic",
    "storeInCache": true,
    "formats": [
      "markdown"
    ],
    "changeTrackingOptions": {
      "modes": [
        "git-diff"
      ],
      "schema": {},
      "prompt": "<string>",
      "tag": null
    }
  },
  "ignoreInvalidURLs": false
}
'

{
  "success": true,
  "id": "<string>",
  "invalidURLs": [
    "<string>"
  ]
}

POST

extract

利用 LLM 从网页中提取结构化数据

curl --request POST \
  --url https://api.firecrawl.dev/v1/extract \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "urls": [
    "<string>"
  ],
  "prompt": "<string>",
  "schema": {},
  "enableWebSearch": false,
  "ignoreSitemap": false,
  "includeSubdomains": true,
  "showSources": false,
  "scrapeOptions": {
    "onlyMainContent": true,
    "includeTags": [
      "<string>"
    ],
    "excludeTags": [
      "<string>"
    ],
    "maxAge": 0,
    "headers": {},
    "waitFor": 0,
    "mobile": false,
    "skipTlsVerification": false,
    "timeout": 30000,
    "parsePDF": true,
    "jsonOptions": {
      "schema": {},
      "systemPrompt": "<string>",
      "prompt": "<string>"
    },
    "actions": [
      {
        "type": "wait",
        "milliseconds": 2,
        "selector": "#my-element"
      }
    ],
    "location": {
      "country": "US",
      "languages": [
        "en-US"
      ]
    },
    "removeBase64Images": true,
    "blockAds": true,
    "proxy": "basic",
    "storeInCache": true,
    "formats": [
      "markdown"
    ],
    "changeTrackingOptions": {
      "modes": [
        "git-diff"
      ],
      "schema": {},
      "prompt": "<string>",
      "tag": null
    }
  },
  "ignoreInvalidURLs": false
}
'

{
  "success": true,
  "id": "<string>",
  "invalidURLs": [
    "<string>"
  ]
}

注意：此 API 的新 v2 版本现已推出，具备改进的功能和性能。

授权

Authorization

string

header

必填

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

请求体

application/json

urls

string<uri>[]

必填

用于提取数据的 URL。URL 应采用 glob 格式。

prompt

string

用于引导抽取过程的提示词

schema

object

用于定义提取后数据结构的模式。必须符合 JSON Schema 规范。

enableWebSearch

boolean

默认值:false

当设为 true 时，提取过程会通过网页搜索获取更多数据

ignoreSitemap

boolean

默认值:false

为 true 时，网站扫描过程中会忽略 sitemap.xml 文件

includeSubdomains

boolean

默认值:true

设为 true 时，还会扫描所提供 URL 的子域名

showSources

boolean

默认值:false

如果为 true，用于提取数据的来源将会包含在响应的 sources 字段中

scrapeOptions

object

显示子属性

ignoreInvalidURLs

boolean

默认值:false

如果在 urls 数组中指定了无效 URL，这些 URL 会被忽略。请求不会因此整体失败，而是会使用剩余的有效 URL 执行提取操作，并在响应的 invalidURLs 字段中返回这些无效 URL。

响应

提取成功

success

boolean

string

invalidURLs

string[] | null

如果 ignoreInvalidURLs 为 true，则此字段是一个数组，包含请求中指定的无效 URL。若没有无效 URL，则该数组为空。若 ignoreInvalidURLs 为 false，则此字段为 undefined。

获取提取状态

使用 API

抓取端点

爬取端点

映射端点

搜索端点

提取端点

账户端点

授权

请求体

响应