> ## Documentation Index > Fetch the complete documentation index at: https://docs.firecrawl.dev/llms.txt > Use this file to discover all available pages before exploring further. # 抓取 > 将任意 URL 转换为干净的数据 Firecrawl 可将网页转换为 markdown，非常适合 LLM 应用。 * 它可处理各种复杂情况：代理、缓存、限流、被 JS 阻止的内容 * 处理动态内容：动态网站、JS 渲染的网站、PDF 和图片 * 输出干净的 markdown、结构化数据、截图或 html。详情请参阅 [Scrape Endpoint API 参考](https://docs.firecrawl.dev/api-reference/endpoint/scrape)。在交互式 Playground 中测试抓取——无需写代码。如果请求失败，请参见 [Errors](/zh/api-reference/errors) 了解完整的错误代码目录、原因、补救措施和重试指南。

## 使用 Firecrawl 抓取 URL

### /scrape 端点

用于抓取指定 URL 并获取其内容。

### 安装

```python Python theme={null} # 使用 pip 安装 firecrawl-py from firecrawl import Firecrawl firecrawl = Firecrawl( # 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流额度： # api_key="fc-YOUR-API-KEY", ) ``` ```js Node theme={null} // 使用 npm 安装 firecrawl import { Firecrawl } from 'firecrawl'; const firecrawl = new Firecrawl({ // 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流额度： // apiKey: "fc-YOUR-API-KEY", }); ``` ```bash CLI theme={null} # 使用 npm 全局安装 npm install -g firecrawl # 身份验证(一次性设置) firecrawl login ```

### 使用方式

```python Python theme={null} from firecrawl import Firecrawl firecrawl = Firecrawl( # 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流上限： # api_key="fc-YOUR-API-KEY", ) # Scrape a website: doc = firecrawl.scrape("https://firecrawl.dev", formats=["markdown", "html"]) print(doc) ``` ```js Node theme={null} import { Firecrawl } from 'firecrawl'; const firecrawl = new Firecrawl({ // 无需 API 密钥即可开始使用 — 添加一个以获得更高限流： // apiKey: "fc-YOUR-API-KEY", }); // 抓取网站： const doc = await firecrawl.scrape('https://firecrawl.dev', { formats: ['markdown', 'html'] }); console.log(doc); ``` ```bash cURL theme={null} # 无需 API 密钥即可开始使用 — 添加 -H "Authorization: Bearer $FIRECRAWL_API_KEY" 以获得更高的限流配额： curl -s -X POST "https://api.firecrawl.dev/v2/scrape" \ -H "Content-Type: application/json" \ -d '{ "url": "https://firecrawl.dev", "formats": ["markdown", "html"] }' ``` ```bash CLI theme={null} # Scrape a URL and get markdown firecrawl https://firecrawl.dev # 使用多种格式(返回 JSON) firecrawl https://firecrawl.dev --format markdown,html,links --pretty ``` 有关参数的更多信息，请参阅 [API 参考](https://docs.firecrawl.dev/api-reference/endpoint/scrape)。 **PDF 和文档：**`/scrape` 会根据 URL 自动检测 PDF、DOCX 和其他文档类型。传入 PDF URL 的方式与传入任意网页相同——Firecrawl 会对其进行解析并返回干净的 markdown。对于无法通过 URL 访问的本地文件，请改用 [`/parse`](/zh/features/parse)。 ```python Python theme={null} doc = firecrawl.scrape("https://example.com/report.pdf", formats=["markdown"]) print(doc.markdown) ``` 每次抓取会消耗 1 个额度。某些选项会产生额外消耗：JSON 模式每页额外消耗 4 个额度，question 和 highlights formats 每种格式每页额外消耗 4 个额度，高级代理 (enhanced proxy) 每页额外消耗 4 个额度，PII 脱敏每页额外消耗 4 个额度，PDF 解析每个 PDF 页面额外消耗 1 个额度，音频或视频提取每页额外消耗 4 个额度。 ### 响应各 SDK 会直接返回数据对象。cURL 将按下方所示原样返回载荷。 ```json theme={null} { "success": true, "data" : { "markdown": "Launch Week I 开始了！[查看我们第 2 天的发布 🚀](https://www.firecrawl.dev/blog/launch-week-i-day-2-doubled-rate-limits)[💥 获享 2 个月免费...", "html": " ## 抓取 formats 现在你可以选择输出所需的 formats。你可以指定多个输出 formats。支持的 formats 包括： * Markdown (`markdown`) * 摘要 (`summary`) * HTML (`html`) - 页面 HTML 的清洗版本 * 原始 HTML (`rawHtml`) - 按页面返回的不经修改的 HTML * 截图 (`screenshot`，可选项包括 `fullPage`、`quality`、`viewport`) — 截图 URL 会在 24 小时后过期 * 链接 (`links`) * JSON (`json`) - 结构化输出 * 图片 (`images`) - 提取页面中的所有图片 URL * 品牌 (`branding`) - 提取品牌识别与设计系统 * Product (`product`) - 从产品页面提取结构化产品 (标题、价格、可用性、变体) * 音频 (`audio`) - 从支持的视频 URL (例如 YouTube) 中提取 MP3 音频 (返回签名 GCS URL，1 小时后过期) * 视频 (`video`) - 从支持的视频 URL (例如 YouTube) 中提取最佳质量的视频 (返回签名 GCS URL，1 小时后过期) * Query (`query`，带有 `prompt` 和可选的 `mode`) - 针对页面提出一个自然语言问题；答案会在 `answer` 字段中返回输出键将与你选择的 format 对应。

## 提取结构化数据

### /scrape (使用 JSON) 端点

用于从抓取的页面中提取结构化数据。要提取产品信息？对于产品页面，[`product` 格式](#extract-product-data)会以确定性的方式返回结构化的产品字段 (标题、价格、库存状态、变体) ——无需调用 LLM，也无需定义 schema。当你需要自定义字段或处理非产品页面时，再使用 `json`。 ```python Python theme={null} from firecrawl import Firecrawl from pydantic import BaseModel app = Firecrawl( # 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流额度： # api_key="fc-YOUR-API-KEY", ) class CompanyInfo(BaseModel): company_mission: str supports_sso: bool is_open_source: bool is_in_yc: bool result = app.scrape( 'https://firecrawl.dev', formats=[{ "type": "json", "schema": CompanyInfo.model_json_schema() }], only_main_content=False, timeout=120000 ) print(result) ``` ```js Node theme={null} import { Firecrawl } from "firecrawl"; import { z } from "zod"; const app = new Firecrawl({ // 无需 API 密钥即可开始使用——添加一个以获得更高的限流额度： // apiKey: "fc-YOUR_API_KEY", }); // Define schema to extract contents into const schema = z.object({ company_mission: z.string(), supports_sso: z.boolean(), is_open_source: z.boolean(), is_in_yc: z.boolean() }); const result = await app.scrape("https://firecrawl.dev", { formats: [{ type: "json", schema: schema }], }); console.log(result); ``` ```bash cURL theme={null} # 无需 API 密钥即可开始使用 — 添加 -H "Authorization: Bearer YOUR_API_KEY" 以获得更高的限流额度： curl -X POST https://api.firecrawl.dev/v2/scrape \ -H 'Content-Type: application/json' \ -d '{ "url": "https://firecrawl.dev", "formats": [ { "type": "json", "schema": { "type": "object", "properties": { "company_mission": { "type": "string" }, "supports_sso": { "type": "boolean" }, "is_open_source": { "type": "boolean" }, "is_in_yc": { "type": "boolean" } }, "required": [ "company_mission", "supports_sso", "is_open_source", "is_in_yc" ] } } ] }' ``` 输出： ```json JSON theme={null} { "success": true, "data": { "json": { "company_mission": "AI 赋能的网页抓取与数据抽取", "supports_sso": true, "is_open_source": true, "is_in_yc": true }, "metadata": { "title": "Firecrawl", "description": "AI 赋能的网页抓取与数据抽取", "robots": "follow, index", "ogTitle": "Firecrawl", "ogDescription": "AI 赋能的网页抓取与数据抽取", "ogUrl": "https://firecrawl.dev/", "ogImage": "https://firecrawl.dev/og.png", "ogLocaleAlternate": [], "ogSiteName": "Firecrawl", "sourceURL": "https://firecrawl.dev/" }, } } ```

### 无需 schema 的提取

现在，你只需向端点传入一个 `prompt`，即可在不提供 schema 的情况下进行提取。LLM 会自行决定数据结构。 ```python Python theme={null} from firecrawl import Firecrawl app = Firecrawl( # 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流额度： # api_key="fc-YOUR-API-KEY", ) result = app.scrape( 'https://firecrawl.dev', formats=[{ "type": "json", "prompt": "Extract the company mission from the page." }], only_main_content=False, timeout=120000 ) print(result) ``` ```js Node theme={null} import { Firecrawl } from "firecrawl"; const app = new Firecrawl({ // 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流上限： // apiKey: "fc-YOUR_API_KEY", }); const result = await app.scrape("https://firecrawl.dev", { formats: [{ type: "json", prompt: "Extract the company mission from the page." }] }); console.log(result); ``` ```bash cURL theme={null} # 无需 API 密钥即可开始使用——添加 -H "Authorization: Bearer YOUR_API_KEY" 可获得更高限流： curl -X POST https://api.firecrawl.dev/v2/scrape \ -H 'Content-Type: application/json' \ -d '{ "url": "https://firecrawl.dev", "formats": [{ "type": "json", "prompt": "Extract the company mission from the page." }] }' ``` 输出： ```json JSON theme={null} { "success": true, "data": { "json": { "company_mission": "AI 驱动的网页抓取与数据抽取", }, "metadata": { "title": "Firecrawl", "description": "AI 驱动的网页抓取与数据抽取", "robots": "follow, index", "ogTitle": "Firecrawl", "ogDescription": "AI 驱动的网页抓取与数据抽取", "ogUrl": "https://firecrawl.dev/", "ogImage": "https://firecrawl.dev/og.png", "ogLocaleAlternate": [], "ogSiteName": "Firecrawl", "sourceURL": "https://firecrawl.dev/" }, } } ```

### JSON 格式选项

使用 `json` 格式时，在 `formats` 中传入一个对象，包含以下参数： * `schema`：用于结构化输出的 JSON Schema。 * `prompt`：可选提示；在提供 schema 时或仅需轻量指引时用于辅助抽取。

## 提取品牌识别度

### /scrape (品牌信息) 端点

branding 格式会从网页中提取完整的品牌识别信息，包括颜色、字体、排版、间距、UI 组件等。它适用于设计系统分析、品牌监测，或构建需要理解网站视觉风格的工具。 ```python Python theme={null} from firecrawl import Firecrawl firecrawl = Firecrawl( # 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流额度： # api_key='fc-YOUR_API_KEY', ) result = firecrawl.scrape( url='https://firecrawl.dev', formats=['branding'] ) print(result['branding']) ``` ```js Node theme={null} import { Firecrawl } from 'firecrawl'; const firecrawl = new Firecrawl({ // 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流额度： // apiKey: "fc-YOUR-API-KEY", }); const result = await firecrawl.scrape('https://firecrawl.dev', { formats: ['branding'] }); console.log(result.branding); ``` ```bash cURL theme={null} # 无需 API 密钥即可开始使用 — 添加 -H "Authorization: Bearer $FIRECRAWL_API_KEY" 以获得更高限流： curl -s -X POST "https://api.firecrawl.dev/v2/scrape" \ -H "Content-Type: application/json" \ -d '{ "url": "https://firecrawl.dev", "formats": ["branding"] }' ``` ### 响应品牌配置会返回一个完整的 `BrandingProfile` 对象，其结构如下： ```json Output theme={null} { "success": true, "data": { "branding": { "colorScheme": "dark", "logo": "https://firecrawl.dev/logo.svg", "colors": { "primary": "#FF6B35", "secondary": "#004E89", "accent": "#F77F00", "background": "#1A1A1A", "textPrimary": "#FFFFFF", "textSecondary": "#B0B0B0" }, "fonts": [ { "family": "Inter" }, { "family": "Roboto Mono" } ], "typography": { "fontFamilies": { "primary": "Inter", "heading": "Inter", "code": "Roboto Mono" }, "fontSizes": { "h1": "48px", "h2": "36px", "h3": "24px", "body": "16px" }, "fontWeights": { "regular": 400, "medium": 500, "bold": 700 } }, "spacing": { "baseUnit": 8, "borderRadius": "8px" }, "components": { "buttonPrimary": { "background": "#FF6B35", "textColor": "#FFFFFF", "borderRadius": "8px" }, "buttonSecondary": { "background": "transparent", "textColor": "#FF6B35", "borderColor": "#FF6B35", "borderRadius": "8px" } }, "images": { "logo": "https://firecrawl.dev/logo.svg", "favicon": "https://firecrawl.dev/favicon.ico", "ogImage": "https://firecrawl.dev/og-image.png" } } } } ```

### 品牌档案结构

`branding` 对象包含以下属性： * `colorScheme`: 检测到的配色方案 (`"light"` 或 `"dark"`) * `logo`: 主徽标的 URL * `colors`: 包含品牌颜色的对象： * `primary`、`secondary`、`accent`: 主要品牌色 * `background`、`textPrimary`、`textSecondary`: 界面颜色 * `link`、`success`、`warning`、`error`: 语义颜色 * `fonts`: 页面中使用的字体系列数组 * `typography`: 详细的排版信息： * `fontFamilies`: 正文、标题与代码字体系列 * `fontSizes`: 标题与正文的尺寸定义 * `fontWeights`: 字重定义 (细体、常规、中等、粗体) * `lineHeights`: 不同文本类型的行高值 * `spacing`: 间距与布局信息： * `baseUnit`: 基础间距单位 (像素) * `borderRadius`: 默认圆角 * `padding`、`margins`: 间距值 * `components`: UI 组件样式： * `buttonPrimary`、`buttonSecondary`: 按钮样式 * `input`: 输入框样式 * `icons`: 图标样式信息 * `images`: 品牌图像 (logo、favicon、og:image) * `animations`: 动画与过渡设置 * `layout`: 布局配置 (栅格、页眉/页脚高度) * `personality`: 品牌个性特征 (语气、调性、目标受众)

### 与其他 formats 结合使用

你可以将 branding 格式与其他 formats 组合，以获取更全面的页面数据： ```python Python theme={null} from firecrawl import Firecrawl firecrawl = Firecrawl( # 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流额度： # api_key='fc-YOUR_API_KEY', ) result = firecrawl.scrape( url='https://firecrawl.dev', formats=['markdown', 'branding', 'screenshot'] ) print(result['markdown']) print(result['branding']) print(result['screenshot']) ``` ```js Node theme={null} import { Firecrawl } from 'firecrawl'; const firecrawl = new Firecrawl({ // 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流额度： // apiKey: "fc-YOUR-API-KEY", }); const result = await firecrawl.scrape('https://firecrawl.dev', { formats: ['markdown', 'branding', 'screenshot'] }); console.log(result.markdown); console.log(result.branding); console.log(result.screenshot); ``` ```bash cURL theme={null} # 无需 API 密钥即可开始使用 — 添加 -H "Authorization: Bearer $FIRECRAWL_API_KEY" 以获得更高的限流上限： curl -s -X POST "https://api.firecrawl.dev/v2/scrape" \ -H "Content-Type: application/json" \ -d '{ "url": "https://firecrawl.dev", "formats": ["markdown", "branding", "screenshot"] }' ```

## 提取产品数据

`product` 格式会以**确定性**方式提取结构化的产品数据——它产出的结构化输出与 [`json` 格式](#extract-structured-data)相同，但无需调用 LLM，也不用自行定义 schema，专为产品页面而设计。如果你一直在用 `json` schema 提取产品字段，建议改用 `formats: ["product"]`——它更快、更便宜，但仅适用于产品。它会返回一个 `product` 对象，其中包含 title、brand、category、description 和 variants；每个 variant 都带有 price、original price、availability 和图片——适合用于价格监控、商品目录采集或比价工具。

### /scrape (带有 product 参数) 端点

```python Python theme={null} from firecrawl import Firecrawl firecrawl = Firecrawl( # 无需 API 密钥即可开始——添加后可提升限流额度： # api_key='fc-YOUR_API_KEY', ) result = firecrawl.scrape( url='https://example.com/products/wireless-headphones', formats=['product'] ) print(result['product']) ``` ```js Node theme={null} import { Firecrawl } from 'firecrawl'; const firecrawl = new Firecrawl({ // 无需 API 密钥即可开始 — 添加后可提升限流额度： // apiKey: "fc-YOUR-API-KEY", }); const result = await firecrawl.scrape('https://example.com/products/wireless-headphones', { formats: ['product'] }); console.log(result.product); ``` ```bash cURL theme={null} # 无需 API 密钥即可开始 — 添加 -H "Authorization: Bearer $FIRECRAWL_API_KEY" 可获得更高限流： curl -s -X POST "https://api.firecrawl.dev/v2/scrape" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com/products/wireless-headphones", "formats": ["product"] }' ``` ### 响应 `product` 格式会返回一个 `product` 对象，结构如下： ```json Output theme={null} { "success": true, "data": { "product": { "title": "Wireless Noise-Cancelling Headphones", "brand": "Acme", "category": "Electronics > Audio > Headphones", "url": "https://example.com/products/wireless-headphones", "description": "Over-ear wireless headphones with active noise cancellation, 30-hour battery life, and plush memory-foam ear cushions for all-day comfort.", "variants": [ { "id": "wireless-headphones-black", "sku": "ACME-WH-BLACK", "title": "Wireless Noise-Cancelling Headphones — Black", "values": { "color": "Black" }, "price": { "amount": 199.99, "currency": "USD", "formatted": "$199.99" }, "sale": { "originalPrice": { "amount": 249.99, "currency": "USD", "formatted": "$249.99" } }, "availability": { "inStock": true, "text": "In Stock" }, "images": [ { "url": "https://example.com/images/headphones-black.jpg", "alt": "Wireless Noise-Cancelling Headphones — Black" } ] } ] } } } ```

### 产品对象结构

`product` 对象包含以下属性： * `title`：产品名称 * `brand`：产品品牌 (可选) * `category`：产品类别 (可选) * `url`：产品的规范 URL * `description`：产品描述 (可选) * `variants`：产品变体数组。价格、可用性和图片信息都在各个变体上——即使是单 SKU 产品，也仍会准确返回一个包含这些信息的变体。每个变体包含： * `id`、`sku`、`title`：变体标识符和标签 (均为可选) * `values`：选项名称到选项值的映射，例如 `{ "color": "Charcoal" }` (可选) * `price`：当前价格对象 (可选) ： * `amount`：数值型价格 * `currency`：货币代码，仅当页面提供该信息时才会返回 (可选) * `formatted`：页面上显示的价格 (可选) * `sale`：仅在变体有折扣时才会出现 (可选) 。包含： * `originalPrice`：原价 (折扣前价格) ，结构与 `price` 相同 * `availability`：库存信息，在变体上始终存在： * `inStock`：该变体是否有库存 * `text`：页面上的原始库存文本 (可选) * `images`：变体图片数组，每项都包含 `url` 和可选的 `alt` 文本 (可选)

### 产品提取的工作方式

`product` 格式会以确定性方式从页面上的结构化数据中提取产品——不涉及 LLM。它会按优先级合并多个来源：**JSON-LD > schema.org microdata > RDFa > 嵌入状态 (`__NEXT_DATA__`/Nuxt/Apollo/Redux/Remix) > AliExpress `runParams` > GA4 `dataLayer` > OpenGraph/``**。合并过程会识别产品身份，因此绝不会将不同产品的字段混合在一起。只有当页面来源中提供了货币信息时，才会返回货币。产品提取采用 fail-closed 策略：对于有歧义的页面，不会产出任何产品；而像 OpenGraph 这样较弱的来源，也只有在存在价格时才会参与补充。在没有可提取产品的页面上，响应会省略 `product` 对象，并添加一个 `warning` (例如“未找到产品...”) 。 **自托管：** `product` 格式由专用的产品提取服务提供支持。在 Firecrawl Cloud 上可开箱即用。如果你选择自托管，请设置 `PRODUCT_EXTRACTION_SERVICE_URL` 并将其指向该服务——如果未设置，请求 `product` 格式时会返回警告，且不会返回产品 (音频/视频格式对其服务也采用相同模式) 。 ### 与其他 formats 组合使用你可以将 product 格式与其他 formats 组合使用，以获取更全面的页面数据： ```python Python theme={null} from firecrawl import Firecrawl firecrawl = Firecrawl( # 无需 API 密钥即可开始——添加一个以提高限流上限： # api_key='fc-YOUR_API_KEY', ) result = firecrawl.scrape( url='https://example.com/products/wireless-headphones', formats=['markdown', 'product'] ) print(result['markdown']) print(result['product']) ``` ```js Node theme={null} import { Firecrawl } from 'firecrawl'; const firecrawl = new Firecrawl({ // 无需 API 密钥即可开始 — 添加后可提升限流上限： // apiKey: "fc-YOUR-API-KEY", }); const result = await firecrawl.scrape('https://example.com/products/wireless-headphones', { formats: ['markdown', 'product'] }); console.log(result.markdown); console.log(result.product); ``` ```bash cURL theme={null} # 无需 API 密钥即可开始 — 添加 -H "Authorization: Bearer $FIRECRAWL_API_KEY" 可提升限流上限： curl -s -X POST "https://api.firecrawl.dev/v2/scrape" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com/products/wireless-headphones", "formats": ["markdown", "product"] }' ```

## 音频提取

`audio` 格式可从受支持的网站 (如 YouTube) 提取音频并保存为 MP3 文件，同时返回一个已签名的 Google Cloud Storage URL。这适用于构建音频处理流程、转录服务或播客工具。音频提取每页消耗 5 个额度 (1 个基础额度 + 4 个额外额度) 。 ```python Python theme={null} from firecrawl import Firecrawl firecrawl = Firecrawl( # 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流： # api_key="fc-YOUR-API-KEY", ) doc = firecrawl.scrape("https://www.youtube.com/watch?v=dQw4w9WgXcQ", formats=["audio"]) print(doc.audio) # MP3 文件的已签名 GCS URL ``` ```js Node theme={null} import { Firecrawl } from 'firecrawl'; const firecrawl = new Firecrawl({ // 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流： // apiKey: "fc-YOUR-API-KEY", }); const doc = await firecrawl.scrape('https://www.youtube.com/watch?v=dQw4w9WgXcQ', { formats: ['audio'] }); console.log(doc.audio); // MP3 文件的已签名 GCS URL ``` ```bash cURL theme={null} # 无需 API 密钥即可开始使用 — 添加 -H "Authorization: Bearer $FIRECRAWL_API_KEY" 以获得更高限流： curl -s -X POST "https://api.firecrawl.dev/v2/scrape" \ -H "Content-Type: application/json" \ -d '{ "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ", "formats": ["audio"] }' ```

## 视频提取

`video` 格式可从受支持的网站 (如 YouTube) 提取最高质量的视频，并返回一个已签名的 Google Cloud Storage URL。这适用于构建视频处理管道、内容审核工具或媒体归档工作流。视频提取每页消耗 5 个额度 (1 个基础额度 + 4 个额外额度) 。 ```python Python theme={null} from firecrawl import Firecrawl firecrawl = Firecrawl( # 无需 API 密钥即可开始使用——添加一个以获得更高的限流： # api_key="fc-YOUR-API-KEY", ) doc = firecrawl.scrape("https://www.youtube.com/watch?v=dQw4w9WgXcQ", formats=["video"]) print(doc.video) # 视频文件的签名 GCS URL ``` ```js Node theme={null} import { Firecrawl } from 'firecrawl'; const firecrawl = new Firecrawl({ // 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流： // apiKey: "fc-YOUR-API-KEY", }); const doc = await firecrawl.scrape('https://www.youtube.com/watch?v=dQw4w9WgXcQ', { formats: ['video'] }); console.log(doc.video); // 视频文件的签名 GCS URL ``` ```bash cURL theme={null} # 无需 API 密钥即可开始使用 — 添加 -H "Authorization: Bearer $FIRECRAWL_API_KEY" 以获得更高的限流配额： curl -s -X POST "https://api.firecrawl.dev/v2/scrape" \ -H "Content-Type: application/json" \ -d '{ "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ", "formats": ["video"] }' ```

## Question 格式

使用 `question` 格式对页面提出一个自然语言问题。Firecrawl 会在响应的 `answer` 字段中返回答案。 `question` 格式每页消耗 5 个额度 (1 个基础额度 + 4 个用于 LLM 调用的额外额度) 。格式对象中的选项： * `question` (`type: "question"` 时必填) ：要回答的问题。最多 10,000 个字符。你可以将 `question` 与其他 formats 组合使用——例如，同时请求 `markdown` 和 `question`，即可在一次调用中获得页面内容和答案。 ```python Python theme={null} from firecrawl import Firecrawl firecrawl = Firecrawl( # 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流额度： # api_key="fc-YOUR-API-KEY", ) doc = firecrawl.scrape( "https://firecrawl.dev", formats=[{"type": "question", "question": "What is Firecrawl?"}], ) print(doc.answer) ``` ```js Node theme={null} import { Firecrawl } from 'firecrawl'; const firecrawl = new Firecrawl({ // 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流上限： // apiKey: "fc-YOUR-API-KEY", }); const doc = await firecrawl.scrape('https://firecrawl.dev', { formats: [{ type: 'question', question: 'What is Firecrawl?' }], }); console.log(doc.answer); ``` ```bash cURL theme={null} # 无需 API 密钥即可开始使用 — 添加 -H "Authorization: Bearer $FIRECRAWL_API_KEY" 以获得更高的限流配额： curl -s -X POST "https://api.firecrawl.dev/v2/scrape" \ -H "Content-Type: application/json" \ -d '{ "url": "https://firecrawl.dev", "formats": [ { "type": "question", "question": "What is Firecrawl?" } ] }' ``` `question` 格式也可通过 `/search` 中的 `scrapeOptions` 使用，这会对每个搜索结果执行相同的提取。

## Highlights 格式

使用 `highlights` 格式可从页面中找出相关的源文本。Firecrawl 会在响应的 `highlights` 字段中返回所选文本。 `highlights` 格式每页消耗 5 额度 (1 个基础额度 + 4 个 LLM 调用的额外额度) 。格式对象中的选项： * `query` (`type: "highlights"` 时必填) ：源文本选择请求。最大 10,000 个字符。你可以将 `highlights` 与其他 formats 组合使用——例如，同时请求 `markdown` 和 `highlights`，即可在一次调用中获取页面内容和源文本。 ```python Python theme={null} from firecrawl import Firecrawl firecrawl = Firecrawl( # 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流上限： # api_key="fc-YOUR-API-KEY", ) doc = firecrawl.scrape( "https://firecrawl.dev", formats=[{"type": "highlights", "query": "What is Firecrawl?"}], ) print(doc.highlights) ``` ```js Node theme={null} import { Firecrawl } from 'firecrawl'; const firecrawl = new Firecrawl({ // 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流上限： // apiKey: "fc-YOUR-API-KEY", }); const doc = await firecrawl.scrape('https://firecrawl.dev', { formats: [{ type: 'highlights', query: 'What is Firecrawl?' }], }); console.log(doc.highlights); ``` ```bash cURL theme={null} # 无需 API 密钥即可开始使用 — 添加 -H "Authorization: Bearer $FIRECRAWL_API_KEY" 以获得更高的限流配额： curl -s -X POST "https://api.firecrawl.dev/v2/scrape" \ -H "Content-Type: application/json" \ -d '{ "url": "https://firecrawl.dev", "formats": [ { "type": "highlights", "query": "What is Firecrawl?" } ] }' ``` `highlights` 格式也可通过 `/search` 中的 `scrapeOptions` 使用，对每个搜索结果执行相同的提取。

## PII 脱敏

设置 `redactPII: true`，对返回的 markdown 中的个人身份信息进行脱敏处理。`markdown` 字段包含脱敏后的结果。请参见 [PII 脱敏](/zh/features/pii-redaction) 了解 SDK、cURL、CLI 和 MCP 示例。

## 使用 actions 与页面交互

Firecrawl 允许你在抓取页面内容之前对网页执行各种 actions。这对于与动态内容交互、在页面之间导航，或访问需要用户操作的内容特别有用。 **我们推荐使用 [交互](/zh/features/interact) 而不是 actions：这是我们更新、更强大的已抓取页面交互方式。** 交互以有状态的浏览器会话形式运行，并可在多次调用之间保持存活，因此你可以通过以下任一方式逐步控制页面： * **自然语言**，适用于灵活、非确定性的流程。例如：*“搜索‘无线耳机’，筛选出 4 星以上且价格低于 200 美元的结果，并返回这些结果”*。 * **Playwright 或 agent-browser 代码**，适用于确定性的步骤。例如：`await page.click('#export')`。交互还支持配置文件、持久会话以及可嵌入的实时浏览器视图 (包含一种交互模式，最终用户可以自行控制浏览器) 。下面是一个示例，演示如何使用 actions 访问 google.com，搜索 Firecrawl，点击第一个结果，并截取页面截图。在执行其他 actions 之前或之后，几乎都应使用 `wait` action，为页面加载预留足够时间。

### 示例

```python Python theme={null} from firecrawl import Firecrawl firecrawl = Firecrawl( # 无需 API 密钥即可开始使用 — 添加后可提升限流额度： # api_key="fc-YOUR-API-KEY", ) doc = firecrawl.scrape( url="https://example.com/login", formats=["markdown"], actions=[ {"type": "write", "text": "john@example.com"}, {"type": "press", "key": "Tab"}, {"type": "write", "text": "secret"}, {"type": "click", "selector": 'button[type="submit"]'}, {"type": "wait", "milliseconds": 1500}, {"type": "screenshot", "full_page": True}, ], ) print(doc.markdown, doc.screenshot) ``` ```js Node theme={null} import { Firecrawl } from 'firecrawl'; const firecrawl = new Firecrawl({ // 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流上限： // apiKey: "fc-YOUR-API-KEY", }); const doc = await firecrawl.scrape('https://example.com/login', { formats: ['markdown'], actions: [ { type: 'write', text: 'john@example.com' }, { type: 'press', key: 'Tab' }, { type: 'write', text: 'secret' }, { type: 'click', selector: 'button[type="submit"]' }, { type: 'wait', milliseconds: 1500 }, { type: 'screenshot', fullPage: true }, ], }); console.log(doc.markdown, doc.screenshot); ``` ```bash cURL theme={null} # 无需 API 密钥即可开始使用 — 添加 -H "Authorization: Bearer YOUR_API_KEY" 以获取更高限流配额： curl -X POST https://api.firecrawl.dev/v2/scrape \ -H 'Content-Type: application/json' \ -d '{ "url": "https://example.com/login", "formats": ["markdown"], "actions": [ { "type": "write", "text": "john@example.com" }, { "type": "press", "key": "Tab" }, { "type": "write", "text": "secret" }, { "type": "click", "selector": "button[type=\"submit\"]" }, { "type": "wait", "milliseconds": 1500 }, { "type": "screenshot", "fullPage": true }, ], }' ```

### 输出

```json JSON theme={null} { "success": true, "data": { "markdown": "我们的首次 Launch Week 已圆满结束！[查看回顾 🚀](blog/firecrawl-launch-week-1-recap)...", "actions": { "screenshots": [ "https://alttmdsdujxrfnakrkyi.supabase.co/storage/v1/object/public/media/screenshot-75ef2d87-31e0-4349-a478-fb432a29e241.png" ], "scrapes": [ { "url": "https://www.firecrawl.dev/", "html": "

Firecrawl

" } ] }, "metadata": { "title": "首页 - Firecrawl", "description": "Firecrawl 可抓取并将任意网站转换为干净的 Markdown。", "language": "en", "keywords": "Firecrawl,Markdown,数据,Mendable,LangChain", "robots": "follow, index", "ogTitle": "Firecrawl", "ogDescription": "将任意网站转换为适配 LLM 的数据。", "ogUrl": "https://www.firecrawl.dev/", "ogImage": "https://www.firecrawl.dev/og.png?123", "ogLocaleAlternate": [], "ogSiteName": "Firecrawl" "sourceURL": "http://google.com", "statusCode": 200 } } } ``` 对于抓取后还需要更强浏览器控制的工作流，例如已认证会话、多步骤导航或页面实时视图，我们建议使用 [交互](/zh/features/interact)，而不是继续扩展 actions 数组。

## 位置与语言

指定国家/地区和首选语言，以根据你的目标位置和语言偏好获取相关内容。

### 工作原理

当你指定位置设置时，Firecrawl 会在可用时使用合适的代理，并仿真相应的语言和时区设置。默认情况下，若未指定位置，位置将设为“US”。 ### 用法要使用位置和语言设置，请在请求体中包含 `location` 对象，并提供以下属性： * `country`：ISO 3166-1 alpha-2 国家/地区代码 (例如“US”“AU”“DE”“JP”) 。默认值为“US”。 * `languages`：按优先级排序的首选语言和区域设置数组。默认使用所设位置对应的语言。 ```python Python theme={null} from firecrawl import Firecrawl firecrawl = Firecrawl( # 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流上限： # api_key="fc-YOUR-API-KEY", ) doc = firecrawl.scrape('https://example.com', formats=['markdown'], location={ 'country': 'US', 'languages': ['en'] } ) print(doc) ``` ```js Node theme={null} import { Firecrawl } from 'firecrawl'; const firecrawl = new Firecrawl({ // 无需 API 密钥即可开始使用 — 添加一个以获得更高的限流额度： // apiKey: "fc-YOUR-API-KEY", }); const doc = await firecrawl.scrape('https://example.com', { formats: ['markdown'], location: { country: 'US', languages: ['en'] }, }); console.log(doc.metadata); ``` ```bash cURL theme={null} # 无需 API 密钥即可开始使用 — 添加 -H "Authorization: Bearer $FIRECRAWL_API_KEY" 可获得更高限流： curl -X POST "https://api.firecrawl.dev/v2/scrape" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com", "formats": ["markdown"], "location": { "country": "US", "languages": ["en"] } }' ``` 有关受支持位置的更多详情，请参阅[代理文档](/zh/features/proxies)。

## 缓存与 maxAge

为加快请求速度，当有较新的副本可用时，Firecrawl 默认会直接从缓存返回结果。 * **默认新鲜度窗口**：`maxAge = 172800000` 毫秒 (2 天) 。如果缓存页面仍在该窗口内，将立即返回；否则会重新抓取页面并写入缓存。 * **性能**：在数据对时效性要求不高时，抓取速度可提升至最多 5 倍。 * **始终获取最新**：将 `maxAge` 设为 `0`。注意，这会完全绕过缓存，因此每次请求都会走完整的抓取流水线，这意味着请求完成所需时间更长，也更有可能失败。如果对每个请求的实时性不是绝对关键，建议使用非零的 `maxAge`。 * **避免存储**：如果不希望 Firecrawl 为本次请求缓存/存储结果，将 `storeInCache` 设为 `false`。 * **仅查缓存**：设置 `minAge` 可仅执行缓存查询，而不会触发新的抓取。该值以毫秒为单位，指定缓存数据必须满足的最小缓存时长。如果未找到缓存数据，将返回带有错误代码 `SCRAPE_NO_CACHED_DATA` 的 `404`。将 `minAge` 设为 `1` 可接受任意缓存数据，无论其缓存时长是多少。 * **变更跟踪**：包含 `changeTracking` 的请求会绕过缓存，因此会忽略 `maxAge`。 * **额度**：缓存结果每页仍消耗 1 个额度。缓存提升的是速度，而不是额度使用。示例 (强制获取最新内容) ： ```python Python theme={null} from firecrawl import Firecrawl firecrawl = Firecrawl(api_key='fc-YOUR_API_KEY') doc = firecrawl.scrape(url='https://example.com', max_age=0, formats=['markdown']) print(doc) ``` ```js Node theme={null} import { Firecrawl } from 'firecrawl'; const firecrawl = new Firecrawl({ apiKey: "fc-YOUR-API-KEY" }); const doc = await firecrawl.scrape('https://example.com', { maxAge: 0, formats: ['markdown'] }); console.log(doc); ``` ```bash cURL theme={null} curl -s -X POST "https://api.firecrawl.dev/v2/scrape" \ -H "Authorization: Bearer $FIRECRAWL_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com", "maxAge": 0, "formats": ["markdown"] }' ``` 示例 (使用 10 分钟缓存窗口) ： ```python Python theme={null} from firecrawl import Firecrawl firecrawl = Firecrawl(api_key='fc-YOUR_API_KEY') doc = firecrawl.scrape(url='https://example.com', max_age=600000, formats=['markdown', 'html']) print(doc) ``` ```js Node theme={null} const firecrawl = new Firecrawl({ apiKey: "fc-YOUR-API-KEY" }); const doc = await firecrawl.scrape('https://example.com', { maxAge: 600000, formats: ['markdown', 'html'] }); console.log(doc); ``` ```bash cURL theme={null} curl -s -X POST "https://api.firecrawl.dev/v2/scrape" \ -H "Authorization: Bearer $FIRECRAWL_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com", "maxAge": 600000, "formats": ["markdown", "html"] }' ```

## 批量抓取多个 URL

你现在可以同时批量抓取多个 URL。它以起始 URL 和可选参数作为输入。params 参数允许你为批量抓取任务指定其他选项，例如输出 formats。

### 工作原理

它与 `/crawl` 端点的运行方式非常相似。它会提交一个批量抓取作业，并返回一个作业 ID，用于检查该批量抓取的状态。 SDK 提供两种方式：同步与异步。同步方式会直接返回批量抓取作业的结果，异步方式则会返回一个作业 ID，供您用于查询批量抓取的状态。 ### 使用方法 ```python Python theme={null} from firecrawl import Firecrawl firecrawl = Firecrawl(api_key="fc-你的 API 密钥") job = firecrawl.batch_scrape([ "https://firecrawl.dev", "https://docs.firecrawl.dev", ], formats=["markdown"], poll_interval=2, wait_timeout=120) print(job) ``` ```js Node theme={null} import { Firecrawl } from 'firecrawl'; const firecrawl = new Firecrawl({ apiKey: "fc-YOUR-API-KEY" }); const job = await firecrawl.batchScrape([ 'https://firecrawl.dev', 'https://docs.firecrawl.dev', ], { options: { formats: ['markdown'] }, pollInterval: 2, timeout: 120 }); console.log(job); ``` ```bash cURL theme={null} curl -s -X POST "https://api.firecrawl.dev/v2/batch/scrape" \ -H "Authorization: Bearer $FIRECRAWL_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "urls": ["https://firecrawl.dev", "https://docs.firecrawl.dev"], "formats": ["markdown"] }' ```

### Response

如果你使用 SDK 的同步方法，将直接返回批量抓取任务的结果；否则会返回一个作业 ID，你可以用它来查询批量抓取的状态。

#### 同步执行

```json 已完成 theme={null} { "status": "completed", "total": 36, "completed": 36, "creditsUsed": 36, "expiresAt": "2024-00-00T00:00:00.000Z", "next": "https://api.firecrawl.dev/v2/batch/scrape/123-456-789?skip=26", "data": [ { "markdown": "[Firecrawl 文档首页！[浅色标志](https://mintlify.s3-us-west-1.amazonaws.com/firecrawl/logo/light.svg)!...", "html": "...", "metadata": { "title": "使用 Groq Llama 3 构建“网站对话”功能 | Firecrawl", "language": "en", "sourceURL": "https://docs.firecrawl.dev/learn/rag-llama3", "description": "了解如何使用 Firecrawl、Groq Llama 3 和 LangChain 构建“与你的网站对话”的机器人。", "ogLocaleAlternate": [], "statusCode": 200 } }, ... ] } ```

#### 异步

你可以使用作业 ID 调用 `/batch/scrape/{id}` 端点来查看批量抓取的状态。该端点应在作业仍在运行期间或刚完成后使用，**因为批量抓取作业会在 24 小时后过期**。 ```json theme={null} { "success": true, "id": "123-456-789", "url": "https://api.firecrawl.dev/v2/batch/scrape/123-456-789" } ```

## 增强模式

对于复杂网站，Firecrawl 提供了增强模式，在提高成功率的同时保护隐私。了解更多关于[增强模式](/zh/features/enhanced-mode)的信息。

## 零数据保留 (ZDR)

Firecrawl 支持零数据保留 (ZDR) ，适用于有严格数据处理要求的团队。启用后，Firecrawl 不会在请求生命周期结束后保留任何页面内容或提取的数据。要启用 ZDR，请在请求中设置 `zeroDataRetention: true`： ```bash cURL theme={null} curl -X POST https://api.firecrawl.dev/v2/scrape \ -H "Content-Type: application/json" \ -H "Authorization: Bearer fc-YOUR_API_KEY" \ -d '{ "url": "https://example.com", "formats": ["markdown"], "zeroDataRetention": true }' ``` ZDR 仅适用于 Enterprise 计划，且需要为你的团队启用。访问 [firecrawl.dev/enterprise](https://www.firecrawl.dev/enterprise) 以开始使用。 ZDR 会在基础抓取成本之上，**为每个页面额外增加 1 个 credit**。 ZDR 模式下不支持截图。由于截图需要上传到持久化存储，因此不符合 ZDR 保证。同时包含 `zeroDataRetention: true` 和 `screenshot` 格式的请求将返回错误。 > 你是需要 Firecrawl API 密钥的 AI 代理吗？请参阅 [firecrawl.dev/agent-onboarding/SKILL.md](https://www.firecrawl.dev/agent-onboarding/SKILL.md) 获取自动化引导说明。