高度なスクレイピングガイド

このガイドでは、Firecrawl の各エンドポイントと、用意されたすべてのパラメータを使いこなす方法を順を追って解説します。

Firecrawl で行う基本的なスクレイピング

単一のページをスクレイピングしてクリーンなMarkdownコンテンツを取得するには、/scrape エンドポイントを使用します。

# pip install firecrawl-py

from firecrawl import Firecrawl

firecrawl = Firecrawl(api_key="fc-YOUR-API-KEY")

doc = firecrawl.scrape("https://firecrawl.dev")

print(doc.markdown)

PDFのスクレイピング

FirecrawlはPDFに対応しています。PDFを確実に解析したい場合は、parsers オプション（例: parsers: ["pdf"]）を使用してください。

スクレイピングのオプション

/scrape エンドポイントを使用する際は、以下のオプションでスクレイピングをカスタマイズできます。

フォーマット (`formats`)

型: array
文字列: ["markdown", "links", "html", "rawHtml", "summary", "images"]
オブジェクト形式:
- JSON: { type: "json", prompt, schema }
- スクリーンショット: { type: "screenshot", fullPage?, quality?, viewport? }
- changeTracking（変更追跡）: { type: "changeTracking", modes?, prompt?, schema?, tag? }（markdown が必要）
デフォルト: ["markdown"]

ページ全体のコンテンツとメインコンテンツ（`onlyMainContent`）

タイプ: boolean
説明: 既定ではスクレイパーはメインコンテンツのみを返します。ページ全体のコンテンツを返すには false に設定してください。
デフォルト: true

含めるタグ（`includeTags`）

Type: array
Description: スクレイプに含める HTML のタグ／クラス／ID。

除外タグ（`excludeTags`）

型: array
説明: スクレイプ対象から除外する HTML のタグ・クラス・ID。

ページの準備完了を待つ（`waitFor`）

型: integer
説明: スクレイピング開始前に待機する時間（ミリ秒）。必要な場合のみ最小限の使用を推奨。
デフォルト: 0

鮮度とキャッシュ（`maxAge`）

タイプ: integer（ミリ秒）
説明: ページのキャッシュが maxAge 以内に更新されたものであれば、Firecrawl は即座にそれを返し、そうでなければ新規にスクレイプしてキャッシュを更新します。常に最新を取得するには 0 を設定します。
デフォルト: 172800000（2日）

リクエストのタイムアウト (`timeout`)

型: integer
説明: 中止までの最大時間（ミリ秒）。
デフォルト: 30000（30秒）

PDF 解析（`parsers`）

型: array
説明: 解析動作を制御します。PDF を解析するには、parsers: ["pdf"] を設定します。

アクション (`actions`)

/scrape エンドポイントを使用する場合、Firecrawl はスクレイピング前にウェブページ上でさまざまなアクションを実行できます。これは、動的コンテンツとの対話、ページ間の移動、ユーザー操作が必要なコンテンツへのアクセスに特に有用です。

型: array
説明: スクレイピング前に実行するブラウザ操作のシーケンス。
サポートされるアクション:
- wait { milliseconds }
- click { selector }
- write { selector, text }
- press { key }
- scroll { direction: "up" | "down" }
- scrape { selector }（サブ要素をスクレイプ）
- executeJavascript { script }
- pdf（一部のフローで PDF レンダリングを起動）

from firecrawl import Firecrawl

firecrawl = Firecrawl(api_key='fc-YOUR-API-KEY')

doc = firecrawl.scrape('https://example.com', {
  actions: [
    { type: 'wait', milliseconds: 1000 },
    { type: 'click', selector: '#accept' },
    { type: 'scroll', direction: 'down' },
    { type: 'write', selector: '#q', text: 'firecrawl' },
    { type: 'press', key: 'Enter' }
  ],
  formats: ['markdown']
})

print(doc.markdown)

使い方の例

cURL

curl -X POST https://api.firecrawl.dev/v2/scrape \
    -H '
    Content-Type: application/json' \
    -H 'Authorization: Bearer fc-YOUR-API-KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "formats": [
        "markdown",
        "links",
        "html",
        "rawHtml",
        { "type": "screenshot", "fullPage": true, "quality": 80 }
      ],
      "includeTags": ["h1", "p", "a", ".main-content"],
      "excludeTags": ["#ad", "#footer"],
      "onlyMainContent": false,
      "waitFor": 1000,
      "timeout": 15000,
      "parsers": ["pdf"]
    }'

この例では、スクレイパーは次を行います:

ページ全体のコンテンツをMarkdownで返します。
レスポンスにMarkdown、raw HTML、HTML、リンク、スクリーンショットを含めます。
HTMLタグの <h1>、<p>、<a> と、クラス .main-content を持つ要素のみを含め、IDが #ad と #footer の要素は除外します。
ページの読み込みのため、スクレイピング前に1000ミリ秒（1秒）待機します。
スクレイプリクエストの最大実行時間を15000ミリ秒（15秒）に設定します。
parsers: ["pdf"] を指定してPDFを明示的に解析します。

APIリファレンスはこちら: Scrape Endpoint Documentation

フォーマットによるJSON抽出

1回の処理で構造化データを抽出するには、formats 内の JSON フォーマットオブジェクトを使用します。

curl -X POST https://api.firecrawl.dev/v2/scrape \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer fc-YOUR-API-KEY' \
  -d '{
    "url": "https://firecrawl.dev",
    "formats": [{
      "type": "json",
      "prompt": "製品の機能を抽出せよ",
      "schema": {"type": "object", "properties": {"features": {"type": "object"}}, "required": ["features"]}
    }]
  }'

Extract エンドポイント

ステータスのポーリングを伴う非同期抽出が必要な場合は、専用の抽出ジョブ API を使用します。

import Firecrawl from '@mendable/firecrawl-js';

const firecrawl = new Firecrawl({ apiKey: 'fc-YOUR-API-KEY' });

// 抽出ジョブを開始
const started = await firecrawl.startExtract({
  urls: ['https://docs.firecrawl.dev'],
  prompt: 'Extract title',
  schema: { type: 'object', properties: { title: { type: 'string' } }, required: ['title'] }
});

// ステータスをポーリング
const status = await firecrawl.getExtractStatus(started.id);
console.log(status.status, status.data);

複数ページのクロール

複数のページをクロールするには、/v2/crawl エンドポイントを使用します。

cURL

curl -X POST https://api.firecrawl.dev/v2/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer fc-あなたのAPIキー' \
    -d '{
      "url": "https://docs.firecrawl.dev"
    }'

IDを返します

{ "id": "1234-5678-9101" }

クローラージョブの確認

クロールジョブのステータスを確認し、結果を取得します。

cURL

curl -X GET https://api.firecrawl.dev/v2/crawl/1234-5678-9101 \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer fc-YOUR-API-KEY'

ページネーション／次のURL

コンテンツが10MBを超える場合、またはクロールジョブがまだ実行中の場合、レスポンスに next パラメータ（次の結果ページへのURL）が含まれることがあります。

クロール用プロンプトとパラメータのプレビュー

自然言語の prompt を指定すると、Firecrawl がクロール設定を自動で推定します。まずはプレビューしてください：

cURL

curl -X POST https://api.firecrawl.dev/v2/crawl/params-preview \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer fc-YOUR-API-KEY' \
  -d '{
    "url": "https://docs.firecrawl.dev",
    "prompt": "ドキュメントとブログを抽出する"
  }'

クローラーのオプション

/v2/crawl エンドポイントを使用する際は、以下のオプションでクロールの挙動をカスタマイズできます:

includePaths

型: array
説明: インクルード対象とする正規表現パターン。
例: ["^/blog/.*$", "^/docs/.*$"]

excludePaths

型: array
説明: 除外対象を指定する正規表現パターン。
例: ["^/admin/.*$", "^/private/.*$"]

maxDiscoveryDepth

Type: integer
Description: 新規URL発見のための最大探索深度。

limit

Type: integer
Description: クロールするページ数の上限。
Default: 10000

crawlEntireDomain

型: boolean
説明: 兄弟ページや親ページにも探索を拡げ、ドメイン全体をカバーします。
デフォルト: false

allowExternalLinks

Type: boolean
Description: 外部ドメインへのリンクを追跡します。
Default: false

allowSubdomains

型: boolean
説明: メインドメインのサブドメインもクロールします。
既定値: false

delay

Type: number
Description: スクレイピング間の遅延（秒）。
Default: undefined

scrapeOptions

Type: object
Description: スクレイパーのオプション（上記のフォーマット参照）。
Example: { "formats": ["markdown", "links", {"type": "screenshot", "fullPage": true}], "includeTags": ["h1", "p", "a", ".main-content"], "excludeTags": ["#ad", "#footer"], "onlyMainContent": false, "waitFor": 1000, "timeout": 15000}
Defaults: formats: ["markdown"]、既定でキャッシュ有効（maxAge 約2日）

使い方の例

cURL

curl -X POST https://api.firecrawl.dev/v2/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer fc-YOUR-API-KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "includePaths": ["^/blog/.*$", "^/docs/.*$"],
      "excludePaths": ["^/admin/.*$", "^/private/.*$"],
      "maxDiscoveryDepth": 2,
      "limit": 1000
    }'

ウェブサイトのリンクのマッピング

/v2/map エンドポイントは、指定したウェブサイトに関連するURLを特定します。

使い方

cURL

curl -X POST https://api.firecrawl.dev/v2/map \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer fc-YOUR-API-KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev"
    }'

マップオプション

search

Type: string
Description: 指定したテキストを含むリンクをフィルタします。

limit

Type: integer
Description: 返すリンクの最大数
Default: 100

sitemap

Type: "only" | "include" | "skip"
Description: マッピング時のsitemapの利用方法を制御します。
Default: "include"

includeSubdomains

Type: boolean
Description: ウェブサイトのサブドメインを含めるかどうか。
Default: true

該当するAPIリファレンスはこちら: Map Endpoint Documentation お読みいただきありがとうございました。

はじめに

標準機能

エージェント機能

Webhook

開発者向けガイド

ユースケース

貢献方法

​Firecrawl で行う基本的なスクレイピング

​PDFのスクレイピング

​スクレイピングのオプション

​フォーマット (formats)

​ページ全体のコンテンツとメインコンテンツ（onlyMainContent）

​含めるタグ（includeTags）

​除外タグ（excludeTags）

​ページの準備完了を待つ（waitFor）

​鮮度とキャッシュ（maxAge）

​リクエストのタイムアウト (timeout)

​PDF 解析（parsers）

​アクション (actions)

​使い方の例

​フォーマットによるJSON抽出

​Extract エンドポイント

​複数ページのクロール

​クローラー ジョブの確認

​ページネーション／次のURL

​クロール用プロンプトとパラメータのプレビュー

​クローラーのオプション

​includePaths

​excludePaths

​maxDiscoveryDepth

​limit

​crawlEntireDomain

​allowExternalLinks

​allowSubdomains

​delay

​scrapeOptions

​使い方の例

​ウェブサイトのリンクのマッピング

​使い方

​マップオプション

​search

​limit

​sitemap

​includeSubdomains

Firecrawl で行う基本的なスクレイピング

PDFのスクレイピング

スクレイピングのオプション

フォーマット (`formats`)

ページ全体のコンテンツとメインコンテンツ（`onlyMainContent`）

含めるタグ（`includeTags`）

除外タグ（`excludeTags`）

ページの準備完了を待つ（`waitFor`）

鮮度とキャッシュ（`maxAge`）

リクエストのタイムアウト (`timeout`)

PDF 解析（`parsers`）

アクション (`actions`)

使い方の例

フォーマットによるJSON抽出

Extract エンドポイント

複数ページのクロール

クローラージョブの確認

ページネーション／次のURL

クロール用プロンプトとパラメータのプレビュー

クローラーのオプション

includePaths

excludePaths

maxDiscoveryDepth

limit

crawlEntireDomain

allowExternalLinks

allowSubdomains

delay

scrapeOptions

使い方の例

ウェブサイトのリンクのマッピング

使い方

マップオプション

search

limit

sitemap

includeSubdomains