Skip to main content

Introducing /parse

The /parse endpoint converts local or non-public documents into clean, LLM-ready data. Upload file bytes via multipart/form-data and get back Markdown, JSON, HTML, links, images, or a summary — with reading order and tables preserved.
  • Turn PDF, DOCX, XLSX, HTML, and more into Markdown or structured JSON
  • Up to 5x faster parsing via a Rust-based engine
  • Files up to 50 MB per request
  • Zero Data Retention support

When to use /parse

Use /parse when the source document is a local file or not publicly accessible by URL. If you have a public URL that points to a document, prefer /scrape — it auto-detects the file type from the extension or content type and parses it the same way.
SourceEndpoint
Public URL to a document (e.g. https://example.com/report.pdf)POST /scrape
Local file or non-public bytes (PDF, DOCX, XLSX, HTML, …)POST /parse

Parsing

/parse endpoint

Used to upload a file and receive parsed content. The request is multipart/form-data with a required file part and an optional options JSON part. Supported extensions: .html, .htm, .pdf, .docx, .doc, .odt, .rtf, .xlsx, .xls.

Usage

from firecrawl import Firecrawl

firecrawl = Firecrawl(api_key="fc-YOUR-API-KEY")

doc = firecrawl.parse("./report.pdf")

print(doc.markdown)

Response

SDKs return the document object directly. cURL returns the JSON payload.
{
  "success": true,
  "data": {
    "markdown": "# Annual Report\n\n...",
    "metadata": {
      "title": "Annual Report",
      "numPages": 42,
      "sourceFile": "report.pdf"
    }
  }
}

Options

/parse accepts a subset of scrape options under the options field. Common settings:
  • formats: Array of output formats. Defaults to ["markdown"]. Supported: markdown, html, rawHtml, links, images, summary, and json (with a schema or prompt).
  • onlyMainContent: Only return the main content of the document. Defaults to true.
  • includeTags / excludeTags: Tag-level inclusion or exclusion (HTML inputs).
  • timeout: Request timeout in milliseconds. Defaults to 30000, max 300000.
  • parsers: File-parser controls. For PDFs, set { "type": "pdf", "mode": "fast" | "auto" | "ocr", "maxPages": <int> }.
/parse does not support browser-only options like actions, waitFor, location, mobile, or change tracking.

PDF parser modes

cURL
curl -X POST https://api.firecrawl.dev/v2/parse \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'file=@./scan.pdf' \
  -F 'options={"parsers":[{"type":"pdf","mode":"ocr","maxPages":50}]};type=application/json'
  • fast: text-only extraction, fastest path.
  • auto (default): text-first with OCR fallback for image-only pages.
  • ocr: OCR every page — use for scanned documents.

Structured JSON output

Pass a JSON schema or prompt to extract structured data directly from the document:
cURL
curl -X POST https://api.firecrawl.dev/v2/parse \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'file=@./invoice.pdf' \
  -F 'options={"formats":[{"type":"json","schema":{"type":"object","properties":{"total":{"type":"number"},"vendor":{"type":"string"}}}}]};type=application/json'

Considerations

  • Maximum file size is 50 MB per request.
  • Parsing very large or scanned PDFs in ocr mode may take longer — increase timeout or use maxPages to bound the work.
  • For batches of files, call /parse per file in parallel; there is no batch upload variant.
Are you an AI agent that needs a Firecrawl API key? See firecrawl.dev/agent-onboarding/SKILL.md for automated onboarding instructions.