/scrape endpoint

The updated /scrape endpoint has been redesigned for enhanced reliability and ease of use. The structure of the new /scrape request body is as follows:

{
  "url": "<string>",
  "formats": ["markdown", "html", "rawHtml", "links", "screenshot"],
  "includeTags": ["<string>"],
  "excludeTags": ["<string>"],
  "headers": { "<key>": "<value>" },
  "waitFor": 123,
  "timeout": 123
}

Formats

You can now choose what formats you want your output in. You can specify multiple output formats. Supported formats are:

  • Markdown (markdown)
  • HTML (html)
  • Raw HTML (rawHtml) (with no modifications)
  • Screenshot (screenshot or screenshot@fullPage)
  • Links (links)

By default, the output will be include only the markdown format.

Details on the new request body

The table below outlines the changes to the request body parameters for the /scrape endpoint in V1.

ParameterChangeDescription
onlyIncludeTagsMoved and RenamedMoved to root level. And renamed to includeTags.
removeTagsMoved and RenamedMoved to root level. And renamed to excludeTags.
onlyMainContentMovedMoved to root level. true by default.
waitForMovedMoved to root level.
headersMovedMoved to root level.
parsePDFMovedMoved to root level.
extractorOptionsNo Change
timeoutNo Change
pageOptionsRemovedNo need for pageOptions parameter. The scrape options were moved to root level.
replaceAllPathsWithAbsolutePathsRemovedreplaceAllPathsWithAbsolutePaths is not needed anymore. Every path is now default to absolute path.
includeHtmlRemovedadd "html" to formats instead.
includeRawHtmlRemovedadd "rawHtml" to formats instead.
screenshotRemovedadd "screenshot" to formats instead.
fullPageScreenshotRemovedadd "screenshot@fullPage" to formats instead.
extractorOptionsRemovedUse "extract" format instead with extract object.

The new extract format is described in the llm-extract section.

/crawl endpoint

We’ve also updated the /crawl endpoint on v1. Check out the improved body request below:

{
  "url": "<string>",
  "excludePaths": ["<string>"],
  "includePaths": ["<string>"],
  "maxDepth": 2,
  "ignoreSitemap": true,
  "limit": 10,
  "allowBackwardLinks": true,
  "allowExternalLinks": true,
  "scrapeOptions": {
    // same options as in /scrape
    "formats": ["markdown", "html", "rawHtml", "screenshot", "links"],
    "headers": { "<key>": "<value>" },
    "includeTags": ["<string>"],
    "excludeTags": ["<string>"],
    "onlyMainContent": true,
    "waitFor": 123
  }
}

Details on the new request body

The table below outlines the changes to the request body parameters for the /crawl endpoint in V1.

ParameterChangeDescription
pageOptionsRenamedRenamed to scrapeOptions.
includesMoved and RenamedMoved to root level. Renamed to includePaths.
excludesMoved and RenamedMoved to root level. Renamed to excludePaths.
allowBackwardCrawlingMoved and RenamedMoved to root level. Renamed to allowBackwardLinks.
allowExternalLinksMovedMoved to root level.
maxDepthMovedMoved to root level.
ignoreSitemapMovedMoved to root level.
limitMovedMoved to root level.
crawlerOptionsRemovedNo need for crawlerOptions parameter. The crawl options were moved to root level.
timeoutRemovedUse timeout in scrapeOptions instead.