Used to crawl a URL and all accessible subpages. This submits a crawl job and returns a job ID to check the status of the crawl.
By default - Crawl will ignore sublinks of a page if they aren’t children of the url you provide. So, the website.com/other-parent/blog-1 wouldn’t be returned if you crawled website.com/blogs/. If you want website.com/other-parent/blog-1, use the allowBackwardLinks parameter
For not completed or large responses exceeding 10MB, a next URL parameter is provided. You must request this URL to retrieve the next 10MB of data. If the next parameter is absent, it indicates the end of the crawl data.
The skip parameter sets the maximum number of results returned for each chunk of results returned.
The skip and next parameter are only relavent when hitting the api directly. If you’re using the SDK, we handle this for you and will return all the results at once.
Copy
Ask AI
{ "status": "scraping", "total": 36, "completed": 10, "creditsUsed": 10, "expiresAt": "2024-00-00T00:00:00.000Z", "next": "https://api.firecrawl.dev/v1/crawl/123-456-789?skip=10", "data": [ { "markdown": "[Firecrawl Docs home page!...", "html": "<!DOCTYPE html><html lang=\"en\" class=\"js-focus-visible lg:[--scroll-mt:9.5rem]\" data-js-focus-visible=\"\">...", "metadata": { "title": "Build a 'Chat with website' using Groq Llama 3 | Firecrawl", "language": "en", "sourceURL": "https://docs.firecrawl.dev/learn/rag-llama3", "description": "Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.", "ogLocaleAlternate": [], "statusCode": 200 } }, ... ]}
Speed up your crawls by 500% when you don’t need the freshest data. Add maxAge to your scrapeOptions to use cached page data when available.
Copy
Ask AI
from firecrawl import FirecrawlApp, ScrapeOptionsapp = FirecrawlApp(api_key="fc-YOUR_API_KEY")# Crawl with cached scraping - 500% faster for pages we've seen recentlycrawl_result = app.crawl_url( 'https://firecrawl.dev', limit=100, scrape_options=ScrapeOptions( formats=['markdown'], maxAge=3600000 # Use cached data if less than 1 hour old ))for page in crawl_result['data']: print(f"URL: {page['metadata']['sourceURL']}") print(f"Content: {page['markdown'][:200]}...")
How it works:
Each page in your crawl checks if we have cached data newer than maxAge
If yes, returns instantly from cache (500% faster)
If no, scrapes the page fresh and caches the result
Perfect for crawling documentation sites, product catalogs, or other relatively static content
For more details on maxAge usage, see the Faster Scraping documentation.
Firecrawl’s WebSocket-based method, Crawl URL and Watch, enables real-time data extraction and monitoring. Start a crawl with a URL and customize it with options like page limits, allowed domains, and output formats, ideal for immediate data processing needs.
Copy
Ask AI
# inside an async function...nest_asyncio.apply()# Define event handlersdef on_document(detail): print("DOC", detail)def on_error(detail): print("ERR", detail['error'])def on_done(detail): print("DONE", detail['status']) # Function to start the crawl and watch processasync def start_crawl_and_watch(): # Initiate the crawl job and get the watcher watcher = app.crawl_url_and_watch('firecrawl.dev', limit=5) # Add event listeners watcher.add_event_listener("document", on_document) watcher.add_event_listener("error", on_error) watcher.add_event_listener("done", on_done) # Start the watcher await watcher.connect()# Run the event loopawait start_crawl_and_watch()
You can configure webhooks to receive real-time notifications as your crawl progresses. This allows you to process pages as they’re scraped instead of waiting for the entire crawl to complete.