Extract website data using LLMs
Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code.
Note: this example is using v0 version of the Firecrawl API. You can install the 0.0.20 version for the Python SDK or the 0.0.36 for the Node SDK.
Setup
Install our python dependencies, including groq and firecrawl-py.
Getting your Groq and Firecrawl API Keys
To use Groq and Firecrawl, you will need to get your API keys. You can get your Groq API key from here and your Firecrawl API key from here.
Load website with Firecrawl
To be able to get all the data from a website page and make sure it is in the cleanest format, we will use Firecrawl. It handles by-passing JS-blocked websites, extracting the main content, and outputting in a LLM-readable format for increased accuracy.
Here is how we will scrape a website url using Firecrawl. We will also set a pageOptions
for only extracting the main content (onlyMainContent: True
) of the website page - excluding the navs, footers, etc.
Perfect, now we have clean data from the website - ready to be fed to the LLM for data extraction.
Extraction and Generation
Now that we have the website data, let’s use Groq to pull out the information we need. We’ll use Groq Llama 3 model in JSON mode and pick out certain fields from the page content.
We are using LLama 3 8b model for this example. Feel free to use bigger models for improved results.
And Voila!
You have now built a data extraction bot using Groq and Firecrawl. You can now use this bot to extract structured data from any website.
If you have any questions or need help, feel free to reach out to us at Firecrawl.