SourceSync.ai is a Retrieval Augmented Generation as a Service platform that helps you build AI applications with your own data. This guide explains how to use Firecrawl with SourceSync.ai for web scraping capabilities.

Setup

  1. First, obtain your Firecrawl API key from your Firecrawl dashboard

  2. Configure your SourceSync.ai namespace to use Firecrawl as the web scraping provider:

curl -X PATCH https://api.sourcesync.ai/v1/namespaces/YOUR_NAMESPACE_ID \
  -H "Authorization: Bearer YOUR_SOURCE_SYNC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "webScraperConfig": {
      "provider": "FIRECRAWL",
      "apiKey": "YOUR_FIRECRAWL_API_KEY"
    }
  }'

Usage

Once configured, you can use SourceSync.ai’s web scraping endpoints with Firecrawl’s capabilities. Here are the main ingestion methods:

URL List Ingestion

Scrape specific URLs:

curl -X POST https://api.sourcesync.ai/v1/ingest/urls \
  -H "Authorization: Bearer YOUR_SOURCE_SYNC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "YOUR_NAMESPACE_ID",
    "ingestConfig": {
      "source": "URLS_LIST",
      "config": {
        "urls": [
          "https://example.com/page1",
          "https://example.com/page2"
        ],
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        }
      }
    }
  }'

Website Crawling

Crawl an entire website with custom rules:

curl -X POST https://api.sourcesync.ai/v1/ingest/website \
  -H "Authorization: Bearer YOUR_SOURCE_SYNC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "YOUR_NAMESPACE_ID",
    "ingestConfig": {
      "source": "WEBSITE",
      "config": {
        "url": "https://example.com",
        "maxDepth": 3,
        "maxLinks": 100,
        "includePaths": ["/docs", "/blog"],
        "excludePaths": ["/admin"],
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        }
      }
    }
  }'

Sitemap Processing

Process all URLs from a sitemap:

curl -X POST https://api.sourcesync.ai/v1/ingest/sitemap \
  -H "Authorization: Bearer YOUR_SOURCE_SYNC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespaceId": "YOUR_NAMESPACE_ID",
    "ingestConfig": {
      "source": "SITEMAP",
      "config": {
        "url": "https://example.com/sitemap.xml",
        "scrapeOptions": {
          "includeSelectors": ["article", "main"],
          "excludeSelectors": [".navigation", ".footer"]
        }
      }
    }
  }'

Features

When using Firecrawl with SourceSync.ai, you get access to:

  • JavaScript rendering support
  • Automatic rate limiting
  • CSS selector-based content extraction
  • Recursive crawling with depth control
  • Sitemap processing

Resources

For additional support: