diff options
| author | Adam Malczewski <[email protected]> | 2026-04-27 23:05:19 +0900 |
|---|---|---|
| committer | Adam Malczewski <[email protected]> | 2026-04-27 23:05:19 +0900 |
| commit | c7d5395ddc4f818d1faf0c59bd7c87d4ffd67a12 (patch) | |
| tree | 695fcbd4c84b18c6eb14f950bc47b11be6828f30 | |
| download | firecrawl-dokploy-c7d5395ddc4f818d1faf0c59bd7c87d4ffd67a12.tar.gz firecrawl-dokploy-c7d5395ddc4f818d1faf0c59bd7c87d4ffd67a12.zip | |
init
| -rw-r--r-- | .gitignore | 1 | ||||
| -rw-r--r-- | .rules/default/goal.md | 168 | ||||
| -rw-r--r-- | docker-compose.yml | 165 | ||||
| -rw-r--r-- | settings.yml | 45 |
4 files changed, 379 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..feead5b --- /dev/null +++ b/.gitignore @@ -0,0 +1 @@ +reference/ diff --git a/.rules/default/goal.md b/.rules/default/goal.md new file mode 100644 index 0000000..5cefc02 --- /dev/null +++ b/.rules/default/goal.md @@ -0,0 +1,168 @@ +# Firecrawl + SearXNG on Dokploy + +A minimal repo for deploying [Firecrawl](https://github.com/firecrawl/firecrawl) with [SearXNG](https://github.com/searxng/searxng) on [Dokploy](https://dokploy.com) — a fully self-hosted web search and content extraction API for AI tooling. + +## What this does + +This gives you a single API endpoint that can: + +- **Search the web** (`/v1/search`) — find pages matching a query, powered by SearXNG aggregating results from Google, Bing, DuckDuckGo, and others +- **Scrape a page** (`/v1/scrape`) — fetch any URL and get clean markdown or structured JSON, with full JavaScript rendering via Playwright +- **Crawl a site** (`/v1/crawl`) — traverse an entire website and extract content from every page +- **Map a site** (`/v1/map`) — discover all URLs on a domain without scraping them + +The search endpoint is what ties it all together for AI use: Firecrawl sends your query to SearXNG, gets back relevant URLs, then scrapes and cleans the top results — all in one API call. + +## Why this repo exists + +Firecrawl's official repo is a large monorepo that assumes you're building from source. SearXNG has its own separate docker-compose setup. To deploy both on Dokploy, you need compose files that: + +1. Use pre-built images instead of `build:` directives +2. Join the `dokploy-network` for Traefik routing +3. Include Traefik labels for automatic HTTPS +4. Use Docker named volumes for persistence +5. Avoid explicit `container_name` declarations (breaks Dokploy logging) +6. Wire SearXNG into Firecrawl via internal Docker networking + +Rather than forking both repos, this repo contains only the compose file, a SearXNG settings file, and this README. When either project publishes new images, you bump a version tag — no merge conflicts, no carrying source code you don't touch. + +## Architecture + +``` + ┌─────────────────────┐ + Internet ──► Traefik ──► Firecrawl API (:3002) + │ │ │ + │ ▼ ▼ + │ Playwright Redis + │ (:3000) (:6379) + │ │ + │ ▼ + │ SearXNG (:8080) + │ │ + │ ▼ + │ Google / Bing / DDG + └─────────────────────┘ +``` + +All services communicate over an internal Docker network. Only the Firecrawl API is exposed to the internet via Traefik. SearXNG is internal-only by default (you can optionally expose it via its own domain). + +## Services + +| Service | Image | Purpose | Resources | +|---|---|---|---| +| `api` | `ghcr.io/firecrawl/firecrawl` | Main API + workers | 4 CPU / 8 GB | +| `playwright-service` | `ghcr.io/firecrawl/playwright-service` | Headless browser for JS pages | 2 CPU / 4 GB | +| `searxng` | `searxng/searxng` | Metasearch engine | minimal | +| `redis` | `redis:alpine` | Queues, rate limiting, cache | minimal | + +## Deploying on Dokploy + +### 1. Prerequisites + +- A Dokploy instance +- A DNS A record pointing your subdomain (e.g. `firecrawl.yourdomain.com`) to your server + +### 2. Create the service + +1. In Dokploy, create a new **Compose** service (type: Docker Compose) +2. Connect this GitHub repo as the source +3. Set the **Compose Path** to `./docker-compose.yml` +4. Set the **branch** to `main` + +### 3. Configure environment variables + +In Dokploy's environment variable editor, set: + +```env +# Required — domain Traefik routes to the Firecrawl API +FIRECRAWL_DOMAIN=firecrawl.yourdomain.com + +# Recommended — protect your API with a key +TEST_API_KEY=fc-your-secret-key + +# Recommended — change the Bull queue dashboard admin key +BULL_AUTH_KEY=something-secure +``` + +### 4. Deploy + +Hit deploy. Dokploy pulls the images, creates the containers, and Traefik generates SSL certificates. Give it ~30 seconds. + +Your API is now live at `https://firecrawl.yourdomain.com`. + +### 5. Test it + +```bash +# Search the web (SearXNG → Firecrawl scrape → clean markdown) +curl -X POST https://firecrawl.yourdomain.com/v1/search \ + -H 'Content-Type: application/json' \ + -H 'Authorization: Bearer fc-your-secret-key' \ + -d '{"query": "what is firecrawl", "limit": 5}' + +# Scrape a single page +curl -X POST https://firecrawl.yourdomain.com/v1/scrape \ + -H 'Content-Type: application/json' \ + -H 'Authorization: Bearer fc-your-secret-key' \ + -d '{"url": "https://example.com"}' +``` + +## SearXNG configuration + +The `searxng/settings.yml` file in this repo is pre-configured for API use: + +- **JSON format enabled** — required for Firecrawl to consume results (disabled by default in SearXNG) +- **Rate limiter disabled** — since SearXNG is only accessed internally by Firecrawl, not by the public internet +- **Engines enabled** — Google, Bing, DuckDuckGo, Wikipedia, GitHub + +You can customize the engines, categories, and other settings by editing `searxng/settings.yml` and redeploying. See the [SearXNG documentation](https://docs.searxng.org/admin/settings/index.html) for all available options. + +**Important:** Change the `secret_key` value in `searxng/settings.yml` before deploying to production. + +## Optional: exposing SearXNG publicly + +By default SearXNG is only accessible internally. If you also want a public search UI, uncomment the Traefik labels on the `searxng` service in `docker-compose.yml` and add `SEARXNG_DOMAIN` to your env vars. + +## Optional: AI extraction features + +Firecrawl can use an LLM for structured data extraction via its `/extract` endpoint. Add one of these to your env vars: + +```env +# OpenAI +OPENAI_API_KEY=sk-your-key + +# Or a local Ollama instance accessible from the server +OLLAMA_BASE_URL=http://host.docker.internal:11434/api +MODEL_NAME=deepseek-r1:7b +``` + +## Resource requirements + +The default limits match Firecrawl's official recommendations. Total: roughly 6 CPU cores and 12 GB RAM. For light personal use you can lower these — edit the `cpus` and `mem_limit` values in `docker-compose.yml`. The API runs fine on 2 cores / 4 GB and Playwright on 1 core / 2 GB, but expect slower scraping on JS-heavy sites. + +SearXNG and Redis are lightweight and don't need explicit resource limits. + +## Updating + +Redeploy from Dokploy to pull the latest images. The compose file uses `latest` tags by default. To pin versions for stability, replace image tags with specific releases (e.g. `ghcr.io/firecrawl/firecrawl:v1.x.x`). Check Firecrawl's [releases page](https://github.com/firecrawl/firecrawl/releases) for available tags. + +Redis data and SearXNG cache persist in named volumes across redeployments. + +## Troubleshooting + +**`/search` returns empty or errors:** SearXNG might not be ready yet, or search engines might be rate-limiting your server's IP. Check SearXNG logs in Dokploy. Try changing the enabled engines in `searxng/settings.yml`. + +**Playwright timeouts on JS-heavy sites:** The default 4 GB memory limit for Playwright might not be enough. Increase `mem_limit` on the `playwright-service`. + +**403 errors from SearXNG:** Make sure `formats: - json` is present in `searxng/settings.yml`. Without it, SearXNG blocks non-HTML requests. + +**Redis connection refused:** Check that the Redis container is healthy in Dokploy's dashboard. The API won't start until Redis passes its healthcheck. + +## File structure + +``` +. +├── docker-compose.yml # All four services, Dokploy-ready +├── searxng/ +│ └── settings.yml # SearXNG config (JSON API enabled, limiter off) +└── README.md +``` diff --git a/docker-compose.yml b/docker-compose.yml new file mode 100644 index 0000000..a59d779 --- /dev/null +++ b/docker-compose.yml @@ -0,0 +1,165 @@ +name: firecrawl + +services: + # ============================================================ + # SearXNG — metasearch engine (powers Firecrawl's /search API) + # ============================================================ + searxng: + image: docker.io/searxng/searxng:latest + networks: + - backend + - dokploy-network + volumes: + - ./searxng:/etc/searxng:rw + - searxng-cache:/var/cache/searxng:rw + environment: + - SEARXNG_BASE_URL=https://${SEARXNG_DOMAIN:-searxng.localhost}/ + cap_drop: + - ALL + cap_add: + - CHOWN + - SETGID + - SETUID + - DAC_OVERRIDE + logging: + driver: "json-file" + options: + max-size: "1m" + max-file: "1" + restart: unless-stopped + # Uncomment labels below if you want SearXNG accessible via its own domain. + # Otherwise it's only reachable internally by Firecrawl. + # labels: + # - "traefik.enable=true" + # - "traefik.http.routers.searxng.rule=Host(`${SEARXNG_DOMAIN}`)" + # - "traefik.http.routers.searxng.entrypoints=websecure" + # - "traefik.http.routers.searxng.tls.certResolver=letsencrypt" + # - "traefik.http.services.searxng.loadbalancer.server.port=8080" + + # ============================================================ + # Playwright — headless browser for JS-rendered pages + # ============================================================ + playwright-service: + image: ghcr.io/firecrawl/playwright-service:latest + networks: + - backend + - dokploy-network + environment: + PORT: 3000 + PROXY_SERVER: ${PROXY_SERVER:-} + PROXY_USERNAME: ${PROXY_USERNAME:-} + PROXY_PASSWORD: ${PROXY_PASSWORD:-} + BLOCK_MEDIA: ${BLOCK_MEDIA:-} + ALLOW_LOCAL_WEBHOOKS: ${ALLOW_LOCAL_WEBHOOKS:-} + MAX_CONCURRENT_PAGES: ${MAX_CONCURRENT_PAGES:-10} + cpus: 2.0 + mem_limit: 4G + memswap_limit: 4G + tmpfs: + - /tmp/.cache:noexec,nosuid,size=1g + logging: + driver: "json-file" + options: + max-size: "10m" + max-file: "3" + compress: "true" + restart: unless-stopped + + # ============================================================ + # Firecrawl API — scrape, crawl, search, map + # ============================================================ + api: + image: ghcr.io/firecrawl/firecrawl + networks: + - backend + - dokploy-network + extra_hosts: + - "host.docker.internal:host-gateway" + environment: + # === Required === + PORT: ${PORT:-3002} + INTERNAL_PORT: ${INTERNAL_PORT:-3002} + HOST: 0.0.0.0 + NUM_WORKERS_PER_QUEUE: ${NUM_WORKERS_PER_QUEUE:-8} + REDIS_URL: redis://redis:6379 + REDIS_RATE_LIMIT_URL: redis://redis:6379 + PLAYWRIGHT_MICROSERVICE_URL: http://playwright-service:3000/scrape + USE_DB_AUTHENTICATION: ${USE_DB_AUTHENTICATION:-false} + # === SearXNG (internal, same compose network) === + SEARXNG_ENDPOINT: http://searxng:8080 + SEARXNG_ENGINES: ${SEARXNG_ENGINES:-} + SEARXNG_CATEGORIES: ${SEARXNG_CATEGORIES:-} + # === Optional: Auth === + TEST_API_KEY: ${TEST_API_KEY:-} + BULL_AUTH_KEY: ${BULL_AUTH_KEY:-CHANGEME} + # === Optional: AI Features === + OPENAI_API_KEY: ${OPENAI_API_KEY:-} + OLLAMA_BASE_URL: ${OLLAMA_BASE_URL:-} + MODEL_NAME: ${MODEL_NAME:-} + # === Optional: Proxy === + PROXY_SERVER: ${PROXY_SERVER:-} + PROXY_USERNAME: ${PROXY_USERNAME:-} + PROXY_PASSWORD: ${PROXY_PASSWORD:-} + ports: + - ${PORT:-3002} + ulimits: + nofile: + soft: 65535 + hard: 65535 + cpus: 4.0 + mem_limit: 8G + memswap_limit: 8G + command: node dist/src/harness.js --start-docker + depends_on: + redis: + condition: service_healthy + playwright-service: + condition: service_started + searxng: + condition: service_started + labels: + - "traefik.enable=true" + - "traefik.http.routers.firecrawl-api.rule=Host(`${FIRECRAWL_DOMAIN}`)" + - "traefik.http.routers.firecrawl-api.entrypoints=websecure" + - "traefik.http.routers.firecrawl-api.tls.certResolver=letsencrypt" + - "traefik.http.services.firecrawl-api.loadbalancer.server.port=${PORT:-3002}" + logging: + driver: "json-file" + options: + max-size: "10m" + max-file: "3" + compress: "true" + restart: unless-stopped + + # ============================================================ + # Redis — queues, rate limiting, caching + # ============================================================ + redis: + image: redis:alpine + networks: + - backend + command: redis-server --bind 0.0.0.0 + volumes: + - redis-data:/data + healthcheck: + test: ["CMD", "redis-cli", "ping"] + interval: 10s + timeout: 5s + retries: 5 + logging: + driver: "json-file" + options: + max-size: "5m" + max-file: "2" + compress: "true" + restart: unless-stopped + +networks: + backend: + driver: bridge + dokploy-network: + external: true + +volumes: + redis-data: + searxng-cache: diff --git a/settings.yml b/settings.yml new file mode 100644 index 0000000..ac380be --- /dev/null +++ b/settings.yml @@ -0,0 +1,45 @@ +use_default_settings: true + +server: + secret_key: "change-this-to-a-random-string" + limiter: false + image_proxy: false + port: 8080 + bind_address: "0.0.0.0" + +search: + safe_search: 0 + autocomplete: "" + default_lang: "" + formats: + - html + - json + +ui: + static_use_hash: true + +engines: + - name: google + engine: google + shortcut: g + disabled: false + + - name: duckduckgo + engine: duckduckgo + shortcut: ddg + disabled: false + + - name: bing + engine: bing + shortcut: bi + disabled: false + + - name: wikipedia + engine: wikipedia + shortcut: wp + disabled: false + + - name: github + engine: github + shortcut: gh + disabled: false |
