summaryrefslogtreecommitdiffhomepage
diff options
context:
space:
mode:
authorAdam Malczewski <[email protected]>2026-04-27 23:05:19 +0900
committerAdam Malczewski <[email protected]>2026-04-27 23:05:19 +0900
commitc7d5395ddc4f818d1faf0c59bd7c87d4ffd67a12 (patch)
tree695fcbd4c84b18c6eb14f950bc47b11be6828f30
downloadfirecrawl-dokploy-c7d5395ddc4f818d1faf0c59bd7c87d4ffd67a12.tar.gz
firecrawl-dokploy-c7d5395ddc4f818d1faf0c59bd7c87d4ffd67a12.zip
init
-rw-r--r--.gitignore1
-rw-r--r--.rules/default/goal.md168
-rw-r--r--docker-compose.yml165
-rw-r--r--settings.yml45
4 files changed, 379 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..feead5b
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1 @@
+reference/
diff --git a/.rules/default/goal.md b/.rules/default/goal.md
new file mode 100644
index 0000000..5cefc02
--- /dev/null
+++ b/.rules/default/goal.md
@@ -0,0 +1,168 @@
+# Firecrawl + SearXNG on Dokploy
+
+A minimal repo for deploying [Firecrawl](https://github.com/firecrawl/firecrawl) with [SearXNG](https://github.com/searxng/searxng) on [Dokploy](https://dokploy.com) — a fully self-hosted web search and content extraction API for AI tooling.
+
+## What this does
+
+This gives you a single API endpoint that can:
+
+- **Search the web** (`/v1/search`) — find pages matching a query, powered by SearXNG aggregating results from Google, Bing, DuckDuckGo, and others
+- **Scrape a page** (`/v1/scrape`) — fetch any URL and get clean markdown or structured JSON, with full JavaScript rendering via Playwright
+- **Crawl a site** (`/v1/crawl`) — traverse an entire website and extract content from every page
+- **Map a site** (`/v1/map`) — discover all URLs on a domain without scraping them
+
+The search endpoint is what ties it all together for AI use: Firecrawl sends your query to SearXNG, gets back relevant URLs, then scrapes and cleans the top results — all in one API call.
+
+## Why this repo exists
+
+Firecrawl's official repo is a large monorepo that assumes you're building from source. SearXNG has its own separate docker-compose setup. To deploy both on Dokploy, you need compose files that:
+
+1. Use pre-built images instead of `build:` directives
+2. Join the `dokploy-network` for Traefik routing
+3. Include Traefik labels for automatic HTTPS
+4. Use Docker named volumes for persistence
+5. Avoid explicit `container_name` declarations (breaks Dokploy logging)
+6. Wire SearXNG into Firecrawl via internal Docker networking
+
+Rather than forking both repos, this repo contains only the compose file, a SearXNG settings file, and this README. When either project publishes new images, you bump a version tag — no merge conflicts, no carrying source code you don't touch.
+
+## Architecture
+
+```
+ ┌─────────────────────┐
+ Internet ──► Traefik ──► Firecrawl API (:3002)
+ │ │ │
+ │ ▼ ▼
+ │ Playwright Redis
+ │ (:3000) (:6379)
+ │ │
+ │ ▼
+ │ SearXNG (:8080)
+ │ │
+ │ ▼
+ │ Google / Bing / DDG
+ └─────────────────────┘
+```
+
+All services communicate over an internal Docker network. Only the Firecrawl API is exposed to the internet via Traefik. SearXNG is internal-only by default (you can optionally expose it via its own domain).
+
+## Services
+
+| Service | Image | Purpose | Resources |
+|---|---|---|---|
+| `api` | `ghcr.io/firecrawl/firecrawl` | Main API + workers | 4 CPU / 8 GB |
+| `playwright-service` | `ghcr.io/firecrawl/playwright-service` | Headless browser for JS pages | 2 CPU / 4 GB |
+| `searxng` | `searxng/searxng` | Metasearch engine | minimal |
+| `redis` | `redis:alpine` | Queues, rate limiting, cache | minimal |
+
+## Deploying on Dokploy
+
+### 1. Prerequisites
+
+- A Dokploy instance
+- A DNS A record pointing your subdomain (e.g. `firecrawl.yourdomain.com`) to your server
+
+### 2. Create the service
+
+1. In Dokploy, create a new **Compose** service (type: Docker Compose)
+2. Connect this GitHub repo as the source
+3. Set the **Compose Path** to `./docker-compose.yml`
+4. Set the **branch** to `main`
+
+### 3. Configure environment variables
+
+In Dokploy's environment variable editor, set:
+
+```env
+# Required — domain Traefik routes to the Firecrawl API
+FIRECRAWL_DOMAIN=firecrawl.yourdomain.com
+
+# Recommended — protect your API with a key
+TEST_API_KEY=fc-your-secret-key
+
+# Recommended — change the Bull queue dashboard admin key
+BULL_AUTH_KEY=something-secure
+```
+
+### 4. Deploy
+
+Hit deploy. Dokploy pulls the images, creates the containers, and Traefik generates SSL certificates. Give it ~30 seconds.
+
+Your API is now live at `https://firecrawl.yourdomain.com`.
+
+### 5. Test it
+
+```bash
+# Search the web (SearXNG → Firecrawl scrape → clean markdown)
+curl -X POST https://firecrawl.yourdomain.com/v1/search \
+ -H 'Content-Type: application/json' \
+ -H 'Authorization: Bearer fc-your-secret-key' \
+ -d '{"query": "what is firecrawl", "limit": 5}'
+
+# Scrape a single page
+curl -X POST https://firecrawl.yourdomain.com/v1/scrape \
+ -H 'Content-Type: application/json' \
+ -H 'Authorization: Bearer fc-your-secret-key' \
+ -d '{"url": "https://example.com"}'
+```
+
+## SearXNG configuration
+
+The `searxng/settings.yml` file in this repo is pre-configured for API use:
+
+- **JSON format enabled** — required for Firecrawl to consume results (disabled by default in SearXNG)
+- **Rate limiter disabled** — since SearXNG is only accessed internally by Firecrawl, not by the public internet
+- **Engines enabled** — Google, Bing, DuckDuckGo, Wikipedia, GitHub
+
+You can customize the engines, categories, and other settings by editing `searxng/settings.yml` and redeploying. See the [SearXNG documentation](https://docs.searxng.org/admin/settings/index.html) for all available options.
+
+**Important:** Change the `secret_key` value in `searxng/settings.yml` before deploying to production.
+
+## Optional: exposing SearXNG publicly
+
+By default SearXNG is only accessible internally. If you also want a public search UI, uncomment the Traefik labels on the `searxng` service in `docker-compose.yml` and add `SEARXNG_DOMAIN` to your env vars.
+
+## Optional: AI extraction features
+
+Firecrawl can use an LLM for structured data extraction via its `/extract` endpoint. Add one of these to your env vars:
+
+```env
+# OpenAI
+OPENAI_API_KEY=sk-your-key
+
+# Or a local Ollama instance accessible from the server
+OLLAMA_BASE_URL=http://host.docker.internal:11434/api
+MODEL_NAME=deepseek-r1:7b
+```
+
+## Resource requirements
+
+The default limits match Firecrawl's official recommendations. Total: roughly 6 CPU cores and 12 GB RAM. For light personal use you can lower these — edit the `cpus` and `mem_limit` values in `docker-compose.yml`. The API runs fine on 2 cores / 4 GB and Playwright on 1 core / 2 GB, but expect slower scraping on JS-heavy sites.
+
+SearXNG and Redis are lightweight and don't need explicit resource limits.
+
+## Updating
+
+Redeploy from Dokploy to pull the latest images. The compose file uses `latest` tags by default. To pin versions for stability, replace image tags with specific releases (e.g. `ghcr.io/firecrawl/firecrawl:v1.x.x`). Check Firecrawl's [releases page](https://github.com/firecrawl/firecrawl/releases) for available tags.
+
+Redis data and SearXNG cache persist in named volumes across redeployments.
+
+## Troubleshooting
+
+**`/search` returns empty or errors:** SearXNG might not be ready yet, or search engines might be rate-limiting your server's IP. Check SearXNG logs in Dokploy. Try changing the enabled engines in `searxng/settings.yml`.
+
+**Playwright timeouts on JS-heavy sites:** The default 4 GB memory limit for Playwright might not be enough. Increase `mem_limit` on the `playwright-service`.
+
+**403 errors from SearXNG:** Make sure `formats: - json` is present in `searxng/settings.yml`. Without it, SearXNG blocks non-HTML requests.
+
+**Redis connection refused:** Check that the Redis container is healthy in Dokploy's dashboard. The API won't start until Redis passes its healthcheck.
+
+## File structure
+
+```
+.
+├── docker-compose.yml # All four services, Dokploy-ready
+├── searxng/
+│ └── settings.yml # SearXNG config (JSON API enabled, limiter off)
+└── README.md
+```
diff --git a/docker-compose.yml b/docker-compose.yml
new file mode 100644
index 0000000..a59d779
--- /dev/null
+++ b/docker-compose.yml
@@ -0,0 +1,165 @@
+name: firecrawl
+
+services:
+ # ============================================================
+ # SearXNG — metasearch engine (powers Firecrawl's /search API)
+ # ============================================================
+ searxng:
+ image: docker.io/searxng/searxng:latest
+ networks:
+ - backend
+ - dokploy-network
+ volumes:
+ - ./searxng:/etc/searxng:rw
+ - searxng-cache:/var/cache/searxng:rw
+ environment:
+ - SEARXNG_BASE_URL=https://${SEARXNG_DOMAIN:-searxng.localhost}/
+ cap_drop:
+ - ALL
+ cap_add:
+ - CHOWN
+ - SETGID
+ - SETUID
+ - DAC_OVERRIDE
+ logging:
+ driver: "json-file"
+ options:
+ max-size: "1m"
+ max-file: "1"
+ restart: unless-stopped
+ # Uncomment labels below if you want SearXNG accessible via its own domain.
+ # Otherwise it's only reachable internally by Firecrawl.
+ # labels:
+ # - "traefik.enable=true"
+ # - "traefik.http.routers.searxng.rule=Host(`${SEARXNG_DOMAIN}`)"
+ # - "traefik.http.routers.searxng.entrypoints=websecure"
+ # - "traefik.http.routers.searxng.tls.certResolver=letsencrypt"
+ # - "traefik.http.services.searxng.loadbalancer.server.port=8080"
+
+ # ============================================================
+ # Playwright — headless browser for JS-rendered pages
+ # ============================================================
+ playwright-service:
+ image: ghcr.io/firecrawl/playwright-service:latest
+ networks:
+ - backend
+ - dokploy-network
+ environment:
+ PORT: 3000
+ PROXY_SERVER: ${PROXY_SERVER:-}
+ PROXY_USERNAME: ${PROXY_USERNAME:-}
+ PROXY_PASSWORD: ${PROXY_PASSWORD:-}
+ BLOCK_MEDIA: ${BLOCK_MEDIA:-}
+ ALLOW_LOCAL_WEBHOOKS: ${ALLOW_LOCAL_WEBHOOKS:-}
+ MAX_CONCURRENT_PAGES: ${MAX_CONCURRENT_PAGES:-10}
+ cpus: 2.0
+ mem_limit: 4G
+ memswap_limit: 4G
+ tmpfs:
+ - /tmp/.cache:noexec,nosuid,size=1g
+ logging:
+ driver: "json-file"
+ options:
+ max-size: "10m"
+ max-file: "3"
+ compress: "true"
+ restart: unless-stopped
+
+ # ============================================================
+ # Firecrawl API — scrape, crawl, search, map
+ # ============================================================
+ api:
+ image: ghcr.io/firecrawl/firecrawl
+ networks:
+ - backend
+ - dokploy-network
+ extra_hosts:
+ - "host.docker.internal:host-gateway"
+ environment:
+ # === Required ===
+ PORT: ${PORT:-3002}
+ INTERNAL_PORT: ${INTERNAL_PORT:-3002}
+ HOST: 0.0.0.0
+ NUM_WORKERS_PER_QUEUE: ${NUM_WORKERS_PER_QUEUE:-8}
+ REDIS_URL: redis://redis:6379
+ REDIS_RATE_LIMIT_URL: redis://redis:6379
+ PLAYWRIGHT_MICROSERVICE_URL: http://playwright-service:3000/scrape
+ USE_DB_AUTHENTICATION: ${USE_DB_AUTHENTICATION:-false}
+ # === SearXNG (internal, same compose network) ===
+ SEARXNG_ENDPOINT: http://searxng:8080
+ SEARXNG_ENGINES: ${SEARXNG_ENGINES:-}
+ SEARXNG_CATEGORIES: ${SEARXNG_CATEGORIES:-}
+ # === Optional: Auth ===
+ TEST_API_KEY: ${TEST_API_KEY:-}
+ BULL_AUTH_KEY: ${BULL_AUTH_KEY:-CHANGEME}
+ # === Optional: AI Features ===
+ OPENAI_API_KEY: ${OPENAI_API_KEY:-}
+ OLLAMA_BASE_URL: ${OLLAMA_BASE_URL:-}
+ MODEL_NAME: ${MODEL_NAME:-}
+ # === Optional: Proxy ===
+ PROXY_SERVER: ${PROXY_SERVER:-}
+ PROXY_USERNAME: ${PROXY_USERNAME:-}
+ PROXY_PASSWORD: ${PROXY_PASSWORD:-}
+ ports:
+ - ${PORT:-3002}
+ ulimits:
+ nofile:
+ soft: 65535
+ hard: 65535
+ cpus: 4.0
+ mem_limit: 8G
+ memswap_limit: 8G
+ command: node dist/src/harness.js --start-docker
+ depends_on:
+ redis:
+ condition: service_healthy
+ playwright-service:
+ condition: service_started
+ searxng:
+ condition: service_started
+ labels:
+ - "traefik.enable=true"
+ - "traefik.http.routers.firecrawl-api.rule=Host(`${FIRECRAWL_DOMAIN}`)"
+ - "traefik.http.routers.firecrawl-api.entrypoints=websecure"
+ - "traefik.http.routers.firecrawl-api.tls.certResolver=letsencrypt"
+ - "traefik.http.services.firecrawl-api.loadbalancer.server.port=${PORT:-3002}"
+ logging:
+ driver: "json-file"
+ options:
+ max-size: "10m"
+ max-file: "3"
+ compress: "true"
+ restart: unless-stopped
+
+ # ============================================================
+ # Redis — queues, rate limiting, caching
+ # ============================================================
+ redis:
+ image: redis:alpine
+ networks:
+ - backend
+ command: redis-server --bind 0.0.0.0
+ volumes:
+ - redis-data:/data
+ healthcheck:
+ test: ["CMD", "redis-cli", "ping"]
+ interval: 10s
+ timeout: 5s
+ retries: 5
+ logging:
+ driver: "json-file"
+ options:
+ max-size: "5m"
+ max-file: "2"
+ compress: "true"
+ restart: unless-stopped
+
+networks:
+ backend:
+ driver: bridge
+ dokploy-network:
+ external: true
+
+volumes:
+ redis-data:
+ searxng-cache:
diff --git a/settings.yml b/settings.yml
new file mode 100644
index 0000000..ac380be
--- /dev/null
+++ b/settings.yml
@@ -0,0 +1,45 @@
+use_default_settings: true
+
+server:
+ secret_key: "change-this-to-a-random-string"
+ limiter: false
+ image_proxy: false
+ port: 8080
+ bind_address: "0.0.0.0"
+
+search:
+ safe_search: 0
+ autocomplete: ""
+ default_lang: ""
+ formats:
+ - html
+ - json
+
+ui:
+ static_use_hash: true
+
+engines:
+ - name: google
+ engine: google
+ shortcut: g
+ disabled: false
+
+ - name: duckduckgo
+ engine: duckduckgo
+ shortcut: ddg
+ disabled: false
+
+ - name: bing
+ engine: bing
+ shortcut: bi
+ disabled: false
+
+ - name: wikipedia
+ engine: wikipedia
+ shortcut: wp
+ disabled: false
+
+ - name: github
+ engine: github
+ shortcut: gh
+ disabled: false