summaryrefslogtreecommitdiffhomepage
path: root/README.md
blob: d6f5ce66143896e333aa2a130ce529a58a66a14f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
# Firecrawl + SearXNG on Dokploy

A minimal repo for deploying [Firecrawl](https://github.com/firecrawl/firecrawl) with [SearXNG](https://github.com/searxng/searxng) on [Dokploy](https://dokploy.com) — a fully self-hosted web search and content extraction API for AI tooling.

## What this does

This gives you a single API endpoint that can:

- **Search the web** (`/v1/search`) — find pages matching a query, powered by SearXNG aggregating results from Google, Bing, DuckDuckGo, and others
- **Scrape a page** (`/v1/scrape`) — fetch any URL and get clean markdown or structured JSON, with full JavaScript rendering via Playwright
- **Crawl a site** (`/v1/crawl`) — traverse an entire website and extract content from every page
- **Map a site** (`/v1/map`) — discover all URLs on a domain without scraping them

The search endpoint is what ties it all together for AI use: Firecrawl sends your query to SearXNG, gets back relevant URLs, then scrapes and cleans the top results — all in one API call.

## Why this repo exists

Firecrawl's official repo is a large monorepo that assumes you're building from source. SearXNG has its own separate docker-compose setup. To deploy both on Dokploy, you need compose files that:

1. Use pre-built images instead of `build:` directives
2. Join the `dokploy-network` for Traefik routing
3. Include Traefik labels for automatic HTTPS
4. Use Docker named volumes for persistence
5. Avoid explicit `container_name` declarations (breaks Dokploy logging)
6. Wire SearXNG into Firecrawl via internal Docker networking

Rather than forking both repos, this repo contains only the compose file, a SearXNG settings file, and this README. When either project publishes new images, you redeploy — no merge conflicts, no carrying source code you don't touch.

## Architecture

```
                         ┌──────────────────────────┐
    Internet ──► Traefik ──► Firecrawl API (:31329)
                         │        │         │
                         │        ▼         ▼
                         │   Playwright   Redis
                         │    (:3000)    (:6379)
                         │        │
                         │        ▼         ▼
                         │    SearXNG   PostgreSQL
                         │    (:8080)    (:5432)
                         │        │
                         │        ▼         ▼
                         │  Google / …   RabbitMQ
                         │               (:5672)
                         └──────────────────────────┘
```

All services communicate over an internal Docker network. Only the Firecrawl API is exposed to the internet via Traefik. SearXNG is internal-only by default (you can optionally expose it via its own domain).

## Services

| Service | Image | Purpose | Resources |
|---|---|---|---|
| `api` | `ghcr.io/firecrawl/firecrawl` | Main API + workers | 4 CPU / 8 GB |
| `playwright-service` | `ghcr.io/firecrawl/playwright-service` | Headless browser for JS pages | 2 CPU / 4 GB |
| `searxng` | `searxng/searxng` | Metasearch engine | minimal |
| `postgres` | `ghcr.io/firecrawl/nuq-postgres` | NUQ job queue store (pg_cron + nuq schema) | minimal |
| `rabbitmq` | `rabbitmq:3-management` | NUQ message broker | minimal |
| `redis` | `redis:alpine` | Rate limiting, cache | minimal |

## Deploying on Dokploy

### 1. Prerequisites

- A Dokploy instance
- A DNS A record pointing your subdomain (e.g. `firecrawl.yourdomain.com`) to your server

### 2. Create the service

1. In Dokploy, create a new **Compose** service (type: Docker Compose)
2. Connect this GitHub repo as the source
3. Set the **Compose Path** to `./docker-compose.yml`
4. Set the **branch** to `main`

### 3. Configure environment variables

Run `bin/prod_secrets` on your local machine to generate secrets via gopass, then paste the output into Dokploy's environment variable editor.

Or set them manually:

```env
# Required — domain Traefik routes to the Firecrawl API
FIRECRAWL_DOMAIN=firecrawl.yourdomain.com

# Recommended — protect your API with a key
TEST_API_KEY=fc-your-secret-key

# Recommended — change the Bull queue dashboard admin key
BULL_AUTH_KEY=something-secure

# Recommended — set a strong PostgreSQL password
POSTGRES_PASSWORD=something-secure

# SearXNG CSRF secret — auto-injected at container start
SEARXNG_SECRET=something-random
```

### 4. Deploy

Hit deploy. Dokploy pulls the images, creates the containers, and Traefik generates SSL certificates. Give it ~60 seconds for all health checks to pass (PostgreSQL and RabbitMQ start first, then the API).

Your API is now live at `https://firecrawl.yourdomain.com`.

### 6. Test it

```bash
# Search the web (SearXNG → Firecrawl scrape → clean markdown)
curl -X POST https://firecrawl.yourdomain.com/v1/search \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer fc-your-secret-key' \
  -d '{"query": "what is firecrawl", "limit": 5}'

# Scrape a single page
curl -X POST https://firecrawl.yourdomain.com/v1/scrape \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer fc-your-secret-key' \
  -d '{"url": "https://example.com"}'
```

Or use `bin/test https://firecrawl.yourdomain.com` to run the full test suite.

## Local development

```bash
# First-time setup: generate dev secrets in gopass
bin/dev_secrets

# Start all services (creates dokploy-network if missing)
bin/up

# Run tests against local stack
bin/test

# Stop
bin/down

# Full cleanup (volumes, orphans)
bin/clean

# Also remove cached images
bin/clean --images
```

## SearXNG configuration

The `searxng/settings.yml` file in this repo is pre-configured for API use:

- **JSON format enabled** — required for Firecrawl to consume results (disabled by default in SearXNG)
- **Rate limiter disabled** — since SearXNG is only accessed internally by Firecrawl, not by the public internet
- **Engines enabled** — Google, Bing, DuckDuckGo, Wikipedia, GitHub

You can customize the engines, categories, and other settings by editing `searxng/settings.yml` and redeploying. See the [SearXNG documentation](https://docs.searxng.org/admin/settings/index.html) for all available options.

The `secret_key` is handled automatically via the `SEARXNG_SECRET` environment variable — the Docker entrypoint injects it at container start. No need to edit this file for secrets.

## Optional: exposing SearXNG publicly

By default SearXNG is only accessible internally. If you also want a public search UI, uncomment the Traefik labels on the `searxng` service in `docker-compose.yml` and add `SEARXNG_DOMAIN` to your env vars.

## Optional: AI extraction features

Firecrawl can use an LLM for structured data extraction via its `/extract` endpoint. Add one of these to your env vars:

```env
# OpenAI
OPENAI_API_KEY=sk-your-key

# Or a local Ollama instance accessible from the server
OLLAMA_BASE_URL=http://host.docker.internal:11434/api
MODEL_NAME=deepseek-r1:7b
```

## Resource requirements

The default limits match Firecrawl's official recommendations. Total: roughly 6 CPU cores and 12 GB RAM. For light personal use you can lower these — edit the `cpus` and `mem_limit` values in `docker-compose.yml`. The API runs fine on 2 cores / 4 GB and Playwright on 1 core / 2 GB, but expect slower scraping on JS-heavy sites.

PostgreSQL, RabbitMQ, SearXNG, and Redis are lightweight and don't need explicit resource limits.

## Updating

Redeploy from Dokploy to pull the latest images. The compose file uses `latest` tags by default. To pin versions for stability, replace image tags with specific releases (e.g. `ghcr.io/firecrawl/firecrawl:v1.x.x`). Check Firecrawl's [releases page](https://github.com/firecrawl/firecrawl/releases) for available tags.

Data persists in named volumes (`postgres-data`, `rabbitmq-data`, `redis-data`, `searxng-cache`) across redeployments.

## Troubleshooting

**`/search` returns empty or errors:** SearXNG might not be ready yet, or search engines might be rate-limiting your server's IP. Check SearXNG logs in Dokploy. Try changing the enabled engines in `searxng/settings.yml`.

**Playwright timeouts on JS-heavy sites:** The default 4 GB memory limit for Playwright might not be enough. Increase `mem_limit` on the `playwright-service`.

**403 errors from SearXNG:** Make sure `formats: - json` is present in `searxng/settings.yml`. Without it, SearXNG blocks non-HTML requests.

**Redis connection refused:** Check that the Redis container is healthy in Dokploy's dashboard. The API won't start until Redis passes its healthcheck.

**PostgreSQL or RabbitMQ not ready:** The API depends on both passing their health checks before starting. Check the Deployments tab in Dokploy for health check status. PostgreSQL needs ~30 seconds on first boot to initialize the NUQ schema.

## File structure

```
.
├── docker-compose.yml        # All six services, Dokploy-ready
├── searxng/
│   └── settings.yml          # SearXNG config (JSON API enabled, limiter off)
├── bin/
│   ├── prod_secrets          # Generate production env vars via gopass
│   ├── dev_secrets           # Generate dev secrets via gopass
│   ├── up                    # Start local dev stack
│   ├── down                  # Stop local dev stack
│   ├── clean                 # Remove containers, volumes, images
│   └── test                  # Test a running deployment
└── README.md
```