summaryrefslogtreecommitdiffhomepage
path: root/.rules/default/goal.md
blob: 5cefc02f25cc7bda62e8e7c57748d0f897890ce1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
# Firecrawl + SearXNG on Dokploy

A minimal repo for deploying [Firecrawl](https://github.com/firecrawl/firecrawl) with [SearXNG](https://github.com/searxng/searxng) on [Dokploy](https://dokploy.com) — a fully self-hosted web search and content extraction API for AI tooling.

## What this does

This gives you a single API endpoint that can:

- **Search the web** (`/v1/search`) — find pages matching a query, powered by SearXNG aggregating results from Google, Bing, DuckDuckGo, and others
- **Scrape a page** (`/v1/scrape`) — fetch any URL and get clean markdown or structured JSON, with full JavaScript rendering via Playwright
- **Crawl a site** (`/v1/crawl`) — traverse an entire website and extract content from every page
- **Map a site** (`/v1/map`) — discover all URLs on a domain without scraping them

The search endpoint is what ties it all together for AI use: Firecrawl sends your query to SearXNG, gets back relevant URLs, then scrapes and cleans the top results — all in one API call.

## Why this repo exists

Firecrawl's official repo is a large monorepo that assumes you're building from source. SearXNG has its own separate docker-compose setup. To deploy both on Dokploy, you need compose files that:

1. Use pre-built images instead of `build:` directives
2. Join the `dokploy-network` for Traefik routing
3. Include Traefik labels for automatic HTTPS
4. Use Docker named volumes for persistence
5. Avoid explicit `container_name` declarations (breaks Dokploy logging)
6. Wire SearXNG into Firecrawl via internal Docker networking

Rather than forking both repos, this repo contains only the compose file, a SearXNG settings file, and this README. When either project publishes new images, you bump a version tag — no merge conflicts, no carrying source code you don't touch.

## Architecture

```
                         ┌─────────────────────┐
    Internet ──► Traefik ──► Firecrawl API (:3002)
                         │        │         │
                         │        ▼         ▼
                         │   Playwright   Redis
                         │    (:3000)    (:6379)
                         │        │
                         │        ▼
                         │    SearXNG (:8080)
                         │        │
                         │        ▼
                         │  Google / Bing / DDG
                         └─────────────────────┘
```

All services communicate over an internal Docker network. Only the Firecrawl API is exposed to the internet via Traefik. SearXNG is internal-only by default (you can optionally expose it via its own domain).

## Services

| Service | Image | Purpose | Resources |
|---|---|---|---|
| `api` | `ghcr.io/firecrawl/firecrawl` | Main API + workers | 4 CPU / 8 GB |
| `playwright-service` | `ghcr.io/firecrawl/playwright-service` | Headless browser for JS pages | 2 CPU / 4 GB |
| `searxng` | `searxng/searxng` | Metasearch engine | minimal |
| `redis` | `redis:alpine` | Queues, rate limiting, cache | minimal |

## Deploying on Dokploy

### 1. Prerequisites

- A Dokploy instance
- A DNS A record pointing your subdomain (e.g. `firecrawl.yourdomain.com`) to your server

### 2. Create the service

1. In Dokploy, create a new **Compose** service (type: Docker Compose)
2. Connect this GitHub repo as the source
3. Set the **Compose Path** to `./docker-compose.yml`
4. Set the **branch** to `main`

### 3. Configure environment variables

In Dokploy's environment variable editor, set:

```env
# Required — domain Traefik routes to the Firecrawl API
FIRECRAWL_DOMAIN=firecrawl.yourdomain.com

# Recommended — protect your API with a key
TEST_API_KEY=fc-your-secret-key

# Recommended — change the Bull queue dashboard admin key
BULL_AUTH_KEY=something-secure
```

### 4. Deploy

Hit deploy. Dokploy pulls the images, creates the containers, and Traefik generates SSL certificates. Give it ~30 seconds.

Your API is now live at `https://firecrawl.yourdomain.com`.

### 5. Test it

```bash
# Search the web (SearXNG → Firecrawl scrape → clean markdown)
curl -X POST https://firecrawl.yourdomain.com/v1/search \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer fc-your-secret-key' \
  -d '{"query": "what is firecrawl", "limit": 5}'

# Scrape a single page
curl -X POST https://firecrawl.yourdomain.com/v1/scrape \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer fc-your-secret-key' \
  -d '{"url": "https://example.com"}'
```

## SearXNG configuration

The `searxng/settings.yml` file in this repo is pre-configured for API use:

- **JSON format enabled** — required for Firecrawl to consume results (disabled by default in SearXNG)
- **Rate limiter disabled** — since SearXNG is only accessed internally by Firecrawl, not by the public internet
- **Engines enabled** — Google, Bing, DuckDuckGo, Wikipedia, GitHub

You can customize the engines, categories, and other settings by editing `searxng/settings.yml` and redeploying. See the [SearXNG documentation](https://docs.searxng.org/admin/settings/index.html) for all available options.

**Important:** Change the `secret_key` value in `searxng/settings.yml` before deploying to production.

## Optional: exposing SearXNG publicly

By default SearXNG is only accessible internally. If you also want a public search UI, uncomment the Traefik labels on the `searxng` service in `docker-compose.yml` and add `SEARXNG_DOMAIN` to your env vars.

## Optional: AI extraction features

Firecrawl can use an LLM for structured data extraction via its `/extract` endpoint. Add one of these to your env vars:

```env
# OpenAI
OPENAI_API_KEY=sk-your-key

# Or a local Ollama instance accessible from the server
OLLAMA_BASE_URL=http://host.docker.internal:11434/api
MODEL_NAME=deepseek-r1:7b
```

## Resource requirements

The default limits match Firecrawl's official recommendations. Total: roughly 6 CPU cores and 12 GB RAM. For light personal use you can lower these — edit the `cpus` and `mem_limit` values in `docker-compose.yml`. The API runs fine on 2 cores / 4 GB and Playwright on 1 core / 2 GB, but expect slower scraping on JS-heavy sites.

SearXNG and Redis are lightweight and don't need explicit resource limits.

## Updating

Redeploy from Dokploy to pull the latest images. The compose file uses `latest` tags by default. To pin versions for stability, replace image tags with specific releases (e.g. `ghcr.io/firecrawl/firecrawl:v1.x.x`). Check Firecrawl's [releases page](https://github.com/firecrawl/firecrawl/releases) for available tags.

Redis data and SearXNG cache persist in named volumes across redeployments.

## Troubleshooting

**`/search` returns empty or errors:** SearXNG might not be ready yet, or search engines might be rate-limiting your server's IP. Check SearXNG logs in Dokploy. Try changing the enabled engines in `searxng/settings.yml`.

**Playwright timeouts on JS-heavy sites:** The default 4 GB memory limit for Playwright might not be enough. Increase `mem_limit` on the `playwright-service`.

**403 errors from SearXNG:** Make sure `formats: - json` is present in `searxng/settings.yml`. Without it, SearXNG blocks non-HTML requests.

**Redis connection refused:** Check that the Redis container is healthy in Dokploy's dashboard. The API won't start until Redis passes its healthcheck.

## File structure

```
.
├── docker-compose.yml        # All four services, Dokploy-ready
├── searxng/
│   └── settings.yml          # SearXNG config (JSON API enabled, limiter off)
└── README.md
```