diff options
| author | Adam Malczewski <[email protected]> | 2026-06-28 09:04:53 +0900 |
|---|---|---|
| committer | Adam Malczewski <[email protected]> | 2026-06-28 09:04:53 +0900 |
| commit | e3a85ab476cdfaa49982078aba3b6792331185cb (patch) | |
| tree | 85e9d1ab01483bbbd608a9db80bf6f8263c5e421 /notes | |
| parent | 73ff84c606f5307e5f40e649cae9f93484c0d99d (diff) | |
| download | dispatch-e3a85ab476cdfaa49982078aba3b6792331185cb.tar.gz dispatch-e3a85ab476cdfaa49982078aba3b6792331185cb.zip | |
docs(memory-leak): comprehensive handoff for OpenCode investigation
Diffstat (limited to 'notes')
| -rw-r--r-- | notes/memory-leak-investigation-handoff.md | 331 |
1 files changed, 331 insertions, 0 deletions
diff --git a/notes/memory-leak-investigation-handoff.md b/notes/memory-leak-investigation-handoff.md new file mode 100644 index 0000000..3dc4ad0 --- /dev/null +++ b/notes/memory-leak-investigation-handoff.md @@ -0,0 +1,331 @@ +# Memory-Leak Investigation — Handoff for OpenCode Agent + +> **Purpose:** This document gives an OpenCode harness agent — running outside +> the Dispatch system, with no prior context — everything needed to +> independently investigate the memory leak that is crashing the production +> Dispatch server. Read it top to bottom before touching anything. + +**Last updated:** 2026-06-28 +**Author of this handoff:** umans/umans-glm-5.2 (Dispatch agent) +**Originating investigation:** `notes/server-crash-investigation.md` (commit `b83aa8d`) + +--- + +## 0. TL;DR — what you are here to do + +The production Dispatch server leaks memory at **~2.5 GB/hour** and eventually +hits a **Bun runtime segfault** (native crash, not a JS exception). A prior +investigation already ruled out LSP as the cause. Your job is to **find where +memory is being retained** and propose a fix. There are four undeployed commits +on the `dev` branch that add memory telemetry and a cgroup circuit breaker — +**those need to be built, installed, and observed first** to get the data that +will localize the leak. Do not start code-diving blindly; deploy the telemetry, +collect a crash cycle, then read the logs. + +--- + +## 1. System Overview — what Dispatch is + +Dispatch is an **AI agent orchestration platform**: a backend that runs one or +more AI "agents" through turns (each turn = a step loop of model calls + +tool dispatch), plus a separate frontend that consumes the backend's typed +contracts over HTTP + WebSocket. + +**Repo layout** — all under `/home/tradam/projects/dispatch/`: + +| Path | What | Git remote | Branch | +|---|---|---|---| +| `backend/` | The server (this codebase) | `[email protected]:realtradam/dispatch.git` | `dev` | +| `frontend/` | The web UI (separate repo, same methodology) | `[email protected]:realtradam/frontend.git` | `dev` | +| `worktrees/` | Per-task git worktrees for isolated feature branches | (varies) | (varies) | +| `bin/` *(inside backend)* | Operational shell scripts (build, install, up, secrets, certs) | — | — | + +Both repos use **`dev` as the active development branch**. The backend is a +**Bun + TypeScript** monorepo (project references via `tsc -b`, Biome for +lint/format, Vitest for tests, `bun:sqlite` for storage, the `ai` / +`@ai-sdk/*` packages for model providers). Architecture rules are in +`backend/AGENTS.md` — **read it** before editing (especially the "kernel +touches no I/O" and "no ambient state" rules). + +--- + +## 2. Where things are installed (production) + +| Artifact | Path | Notes | +|---|---|---| +| Production server binary | `/usr/bin/dispatch-server` | Bun **standalone compiled** binary. Currently the OLD one (built Jun 27 21:40). Not you to install — see §6. | +| CLI binary | `/usr/bin/dispatch` | `dispatch send/list/read/...` | +| Frontend static files | `/usr/share/dispatch/web/` | Built by `bin/build` (Vite, `VITE_HTTP_PORT=24991 VITE_WS_PORT=24990`) | +| Config / env file | `/etc/dispatch/env` | `EnvironmentFile` for systemd. **Root-readable only** (`Permission denied` for non-root). Source template in repo: `systemd/dispatch.env` (installed by `bin/setup-env`). | +| Data directory | `/var/lib/dispatch/` | WorkingDirectory of the service. SQLite DBs live here: `dispatch.db`, `dispatch.db-shm`, `dispatch.db-wal`. | +| Logs | the journal | `journalctl -u dispatch` — there are no log files on disk; everything goes to journald. | +| systemd unit (live) | `/etc/systemd/system/dispatch.service` | Installed from the repo template `systemd/dispatch.service` by `bin/install`. | +| systemd unit (repo template) | `backend/systemd/dispatch.service` | **Edit this one**, not the live file — `bin/install` copies it over. | +| Install script | `backend/bin/install` | Builds + installs the binary, unit, env, frontend. Uses `sudo` on privileged lines only (the agent has no sudo — hand scripts to the user). | +| Build script | `backend/bin/build` | `tsc --build` then `bun build --compile` → `dist/dispatch-server`. | +| Memory-limits script | `backend/bin/apply-memory-limits.sh` | Applies `MemoryHigh`/`MemoryMax` to the live service. **User has NOT run it yet.** | + +--- + +## 3. Production ports + +| Service | Port | Confirmed | +|---|---|---| +| HTTP API | **24991** | `BACKEND_PORT=24991` in `systemd/dispatch.env` (set by `bin/setup-env`). Live: `ss -tlnp` shows `dispatch-server` listening on `*:24991`. | +| Surface WebSocket | **24990** | `SURFACE_WS_PORT`. Live: `ss -tlnp` shows `dispatch-server` listening on `*:24990`. | + +The **dev** stack (`bin/up`) uses different ports: **24203** (HTTP) + **24205** (WS) + **24204** (frontend Vite dev server). Don't confuse dev ports with prod ports. + +**Health check:** `curl http://localhost:24991/health` → `{"ok":true}` + +--- + +## 4. How to access the running server + +```bash +# Is it running? +systemctl status dispatch + +# Live logs (follow) +journalctl -u dispatch -f + +# Recent crashes / errors (last 2h) +journalctl -u dispatch --since '2 hours ago' -n 200 + +# Memory telemetry — ONLY visible once the new binary is deployed (see §6). +# The EXACT log tags to grep (NOT "[mem-telemetry]" — that is the module name): +journalctl -u dispatch -f | rg 'memory:periodic|memory:gc' + +# Restart (needs root — hand to the user) +sudo systemctl restart dispatch +``` + +> **The service is named `dispatch`, NOT `dispatch-server`.** The binary is +> `dispatch-server`; the systemd unit is `dispatch.service`. This mismatch +> tripped up the first investigation — don't repeat it. + +The backend checkout at `/home/tradam/projects/dispatch/backend/` is on `dev` +and contains the source. The running binary is compiled, so **editing source +does nothing until you rebuild + install + restart** (§6). + +--- + +## 5. The memory leak — what we know + +### 5.1 Symptom & crash signature + +- The server grows **~2.5 GB/hour** under load, eventually triggering a **Bun + runtime segfault** (native crash). +- **Last crash:** Jun 28 00:12 JST. + ``` + panic(main thread): Segmentation fault at address 0x0 + oh no: Bun has crashed. This indicates a bug in Bun, not your code. + Main process exited, code=dumped, status=4/ILL + ``` + RSS was **6.2 GB** at crash (over 2h31m uptime). A prior session hit + **25.7 GB over 18h**. +- The crash is a **native segfault inside Bun's runtime** (NULL deref in the + allocator/GC), NOT a JS exception. The crash-time memory readout was itself + corrupted (`RSS: 0.02ZB` — impossible), and systemd saw SIGILL while Bun + reported SIGSEGV — both signs of **heap corruption under memory pressure**. +- This is the **only** segfault in 5 days of logs. Earlier crashes (Jun 27 + ~02:44–02:52) were `exit-code` failures = application-level LSP bugs, **now + fixed** (see §7). + +### 5.2 What it is NOT + +- **NOT LSP.** The Language Server Protocol extension was the original prime + suspect; it is exonerated. Its caches are bounded (`MAX_OPEN_DOCUMENTS = 50`, + `MAX_PUSH_DIAGNOSTICS = 100` — see `packages/lsp/src/client.ts:153` and + `diagnostics.ts:47`). 50 docs + 100 diag entries cannot explain gigabytes. +- **NOT the conversation store.** `packages/conversation-store` is + **SQLite-backed** (all reads/writes go through a `StorageNamespace` to + `bun:sqlite`) — it does not hold conversations in an in-memory map. +- **NOT `activeConversations`.** The session-orchestrator's + `activeConversations` is just a `Set<string>` of conversation IDs (tiny). + +### 5.3 Prime suspects (not yet confirmed — that's your job) + +1. **The AI-SDK streaming path** — the `async` generator returned by + `provider.stream(...)` (`@ai-sdk/*`, including the `openai-stream` package). + Streaming response buffers may be retained if the generator is not fully + drained or if the SDK holds internal buffers per request. +2. **Per-turn message arrays** in `session-orchestrator` — the history + + `userMsg` + `providerMessages` arrays assembled before each + `provider.stream(...)` call. For long multi-step agent turns these can be + large; if references outlive the turn (closures, unresolved promises, event + listeners), they leak. +3. **A Bun bug fixed in 1.3.14.** The crash was on Bun **v1.3.13**. Bun 1.3.14 + has been built locally but not installed (§6.3) — deploying it may itself + resolve the segfault even if a leak remains. + +### 5.4 Key files to examine (the leak-suspect code path) + +| File | What's there | Why it matters | +|---|---|---| +| `packages/session-orchestrator/src/orchestrator.ts` | `runTurnDetached()` at **line 541** — kicks off a turn; the `activeConversations.add/delete` around it (556, 979) brackets the turn lifecycle. | Turn lifecycle: where memory should be acquired and released. If a turn's buffers aren't released here, it leaks per-turn. | +| `packages/session-orchestrator/src/orchestrator.ts` | `for await (const event of provider.stream(messages, assembled.tools, providerOpts))` at **line 1276** (and a second one at **1430** for compaction/summary). | **The streaming loop** — the #1 suspect. This is where the AI-SDK async generator is consumed. If the generator is abandoned mid-stream (error, `break`, abort) or its internal buffers are retained, memory grows per turn. | +| `packages/session-orchestrator/src/pure.ts` | The pure turn/step decision logic + the `MemorySample`/`memoryDelta`/`memorySampleAttributes` helpers used by telemetry. | The per-turn before/after sampling wraps the stream boundary (referenced by `mem-telemetry.ts`). | +| `packages/kernel/src/runtime/run-turn.ts` | The **step loop** (`runTurn`) — one agent turn = N steps of (model call → tool dispatch → feed results back). | MAX_STEPS is disabled (`0 = unlimited`, commit `e8b4bf1`), so a single turn can run many steps. Many steps ⇒ many accumulated message arrays ⇒ more retention surface. | +| `packages/kernel/src/runtime/dispatch.ts` | **Tool dispatch** (the `maxConcurrent`/`eager` tool-execution loop). | Tool results accumulate into the turn's message history. A rogue tool returning huge payloads could balloon arrays. | +| `packages/host-bin/src/mem-telemetry.ts` | The periodic telemetry edge effect. | Read this to understand exactly what `memory:periodic` / `memory:gc` log entries contain (rss, heapUsed, heapTotal, external, arrayBuffers, activeConversations, reclaimedRssMB). | + +> **Architecture note (AGENTS.md):** the kernel (`packages/kernel`) touches NO +> I/O and names no concrete feature. Decision logic is pure; effects are +> injected at the edges (host-bin). When investigating, keep this layering in +> mind — a leak "in the kernel" would actually be in a closure captured by an +> edge that feeds the kernel, not in the kernel's own (stateless) turn loop. + +--- + +## 6. What's been done (committed to `dev`, NOT yet deployed) + +The running binary is still the **old one** (built Jun 27 21:40, pre-telemetry, +pre-Bun-1.3.14, pre-LSP-disable). Four commits on `dev` need building + +installing before any of this takes effect. **Verify before you trust:** the +live systemd still reports `MemoryMax=infinity` (confirmed via +`systemctl show dispatch -p MemoryMax`). + +| Commit | What | Status | +|---|---|---| +| `d1de9ed` | `systemd/dispatch.service` gets `MemoryHigh=20G` + `MemoryMax=24G` circuit breaker. | Committed. **Live = infinity (not applied).** | +| `b3b6eb4` | `bin/apply-memory-limits.sh` — applies the limits to the live service. | Committed. **User has NOT run it.** | +| `1cd66da` | Memory telemetry: periodic `process.memoryUsage()` every 15s, per-turn before/after sampling, `activeConversations` tagging, `Bun.gc(true)` every 5 min. Files: `packages/host-bin/src/mem-telemetry.ts` + `packages/session-orchestrator/src/pure.ts`. | Committed. **NOT deployed.** | +| `73ff84c` | **LSP disabled** as a precaution (`main.ts` import commented out + removed from `CORE_EXTENSIONS`; `transport-http` tolerates its absence). Also set telemetry interval 60s→15s. | Committed. **NOT deployed.** LSP stays disabled until the leak is understood. | +| (bun upgrade) | Bun **1.3.13 → 1.3.14** via `bun upgrade`. New binary built at `dist/dispatch-server`. | Rebuilt. **NOT installed** to `/usr/bin/dispatch-server`. | + +### 6.1 The telemetry data you will collect (once deployed) + +The exact log tags to grep in journald are **`memory:periodic`** and +**`memory:gc`** (the module is `mem-telemetry.ts`, but the log *messages* are +those two strings — do not grep for `[mem-telemetry]`). + +- `memory:periodic` — every 15s: `rss`, `heapUsed`, `heapTotal`, `external`, + `arrayBuffers` (bytes), plus `activeConversations` (count). Correlate RSS + growth with active turns: if RSS climbs while `activeConversations` is 0, + the leak is background/idle (timers, watchers, retained closures); if it + climbs only during/after turns, it's the streaming/turn path. +- `memory:gc` — every 5 min: runs `Bun.gc(true)` and logs RSS before/after + + `reclaimedRssMB` (`before - after`). **Interpretation:** a large positive + `reclaimedRssMB` = GC freed it (fragmentation, reclaimable — not a true + leak). Near-zero/negative `reclaimedRssMB` = memory is **LIVE** (retained + objects — a real leak). This single field distinguishes the two failure + modes; check it first when you get data. + +### 6.2 Deploy sequence (hand to the user — you have no sudo) + +The agent cannot run `sudo` or install to `/usr/bin`. Write a script with +`sudo` on the privileged lines and hand it to the user (per the repo's sudo +policy). The sequence, roughly: + +1. From `backend/`: `bun run typecheck && bun run test && bun run check` — + make sure `dev` is green (it should be; these commits are committed). +2. `bin/build` — produces `dist/dispatch-server` (Bun 1.3.14, with telemetry + + LSP-disabled). +3. `bin/install` (or a targeted script) — copies `dist/dispatch-server` → + `/usr/bin/dispatch-server`, the unit template → `/etc/systemd/system/`, + `systemd/dispatch.env` → `/etc/dispatch/env`. (Or run `bin/apply-memory-limits.sh` + separately if you only want the cgroup limits without a full reinstall.) +4. `sudo systemctl daemon-reload && sudo systemctl restart dispatch`. +5. Confirm: `systemctl show dispatch -p MemoryMax` should show `24G`, and + `journalctl -u dispatch -f | rg 'memory:periodic'` should show telemetry + within 15s. + +> ⚠️ **Warning:** deploying restarts the production server. The live server is +> actively serving agent conversations. Coordinate with the user (ceb2 / +> tradam) before restarting — don't just yank it. Also note the running server +> at time of writing is PID 28956 (started Jun 28 08:51). + +--- + +## 7. Background — the original crash investigation (already done) + +You do **not** need to redo this; it's recorded for context. + +- **Original report:** `notes/server-crash-investigation.md` (commit `b83aa8d`). +- The first investigation's task description was **stale**: it claimed an + "uncommitted hot-fix that disables LSP" existed. It did not — the working + tree was clean and LSP was *enabled*. The developer then *chose* to disable + LSP (commit `73ff84c`) as a precaution, but the root cause was already + identified as the Bun segfault + leak, not LSP. +- The **three original LSP suspects** (JSON-parse TypeError, `fs.watch` ENOENT, + unbounded cache leak) were all fixed in commits `05ff256` + `f9d1ca5` and + are correct/tested. LSP being disabled now is a precaution, not a fix for the + crash. Once the leak is understood, LSP can be re-enabled safely. +- **Do not spend time re-investigating LSP.** It is exonerated. + +--- + +## 8. What needs investigation (your tasks, in order) + +1. **Deploy the telemetry + cgroup limits** (§6) — you cannot localize the leak + without the `memory:periodic` / `memory:gc` data. Get a full crash cycle's + worth of logs (let it run until it either hits `MemoryMax=24G` and restarts, + or until the growth pattern is clear). +2. **Read the telemetry first.** The `memory:gc` `reclaimedRssMB` field tells + you live-retained vs fragmentation. If near-zero after GC → real retained + leak; chase which subsystem holds it. +3. **Correlate growth with `activeConversations`.** Idle-growth ⇒ timers / + watchers / retained closures / the LSP file-watcher (if any servers still + spawn — but LSP is disabled). Turn-bound growth ⇒ the streaming/turn path + (suspects in §5.4). +4. **Examine the streaming loop** (`orchestrator.ts:1276`): is the + `provider.stream(...)` async generator fully drained on every path + (success, error, abort, `break`)? Does the AI SDK retain response buffers? + Consider adding a before/after `memoryUsage()` around a single turn to + measure per-turn delta directly. +5. **Examine the message arrays** assembled before `provider.stream` — are + they scoped to the turn and released when the turn completes, or do + closures/promises/event-listeners keep them alive? +6. **Test Bun 1.3.14 in isolation.** Since the upgrade is built but not + installed, see if a memory-stress repro under 1.3.13 vs 1.3.14 behaves + differently — the segfault may be a Bun allocator bug that 1.3.14 fixes. +7. **Propose a fix** (do not implement without coordinating — report up to the + orchestrator). Likely candidates: ensure the streaming generator is always + fully consumed / explicitly closed on error; bound per-turn message + history; release references at `activeConversations.delete` time; or + confirm it's purely a Bun bug requiring the 1.3.14 deploy. + +--- + +## 9. Quick-reference commands + +```bash +# --- status & logs --- +systemctl status dispatch +systemctl show dispatch -p MemoryMax -p MemoryHigh # expect 24G/20G after deploy +journalctl -u dispatch -f # live logs +journalctl -u dispatch -f | rg 'memory:periodic|memory:gc' # telemetry (after deploy) +journalctl -u dispatch --since '2 hours ago' -n 200 # recent history / crashes +curl http://localhost:24991/health # → {"ok":true} +ss -tlnp | rg dispatch # ports 24991 + 24990 + +# --- source (the leak path) --- +cd /home/tradam/projects/dispatch/backend +git log --oneline -n 8 # see the undeployed commits +rg -n 'runTurnDetached|provider\.stream|for await' packages/session-orchestrator/src/orchestrator.ts + +# --- build & verify (NO sudo needed) --- +bun run typecheck && bun run test && bun run check +bin/build # → dist/dispatch-server + +# --- deploy (NEEDS sudo — hand a script to the user) --- +# bin/install (or: sudo cp dist/dispatch-server /usr/bin/dispatch-server) +# bin/apply-memory-limits.sh (applies MemoryMax to live service) +# sudo systemctl daemon-reload && sudo systemctl restart dispatch +``` + +--- + +## 10. Guardrails + +- **Do NOT edit implementation code without reporting to the orchestrator.** + This is an investigation handoff; propose fixes, don't apply them + unilaterally. +- **You have no sudo.** Any step touching `/usr/bin`, `/etc/`, or + `systemctl restart` must be a script handed to the user with `sudo` on the + privileged lines. +- **Do not re-enable LSP** until the leak is understood (it's disabled as a + precaution; re-enabling is a separate decision). +- **Do not merge or push.** Work stays on `dev` locally. +- **The service is `dispatch`, not `dispatch-server`.** |
