summaryrefslogtreecommitdiffhomepage
path: root/notes
diff options
context:
space:
mode:
authorAdam Malczewski <[email protected]>2026-06-28 09:04:53 +0900
committerAdam Malczewski <[email protected]>2026-06-28 09:04:53 +0900
commite3a85ab476cdfaa49982078aba3b6792331185cb (patch)
tree85e9d1ab01483bbbd608a9db80bf6f8263c5e421 /notes
parent73ff84c606f5307e5f40e649cae9f93484c0d99d (diff)
downloaddispatch-e3a85ab476cdfaa49982078aba3b6792331185cb.tar.gz
dispatch-e3a85ab476cdfaa49982078aba3b6792331185cb.zip
docs(memory-leak): comprehensive handoff for OpenCode investigation
Diffstat (limited to 'notes')
-rw-r--r--notes/memory-leak-investigation-handoff.md331
1 files changed, 331 insertions, 0 deletions
diff --git a/notes/memory-leak-investigation-handoff.md b/notes/memory-leak-investigation-handoff.md
new file mode 100644
index 0000000..3dc4ad0
--- /dev/null
+++ b/notes/memory-leak-investigation-handoff.md
@@ -0,0 +1,331 @@
+# Memory-Leak Investigation — Handoff for OpenCode Agent
+
+> **Purpose:** This document gives an OpenCode harness agent — running outside
+> the Dispatch system, with no prior context — everything needed to
+> independently investigate the memory leak that is crashing the production
+> Dispatch server. Read it top to bottom before touching anything.
+
+**Last updated:** 2026-06-28
+**Author of this handoff:** umans/umans-glm-5.2 (Dispatch agent)
+**Originating investigation:** `notes/server-crash-investigation.md` (commit `b83aa8d`)
+
+---
+
+## 0. TL;DR — what you are here to do
+
+The production Dispatch server leaks memory at **~2.5 GB/hour** and eventually
+hits a **Bun runtime segfault** (native crash, not a JS exception). A prior
+investigation already ruled out LSP as the cause. Your job is to **find where
+memory is being retained** and propose a fix. There are four undeployed commits
+on the `dev` branch that add memory telemetry and a cgroup circuit breaker —
+**those need to be built, installed, and observed first** to get the data that
+will localize the leak. Do not start code-diving blindly; deploy the telemetry,
+collect a crash cycle, then read the logs.
+
+---
+
+## 1. System Overview — what Dispatch is
+
+Dispatch is an **AI agent orchestration platform**: a backend that runs one or
+more AI "agents" through turns (each turn = a step loop of model calls +
+tool dispatch), plus a separate frontend that consumes the backend's typed
+contracts over HTTP + WebSocket.
+
+**Repo layout** — all under `/home/tradam/projects/dispatch/`:
+
+| Path | What | Git remote | Branch |
+|---|---|---|---|
+| `backend/` | The server (this codebase) | `[email protected]:realtradam/dispatch.git` | `dev` |
+| `frontend/` | The web UI (separate repo, same methodology) | `[email protected]:realtradam/frontend.git` | `dev` |
+| `worktrees/` | Per-task git worktrees for isolated feature branches | (varies) | (varies) |
+| `bin/` *(inside backend)* | Operational shell scripts (build, install, up, secrets, certs) | — | — |
+
+Both repos use **`dev` as the active development branch**. The backend is a
+**Bun + TypeScript** monorepo (project references via `tsc -b`, Biome for
+lint/format, Vitest for tests, `bun:sqlite` for storage, the `ai` /
+`@ai-sdk/*` packages for model providers). Architecture rules are in
+`backend/AGENTS.md` — **read it** before editing (especially the "kernel
+touches no I/O" and "no ambient state" rules).
+
+---
+
+## 2. Where things are installed (production)
+
+| Artifact | Path | Notes |
+|---|---|---|
+| Production server binary | `/usr/bin/dispatch-server` | Bun **standalone compiled** binary. Currently the OLD one (built Jun 27 21:40). Not you to install — see §6. |
+| CLI binary | `/usr/bin/dispatch` | `dispatch send/list/read/...` |
+| Frontend static files | `/usr/share/dispatch/web/` | Built by `bin/build` (Vite, `VITE_HTTP_PORT=24991 VITE_WS_PORT=24990`) |
+| Config / env file | `/etc/dispatch/env` | `EnvironmentFile` for systemd. **Root-readable only** (`Permission denied` for non-root). Source template in repo: `systemd/dispatch.env` (installed by `bin/setup-env`). |
+| Data directory | `/var/lib/dispatch/` | WorkingDirectory of the service. SQLite DBs live here: `dispatch.db`, `dispatch.db-shm`, `dispatch.db-wal`. |
+| Logs | the journal | `journalctl -u dispatch` — there are no log files on disk; everything goes to journald. |
+| systemd unit (live) | `/etc/systemd/system/dispatch.service` | Installed from the repo template `systemd/dispatch.service` by `bin/install`. |
+| systemd unit (repo template) | `backend/systemd/dispatch.service` | **Edit this one**, not the live file — `bin/install` copies it over. |
+| Install script | `backend/bin/install` | Builds + installs the binary, unit, env, frontend. Uses `sudo` on privileged lines only (the agent has no sudo — hand scripts to the user). |
+| Build script | `backend/bin/build` | `tsc --build` then `bun build --compile` → `dist/dispatch-server`. |
+| Memory-limits script | `backend/bin/apply-memory-limits.sh` | Applies `MemoryHigh`/`MemoryMax` to the live service. **User has NOT run it yet.** |
+
+---
+
+## 3. Production ports
+
+| Service | Port | Confirmed |
+|---|---|---|
+| HTTP API | **24991** | `BACKEND_PORT=24991` in `systemd/dispatch.env` (set by `bin/setup-env`). Live: `ss -tlnp` shows `dispatch-server` listening on `*:24991`. |
+| Surface WebSocket | **24990** | `SURFACE_WS_PORT`. Live: `ss -tlnp` shows `dispatch-server` listening on `*:24990`. |
+
+The **dev** stack (`bin/up`) uses different ports: **24203** (HTTP) + **24205** (WS) + **24204** (frontend Vite dev server). Don't confuse dev ports with prod ports.
+
+**Health check:** `curl http://localhost:24991/health` → `{"ok":true}`
+
+---
+
+## 4. How to access the running server
+
+```bash
+# Is it running?
+systemctl status dispatch
+
+# Live logs (follow)
+journalctl -u dispatch -f
+
+# Recent crashes / errors (last 2h)
+journalctl -u dispatch --since '2 hours ago' -n 200
+
+# Memory telemetry — ONLY visible once the new binary is deployed (see §6).
+# The EXACT log tags to grep (NOT "[mem-telemetry]" — that is the module name):
+journalctl -u dispatch -f | rg 'memory:periodic|memory:gc'
+
+# Restart (needs root — hand to the user)
+sudo systemctl restart dispatch
+```
+
+> **The service is named `dispatch`, NOT `dispatch-server`.** The binary is
+> `dispatch-server`; the systemd unit is `dispatch.service`. This mismatch
+> tripped up the first investigation — don't repeat it.
+
+The backend checkout at `/home/tradam/projects/dispatch/backend/` is on `dev`
+and contains the source. The running binary is compiled, so **editing source
+does nothing until you rebuild + install + restart** (§6).
+
+---
+
+## 5. The memory leak — what we know
+
+### 5.1 Symptom & crash signature
+
+- The server grows **~2.5 GB/hour** under load, eventually triggering a **Bun
+ runtime segfault** (native crash).
+- **Last crash:** Jun 28 00:12 JST.
+ ```
+ panic(main thread): Segmentation fault at address 0x0
+ oh no: Bun has crashed. This indicates a bug in Bun, not your code.
+ Main process exited, code=dumped, status=4/ILL
+ ```
+ RSS was **6.2 GB** at crash (over 2h31m uptime). A prior session hit
+ **25.7 GB over 18h**.
+- The crash is a **native segfault inside Bun's runtime** (NULL deref in the
+ allocator/GC), NOT a JS exception. The crash-time memory readout was itself
+ corrupted (`RSS: 0.02ZB` — impossible), and systemd saw SIGILL while Bun
+ reported SIGSEGV — both signs of **heap corruption under memory pressure**.
+- This is the **only** segfault in 5 days of logs. Earlier crashes (Jun 27
+ ~02:44–02:52) were `exit-code` failures = application-level LSP bugs, **now
+ fixed** (see §7).
+
+### 5.2 What it is NOT
+
+- **NOT LSP.** The Language Server Protocol extension was the original prime
+ suspect; it is exonerated. Its caches are bounded (`MAX_OPEN_DOCUMENTS = 50`,
+ `MAX_PUSH_DIAGNOSTICS = 100` — see `packages/lsp/src/client.ts:153` and
+ `diagnostics.ts:47`). 50 docs + 100 diag entries cannot explain gigabytes.
+- **NOT the conversation store.** `packages/conversation-store` is
+ **SQLite-backed** (all reads/writes go through a `StorageNamespace` to
+ `bun:sqlite`) — it does not hold conversations in an in-memory map.
+- **NOT `activeConversations`.** The session-orchestrator's
+ `activeConversations` is just a `Set<string>` of conversation IDs (tiny).
+
+### 5.3 Prime suspects (not yet confirmed — that's your job)
+
+1. **The AI-SDK streaming path** — the `async` generator returned by
+ `provider.stream(...)` (`@ai-sdk/*`, including the `openai-stream` package).
+ Streaming response buffers may be retained if the generator is not fully
+ drained or if the SDK holds internal buffers per request.
+2. **Per-turn message arrays** in `session-orchestrator` — the history +
+ `userMsg` + `providerMessages` arrays assembled before each
+ `provider.stream(...)` call. For long multi-step agent turns these can be
+ large; if references outlive the turn (closures, unresolved promises, event
+ listeners), they leak.
+3. **A Bun bug fixed in 1.3.14.** The crash was on Bun **v1.3.13**. Bun 1.3.14
+ has been built locally but not installed (§6.3) — deploying it may itself
+ resolve the segfault even if a leak remains.
+
+### 5.4 Key files to examine (the leak-suspect code path)
+
+| File | What's there | Why it matters |
+|---|---|---|
+| `packages/session-orchestrator/src/orchestrator.ts` | `runTurnDetached()` at **line 541** — kicks off a turn; the `activeConversations.add/delete` around it (556, 979) brackets the turn lifecycle. | Turn lifecycle: where memory should be acquired and released. If a turn's buffers aren't released here, it leaks per-turn. |
+| `packages/session-orchestrator/src/orchestrator.ts` | `for await (const event of provider.stream(messages, assembled.tools, providerOpts))` at **line 1276** (and a second one at **1430** for compaction/summary). | **The streaming loop** — the #1 suspect. This is where the AI-SDK async generator is consumed. If the generator is abandoned mid-stream (error, `break`, abort) or its internal buffers are retained, memory grows per turn. |
+| `packages/session-orchestrator/src/pure.ts` | The pure turn/step decision logic + the `MemorySample`/`memoryDelta`/`memorySampleAttributes` helpers used by telemetry. | The per-turn before/after sampling wraps the stream boundary (referenced by `mem-telemetry.ts`). |
+| `packages/kernel/src/runtime/run-turn.ts` | The **step loop** (`runTurn`) — one agent turn = N steps of (model call → tool dispatch → feed results back). | MAX_STEPS is disabled (`0 = unlimited`, commit `e8b4bf1`), so a single turn can run many steps. Many steps ⇒ many accumulated message arrays ⇒ more retention surface. |
+| `packages/kernel/src/runtime/dispatch.ts` | **Tool dispatch** (the `maxConcurrent`/`eager` tool-execution loop). | Tool results accumulate into the turn's message history. A rogue tool returning huge payloads could balloon arrays. |
+| `packages/host-bin/src/mem-telemetry.ts` | The periodic telemetry edge effect. | Read this to understand exactly what `memory:periodic` / `memory:gc` log entries contain (rss, heapUsed, heapTotal, external, arrayBuffers, activeConversations, reclaimedRssMB). |
+
+> **Architecture note (AGENTS.md):** the kernel (`packages/kernel`) touches NO
+> I/O and names no concrete feature. Decision logic is pure; effects are
+> injected at the edges (host-bin). When investigating, keep this layering in
+> mind — a leak "in the kernel" would actually be in a closure captured by an
+> edge that feeds the kernel, not in the kernel's own (stateless) turn loop.
+
+---
+
+## 6. What's been done (committed to `dev`, NOT yet deployed)
+
+The running binary is still the **old one** (built Jun 27 21:40, pre-telemetry,
+pre-Bun-1.3.14, pre-LSP-disable). Four commits on `dev` need building +
+installing before any of this takes effect. **Verify before you trust:** the
+live systemd still reports `MemoryMax=infinity` (confirmed via
+`systemctl show dispatch -p MemoryMax`).
+
+| Commit | What | Status |
+|---|---|---|
+| `d1de9ed` | `systemd/dispatch.service` gets `MemoryHigh=20G` + `MemoryMax=24G` circuit breaker. | Committed. **Live = infinity (not applied).** |
+| `b3b6eb4` | `bin/apply-memory-limits.sh` — applies the limits to the live service. | Committed. **User has NOT run it.** |
+| `1cd66da` | Memory telemetry: periodic `process.memoryUsage()` every 15s, per-turn before/after sampling, `activeConversations` tagging, `Bun.gc(true)` every 5 min. Files: `packages/host-bin/src/mem-telemetry.ts` + `packages/session-orchestrator/src/pure.ts`. | Committed. **NOT deployed.** |
+| `73ff84c` | **LSP disabled** as a precaution (`main.ts` import commented out + removed from `CORE_EXTENSIONS`; `transport-http` tolerates its absence). Also set telemetry interval 60s→15s. | Committed. **NOT deployed.** LSP stays disabled until the leak is understood. |
+| (bun upgrade) | Bun **1.3.13 → 1.3.14** via `bun upgrade`. New binary built at `dist/dispatch-server`. | Rebuilt. **NOT installed** to `/usr/bin/dispatch-server`. |
+
+### 6.1 The telemetry data you will collect (once deployed)
+
+The exact log tags to grep in journald are **`memory:periodic`** and
+**`memory:gc`** (the module is `mem-telemetry.ts`, but the log *messages* are
+those two strings — do not grep for `[mem-telemetry]`).
+
+- `memory:periodic` — every 15s: `rss`, `heapUsed`, `heapTotal`, `external`,
+ `arrayBuffers` (bytes), plus `activeConversations` (count). Correlate RSS
+ growth with active turns: if RSS climbs while `activeConversations` is 0,
+ the leak is background/idle (timers, watchers, retained closures); if it
+ climbs only during/after turns, it's the streaming/turn path.
+- `memory:gc` — every 5 min: runs `Bun.gc(true)` and logs RSS before/after +
+ `reclaimedRssMB` (`before - after`). **Interpretation:** a large positive
+ `reclaimedRssMB` = GC freed it (fragmentation, reclaimable — not a true
+ leak). Near-zero/negative `reclaimedRssMB` = memory is **LIVE** (retained
+ objects — a real leak). This single field distinguishes the two failure
+ modes; check it first when you get data.
+
+### 6.2 Deploy sequence (hand to the user — you have no sudo)
+
+The agent cannot run `sudo` or install to `/usr/bin`. Write a script with
+`sudo` on the privileged lines and hand it to the user (per the repo's sudo
+policy). The sequence, roughly:
+
+1. From `backend/`: `bun run typecheck && bun run test && bun run check` —
+ make sure `dev` is green (it should be; these commits are committed).
+2. `bin/build` — produces `dist/dispatch-server` (Bun 1.3.14, with telemetry +
+ LSP-disabled).
+3. `bin/install` (or a targeted script) — copies `dist/dispatch-server` →
+ `/usr/bin/dispatch-server`, the unit template → `/etc/systemd/system/`,
+ `systemd/dispatch.env` → `/etc/dispatch/env`. (Or run `bin/apply-memory-limits.sh`
+ separately if you only want the cgroup limits without a full reinstall.)
+4. `sudo systemctl daemon-reload && sudo systemctl restart dispatch`.
+5. Confirm: `systemctl show dispatch -p MemoryMax` should show `24G`, and
+ `journalctl -u dispatch -f | rg 'memory:periodic'` should show telemetry
+ within 15s.
+
+> ⚠️ **Warning:** deploying restarts the production server. The live server is
+> actively serving agent conversations. Coordinate with the user (ceb2 /
+> tradam) before restarting — don't just yank it. Also note the running server
+> at time of writing is PID 28956 (started Jun 28 08:51).
+
+---
+
+## 7. Background — the original crash investigation (already done)
+
+You do **not** need to redo this; it's recorded for context.
+
+- **Original report:** `notes/server-crash-investigation.md` (commit `b83aa8d`).
+- The first investigation's task description was **stale**: it claimed an
+ "uncommitted hot-fix that disables LSP" existed. It did not — the working
+ tree was clean and LSP was *enabled*. The developer then *chose* to disable
+ LSP (commit `73ff84c`) as a precaution, but the root cause was already
+ identified as the Bun segfault + leak, not LSP.
+- The **three original LSP suspects** (JSON-parse TypeError, `fs.watch` ENOENT,
+ unbounded cache leak) were all fixed in commits `05ff256` + `f9d1ca5` and
+ are correct/tested. LSP being disabled now is a precaution, not a fix for the
+ crash. Once the leak is understood, LSP can be re-enabled safely.
+- **Do not spend time re-investigating LSP.** It is exonerated.
+
+---
+
+## 8. What needs investigation (your tasks, in order)
+
+1. **Deploy the telemetry + cgroup limits** (§6) — you cannot localize the leak
+ without the `memory:periodic` / `memory:gc` data. Get a full crash cycle's
+ worth of logs (let it run until it either hits `MemoryMax=24G` and restarts,
+ or until the growth pattern is clear).
+2. **Read the telemetry first.** The `memory:gc` `reclaimedRssMB` field tells
+ you live-retained vs fragmentation. If near-zero after GC → real retained
+ leak; chase which subsystem holds it.
+3. **Correlate growth with `activeConversations`.** Idle-growth ⇒ timers /
+ watchers / retained closures / the LSP file-watcher (if any servers still
+ spawn — but LSP is disabled). Turn-bound growth ⇒ the streaming/turn path
+ (suspects in §5.4).
+4. **Examine the streaming loop** (`orchestrator.ts:1276`): is the
+ `provider.stream(...)` async generator fully drained on every path
+ (success, error, abort, `break`)? Does the AI SDK retain response buffers?
+ Consider adding a before/after `memoryUsage()` around a single turn to
+ measure per-turn delta directly.
+5. **Examine the message arrays** assembled before `provider.stream` — are
+ they scoped to the turn and released when the turn completes, or do
+ closures/promises/event-listeners keep them alive?
+6. **Test Bun 1.3.14 in isolation.** Since the upgrade is built but not
+ installed, see if a memory-stress repro under 1.3.13 vs 1.3.14 behaves
+ differently — the segfault may be a Bun allocator bug that 1.3.14 fixes.
+7. **Propose a fix** (do not implement without coordinating — report up to the
+ orchestrator). Likely candidates: ensure the streaming generator is always
+ fully consumed / explicitly closed on error; bound per-turn message
+ history; release references at `activeConversations.delete` time; or
+ confirm it's purely a Bun bug requiring the 1.3.14 deploy.
+
+---
+
+## 9. Quick-reference commands
+
+```bash
+# --- status & logs ---
+systemctl status dispatch
+systemctl show dispatch -p MemoryMax -p MemoryHigh # expect 24G/20G after deploy
+journalctl -u dispatch -f # live logs
+journalctl -u dispatch -f | rg 'memory:periodic|memory:gc' # telemetry (after deploy)
+journalctl -u dispatch --since '2 hours ago' -n 200 # recent history / crashes
+curl http://localhost:24991/health # → {"ok":true}
+ss -tlnp | rg dispatch # ports 24991 + 24990
+
+# --- source (the leak path) ---
+cd /home/tradam/projects/dispatch/backend
+git log --oneline -n 8 # see the undeployed commits
+rg -n 'runTurnDetached|provider\.stream|for await' packages/session-orchestrator/src/orchestrator.ts
+
+# --- build & verify (NO sudo needed) ---
+bun run typecheck && bun run test && bun run check
+bin/build # → dist/dispatch-server
+
+# --- deploy (NEEDS sudo — hand a script to the user) ---
+# bin/install (or: sudo cp dist/dispatch-server /usr/bin/dispatch-server)
+# bin/apply-memory-limits.sh (applies MemoryMax to live service)
+# sudo systemctl daemon-reload && sudo systemctl restart dispatch
+```
+
+---
+
+## 10. Guardrails
+
+- **Do NOT edit implementation code without reporting to the orchestrator.**
+ This is an investigation handoff; propose fixes, don't apply them
+ unilaterally.
+- **You have no sudo.** Any step touching `/usr/bin`, `/etc/`, or
+ `systemctl restart` must be a script handed to the user with `sudo` on the
+ privileged lines.
+- **Do not re-enable LSP** until the leak is understood (it's disabled as a
+ precaution; re-enabling is a separate decision).
+- **Do not merge or push.** Work stays on `dev` locally.
+- **The service is `dispatch`, not `dispatch-server`.**