# Memory-Leak Investigation — Handoff for OpenCode Agent > **Purpose:** This document gives an OpenCode harness agent — running outside > the Dispatch system, with no prior context — everything needed to > independently investigate the memory leak that is crashing the production > Dispatch server. Read it top to bottom before touching anything. **Last updated:** 2026-06-28 **Author of this handoff:** umans/umans-glm-5.2 (Dispatch agent) **Originating investigation:** `notes/server-crash-investigation.md` (commit `b83aa8d`) --- ## 0. TL;DR — what you are here to do The production Dispatch server leaks memory at **~2.5 GB/hour** and eventually hits a **Bun runtime segfault** (native crash, not a JS exception). A prior investigation already ruled out LSP as the cause. Your job is to **find where memory is being retained** and propose a fix. There are four undeployed commits on the `dev` branch that add memory telemetry and a cgroup circuit breaker — **those need to be built, installed, and observed first** to get the data that will localize the leak. Do not start code-diving blindly; deploy the telemetry, collect a crash cycle, then read the logs. --- ## 1. System Overview — what Dispatch is Dispatch is an **AI agent orchestration platform**: a backend that runs one or more AI "agents" through turns (each turn = a step loop of model calls + tool dispatch), plus a separate frontend that consumes the backend's typed contracts over HTTP + WebSocket. **Repo layout** — all under `/home/tradam/projects/dispatch/`: | Path | What | Git remote | Branch | |---|---|---|---| | `backend/` | The server (this codebase) | `git@github.com:realtradam/dispatch.git` | `dev` | | `frontend/` | The web UI (separate repo, same methodology) | `git@github.com:realtradam/frontend.git` | `dev` | | `worktrees/` | Per-task git worktrees for isolated feature branches | (varies) | (varies) | | `bin/` *(inside backend)* | Operational shell scripts (build, install, up, secrets, certs) | — | — | Both repos use **`dev` as the active development branch**. The backend is a **Bun + TypeScript** monorepo (project references via `tsc -b`, Biome for lint/format, Vitest for tests, `bun:sqlite` for storage, the `ai` / `@ai-sdk/*` packages for model providers). Architecture rules are in `backend/AGENTS.md` — **read it** before editing (especially the "kernel touches no I/O" and "no ambient state" rules). --- ## 2. Where things are installed (production) | Artifact | Path | Notes | |---|---|---| | Production server binary | `/usr/bin/dispatch-server` | Bun **standalone compiled** binary. Currently the OLD one (built Jun 27 21:40). Not you to install — see §6. | | CLI binary | `/usr/bin/dispatch` | `dispatch send/list/read/...` | | Frontend static files | `/usr/share/dispatch/web/` | Built by `bin/build` (Vite, `VITE_HTTP_PORT=24991 VITE_WS_PORT=24990`) | | Config / env file | `/etc/dispatch/env` | `EnvironmentFile` for systemd. **Root-readable only** (`Permission denied` for non-root). Source template in repo: `systemd/dispatch.env` (installed by `bin/setup-env`). | | Data directory | `/var/lib/dispatch/` | WorkingDirectory of the service. SQLite DBs live here: `dispatch.db`, `dispatch.db-shm`, `dispatch.db-wal`. | | Logs | the journal | `journalctl -u dispatch` — there are no log files on disk; everything goes to journald. | | systemd unit (live) | `/etc/systemd/system/dispatch.service` | Installed from the repo template `systemd/dispatch.service` by `bin/install`. | | systemd unit (repo template) | `backend/systemd/dispatch.service` | **Edit this one**, not the live file — `bin/install` copies it over. | | Install script | `backend/bin/install` | Builds + installs the binary, unit, env, frontend. Uses `sudo` on privileged lines only (the agent has no sudo — hand scripts to the user). | | Build script | `backend/bin/build` | `tsc --build` then `bun build --compile` → `dist/dispatch-server`. | | Memory-limits script | `backend/bin/apply-memory-limits.sh` | Applies `MemoryHigh`/`MemoryMax` to the live service. **User has NOT run it yet.** | --- ## 3. Production ports | Service | Port | Confirmed | |---|---|---| | HTTP API | **24991** | `BACKEND_PORT=24991` in `systemd/dispatch.env` (set by `bin/setup-env`). Live: `ss -tlnp` shows `dispatch-server` listening on `*:24991`. | | Surface WebSocket | **24990** | `SURFACE_WS_PORT`. Live: `ss -tlnp` shows `dispatch-server` listening on `*:24990`. | The **dev** stack (`bin/up`) uses different ports: **24203** (HTTP) + **24205** (WS) + **24204** (frontend Vite dev server). Don't confuse dev ports with prod ports. **Health check:** `curl http://localhost:24991/health` → `{"ok":true}` --- ## 4. How to access the running server ```bash # Is it running? systemctl status dispatch # Live logs (follow) journalctl -u dispatch -f # Recent crashes / errors (last 2h) journalctl -u dispatch --since '2 hours ago' -n 200 # Memory telemetry — ONLY visible once the new binary is deployed (see §6). # The EXACT log tags to grep (NOT "[mem-telemetry]" — that is the module name): journalctl -u dispatch -f | rg 'memory:periodic|memory:gc' # Restart (needs root — hand to the user) sudo systemctl restart dispatch ``` > **The service is named `dispatch`, NOT `dispatch-server`.** The binary is > `dispatch-server`; the systemd unit is `dispatch.service`. This mismatch > tripped up the first investigation — don't repeat it. The backend checkout at `/home/tradam/projects/dispatch/backend/` is on `dev` and contains the source. The running binary is compiled, so **editing source does nothing until you rebuild + install + restart** (§6). --- ## 5. The memory leak — what we know ### 5.1 Symptom & crash signature - The server grows **~2.5 GB/hour** under load, eventually triggering a **Bun runtime segfault** (native crash). - **Last crash:** Jun 28 00:12 JST. ``` panic(main thread): Segmentation fault at address 0x0 oh no: Bun has crashed. This indicates a bug in Bun, not your code. Main process exited, code=dumped, status=4/ILL ``` RSS was **6.2 GB** at crash (over 2h31m uptime). A prior session hit **25.7 GB over 18h**. - The crash is a **native segfault inside Bun's runtime** (NULL deref in the allocator/GC), NOT a JS exception. The crash-time memory readout was itself corrupted (`RSS: 0.02ZB` — impossible), and systemd saw SIGILL while Bun reported SIGSEGV — both signs of **heap corruption under memory pressure**. - This is the **only** segfault in 5 days of logs. Earlier crashes (Jun 27 ~02:44–02:52) were `exit-code` failures = application-level LSP bugs, **now fixed** (see §7). ### 5.2 What it is NOT - **NOT LSP.** The Language Server Protocol extension was the original prime suspect; it is exonerated. Its caches are bounded (`MAX_OPEN_DOCUMENTS = 50`, `MAX_PUSH_DIAGNOSTICS = 100` — see `packages/lsp/src/client.ts:153` and `diagnostics.ts:47`). 50 docs + 100 diag entries cannot explain gigabytes. - **NOT the conversation store.** `packages/conversation-store` is **SQLite-backed** (all reads/writes go through a `StorageNamespace` to `bun:sqlite`) — it does not hold conversations in an in-memory map. - **NOT `activeConversations`.** The session-orchestrator's `activeConversations` is just a `Set` of conversation IDs (tiny). ### 5.3 Prime suspects (not yet confirmed — that's your job) 1. **The AI-SDK streaming path** — the `async` generator returned by `provider.stream(...)` (`@ai-sdk/*`, including the `openai-stream` package). Streaming response buffers may be retained if the generator is not fully drained or if the SDK holds internal buffers per request. 2. **Per-turn message arrays** in `session-orchestrator` — the history + `userMsg` + `providerMessages` arrays assembled before each `provider.stream(...)` call. For long multi-step agent turns these can be large; if references outlive the turn (closures, unresolved promises, event listeners), they leak. 3. **A Bun bug fixed in 1.3.14.** The crash was on Bun **v1.3.13**. Bun 1.3.14 has been built locally but not installed (§6.3) — deploying it may itself resolve the segfault even if a leak remains. ### 5.4 Key files to examine (the leak-suspect code path) | File | What's there | Why it matters | |---|---|---| | `packages/session-orchestrator/src/orchestrator.ts` | `runTurnDetached()` at **line 541** — kicks off a turn; the `activeConversations.add/delete` around it (556, 979) brackets the turn lifecycle. | Turn lifecycle: where memory should be acquired and released. If a turn's buffers aren't released here, it leaks per-turn. | | `packages/session-orchestrator/src/orchestrator.ts` | `for await (const event of provider.stream(messages, assembled.tools, providerOpts))` at **line 1276** (and a second one at **1430** for compaction/summary). | **The streaming loop** — the #1 suspect. This is where the AI-SDK async generator is consumed. If the generator is abandoned mid-stream (error, `break`, abort) or its internal buffers are retained, memory grows per turn. | | `packages/session-orchestrator/src/pure.ts` | The pure turn/step decision logic + the `MemorySample`/`memoryDelta`/`memorySampleAttributes` helpers used by telemetry. | The per-turn before/after sampling wraps the stream boundary (referenced by `mem-telemetry.ts`). | | `packages/kernel/src/runtime/run-turn.ts` | The **step loop** (`runTurn`) — one agent turn = N steps of (model call → tool dispatch → feed results back). | MAX_STEPS is disabled (`0 = unlimited`, commit `e8b4bf1`), so a single turn can run many steps. Many steps ⇒ many accumulated message arrays ⇒ more retention surface. | | `packages/kernel/src/runtime/dispatch.ts` | **Tool dispatch** (the `maxConcurrent`/`eager` tool-execution loop). | Tool results accumulate into the turn's message history. A rogue tool returning huge payloads could balloon arrays. | | `packages/host-bin/src/mem-telemetry.ts` | The periodic telemetry edge effect. | Read this to understand exactly what `memory:periodic` / `memory:gc` log entries contain (rss, heapUsed, heapTotal, external, arrayBuffers, activeConversations, reclaimedRssMB). | > **Architecture note (AGENTS.md):** the kernel (`packages/kernel`) touches NO > I/O and names no concrete feature. Decision logic is pure; effects are > injected at the edges (host-bin). When investigating, keep this layering in > mind — a leak "in the kernel" would actually be in a closure captured by an > edge that feeds the kernel, not in the kernel's own (stateless) turn loop. --- ## 6. What's been done (committed to `dev`, NOT yet deployed) The running binary is still the **old one** (built Jun 27 21:40, pre-telemetry, pre-Bun-1.3.14, pre-LSP-disable). Four commits on `dev` need building + installing before any of this takes effect. **Verify before you trust:** the live systemd still reports `MemoryMax=infinity` (confirmed via `systemctl show dispatch -p MemoryMax`). | Commit | What | Status | |---|---|---| | `d1de9ed` | `systemd/dispatch.service` gets `MemoryHigh=20G` + `MemoryMax=24G` circuit breaker. | Committed. **Live = infinity (not applied).** | | `b3b6eb4` | `bin/apply-memory-limits.sh` — applies the limits to the live service. | Committed. **User has NOT run it.** | | `1cd66da` | Memory telemetry: periodic `process.memoryUsage()` every 15s, per-turn before/after sampling, `activeConversations` tagging, `Bun.gc(true)` every 5 min. Files: `packages/host-bin/src/mem-telemetry.ts` + `packages/session-orchestrator/src/pure.ts`. | Committed. **NOT deployed.** | | `73ff84c` | **LSP disabled** as a precaution (`main.ts` import commented out + removed from `CORE_EXTENSIONS`; `transport-http` tolerates its absence). Also set telemetry interval 60s→15s. | Committed. **NOT deployed.** LSP stays disabled until the leak is understood. | | (bun upgrade) | Bun **1.3.13 → 1.3.14** via `bun upgrade`. New binary built at `dist/dispatch-server`. | Rebuilt. **NOT installed** to `/usr/bin/dispatch-server`. | ### 6.1 The telemetry data you will collect (once deployed) The exact log tags to grep in journald are **`memory:periodic`** and **`memory:gc`** (the module is `mem-telemetry.ts`, but the log *messages* are those two strings — do not grep for `[mem-telemetry]`). - `memory:periodic` — every 15s: `rss`, `heapUsed`, `heapTotal`, `external`, `arrayBuffers` (bytes), plus `activeConversations` (count). Correlate RSS growth with active turns: if RSS climbs while `activeConversations` is 0, the leak is background/idle (timers, watchers, retained closures); if it climbs only during/after turns, it's the streaming/turn path. - `memory:gc` — every 5 min: runs `Bun.gc(true)` and logs RSS before/after + `reclaimedRssMB` (`before - after`). **Interpretation:** a large positive `reclaimedRssMB` = GC freed it (fragmentation, reclaimable — not a true leak). Near-zero/negative `reclaimedRssMB` = memory is **LIVE** (retained objects — a real leak). This single field distinguishes the two failure modes; check it first when you get data. ### 6.2 Deploy sequence (hand to the user — you have no sudo) The agent cannot run `sudo` or install to `/usr/bin`. Write a script with `sudo` on the privileged lines and hand it to the user (per the repo's sudo policy). The sequence, roughly: 1. From `backend/`: `bun run typecheck && bun run test && bun run check` — make sure `dev` is green (it should be; these commits are committed). 2. `bin/build` — produces `dist/dispatch-server` (Bun 1.3.14, with telemetry + LSP-disabled). 3. `bin/install` (or a targeted script) — copies `dist/dispatch-server` → `/usr/bin/dispatch-server`, the unit template → `/etc/systemd/system/`, `systemd/dispatch.env` → `/etc/dispatch/env`. (Or run `bin/apply-memory-limits.sh` separately if you only want the cgroup limits without a full reinstall.) 4. `sudo systemctl daemon-reload && sudo systemctl restart dispatch`. 5. Confirm: `systemctl show dispatch -p MemoryMax` should show `24G`, and `journalctl -u dispatch -f | rg 'memory:periodic'` should show telemetry within 15s. > ⚠️ **Warning:** deploying restarts the production server. The live server is > actively serving agent conversations. Coordinate with the user (ceb2 / > tradam) before restarting — don't just yank it. Also note the running server > at time of writing is PID 28956 (started Jun 28 08:51). --- ## 7. Background — the original crash investigation (already done) You do **not** need to redo this; it's recorded for context. - **Original report:** `notes/server-crash-investigation.md` (commit `b83aa8d`). - The first investigation's task description was **stale**: it claimed an "uncommitted hot-fix that disables LSP" existed. It did not — the working tree was clean and LSP was *enabled*. The developer then *chose* to disable LSP (commit `73ff84c`) as a precaution, but the root cause was already identified as the Bun segfault + leak, not LSP. - The **three original LSP suspects** (JSON-parse TypeError, `fs.watch` ENOENT, unbounded cache leak) were all fixed in commits `05ff256` + `f9d1ca5` and are correct/tested. LSP being disabled now is a precaution, not a fix for the crash. Once the leak is understood, LSP can be re-enabled safely. - **Do not spend time re-investigating LSP.** It is exonerated. --- ## 8. What needs investigation (your tasks, in order) 1. **Deploy the telemetry + cgroup limits** (§6) — you cannot localize the leak without the `memory:periodic` / `memory:gc` data. Get a full crash cycle's worth of logs (let it run until it either hits `MemoryMax=24G` and restarts, or until the growth pattern is clear). 2. **Read the telemetry first.** The `memory:gc` `reclaimedRssMB` field tells you live-retained vs fragmentation. If near-zero after GC → real retained leak; chase which subsystem holds it. 3. **Correlate growth with `activeConversations`.** Idle-growth ⇒ timers / watchers / retained closures / the LSP file-watcher (if any servers still spawn — but LSP is disabled). Turn-bound growth ⇒ the streaming/turn path (suspects in §5.4). 4. **Examine the streaming loop** (`orchestrator.ts:1276`): is the `provider.stream(...)` async generator fully drained on every path (success, error, abort, `break`)? Does the AI SDK retain response buffers? Consider adding a before/after `memoryUsage()` around a single turn to measure per-turn delta directly. 5. **Examine the message arrays** assembled before `provider.stream` — are they scoped to the turn and released when the turn completes, or do closures/promises/event-listeners keep them alive? 6. **Test Bun 1.3.14 in isolation.** Since the upgrade is built but not installed, see if a memory-stress repro under 1.3.13 vs 1.3.14 behaves differently — the segfault may be a Bun allocator bug that 1.3.14 fixes. 7. **Propose a fix** (do not implement without coordinating — report up to the orchestrator). Likely candidates: ensure the streaming generator is always fully consumed / explicitly closed on error; bound per-turn message history; release references at `activeConversations.delete` time; or confirm it's purely a Bun bug requiring the 1.3.14 deploy. --- ## 9. Quick-reference commands ```bash # --- status & logs --- systemctl status dispatch systemctl show dispatch -p MemoryMax -p MemoryHigh # expect 24G/20G after deploy journalctl -u dispatch -f # live logs journalctl -u dispatch -f | rg 'memory:periodic|memory:gc' # telemetry (after deploy) journalctl -u dispatch --since '2 hours ago' -n 200 # recent history / crashes curl http://localhost:24991/health # → {"ok":true} ss -tlnp | rg dispatch # ports 24991 + 24990 # --- source (the leak path) --- cd /home/tradam/projects/dispatch/backend git log --oneline -n 8 # see the undeployed commits rg -n 'runTurnDetached|provider\.stream|for await' packages/session-orchestrator/src/orchestrator.ts # --- build & verify (NO sudo needed) --- bun run typecheck && bun run test && bun run check bin/build # → dist/dispatch-server # --- deploy (NEEDS sudo — hand a script to the user) --- # bin/install (or: sudo cp dist/dispatch-server /usr/bin/dispatch-server) # bin/apply-memory-limits.sh (applies MemoryMax to live service) # sudo systemctl daemon-reload && sudo systemctl restart dispatch ``` --- ## 10. Guardrails - **Do NOT edit implementation code without reporting to the orchestrator.** This is an investigation handoff; propose fixes, don't apply them unilaterally. - **You have no sudo.** Any step touching `/usr/bin`, `/etc/`, or `systemctl restart` must be a script handed to the user with `sudo` on the privileged lines. - **Do not re-enable LSP** until the leak is understood (it's disabled as a precaution; re-enabling is a separate decision). - **Do not merge or push.** Work stays on `dev` locally. - **The service is `dispatch`, not `dispatch-server`.**