notes/memory-leak-investigation-handoff.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331

# Memory-Leak Investigation — Handoff for OpenCode Agent

> **Purpose:** This document gives an OpenCode harness agent — running outside
> the Dispatch system, with no prior context — everything needed to
> independently investigate the memory leak that is crashing the production
> Dispatch server. Read it top to bottom before touching anything.

**Last updated:** 2026-06-28
**Author of this handoff:** umans/umans-glm-5.2 (Dispatch agent)
**Originating investigation:** `notes/server-crash-investigation.md` (commit `b83aa8d`)

---

## 0. TL;DR — what you are here to do

The production Dispatch server leaks memory at **~2.5 GB/hour** and eventually
hits a **Bun runtime segfault** (native crash, not a JS exception). A prior
investigation already ruled out LSP as the cause. Your job is to **find where
memory is being retained** and propose a fix. There are four undeployed commits
on the `dev` branch that add memory telemetry and a cgroup circuit breaker —
**those need to be built, installed, and observed first** to get the data that
will localize the leak. Do not start code-diving blindly; deploy the telemetry,
collect a crash cycle, then read the logs.

---

## 1. System Overview — what Dispatch is

Dispatch is an **AI agent orchestration platform**: a backend that runs one or
more AI "agents" through turns (each turn = a step loop of model calls +
tool dispatch), plus a separate frontend that consumes the backend's typed
contracts over HTTP + WebSocket.

**Repo layout** — all under `/home/tradam/projects/dispatch/`:

| Path | What | Git remote | Branch |
|---|---|---|---|
| `backend/` | The server (this codebase) | `[email protected]:realtradam/dispatch.git` | `dev` |
| `frontend/` | The web UI (separate repo, same methodology) | `[email protected]:realtradam/frontend.git` | `dev` |
| `worktrees/` | Per-task git worktrees for isolated feature branches | (varies) | (varies) |
| `bin/` *(inside backend)* | Operational shell scripts (build, install, up, secrets, certs) | — | — |

Both repos use **`dev` as the active development branch**. The backend is a
**Bun + TypeScript** monorepo (project references via `tsc -b`, Biome for
lint/format, Vitest for tests, `bun:sqlite` for storage, the `ai` /
`@ai-sdk/*` packages for model providers). Architecture rules are in
`backend/AGENTS.md` — **read it** before editing (especially the "kernel
touches no I/O" and "no ambient state" rules).

---

## 2. Where things are installed (production)

| Artifact | Path | Notes |
|---|---|---|
| Production server binary | `/usr/bin/dispatch-server` | Bun **standalone compiled** binary. Currently the OLD one (built Jun 27 21:40). Not you to install — see §6. |
| CLI binary | `/usr/bin/dispatch` | `dispatch send/list/read/...` |
| Frontend static files | `/usr/share/dispatch/web/` | Built by `bin/build` (Vite, `VITE_HTTP_PORT=24991 VITE_WS_PORT=24990`) |
| Config / env file | `/etc/dispatch/env` | `EnvironmentFile` for systemd. **Root-readable only** (`Permission denied` for non-root). Source template in repo: `systemd/dispatch.env` (installed by `bin/setup-env`). |
| Data directory | `/var/lib/dispatch/` | WorkingDirectory of the service. SQLite DBs live here: `dispatch.db`, `dispatch.db-shm`, `dispatch.db-wal`. |
| Logs | the journal | `journalctl -u dispatch` — there are no log files on disk; everything goes to journald. |
| systemd unit (live) | `/etc/systemd/system/dispatch.service` | Installed from the repo template `systemd/dispatch.service` by `bin/install`. |
| systemd unit (repo template) | `backend/systemd/dispatch.service` | **Edit this one**, not the live file — `bin/install` copies it over. |
| Install script | `backend/bin/install` | Builds + installs the binary, unit, env, frontend. Uses `sudo` on privileged lines only (the agent has no sudo — hand scripts to the user). |
| Build script | `backend/bin/build` | `tsc --build` then `bun build --compile` → `dist/dispatch-server`. |
| Memory-limits script | `backend/bin/apply-memory-limits.sh` | Applies `MemoryHigh`/`MemoryMax` to the live service. **User has NOT run it yet.** |

---

## 3. Production ports

| Service | Port | Confirmed |
|---|---|---|
| HTTP API | **24991** | `BACKEND_PORT=24991` in `systemd/dispatch.env` (set by `bin/setup-env`). Live: `ss -tlnp` shows `dispatch-server` listening on `*:24991`. |
| Surface WebSocket | **24990** | `SURFACE_WS_PORT`. Live: `ss -tlnp` shows `dispatch-server` listening on `*:24990`. |

The **dev** stack (`bin/up`) uses different ports: **24203** (HTTP) + **24205** (WS) + **24204** (frontend Vite dev server). Don't confuse dev ports with prod ports.

**Health check:** `curl http://localhost:24991/health` → `{"ok":true}`

---

## 4. How to access the running server

```bash
# Is it running?
systemctl status dispatch

# Live logs (follow)
journalctl -u dispatch -f

# Recent crashes / errors (last 2h)
journalctl -u dispatch --since '2 hours ago' -n 200

# Memory telemetry — ONLY visible once the new binary is deployed (see §6).
# The EXACT log tags to grep (NOT "[mem-telemetry]" — that is the module name):
journalctl -u dispatch -f | rg 'memory:periodic|memory:gc'

# Restart (needs root — hand to the user)
sudo systemctl restart dispatch
```

> **The service is named `dispatch`, NOT `dispatch-server`.** The binary is
> `dispatch-server`; the systemd unit is `dispatch.service`. This mismatch
> tripped up the first investigation — don't repeat it.

The backend checkout at `/home/tradam/projects/dispatch/backend/` is on `dev`
and contains the source. The running binary is compiled, so **editing source
does nothing until you rebuild + install + restart** (§6).

---

## 5. The memory leak — what we know

### 5.1 Symptom & crash signature

- The server grows **~2.5 GB/hour** under load, eventually triggering a **Bun
  runtime segfault** (native crash).
- **Last crash:** Jun 28 00:12 JST.
  ```
  panic(main thread): Segmentation fault at address 0x0
  oh no: Bun has crashed. This indicates a bug in Bun, not your code.
  Main process exited, code=dumped, status=4/ILL
  ```
  RSS was **6.2 GB** at crash (over 2h31m uptime). A prior session hit
  **25.7 GB over 18h**.
- The crash is a **native segfault inside Bun's runtime** (NULL deref in the
  allocator/GC), NOT a JS exception. The crash-time memory readout was itself
  corrupted (`RSS: 0.02ZB` — impossible), and systemd saw SIGILL while Bun
  reported SIGSEGV — both signs of **heap corruption under memory pressure**.
- This is the **only** segfault in 5 days of logs. Earlier crashes (Jun 27
  ~02:44–02:52) were `exit-code` failures = application-level LSP bugs, **now
  fixed** (see §7).

### 5.2 What it is NOT

- **NOT LSP.** The Language Server Protocol extension was the original prime
  suspect; it is exonerated. Its caches are bounded (`MAX_OPEN_DOCUMENTS = 50`,
  `MAX_PUSH_DIAGNOSTICS = 100` — see `packages/lsp/src/client.ts:153` and
  `diagnostics.ts:47`). 50 docs + 100 diag entries cannot explain gigabytes.
- **NOT the conversation store.** `packages/conversation-store` is
  **SQLite-backed** (all reads/writes go through a `StorageNamespace` to
  `bun:sqlite`) — it does not hold conversations in an in-memory map.
- **NOT `activeConversations`.** The session-orchestrator's
  `activeConversations` is just a `Set<string>` of conversation IDs (tiny).

### 5.3 Prime suspects (not yet confirmed — that's your job)

1. **The AI-SDK streaming path** — the `async` generator returned by
   `provider.stream(...)` (`@ai-sdk/*`, including the `openai-stream` package).
   Streaming response buffers may be retained if the generator is not fully
   drained or if the SDK holds internal buffers per request.
2. **Per-turn message arrays** in `session-orchestrator` — the history +
   `userMsg` + `providerMessages` arrays assembled before each
   `provider.stream(...)` call. For long multi-step agent turns these can be
   large; if references outlive the turn (closures, unresolved promises, event
   listeners), they leak.
3. **A Bun bug fixed in 1.3.14.** The crash was on Bun **v1.3.13**. Bun 1.3.14
   has been built locally but not installed (§6.3) — deploying it may itself
   resolve the segfault even if a leak remains.

### 5.4 Key files to examine (the leak-suspect code path)

| File | What's there | Why it matters |
|---|---|---|
| `packages/session-orchestrator/src/orchestrator.ts` | `runTurnDetached()` at **line 541** — kicks off a turn; the `activeConversations.add/delete` around it (556, 979) brackets the turn lifecycle. | Turn lifecycle: where memory should be acquired and released. If a turn's buffers aren't released here, it leaks per-turn. |
| `packages/session-orchestrator/src/orchestrator.ts` | `for await (const event of provider.stream(messages, assembled.tools, providerOpts))` at **line 1276** (and a second one at **1430** for compaction/summary). | **The streaming loop** — the #1 suspect. This is where the AI-SDK async generator is consumed. If the generator is abandoned mid-stream (error, `break`, abort) or its internal buffers are retained, memory grows per turn. |
| `packages/session-orchestrator/src/pure.ts` | The pure turn/step decision logic + the `MemorySample`/`memoryDelta`/`memorySampleAttributes` helpers used by telemetry. | The per-turn before/after sampling wraps the stream boundary (referenced by `mem-telemetry.ts`). |
| `packages/kernel/src/runtime/run-turn.ts` | The **step loop** (`runTurn`) — one agent turn = N steps of (model call → tool dispatch → feed results back). | MAX_STEPS is disabled (`0 = unlimited`, commit `e8b4bf1`), so a single turn can run many steps. Many steps ⇒ many accumulated message arrays ⇒ more retention surface. |
| `packages/kernel/src/runtime/dispatch.ts` | **Tool dispatch** (the `maxConcurrent`/`eager` tool-execution loop). | Tool results accumulate into the turn's message history. A rogue tool returning huge payloads could balloon arrays. |
| `packages/host-bin/src/mem-telemetry.ts` | The periodic telemetry edge effect. | Read this to understand exactly what `memory:periodic` / `memory:gc` log entries contain (rss, heapUsed, heapTotal, external, arrayBuffers, activeConversations, reclaimedRssMB). |

> **Architecture note (AGENTS.md):** the kernel (`packages/kernel`) touches NO
> I/O and names no concrete feature. Decision logic is pure; effects are
> injected at the edges (host-bin). When investigating, keep this layering in
> mind — a leak "in the kernel" would actually be in a closure captured by an
> edge that feeds the kernel, not in the kernel's own (stateless) turn loop.

---

## 6. What's been done (committed to `dev`, NOT yet deployed)

The running binary is still the **old one** (built Jun 27 21:40, pre-telemetry,
pre-Bun-1.3.14, pre-LSP-disable). Four commits on `dev` need building +
installing before any of this takes effect. **Verify before you trust:** the
live systemd still reports `MemoryMax=infinity` (confirmed via
`systemctl show dispatch -p MemoryMax`).

| Commit | What | Status |
|---|---|---|
| `d1de9ed` | `systemd/dispatch.service` gets `MemoryHigh=20G` + `MemoryMax=24G` circuit breaker. | Committed. **Live = infinity (not applied).** |
| `b3b6eb4` | `bin/apply-memory-limits.sh` — applies the limits to the live service. | Committed. **User has NOT run it.** |
| `1cd66da` | Memory telemetry: periodic `process.memoryUsage()` every 15s, per-turn before/after sampling, `activeConversations` tagging, `Bun.gc(true)` every 5 min. Files: `packages/host-bin/src/mem-telemetry.ts` + `packages/session-orchestrator/src/pure.ts`. | Committed. **NOT deployed.** |
| `73ff84c` | **LSP disabled** as a precaution (`main.ts` import commented out + removed from `CORE_EXTENSIONS`; `transport-http` tolerates its absence). Also set telemetry interval 60s→15s. | Committed. **NOT deployed.** LSP stays disabled until the leak is understood. |
| (bun upgrade) | Bun **1.3.13 → 1.3.14** via `bun upgrade`. New binary built at `dist/dispatch-server`. | Rebuilt. **NOT installed** to `/usr/bin/dispatch-server`. |

### 6.1 The telemetry data you will collect (once deployed)

The exact log tags to grep in journald are **`memory:periodic`** and
**`memory:gc`** (the module is `mem-telemetry.ts`, but the log *messages* are
those two strings — do not grep for `[mem-telemetry]`).

- `memory:periodic` — every 15s: `rss`, `heapUsed`, `heapTotal`, `external`,
  `arrayBuffers` (bytes), plus `activeConversations` (count). Correlate RSS
  growth with active turns: if RSS climbs while `activeConversations` is 0,
  the leak is background/idle (timers, watchers, retained closures); if it
  climbs only during/after turns, it's the streaming/turn path.
- `memory:gc` — every 5 min: runs `Bun.gc(true)` and logs RSS before/after +
  `reclaimedRssMB` (`before - after`). **Interpretation:** a large positive
  `reclaimedRssMB` = GC freed it (fragmentation, reclaimable — not a true
  leak). Near-zero/negative `reclaimedRssMB` = memory is **LIVE** (retained
  objects — a real leak). This single field distinguishes the two failure
  modes; check it first when you get data.

### 6.2 Deploy sequence (hand to the user — you have no sudo)

The agent cannot run `sudo` or install to `/usr/bin`. Write a script with
`sudo` on the privileged lines and hand it to the user (per the repo's sudo
policy). The sequence, roughly:

1. From `backend/`: `bun run typecheck && bun run test && bun run check` —
   make sure `dev` is green (it should be; these commits are committed).
2. `bin/build` — produces `dist/dispatch-server` (Bun 1.3.14, with telemetry +
   LSP-disabled).
3. `bin/install` (or a targeted script) — copies `dist/dispatch-server` →
   `/usr/bin/dispatch-server`, the unit template → `/etc/systemd/system/`,
   `systemd/dispatch.env` → `/etc/dispatch/env`. (Or run `bin/apply-memory-limits.sh`
   separately if you only want the cgroup limits without a full reinstall.)
4. `sudo systemctl daemon-reload && sudo systemctl restart dispatch`.
5. Confirm: `systemctl show dispatch -p MemoryMax` should show `24G`, and
   `journalctl -u dispatch -f | rg 'memory:periodic'` should show telemetry
   within 15s.

> ⚠️ **Warning:** deploying restarts the production server. The live server is
> actively serving agent conversations. Coordinate with the user (ceb2 /
> tradam) before restarting — don't just yank it. Also note the running server
> at time of writing is PID 28956 (started Jun 28 08:51).

---

## 7. Background — the original crash investigation (already done)

You do **not** need to redo this; it's recorded for context.

- **Original report:** `notes/server-crash-investigation.md` (commit `b83aa8d`).
- The first investigation's task description was **stale**: it claimed an
  "uncommitted hot-fix that disables LSP" existed. It did not — the working
  tree was clean and LSP was *enabled*. The developer then *chose* to disable
  LSP (commit `73ff84c`) as a precaution, but the root cause was already
  identified as the Bun segfault + leak, not LSP.
- The **three original LSP suspects** (JSON-parse TypeError, `fs.watch` ENOENT,
  unbounded cache leak) were all fixed in commits `05ff256` + `f9d1ca5` and
  are correct/tested. LSP being disabled now is a precaution, not a fix for the
  crash. Once the leak is understood, LSP can be re-enabled safely.
- **Do not spend time re-investigating LSP.** It is exonerated.

---

## 8. What needs investigation (your tasks, in order)

1. **Deploy the telemetry + cgroup limits** (§6) — you cannot localize the leak
   without the `memory:periodic` / `memory:gc` data. Get a full crash cycle's
   worth of logs (let it run until it either hits `MemoryMax=24G` and restarts,
   or until the growth pattern is clear).
2. **Read the telemetry first.** The `memory:gc` `reclaimedRssMB` field tells
   you live-retained vs fragmentation. If near-zero after GC → real retained
   leak; chase which subsystem holds it.
3. **Correlate growth with `activeConversations`.** Idle-growth ⇒ timers /
   watchers / retained closures / the LSP file-watcher (if any servers still
   spawn — but LSP is disabled). Turn-bound growth ⇒ the streaming/turn path
   (suspects in §5.4).
4. **Examine the streaming loop** (`orchestrator.ts:1276`): is the
   `provider.stream(...)` async generator fully drained on every path
   (success, error, abort, `break`)? Does the AI SDK retain response buffers?
   Consider adding a before/after `memoryUsage()` around a single turn to
   measure per-turn delta directly.
5. **Examine the message arrays** assembled before `provider.stream` — are
   they scoped to the turn and released when the turn completes, or do
   closures/promises/event-listeners keep them alive?
6. **Test Bun 1.3.14 in isolation.** Since the upgrade is built but not
   installed, see if a memory-stress repro under 1.3.13 vs 1.3.14 behaves
   differently — the segfault may be a Bun allocator bug that 1.3.14 fixes.
7. **Propose a fix** (do not implement without coordinating — report up to the
   orchestrator). Likely candidates: ensure the streaming generator is always
   fully consumed / explicitly closed on error; bound per-turn message
   history; release references at `activeConversations.delete` time; or
   confirm it's purely a Bun bug requiring the 1.3.14 deploy.

---

## 9. Quick-reference commands

```bash
# --- status & logs ---
systemctl status dispatch
systemctl show dispatch -p MemoryMax -p MemoryHigh       # expect 24G/20G after deploy
journalctl -u dispatch -f                                  # live logs
journalctl -u dispatch -f | rg 'memory:periodic|memory:gc' # telemetry (after deploy)
journalctl -u dispatch --since '2 hours ago' -n 200       # recent history / crashes
curl http://localhost:24991/health                        # → {"ok":true}
ss -tlnp | rg dispatch                                    # ports 24991 + 24990

# --- source (the leak path) ---
cd /home/tradam/projects/dispatch/backend
git log --oneline -n 8                                    # see the undeployed commits
rg -n 'runTurnDetached|provider\.stream|for await' packages/session-orchestrator/src/orchestrator.ts

# --- build & verify (NO sudo needed) ---
bun run typecheck && bun run test && bun run check
bin/build                                                 # → dist/dispatch-server

# --- deploy (NEEDS sudo — hand a script to the user) ---
# bin/install  (or: sudo cp dist/dispatch-server /usr/bin/dispatch-server)
# bin/apply-memory-limits.sh   (applies MemoryMax to live service)
# sudo systemctl daemon-reload && sudo systemctl restart dispatch
```

---

## 10. Guardrails

- **Do NOT edit implementation code without reporting to the orchestrator.**
  This is an investigation handoff; propose fixes, don't apply them
  unilaterally.
- **You have no sudo.** Any step touching `/usr/bin`, `/etc/`, or
  `systemctl restart` must be a script handed to the user with `sudo` on the
  privileged lines.
- **Do not re-enable LSP** until the leak is understood (it's disabled as a
  precaution; re-enabling is a separate decision).
- **Do not merge or push.** Work stays on `dev` locally.
- **The service is `dispatch`, not `dispatch-server`.**