# Server Crash Investigation — LSP Suspected **Date:** 2026-06-27 **Investigator:** umans/umans-glm-5.2 **Checkout:** `dispatch/backend` on `dev` (commit `f9d1ca5`, working tree clean) ## TL;DR The crash under investigation (Jun 28 00:12 JST) is **a Bun runtime segfault**, not an LSP application crash. The three LSP issues the task suspected (JSON-parse TypeError, fs.watch ENOENT, unbounded cache memory leak) were **all already fixed** in commits `05ff256` (21:14) and `f9d1ca5` (21:32), compiled into the binary at 21:40 — which is the binary that crashed 2.5 h later. A separate, still-live **memory leak** (6.2 GB in 2.5 h post-fix; 25.7 GB in 18 h pre-fix) fed memory pressure into Bun's allocator and triggered the native crash. The leak is **not** in the LSP caches (they are now bounded to ~50 docs + ~100 diag entries) — it is elsewhere in the runtime (most likely AI-SDK streaming buffers / retained conversation message arrays during long agent turns). The task's premise ("there is an uncommitted hot-fix that disables LSP in `host-bin`/`transport-http`") does **not** match the checkout. The working tree is clean; LSP is fully enabled (not disabled); no `test_race.js` exists. The developer chose to **fix** LSP rather than disable it. --- ## 1. What the logs actually show ### The crash (the event under investigation) `systemctl status dispatch.service` (the service is named `dispatch.service`, **not** `dispatch-server.service` — the latter does not exist): ``` Jun 28 00:12:02 dispatch-server[931590]: Bun v1.3.13 (bf2e2cec) Linux x64 Jun 28 00:12:02 dispatch-server[931590]: WSL Kernel v6.6.87 | glibc v2.43 Jun 28 00:12:02 dispatch-server[931590]: Args: "/usr/bin/dispatch-server" Jun 28 00:12:02 dispatch-server[931590]: Elapsed: 9115307ms | User: 582013ms | Sys: 738525ms Jun 28 00:12:02 dispatch-server[931590]: RSS: 0.02ZB | Peak: 0.29GB | Commit: 0.02ZB | Faults: 0 | Machine: 33.24GB Jun 28 00:12:02 dispatch-server[931590]: panic(main thread): Segmentation fault at address 0x0 Jun 28 00:12:02 dispatch-server[931590]: oh no: Bun has crashed. This indicates a bug in Bun, not your code. Jun 28 00:12:05 systemd[1]: dispatch.service: Main process exited, code=dumped, status=4/ILL Jun 28 00:12:05 systemd[1]: dispatch.service: Failed with result 'core-dump'. Jun 28 00:12:05 systemd[1]: dispatch.service: Consumed 1h 29min CPU over 2h 31min wall, 6.2G memory peak. ``` Key signals: - **`panic(main thread): Segmentation fault at address 0x0`** — a NULL-pointer dereference **inside Bun's runtime**, not in dispatch code. Bun's own crash handler states "This indicates a bug in Bun, not your code." - **Corrupted crash-time memory report.** `RSS: 0.02ZB` (zettabytes) and `Faults: 0` are impossible values; `Peak: 0.29GB` inside the dump contradicts systemd's cgroup accounting of `6.2G memory peak`. The crash corrupted the runtime state *before* the handler read its own stats — consistent with a heap corruption / use-after-free, not a clean NULL deref of a known address. - **Signal mismatch.** Bun reports SIGSEGV ("Segmentation fault") but systemd recorded `status=4/ILL` (signal 4 = SIGILL, illegal instruction). Two different signals from one crash ⇒ the runtime was already corrupt when the trap fired. Classic signature of an allocator/GC corruption under memory pressure. - **No coredump retained** (`coredumpctl list` → "No coredumps found"; `/var/lib/systemd/coredump` empty), so the stack cannot be symbolicated locally. The redacted report URL is in the journal: `https://bun.report/1.3.13/l_1bf2e2ce…`. - **Memory pressure was real.** systemd's cgroup (trustworthy) says the process peaked at **6.2 GB** over 2 h 31 m before dying — a ~2.5 GB/h growth rate from a fresh boot. ### Crash history (last 48 h) — two distinct failure modes ``` Jun 26 15:09 Failed exit-code 16h12m wall 3.4G peak (app-level) Jun 27 02:44 Failed exit-code 1h22m wall 3.1G peak (app-level — crash loop begins) Jun 27 02:51 Failed exit-code 7m wall 1.1G peak (crash loop) Jun 27 02:52 Failed exit-code 58s wall 700M peak (crash loop) Jun 27 20:58 Stopped (manual) 17h58m wall 25.7G peak (huge leak, no crash) Jun 28 00:12 SIGSEGV/SIGILL dump 2h31m wall 6.2G peak (Bun segfault — THIS event) ``` The Jun 27 02:44–02:52 run is a **tight crash loop** (3 exits in 8 min with collapsing uptime: 1h22m → 7m → 58s). These were `exit-code` failures (**not** segfaults) — i.e. the application-level LSP crashes (JSON TypeError, ENOENT). The only segfault in 5 days of logs is the 00:12 event, which occurred **after** the LSP fixes were deployed. --- ## 2. The task premise vs. the actual checkout The task described an **uncommitted hot-fix that disables LSP** in `packages/host-bin/src/main.ts` and `packages/transport-http/src/extension.ts`, plus an untracked `packages/lsp/src/test_race.js`. **None of this exists:** - `git status` → "nothing to commit, working tree clean". No untracked files. - `fd test_race` → no results. `test_race.js` does not exist anywhere. - `git log -- packages/host-bin/src/main.ts packages/transport-http/src/extension.ts` → no "disable LSP" commit. LSP has only ever been **enabled** here. - `host-bin/src/main.ts:24` imports `@dispatch/lsp`; line 104 includes `lspExt` in `CORE_EXTENSIONS`. LSP is **enabled**. - `transport-http/src/extension.ts:98` calls `host.getService(lspServiceHandle)` with no try/catch — LSP is wired **straight-through**, not made optional. What the developer actually did (commits `05ff256` @ 21:14 and `f9d1ca5` @ 21:32) was **fix** the LSP crashes in place, then rebuild (`/usr/bin/dispatch-server` mtime `2026-06-27 21:40:04`) and restart (`Jun 27 21:40:06 Started`). The prior AI review that identified the three bugs was committed alongside the fixes as `ai-review-report.md`. --- ## 3. The three suspected LSP issues — all already fixed A committed prior review (`ai-review-report.md`, part of `05ff256`) names the exact three issues the task raised. Each is fixed in the current tree: ### 3a. Unhandled JSON parse / TypeError (`client.ts`) — FIXED `rpc.ts:89-99` wraps `JSON.parse` in try/catch (malformed/split-UTF-8 messages are logged and skipped). The **fatal** bug, though, was the defence-in-depth boundary in `client.ts`: when the language-server process died, `markBroken()` set `this.rpc = null`; a final stdout flush then evaluated `this.rpc?.handleMessage(msg).catch(() => {})` → short-circuits to `undefined.catch()` → a **synchronous TypeError** that crashed the process. **Fix** (`client.ts:310`): a second optional-chaining `?.`: ```ts void this.rpc?.handleMessage(msg)?.catch(() => {}); ``` `undefined?.catch()` → `undefined`, no throw. Covered by regression test `handleBytes does not crash when the server dies and rpc is null (Bug 1)`. ### 3b. ENOENT from fs.watch on transient `.old_modules` dirs — FIXED `extension.ts`'s recursive `node:fs.watch` emitted `'error'` events when `bun install` deleted transient `.old_modules-*` directories. With no `.on("error")` listener, Node/Bun escalates the event to an **uncaught exception** that kills the process. **Fix** (`extension.ts:100`): a no-op error listener swallows transient FS errors: ```ts watcher.on("error", () => { /* ignore transient FS errors */ }); ``` Covered by `extension.test.ts` (the `__test__realFileWatcher` re-export lets the behaviour be exercised with an injected fake watcher). ### 3c. Unbounded cache memory leak — FIXED (in LSP) `LanguageServerClient` previously retained every touched file's text + diagnostics forever in `openDocuments`, `lastDiagSnapshot`, and `pushDiagnostics` (the documented "9.5 GB over 12 h" leak). **Fixes:** - `client.ts`: `MAX_OPEN_DOCUMENTS = 50` LRU — overflow evicts the least-recently-used doc via `textDocument/didClose` + purge (`evictIfOverCap`, `closeDocument`). - `diagnostics.ts`: `MAX_PUSH_DIAGNOSTICS = 100` — background-scanned files (never opened by the agent) are evicted oldest-first (`evictPushIfOverCap`, delete-then-set for LRU recency). - `markBroken()` now `clear()`s `openDocuments` + `lastDiagSnapshot` so a repeatedly-crashed/re-spawned client doesn't accumulate across cycles. - Initialize timeout leak fixed: the timeout is now passed into `rpc.sendRequest` (which deletes the pending entry on expiry) instead of a `Promise.race` that left the original promise lodged in `pending` forever. --- ## 4. Root cause of the 00:12 crash **The crash is a Bun runtime segfault, not an LSP bug.** Reasoning: 1. The binary that crashed (`/usr/bin/dispatch-server`, built 21:40:04) **contains all three LSP fixes** (commits 21:14 + 21:32 precede the build). So the known LSP crash paths were already closed. 2. The crash signature is a native `panic(main thread): Segmentation fault at address 0x0` emitted by **Bun's own crash handler**, which explicitly says "This indicates a bug in Bun, not your code." Application-level crashes (the pre-fix TypeError/ENOENT) exit with a JS stack trace and `exit-code`, not a native `code=dumped, status=4/ILL`. 3. The corrupted memory readout (`RSS: 0.02ZB`, `Faults: 0`, SIGSEGV-vs-SIGILL mismatch) points to **heap corruption in Bun's allocator/GC**, not a clean dereference of application state. **The trigger is memory pressure from a still-live leak.** The post-fix binary grew to **6.2 GB in 2.5 h** (the pre-fix session hit **25.7 GB in 18 h**). The LSP caches are now bounded to ≤50 open documents + ≤100 diagnostic entries — orders of magnitude too small to account for gigabytes. The leak is **elsewhere**: - `conversation-store` is **SQLite-backed** (`storage.get`/`set` via `StorageNamespace`), not an in-memory map — not the leak. - `session-orchestrator`'s `activeConversations` is a `Set` of IDs (tiny) — not the leak. - Most likely culprits (not yet confirmed, out of investigation scope): the **AI SDK streaming buffers** and the **conversation message arrays assembled in memory** per turn (`orchestrator.ts` `for await (const event of provider.stream(...))`), which for long multi-step agent turns can hold large transcripts; plus Bun's standalone-executable memory reclamation under WSL. So there are **two problems**, only one of which is fixed: | Problem | Status | Failure mode | |---|---|---| | LSP app crashes (JSON TypeError, ENOENT) | ✅ Fixed (05ff256, f9d1ca5) | `exit-code`, crash loop | | Memory leak → Bun segfault | ❌ Not fixed | native `code=dumped`, SIGSEGV/SIGILL | --- ## 5. Proposed fix (do not implement — investigation only) The LSP work is done and correct; do **not** disable LSP. The remaining issue is the memory leak that triggers the native crash. Recommended actions, in priority order: 1. **Treat the leak as the primary defect, not LSP.** Open a separate investigation to localize the 2.5 GB/h growth. First step: add periodic `process.memoryUsage().rss` + `Bun.gc()` logging on a timer and correlate growth with active conversations/turns. Suspect the AI-SDK streaming path and per-turn message assembly in `session-orchestrator/orchestrator.ts`. 2. **Add a memory-pressure circuit breaker** to `dispatch.service` so a leak degrades gracefully instead of segfaulting: a watchdog that restarts the process on RSS exceeding a threshold (e.g. 3 GB), or systemd `MemoryHigh=`/`MemoryMax=` cgroup limits with a managed restart. This turns the uncontrolled segfault into a controlled recycle. 3. **File the Bun crash** at `https://bun.report/1.3.13/l_1bf2e2ce…` (the URL is in the journal). The corrupted RSS/signal readout is a Bun bug worth reporting regardless of our leak — a memory leak should not crash the runtime with a native segfault; it should OOM-kill cleanly. 4. **Consider a Bun upgrade.** The crash is on `Bun v1.3.13`; allocator/GC fixes land frequently. Pin and test a newer Bun in the standalone build. 5. **Keep the LSP fixes** (`?.catch()`, `watcher.on("error")`, bounded caches, `markBroken` clear, sendRequest timeout). They are correct and tested; disabling LSP would regress the diagnostics tool with no crash benefit, since the crash is not in LSP. ### Evidence summary | Claim | Evidence | |---|---| | LSP fixes deployed before crash | binary mtime 21:40:04 > commits 21:14/21:32; service (re)started 21:40:06 | | Crash is native, not app-level | `panic(main thread): Segmentation fault`, `code=dumped, status=4/ILL`, no JS stack | | LSP caches bounded | `MAX_OPEN_DOCUMENTS=50`, `MAX_PUSH_DIAGNOSTICS=100` | | Leak persists post-fix | 6.2 GB / 2.5 h (post-fix) vs 25.7 GB / 18 h (pre-fix) | | Conversation store not the leak | SQLite-backed (`StorageNamespace`), not in-memory maps | | No "disable LSP" hot-fix exists | clean tree; LSP enabled in `main.ts:104` + `transport-http:98`; no `test_race.js` | ### Contract gaps / change-requests for other units - **session-orchestrator:** the per-turn streaming loop (`orchestrator.ts` `provider.stream(...)`) and message-assembly path are the prime leak suspect. Request a memory profile of a long multi-step turn to confirm whether buffers/arrays are retained after the turn completes. No contract change needed — this is an implementation/leak-localization request. - **host-bin / systemd unit:** request a `MemoryMax=`/watchdog addition to `/etc/systemd/system/dispatch.service` (infra, not code).