diff options
| author | Adam Malczewski <[email protected]> | 2026-06-28 00:23:22 +0900 |
|---|---|---|
| committer | Adam Malczewski <[email protected]> | 2026-06-28 00:23:22 +0900 |
| commit | b83aa8ddbb7023ae9dd0332b4989d4baa11522af (patch) | |
| tree | d0d8f77c631981bec6401df48666bbe0765aeeb4 /notes | |
| parent | f9d1ca533ad2c5d71a3bc349934d54c09de305bf (diff) | |
| download | dispatch-b83aa8ddbb7023ae9dd0332b4989d4baa11522af.tar.gz dispatch-b83aa8ddbb7023ae9dd0332b4989d4baa11522af.zip | |
docs(server-crash): investigate LSP-related server crash
Diffstat (limited to 'notes')
| -rw-r--r-- | notes/server-crash-investigation.md | 257 |
1 files changed, 257 insertions, 0 deletions
diff --git a/notes/server-crash-investigation.md b/notes/server-crash-investigation.md new file mode 100644 index 0000000..6e11c4f --- /dev/null +++ b/notes/server-crash-investigation.md @@ -0,0 +1,257 @@ +# Server Crash Investigation — LSP Suspected + +**Date:** 2026-06-27 +**Investigator:** umans/umans-glm-5.2 +**Checkout:** `dispatch/backend` on `dev` (commit `f9d1ca5`, working tree clean) + +## TL;DR + +The crash under investigation (Jun 28 00:12 JST) is **a Bun runtime +segfault**, not an LSP application crash. The three LSP issues the task +suspected (JSON-parse TypeError, fs.watch ENOENT, unbounded cache memory +leak) were **all already fixed** in commits `05ff256` (21:14) and `f9d1ca5` +(21:32), compiled into the binary at 21:40 — which is the binary that +crashed 2.5 h later. A separate, still-live **memory leak** (6.2 GB in 2.5 h +post-fix; 25.7 GB in 18 h pre-fix) fed memory pressure into Bun's +allocator and triggered the native crash. The leak is **not** in the LSP +caches (they are now bounded to ~50 docs + ~100 diag entries) — it is +elsewhere in the runtime (most likely AI-SDK streaming buffers / retained +conversation message arrays during long agent turns). + +The task's premise ("there is an uncommitted hot-fix that disables LSP in +`host-bin`/`transport-http`") does **not** match the checkout. The working +tree is clean; LSP is fully enabled (not disabled); no `test_race.js` +exists. The developer chose to **fix** LSP rather than disable it. + +--- + +## 1. What the logs actually show + +### The crash (the event under investigation) + +`systemctl status dispatch.service` (the service is named `dispatch.service`, +**not** `dispatch-server.service` — the latter does not exist): + +``` +Jun 28 00:12:02 dispatch-server[931590]: Bun v1.3.13 (bf2e2cec) Linux x64 +Jun 28 00:12:02 dispatch-server[931590]: WSL Kernel v6.6.87 | glibc v2.43 +Jun 28 00:12:02 dispatch-server[931590]: Args: "/usr/bin/dispatch-server" +Jun 28 00:12:02 dispatch-server[931590]: Elapsed: 9115307ms | User: 582013ms | Sys: 738525ms +Jun 28 00:12:02 dispatch-server[931590]: RSS: 0.02ZB | Peak: 0.29GB | Commit: 0.02ZB | Faults: 0 | Machine: 33.24GB +Jun 28 00:12:02 dispatch-server[931590]: panic(main thread): Segmentation fault at address 0x0 +Jun 28 00:12:02 dispatch-server[931590]: oh no: Bun has crashed. This indicates a bug in Bun, not your code. +Jun 28 00:12:05 systemd[1]: dispatch.service: Main process exited, code=dumped, status=4/ILL +Jun 28 00:12:05 systemd[1]: dispatch.service: Failed with result 'core-dump'. +Jun 28 00:12:05 systemd[1]: dispatch.service: Consumed 1h 29min CPU over 2h 31min wall, 6.2G memory peak. +``` + +Key signals: + +- **`panic(main thread): Segmentation fault at address 0x0`** — a NULL-pointer + dereference **inside Bun's runtime**, not in dispatch code. Bun's own crash + handler states "This indicates a bug in Bun, not your code." +- **Corrupted crash-time memory report.** `RSS: 0.02ZB` (zettabytes) and + `Faults: 0` are impossible values; `Peak: 0.29GB` inside the dump contradicts + systemd's cgroup accounting of `6.2G memory peak`. The crash corrupted the + runtime state *before* the handler read its own stats — consistent with a + heap corruption / use-after-free, not a clean NULL deref of a known address. +- **Signal mismatch.** Bun reports SIGSEGV ("Segmentation fault") but systemd + recorded `status=4/ILL` (signal 4 = SIGILL, illegal instruction). Two + different signals from one crash ⇒ the runtime was already corrupt when the + trap fired. Classic signature of an allocator/GC corruption under memory + pressure. +- **No coredump retained** (`coredumpctl list` → "No coredumps found"; + `/var/lib/systemd/coredump` empty), so the stack cannot be symbolicated + locally. The redacted report URL is in the journal: + `https://bun.report/1.3.13/l_1bf2e2ce…`. +- **Memory pressure was real.** systemd's cgroup (trustworthy) says the + process peaked at **6.2 GB** over 2 h 31 m before dying — a ~2.5 GB/h growth + rate from a fresh boot. + +### Crash history (last 48 h) — two distinct failure modes + +``` +Jun 26 15:09 Failed exit-code 16h12m wall 3.4G peak (app-level) +Jun 27 02:44 Failed exit-code 1h22m wall 3.1G peak (app-level — crash loop begins) +Jun 27 02:51 Failed exit-code 7m wall 1.1G peak (crash loop) +Jun 27 02:52 Failed exit-code 58s wall 700M peak (crash loop) +Jun 27 20:58 Stopped (manual) 17h58m wall 25.7G peak (huge leak, no crash) +Jun 28 00:12 SIGSEGV/SIGILL dump 2h31m wall 6.2G peak (Bun segfault — THIS event) +``` + +The Jun 27 02:44–02:52 run is a **tight crash loop** (3 exits in 8 min with +collapsing uptime: 1h22m → 7m → 58s). These were `exit-code` failures +(**not** segfaults) — i.e. the application-level LSP crashes (JSON TypeError, +ENOENT). The only segfault in 5 days of logs is the 00:12 event, which +occurred **after** the LSP fixes were deployed. + +--- + +## 2. The task premise vs. the actual checkout + +The task described an **uncommitted hot-fix that disables LSP** in +`packages/host-bin/src/main.ts` and `packages/transport-http/src/extension.ts`, +plus an untracked `packages/lsp/src/test_race.js`. **None of this exists:** + +- `git status` → "nothing to commit, working tree clean". No untracked files. +- `fd test_race` → no results. `test_race.js` does not exist anywhere. +- `git log -- packages/host-bin/src/main.ts packages/transport-http/src/extension.ts` + → no "disable LSP" commit. LSP has only ever been **enabled** here. +- `host-bin/src/main.ts:24` imports `@dispatch/lsp`; line 104 includes `lspExt` + in `CORE_EXTENSIONS`. LSP is **enabled**. +- `transport-http/src/extension.ts:98` calls `host.getService(lspServiceHandle)` + with no try/catch — LSP is wired **straight-through**, not made optional. + +What the developer actually did (commits `05ff256` @ 21:14 and `f9d1ca5` @ +21:32) was **fix** the LSP crashes in place, then rebuild +(`/usr/bin/dispatch-server` mtime `2026-06-27 21:40:04`) and restart +(`Jun 27 21:40:06 Started`). The prior AI review that identified the three +bugs was committed alongside the fixes as `ai-review-report.md`. + +--- + +## 3. The three suspected LSP issues — all already fixed + +A committed prior review (`ai-review-report.md`, part of `05ff256`) names the +exact three issues the task raised. Each is fixed in the current tree: + +### 3a. Unhandled JSON parse / TypeError (`client.ts`) — FIXED + +`rpc.ts:89-99` wraps `JSON.parse` in try/catch (malformed/split-UTF-8 messages +are logged and skipped). The **fatal** bug, though, was the defence-in-depth +boundary in `client.ts`: when the language-server process died, `markBroken()` +set `this.rpc = null`; a final stdout flush then evaluated +`this.rpc?.handleMessage(msg).catch(() => {})` → short-circuits to +`undefined.catch()` → a **synchronous TypeError** that crashed the process. + +**Fix** (`client.ts:310`): a second optional-chaining `?.`: +```ts +void this.rpc?.handleMessage(msg)?.catch(() => {}); +``` +`undefined?.catch()` → `undefined`, no throw. Covered by regression test +`handleBytes does not crash when the server dies and rpc is null (Bug 1)`. + +### 3b. ENOENT from fs.watch on transient `.old_modules` dirs — FIXED + +`extension.ts`'s recursive `node:fs.watch` emitted `'error'` events when +`bun install` deleted transient `.old_modules-*` directories. With no +`.on("error")` listener, Node/Bun escalates the event to an **uncaught +exception** that kills the process. + +**Fix** (`extension.ts:100`): a no-op error listener swallows transient FS +errors: +```ts +watcher.on("error", () => { /* ignore transient FS errors */ }); +``` +Covered by `extension.test.ts` (the `__test__realFileWatcher` re-export lets +the behaviour be exercised with an injected fake watcher). + +### 3c. Unbounded cache memory leak — FIXED (in LSP) + +`LanguageServerClient` previously retained every touched file's text + +diagnostics forever in `openDocuments`, `lastDiagSnapshot`, and +`pushDiagnostics` (the documented "9.5 GB over 12 h" leak). + +**Fixes:** +- `client.ts`: `MAX_OPEN_DOCUMENTS = 50` LRU — overflow evicts the + least-recently-used doc via `textDocument/didClose` + purge + (`evictIfOverCap`, `closeDocument`). +- `diagnostics.ts`: `MAX_PUSH_DIAGNOSTICS = 100` — background-scanned files + (never opened by the agent) are evicted oldest-first + (`evictPushIfOverCap`, delete-then-set for LRU recency). +- `markBroken()` now `clear()`s `openDocuments` + `lastDiagSnapshot` so a + repeatedly-crashed/re-spawned client doesn't accumulate across cycles. +- Initialize timeout leak fixed: the timeout is now passed into + `rpc.sendRequest` (which deletes the pending entry on expiry) instead of a + `Promise.race` that left the original promise lodged in `pending` forever. + +--- + +## 4. Root cause of the 00:12 crash + +**The crash is a Bun runtime segfault, not an LSP bug.** Reasoning: + +1. The binary that crashed (`/usr/bin/dispatch-server`, built 21:40:04) + **contains all three LSP fixes** (commits 21:14 + 21:32 precede the build). + So the known LSP crash paths were already closed. +2. The crash signature is a native `panic(main thread): Segmentation fault at + address 0x0` emitted by **Bun's own crash handler**, which explicitly says + "This indicates a bug in Bun, not your code." Application-level crashes + (the pre-fix TypeError/ENOENT) exit with a JS stack trace and `exit-code`, + not a native `code=dumped, status=4/ILL`. +3. The corrupted memory readout (`RSS: 0.02ZB`, `Faults: 0`, SIGSEGV-vs-SIGILL + mismatch) points to **heap corruption in Bun's allocator/GC**, not a clean + dereference of application state. + +**The trigger is memory pressure from a still-live leak.** The post-fix binary +grew to **6.2 GB in 2.5 h** (the pre-fix session hit **25.7 GB in 18 h**). The +LSP caches are now bounded to ≤50 open documents + ≤100 diagnostic entries — +orders of magnitude too small to account for gigabytes. The leak is +**elsewhere**: + +- `conversation-store` is **SQLite-backed** (`storage.get`/`set` via + `StorageNamespace`), not an in-memory map — not the leak. +- `session-orchestrator`'s `activeConversations` is a `Set<string>` of IDs + (tiny) — not the leak. +- Most likely culprits (not yet confirmed, out of investigation scope): the + **AI SDK streaming buffers** and the **conversation message arrays assembled + in memory** per turn (`orchestrator.ts` `for await (const event of + provider.stream(...))`), which for long multi-step agent turns can hold large + transcripts; plus Bun's standalone-executable memory reclamation under WSL. + +So there are **two problems**, only one of which is fixed: + +| Problem | Status | Failure mode | +|---|---|---| +| LSP app crashes (JSON TypeError, ENOENT) | ✅ Fixed (05ff256, f9d1ca5) | `exit-code`, crash loop | +| Memory leak → Bun segfault | ❌ Not fixed | native `code=dumped`, SIGSEGV/SIGILL | + +--- + +## 5. Proposed fix (do not implement — investigation only) + +The LSP work is done and correct; do **not** disable LSP. The remaining issue +is the memory leak that triggers the native crash. Recommended actions, in +priority order: + +1. **Treat the leak as the primary defect, not LSP.** Open a separate + investigation to localize the 2.5 GB/h growth. First step: add periodic + `process.memoryUsage().rss` + `Bun.gc()` logging on a timer and correlate + growth with active conversations/turns. Suspect the AI-SDK streaming path + and per-turn message assembly in `session-orchestrator/orchestrator.ts`. +2. **Add a memory-pressure circuit breaker** to `dispatch.service` so a leak + degrades gracefully instead of segfaulting: a watchdog that restarts the + process on RSS exceeding a threshold (e.g. 3 GB), or systemd + `MemoryHigh=`/`MemoryMax=` cgroup limits with a managed restart. This turns + the uncontrolled segfault into a controlled recycle. +3. **File the Bun crash** at `https://bun.report/1.3.13/l_1bf2e2ce…` (the URL + is in the journal). The corrupted RSS/signal readout is a Bun bug worth + reporting regardless of our leak — a memory leak should not crash the + runtime with a native segfault; it should OOM-kill cleanly. +4. **Consider a Bun upgrade.** The crash is on `Bun v1.3.13`; allocator/GC + fixes land frequently. Pin and test a newer Bun in the standalone build. +5. **Keep the LSP fixes** (`?.catch()`, `watcher.on("error")`, bounded caches, + `markBroken` clear, sendRequest timeout). They are correct and tested; + disabling LSP would regress the diagnostics tool with no crash benefit, + since the crash is not in LSP. + +### Evidence summary + +| Claim | Evidence | +|---|---| +| LSP fixes deployed before crash | binary mtime 21:40:04 > commits 21:14/21:32; service (re)started 21:40:06 | +| Crash is native, not app-level | `panic(main thread): Segmentation fault`, `code=dumped, status=4/ILL`, no JS stack | +| LSP caches bounded | `MAX_OPEN_DOCUMENTS=50`, `MAX_PUSH_DIAGNOSTICS=100` | +| Leak persists post-fix | 6.2 GB / 2.5 h (post-fix) vs 25.7 GB / 18 h (pre-fix) | +| Conversation store not the leak | SQLite-backed (`StorageNamespace`), not in-memory maps | +| No "disable LSP" hot-fix exists | clean tree; LSP enabled in `main.ts:104` + `transport-http:98`; no `test_race.js` | + +### Contract gaps / change-requests for other units + +- **session-orchestrator:** the per-turn streaming loop (`orchestrator.ts` + `provider.stream(...)`) and message-assembly path are the prime leak suspect. + Request a memory profile of a long multi-step turn to confirm whether + buffers/arrays are retained after the turn completes. No contract change + needed — this is an implementation/leak-localization request. +- **host-bin / systemd unit:** request a `MemoryMax=`/watchdog addition to + `/etc/systemd/system/dispatch.service` (infra, not code). |
