summaryrefslogtreecommitdiffhomepage
path: root/notes
diff options
context:
space:
mode:
authorAdam Malczewski <[email protected]>2026-05-30 23:15:18 +0900
committerAdam Malczewski <[email protected]>2026-05-30 23:15:18 +0900
commit4e636511ae748d606d8871f5068a2bd18b386bd0 (patch)
treea5e0726d71d9d88d09d938ea2318a61e36ade68f /notes
parent624b808da0f2f8bbad8a4fbbcca3f82f24ecfc47 (diff)
downloaddispatch-4e636511ae748d606d8871f5068a2bd18b386bd0.tar.gz
dispatch-4e636511ae748d606d8871f5068a2bd18b386bd0.zip
chore(notes): collect loose root docs into notes/; add reconcile edge-cases note
Move all loose root-level .md files (plans, reports, gemini reviews, incident notes) into a single notes/ directory, and update the doc-reference breadcrumbs in code comments/test labels to the notes/ path. Add notes/queue-interrupt-reconcile-edge-cases.md: documents why the queue/interrupt/turn-sealed reconcile path keeps surfacing edge cases (a catalog of the four review-pass bugs, the no-loss/no-duplicate invariants, the recommended membership-based reconcile refactor, and interleaving-test guidance).
Diffstat (limited to 'notes')
-rw-r--r--notes/cache-miss-report.md180
-rw-r--r--notes/changes-report.md54
-rw-r--r--notes/changes.md133
-rw-r--r--notes/claude-auth-report.md282
-rw-r--r--notes/claude-report.md114
-rw-r--r--notes/context.md169
-rw-r--r--notes/eviction-limitation.md105
-rw-r--r--notes/gemini-chunk-eviction-review-2.md95
-rw-r--r--notes/gemini-chunk-eviction-review-3.md118
-rw-r--r--notes/gemini-chunk-eviction-review.md69
-rw-r--r--notes/gemini-chunk-log-review.md90
-rw-r--r--notes/harness-comparison.md373
-rw-r--r--notes/plan-bg-restore.md1294
-rw-r--r--notes/plan-chunk-eviction.md251
-rw-r--r--notes/plan-chunk-log.md271
-rw-r--r--notes/plan-chunk-refactor.md245
-rw-r--r--notes/plan-v6-upgrade.md450
-rw-r--r--notes/plan.md451
-rw-r--r--notes/problem.md79
-rw-r--r--notes/queue-interrupt-reconcile-edge-cases.md183
-rw-r--r--notes/report.md38
-rw-r--r--notes/requirements.md353
-rw-r--r--notes/tool-runner-duplication-incident.md147
-rw-r--r--notes/wishlist.md26
24 files changed, 5570 insertions, 0 deletions
diff --git a/notes/cache-miss-report.md b/notes/cache-miss-report.md
new file mode 100644
index 0000000..03342af
--- /dev/null
+++ b/notes/cache-miss-report.md
@@ -0,0 +1,180 @@
+# Cache Miss Investigation — Dispatch / Claude prompt caching
+
+> Read-only investigation. No code was modified. File:line references are to the
+> state of the tree at the time of writing.
+
+## TL;DR
+
+The beta header and cache-breakpoint placement are **correct**. The cache misses
+come from a **message-serialization instability inside multi-step turns**:
+dispatch stores an entire multi-step assistant turn as a *single growing
+assistant message*, and `toModelMessages` + the Anthropic Pass-3 normalization
+**re-bucket every tool-call and every tool-result on every step**. This moves
+earlier steps' `tool_use` / `tool_result` blocks to new positions each step, so
+Anthropic can only match the prefix up to the *first* step's text/thinking.
+Everything after is re-written as a new cache entry.
+
+Result: `cache_write` grows every step while `cache_read` stays flat — exactly
+the panel readout (write 100,470 ≫ read 40,448; last request 19% < session 29%).
+
+## Evidence (the Cache Rate panel)
+
+```
+readCache hits 40,448
+writeCache writes 100,470
+freshUncached input 18
+Total input 140,936 (= read + write + fresh, confirms inputTokens is the TOTAL prompt)
+Output 5,130
+9 req
+Session (this tab) 29%
+Last request 19%
+```
+
+Two tells:
+
+1. **writes ≫ reads.** In a healthy rolling cache over a growing turn, reads
+ accumulate and dominate writes. Here writes are ~2.5× reads → the cacheable
+ prefix is being invalidated and re-written almost every request.
+2. **Last request (19%) < cumulative (29%).** The *largest* request has the
+ *lowest* hit rate. In a working rolling cache the last turn should be the
+ *highest* (biggest cached prefix, smallest delta). The inversion means the
+ newest, biggest request re-wrote the most.
+
+## What is fine (ruled out)
+
+- **Beta header present** — `prompt-caching-scope-2026-01-05` is sent
+ (`packages/core/src/credentials/anthropic-betas.ts:13-20`, wired in
+ `packages/core/src/llm/provider.ts:123`). The earlier `claude-report.md`
+ "Root Cause 1" (missing beta) is fixed.
+- **Breakpoint placement matches OpenCode exactly** — dispatch's
+ `applyAnthropicCaching` (`packages/core/src/agent/agent.ts:448-466`) marks
+ first-2 system + last-2 non-system messages at the *message* level, identical
+ to OpenCode's `applyCaching`
+ (`references/opencode/packages/opencode/src/provider/transform.ts:345-394`)
+ for `providerID === "anthropic"`.
+- **System prompt is deterministic** — `buildSystemPrompt`
+ (`packages/api/src/agent-manager.ts:127-147`) has no date/cwd/env/timestamp.
+ The billing-header `cch` derives from the first user message
+ (`packages/core/src/credentials/claude.ts:354-372`), stable within a
+ conversation. So the `tools + system` prefix does **not** churn.
+- **OAuth body transform is a faithful port** of the reference plugin
+ (`packages/core/src/llm/anthropic-oauth-transform.ts` ≈
+ `references/opencode-claude-auth/src/transforms.ts`).
+
+## Root cause (primary) — multi-step turns reshuffle their own prefix every step
+
+### The architecture
+
+A whole turn (all tool steps) accumulates into ONE assistant message whose
+`chunks` array is shared across steps (`agent.ts:786`, pushed once at
+`agent.ts:923-926`). Each step's first `tool-call` opens a *new* `tool-batch`
+chunk because the preceding chunk is text/thinking
+(`packages/core/src/chunks/append.ts:111-124`). So after 2 steps the single
+message is:
+
+```
+assistant.chunks = [ text0, think0, batch0{A:+resultA}, text1, think1, batch1{B:+resultB} ]
+```
+
+### The serialization
+
+`toModelMessages` (`agent.ts:162-256`) walks all chunks of that one message,
+pushing **all** tool-calls into `parts` and **all** results into a single
+trailing `tool` message. Then `applyAnthropicStructuralNormalisations` Pass 3
+(`agent.ts:399-413`) splits the assistant message because tool-calls are
+followed by later-step text.
+
+Concrete trace of the request prefix:
+
+```
+# Step-1 request (after step 0) # Step-2 request (after step 1)
+assistant:[text0, think0, callA] assistant:[text0, think0, text1, think1] <- Pass-3 non-tool
+tool:[resultA] assistant:[callA, callB] <- Pass-3 tool bucket
+ tool:[resultA, resultB]
+```
+
+After the `@ai-sdk/anthropic` converter merges the two consecutive assistant
+messages, the single Anthropic assistant turn is:
+
+- step 1 cached: `[text0, think0, tool_use_A]`
+- step 2 sends: `[text0, think0, text1, think1, tool_use_A, tool_use_B]`
+
+The two diverge **right after `think0`** (cached had `tool_use_A` next; step 2
+has `text1`). Anthropic's longest-prefix match ends at roughly
+`static + user1 + text0 + think0`. Everything from there on — including
+`resultA`, which was the cached tail one step earlier but is now re-grouped with
+`resultB` — becomes a **cache write**.
+
+Every additional step pushes all prior `tool_use` blocks further back and
+re-groups all results, so:
+
+- the matchable read prefix stays ≈ constant (static prefix + first step),
+- the re-written suffix grows with the entire turn,
+- cumulative `cache_write` balloons, `cache_read` stays low, and the **largest
+ (last) request has the lowest hit rate**.
+
+This reproduces every symptom in the panel.
+
+### Why it slipped through
+
+The caching tests (`packages/core/tests/agent/agent.test.ts:1034`, `:1078`) only
+exercise a *single* step with parallel calls (3 reads → one tool message), which
+is correct. The **multi-step accumulation path has no test**, and that's the
+path that churns.
+
+### Contrast with OpenCode
+
+OpenCode's native loop appends each step as its own stable `assistant` + `tool`
+message pair (`[system, user, assistant0, tool0, assistant1, tool1, …]`). Those
+never move, so its rolling cache accumulates. Dispatch's "one growing assistant
+message, re-derive the split each step" is the divergence.
+
+## Secondary findings (smaller, worth noting)
+
+1. **Per-turn session-id rotation.** `createProvider` runs inside `run()` and
+ mints a new `X-Claude-Code-Session-Id` per turn (`provider.ts:92`, called at
+ `agent.ts:745`). Within a turn it's stable, but if the `prompt-caching-scope`
+ beta keys cache by session, cross-turn reads are lost on top of the
+ within-turn churn. *Confidence: medium — verify against Anthropic's scope
+ semantics.*
+2. **Reasoning empty-text filter diverges from OpenCode.** Dispatch drops any
+ `reasoning` part with `text === ""` (`agent.ts:356-360`); OpenCode keeps it
+ when a signature / `redactedData` is present (`transform.ts:144-150`).
+ Dropping a signed-but-empty thinking block changes assistant bytes and can
+ break thinking-signature round-trips. Minor vs. the primary issue.
+3. **5-minute ephemeral TTL** — generic: idle gaps > 5 min between turns force a
+ cold re-write regardless of the above.
+
+## Recommended fix direction (not yet implemented)
+
+Make the wire history **stable across steps** by emitting one `assistant` + one
+`tool` message **per step (per `tool-batch` chunk)** instead of collapsing the
+whole turn into one assistant message and one trailing tool message. Concretely,
+in `toModelMessages`, segment the assistant chunks at each `tool-batch` boundary
+and emit `[assistant(text, think, tool-calls), tool(results)]` per segment. This:
+
+- keeps the intended within-step grouping (parallel calls in one step → one tool
+ message), satisfying the existing "Root Cause 2" tests,
+- preserves earlier steps' block positions so the rolling cache accumulates,
+- removes the Pass-3 reshuffle trigger (no later-step text after a tool-call
+ within a single message),
+- and matches OpenCode's stable message layout.
+
+Add a regression test for a **3-step sequential** turn asserting the step-0 and
+step-1 message blocks are byte-identical between the step-2 and step-3 requests.
+
+## Key files
+
+| Area | Location |
+| --- | --- |
+| Turn accumulator (single assistant msg) | `packages/core/src/agent/agent.ts:786`, `:923-926` |
+| New tool-batch chunk per step | `packages/core/src/chunks/append.ts:111-124` |
+| Rebuild + group all results | `packages/core/src/agent/agent.ts:162-256` |
+| Pass-3 reshuffle (split) | `packages/core/src/agent/agent.ts:399-413` |
+| Breakpoints (faithful port) | `packages/core/src/agent/agent.ts:448-466` |
+| OpenCode reference `applyCaching` | `references/opencode/packages/opencode/src/provider/transform.ts:345-394` |
+| Beta headers | `packages/core/src/credentials/anthropic-betas.ts:13-20` |
+| Per-run session id | `packages/core/src/llm/provider.ts:92` |
+| OAuth body transform | `packages/core/src/llm/anthropic-oauth-transform.ts` |
+| Cache Rate panel + aggregation | `packages/frontend/src/lib/components/CacheRatePanel.svelte`, `packages/frontend/src/lib/tabs.svelte.ts:856-877` |
+| Caching tests (single-step only) | `packages/core/tests/agent/agent.test.ts:1034`, `:1078` |
diff --git a/notes/changes-report.md b/notes/changes-report.md
new file mode 100644
index 0000000..f30b611
--- /dev/null
+++ b/notes/changes-report.md
@@ -0,0 +1,54 @@
+# Changes Report
+
+## Overview
+This report reviews the uncommitted changes in the `dispatch` repository. The changes can be broadly categorized into two parts:
+1. **Codebase Updates**: The removal of the `todo` (task list) tool and a fix to a `summon` tool test.
+2. **Documentation Additions**: Several new untracked markdown documents regarding planning and incident reports.
+
+## Per-File Analysis
+
+### Untracked Files
+- **`claude-auth-report.md`** & **`tool-runner-duplication-incident.md`**: New post-mortem and incident investigation reports.
+- **`cyberdeck/credentials.md`**: Documentation regarding credentials.
+- **`skill-plan/` directory**: Multiple markdown files outlining a multi-step plan for building a new "skill" architecture (tool definitions, registration, routes, etc.).
+
+**Assessment**: Adding detailed markdown documents for incidents and project planning is a very good practice. They are cleanly separated from the application source code.
+
+### Modified Files (Staged & Unstaged)
+
+#### 1. `packages/core/src/tools/task-list.ts`
+- **Change**: Removed the `createTaskListTool` function and the Zod/ToolDefinition imports. The `TaskList` class remains.
+- **Correctness**: The removal of the factory is clean.
+
+#### 2. `packages/api/src/agent-manager.ts`
+- **Change**: Removed `createTaskListTool` imports and invocations. Removed `todo` from the `TOOL_DESCRIPTIONS` and the lengthy `TODO_GUIDANCE` string from the agent's system prompt.
+- **Correctness**: Consistently applies the removal of the `todo` tool. However, it still retains `tabAgent.taskList = new TaskList()`, which may now be dead state (see Issues section).
+
+#### 3. `packages/core/src/agents/loader.ts` & `packages/core/src/tools/summon.ts`
+- **Change**: Removed `"todo"` from the default tool arrays and documentation comments for child agents.
+- **Correctness**: Follows through with the removal of the `todo` tool across tool lists, ensuring child agents don't request a missing tool.
+
+#### 4. `packages/core/src/index.ts`
+- **Change**: Updated exports to remove `createTaskListTool`, while keeping `TaskList`.
+- **Correctness**: Correctly reflects the module changes.
+
+#### 5. Tests (`packages/api/tests/*.test.ts` & `packages/core/tests/agents/loader.test.ts`)
+- **Change**: Removed mocked `createTaskListTool` injections and updated assertions that previously verified the automatic injection of the `todo` tool.
+- **Correctness**: Properly aligns tests with the new implementation. Build/tests should pass cleanly.
+
+#### 6. `packages/core/tests/tools/summon.test.ts`
+- **Change**: Updated the "surfaces child errors when blocking" test to correctly anticipate the `agent_id: <id>` prefix before the child output. Added an excellent explanatory comment about why this prefix exists (for frontend `ToolCallDisplay` regex parsing).
+- **Correctness**: The change is highly robust. Explicitly asserting `.toContain()` before the final `.toBe()` provides great debugging signals if the test fails in the future. The comment is an excellent practice.
+
+## Correctness Assessment
+The changes successfully and cleanly remove the `todo` tool logic from the backend tools and system prompt. The test fix for `summon.test.ts` is technically sound and well-documented. The changes are internally consistent from a backend tooling perspective.
+
+## Issues & Concerns Found
+**Incomplete Refactoring (Dead Code)**:
+While the `todo` tool was completely removed, the `TaskList` class itself and its instantiation in `AgentManager` (`tabAgent.taskList = new TaskList()`) were kept. Because the agent no longer has a tool to interact with this list, any tasks will remain empty.
+Consequently, the UI component that relies on this (`TaskListPanel.svelte` in the `frontend` package) will now always render an empty state. This represents an incomplete feature removal/refactor.
+
+## Recommendations
+1. **Clean Up `TaskList`**: If the todo feature is permanently removed, you should also delete the `TaskList` class from `packages/core`, remove the `taskList` property from `tabAgent` in `packages/api`, and delete `TaskListPanel.svelte` (along with its imports in `SidebarPanel.svelte`) from the frontend.
+2. **Staging**: If the untracked markdown documents are ready, ensure they are intentionally added (`git add`) and committed.
+3. **Commit the Changes**: The code removal of the `todo` tool is safe to commit. I recommend summarizing it as "refactor: remove todo tool from agent capabilities". \ No newline at end of file
diff --git a/notes/changes.md b/notes/changes.md
new file mode 100644
index 0000000..66389d9
--- /dev/null
+++ b/notes/changes.md
@@ -0,0 +1,133 @@
+# Changes
+
+## May 27, 2026
+
+### Chunk-Based Message Refactor (`ca6ee91`)
+
+Replaced the flat `content: string` + `thinking: string` message model with an ordered
+`chunks: Chunk[]` union that preserves actual temporal ordering of events from the model.
+
+**New chunk types:**
+
+| Type | Body | Emitted on |
+|------|------|-----------|
+| `text` | `text: string` | `text-delta` events, coalesced |
+| `thinking` | `text: string` | `reasoning-delta` events, coalesced |
+| `tool-batch` | `calls: Array<{id, name, arguments, result?, isError?, shellOutput?}>` | `tool-call` events, batched |
+| `error` | `message: string, statusCode?: number` | Error events |
+| `system` | `text: string, kind` | System notices (model-changed, config-reload, cancelled, rate-limit) |
+
+**Key design decisions:**
+- System events during active turn append inline to the assistant message's chunks
+- System events outside turns create/append `role: "system"` messages
+- `toCoreMessages` strips `error`/`system` chunks and `role: "system"` messages
+- `MessageRole` changed from `user | assistant | tool` → `user | assistant | system`
+- Tool calls/results embedded in `tool-batch` chunks, no separate `role: "tool"` messages
+
+**Files changed:**
+- `packages/core/src/types/index.ts` — `Chunk` union, `MessageRole` update
+- `packages/frontend/src/lib/types.ts` — mirrored types
+- `packages/core/src/chunks/append.ts` — `appendEventToChunks()` state machine + `applySystemEvent()` router
+- `packages/core/tests/chunks/append.test.ts` — 35 unit tests
+- `packages/core/src/db/index.ts` — removed `thinking` column from messages schema
+- `packages/core/src/db/messages.ts` — updated `appendMessage`/`getMessagesForTab`
+- `packages/core/src/agent/agent.ts` — single `chunks[]` accumulator, updated `toCoreMessages`
+- `packages/api/src/agent-manager.ts` — progressive persistence, system event routing
+- `packages/frontend/src/lib/tabs.svelte.ts` — unified `applyChunkEvent`, `openAgentTab` reads chunks
+- `packages/frontend/src/lib/components/ChatMessage.svelte` — per-type chunk renderers
+- `packages/frontend/src/lib/components/ToolCallDisplay.svelte` — prop updates
+
+**Database:** Messages and tabs tables dropped; settings, keys, credentials preserved.
+Backup at `~/.local/share/dispatch/dispatch.db.bak-20260527-181334`.
+
+---
+
+### Frontend Fixes
+
+#### Wire-format drift (`5261879`)
+
+`openAgentTab` expected `contentJson: string` on the wire but the API now returns
+`chunks: Chunk[]`. Fixed to read `m.chunks` directly with `Array.isArray` fallback.
+
+Also added diagnostic debug info to `copyConversation`: store state block
+(connection status, agentStatus, message counts) and per-message chunk summaries.
+
+#### structuredClone → $state.snapshot (`faeb8fe`)
+
+Svelte 5 `$state` proxies throw `DataCloneError` on native `structuredClone()`.
+Fixed by switching to `$state.snapshot()` in `applyChunkEvent` and `routeSystemEvent`.
+This was the root cause of `chunks=0` in production — every content event after
+placeholder creation silently failed.
+
+#### WS error swallowing (`faeb8fe`)
+
+`ws.svelte.ts` wrapped all callbacks in a single `try{} catch{}` that swallowed
+errors. Split into per-callback try/catch with `console.error`. Future bugs of
+this class are now diagnosable from the browser console.
+
+#### statuses reconnect handler (`faeb8fe`)
+
+Added `statuses` variant to `AgentEvent` union. On WS reconnect, handler syncs
+`agentStatus` for all tabs, detects desync (frontend thinks running, backend says
+idle/error), calls `reloadTabMessagesFromApi` to pull persisted chunks, and clears
+`currentAssistantId` and streaming flags.
+
+---
+
+### Model Routing Fix (`9ac04b9`)
+
+When a user selected a model (e.g., Gemini via configured key) but the corresponding
+API key environment variable was not set, `getOrCreateAgentForTab` in `packages/api/src/agent-manager.ts:610`
+set `useOverride = true` without updating `model` or `baseURL` from their defaults.
+The request silently went to `https://opencode.ai/zen/go/v1` with `model: "deepseek-v4-flash"`
+and no API key — OpenCode Go routed this to Claude, causing every model selection
+to respond with "I'm Claude, made by Anthropic."
+
+Fixed by setting `baseURL = key.base_url` and `model = effectiveModelId` in the
+missing-key branch so requests target the correct endpoint and produce a diagnosable
+auth error instead of a silent model-swap.
+
+---
+
+### Test Infrastructure Rewrite (`1e3f67e`)
+
+Replaced the POJO (plain-old-JavaScript-object) test harness in `packages/frontend/tests/chat-store.test.ts`
+with real `$state`-backed store instances via an exported `createTabStore()` factory
+and `handleEvent()` method.
+
+**What this catches that the old harness couldn't:**
+- Logic bugs in the actual `handleEvent` / `applyChunkEvent` / `routeSystemEvent` code
+- Drift between harness and production (now the same code)
+- Reactivity contract issues with real `$state` proxies
+
+**Known limitation:** The `structuredClone(svelteProxy)` bug cannot be reproduced in
+these tests because Bun's `structuredClone` (used by vitest) is more permissive than
+browser `structuredClone`. Catching that class of bug requires a browser-runtime test
+layer (Playwright, vitest browser mode).
+
+**Mocks:** `wsClient`, `config`, and `fetch` are mocked so module-load side effects
+(WebSocket connection, localStorage access, HTTP calls) don't interfere.
+
+**Files:**
+- `packages/frontend/src/lib/tabs.svelte.ts` — exported `createTabStore`, added `handleEvent`
+- `packages/frontend/tests/chat-store.test.ts` — 32 tests through the real reactive store
+
+---
+
+### Earlier: Read-System Fixes
+
+#### Path resolution (`da57842`)
+
+`read-file.ts`, `read-file-slice.ts`, `write-file.ts`, and `list-files.ts` used
+`resolve(join(workingDirectory, path))` which mangled absolute paths (e.g.,
+spill paths like `/tmp/dispatch/tool-results/...`). `join()` concatenates rather
+than short-circuiting on absolute segments.
+
+Fixed to use a shared `canonicalize()` helper in `packages/core/src/tools/path-utils.ts`
+that resolves via `realpath` and walks up to the nearest existing ancestor when the
+leaf doesn't exist (handles `write_file` creating new files through symlinked parent dirs).
+
+#### DEFAULT_LIMIT alignment
+
+Changed `DEFAULT_LIMIT` from 2000 → `MAX_LINES` (500) in `read-file.ts` so default
+reads don't always trigger truncator spills.
diff --git a/notes/claude-auth-report.md b/notes/claude-auth-report.md
new file mode 100644
index 0000000..db2c6e6
--- /dev/null
+++ b/notes/claude-auth-report.md
@@ -0,0 +1,282 @@
+# Claude Auth Report — "old tab 401s, new tab works"
+
+## Symptom
+
+A tab that has been open for a while starts failing **every** Claude request with:
+
+```
+Invalid authentication credentials
+status=401 authentication_error
+url=https://api.anthropic.com/v1/messages
+model=claude-opus-4-8 baseURL=https://api.anthropic.com/v1
+```
+
+Opening a **new tab** and selecting the same Claude key works fine. The old tab
+keeps 401-ing no matter what you send. Both tabs are configured against the same
+`anthropic` key (`claude-pro` / `claude-max`), so "they should be using the same
+auth" — but they are not.
+
+This is an **authentication-layer** problem, not an agent/tool-wiring/permission
+bug. The 401 comes back from Anthropic on the chat request itself.
+
+---
+
+## Decisive clue: new tab works, old tab doesn't
+
+This single fact rules out most hypotheses and points at exactly one mechanism:
+**the per-tab `Agent` instance is cached, and it freezes the access token it was
+constructed with.** A new tab builds a fresh `Agent`, re-resolves credentials
+from the DB, and gets a currently-valid token. The old tab never re-resolves.
+
+---
+
+## Root cause #1 (proximate, confirmed): the cached per-tab Agent freezes a stale access token
+
+### Where the token is captured
+
+Credentials are resolved **only when a new `Agent` is constructed**, inside
+`getOrCreateAgentForTab` in `packages/api/src/agent-manager.ts`:
+
+- The `anthropic` branch (lines ~579–636) resolves `claudeCredentials =
+ { accessToken: <token> }` and passes it into the `Agent` constructor
+ (lines ~686–700, `...(claudeCredentials ? { claudeCredentials } : {})`).
+- That token is stored on the Agent's `this.config.claudeCredentials` and never
+ updated for the life of the Agent.
+
+In `packages/core/src/agent/agent.ts`, every `run()` call rebuilds the provider
+from that frozen config:
+
+```ts
+const providerFactory = createProvider({
+ apiKey: this.config.apiKey,
+ baseURL: this.config.baseURL,
+ provider: this.config.provider,
+ claudeCredentials: this.config.claudeCredentials, // <-- frozen at construction
+});
+```
+
+And in `packages/core/src/llm/provider.ts`, `createClaudeOAuthProvider` sends it
+as the bearer token:
+
+```ts
+authToken: config.claudeCredentials?.accessToken ?? config.apiKey,
+```
+
+So the access token used on the wire is whatever was captured when the tab's
+Agent was first built. Calling `run()` again does **not** refresh it.
+
+### Why the cache is never invalidated on expiry
+
+`getOrCreateAgentForTab` (agent-manager.ts lines ~361–368) only discards the
+cached Agent when the key, model, or permissions change:
+
+```ts
+if (
+ tabAgent.agent &&
+ (effectiveKeyId !== tabAgent.keyId ||
+ effectiveModelId !== tabAgent.modelId ||
+ permKey !== tabAgent._lastPermKey)
+) {
+ tabAgent.agent = null;
+}
+```
+
+**Token expiry is not one of the conditions.** A tab that keeps the same key +
+model + permissions will reuse the same `Agent` — and therefore the same frozen
+token — indefinitely. The credential-refresh code below this gate only runs when
+`tabAgent.agent` is null, which for a stable tab never happens again.
+
+### Why this produces the exact symptom
+
+- **Old tab:** built its `Agent` earlier with token **A**. Time passes. Token A
+ is no longer accepted by Anthropic (it either reached its ~8h TTL, or a refresh
+ elsewhere — a new tab, the Model Status panel, model listing — rotated the
+ credential in the DB and **revoked A** server-side). The cached Agent still
+ holds A. Cache gate sees no key/model/perm change → Agent reused → every
+ request sends the dead token A → `401 authentication_error`.
+- **New tab:** no cached Agent → runs the resolution path → reads the **current**
+ token from the DB (and/or refreshes) → token **B** → works.
+
+Both tabs "use the same key," but the old tab is pinned to a now-invalid
+*instance* of that key's token. That is the whole discrepancy.
+
+> Note on OAuth token rotation: Anthropic's refresh flow rotates the refresh
+> token and can invalidate the previously issued access token. So the moment any
+> code path successfully refreshes the credential (updating the DB), every
+> already-constructed Agent still holding the old access token is now holding a
+> *revoked* token — not merely an expired one. This is why the old tab can fail
+> even when its captured token's local `expiresAt` hasn't elapsed yet.
+
+---
+
+## Root cause #2 (contributing, confirmed by reference): the OAuth refresh call is malformed
+
+Even when the resolution path *does* run, the refresh half of it is broken, so
+the system can't reliably self-heal an expired token.
+
+`packages/core/src/credentials/claude.ts`:
+
+```ts
+const OAUTH_TOKEN_URL = "https://claude.ai/v1/oauth/token"; // line 27 — wrong host
+
+async function refreshViaOAuth(refreshToken: string) {
+ const body = new URLSearchParams({ grant_type: "refresh_token", client_id: OAUTH_CLIENT_ID, refresh_token: refreshToken });
+ const response = await fetch(OAUTH_TOKEN_URL, {
+ method: "POST",
+ headers: { "Content-Type": "application/x-www-form-urlencoded" }, // wrong content-type
+ body: body.toString(), // form-encoded
+ });
+ if (!response.ok) return null; // failure is swallowed silently
+ ...
+}
+```
+
+Two concrete defects, both contradicted by the working reference implementation
+in `references/oh-my-pi/packages/ai/src/utils/oauth/anthropic.ts`:
+
+1. **Wrong endpoint host.** The token exchange/refresh endpoint lives on the API
+ host, not the web-app host:
+ ```ts
+ const TOKEN_URL = "https://api.anthropic.com/v1/oauth/token"; // reference, line 11
+ ```
+ The reference's test suite asserts exactly this URL
+ (`references/oh-my-pi/packages/ai/test/anthropic-oauth.test.ts`). `claude.ai`
+ is correct only for the *authorize* step (`https://claude.ai/oauth/authorize`).
+
+2. **Wrong body format.** Anthropic's `/v1/oauth/token` expects **JSON**, not
+ form-encoding. The reference posts:
+ ```ts
+ headers: { "Content-Type": "application/json", Accept: "application/json" },
+ body: JSON.stringify(body),
+ ```
+
+Because both are wrong, `refreshViaOAuth` effectively always returns `null`. When
+a token has genuinely expired, the manager's `anthropic` branch hits its final
+`else` and does the worst possible thing — proceeds with the **stale** token:
+
+```ts
+console.warn(`dispatch: unable to refresh Claude credentials for "${account.label}" — using stale token`);
+claudeCredentials = { accessToken: account.credentials.accessToken }; // expired
+...
+useOverride = true;
+```
+
+So the contributing failure mode is:
+- New tabs work **only as long as the DB token is still valid** (e.g. because
+ Claude Code refreshed it on disk and it was re-imported, or it simply hasn't
+ expired yet). Dispatch's *own* refresh cannot renew it.
+- Once the DB token expires with no external renewal, even new tabs will 401,
+ and the warning above appears in the server log.
+
+**Diagnostic check:** look in the server logs for
+`dispatch: unable to refresh Claude credentials for "..." — using stale token`.
+Its presence confirms the refresh path is failing and the stale-token fallback
+fired.
+
+> History: both the wrong URL and the form-encoded body were introduced in the
+> original commit `8151447` ("feat: claude max oauth support…"). They have never
+> been correct in this repo.
+
+---
+
+## Secondary suspect (unverified): the chat provider omits `anthropic-beta`
+
+`createClaudeOAuthProvider` (provider.ts lines ~76–84) sets only:
+
+```ts
+headers: {
+ "anthropic-dangerous-direct-browser-access": "true",
+ "x-app": "cli",
+ "user-agent": "claude-cli/2.1.112 (external, sdk-cli)",
+}
+```
+
+It does **not** send `anthropic-beta: …,oauth-2025-04-20,…` or
+`anthropic-version`. Every other path in this codebase and in the reference does:
+
+- `getAnthropicHeaders()` (claude.ts:387) builds the full set, including
+ `oauth-2025-04-20`, and is used by the models/profile/usage calls.
+- The reference's chat path always sends `anthropic-beta: ANTHROPIC_OAUTH_BETA`
+ with the OAuth bearer (`references/oh-my-pi/.../provider-models/openai-compat.ts`,
+ `.../providers/anthropic.ts:1542`). Anthropic generally requires the
+ `oauth-2025-04-20` beta for Bearer/OAuth requests.
+
+This is flagged **secondary** because chat works at all on a fresh token, which
+suggests `@ai-sdk/anthropic` may inject the oauth beta itself for `authToken`
+requests. **I could not verify this** — the workspace deps are not installed in
+this environment (`node_modules/@ai-sdk/anthropic` is absent) and outbound
+network is blocked, so the SDK could not be inspected and the request could not
+be reproduced live. If fixing #1 and #2 doesn't fully resolve things, make this
+provider reuse `getAnthropicHeaders()` so the chat path carries the same
+`anthropic-beta` / `anthropic-version` as every other Anthropic call.
+
+---
+
+## Minor, unrelated observation
+
+`agent.ts:836` gates adaptive thinking on a hardcoded model string:
+
+```ts
+const isOpus47 = this.config.model === "claude-opus-4-7";
+```
+
+The failing model is `claude-opus-4-8`, so it silently takes the non-adaptive
+`thinking: { type: "enabled", budgetTokens }` branch. Not related to the 401, but
+that literal will need updating for opus-4-8 to get adaptive thinking.
+
+---
+
+## Recommended fixes
+
+In priority order:
+
+### 1. Stop freezing the token in the cached Agent (fixes the new-vs-old-tab bug)
+
+Pick one of:
+
+- **Re-resolve credentials per `run()` for `anthropic` keys**, instead of caching
+ them in the Agent's config. e.g. have the Agent pull the access token through a
+ callback/getter at request time (`() => refreshAccountCredentials(account)`),
+ so each `run()` uses the live DB token. This is the cleanest fix.
+- **Or** invalidate the cached Agent when its captured token is near/after
+ expiry. Add an expiry check to the cache-invalidation gate in
+ `getOrCreateAgentForTab` (store the captured `expiresAt` on `tabAgent` and null
+ the agent when `Date.now() > expiresAt - 60_000`). Coarser, but localized.
+- **Or** when a refresh rotates the DB token, proactively null every cached
+ `tabAgent.agent` whose `keyId` matches, forcing re-resolution on next send.
+
+The getter approach is preferred because it also survives token rotation
+(root-cause note above) without any cache bookkeeping.
+
+### 2. Fix the OAuth refresh call in `credentials/claude.ts`
+
+- `OAUTH_TOKEN_URL` → `https://api.anthropic.com/v1/oauth/token`.
+- In `refreshViaOAuth`, send JSON: `Content-Type: application/json`,
+ `Accept: application/json`, `body: JSON.stringify({ grant_type, client_id, refresh_token })`.
+- Don't swallow failures silently — log `response.status` and the body so the
+ next stale-token event is diagnosable. (Mirror the reference's `postJson`,
+ which throws with `status` + `body` included.)
+
+### 3. (If still needed) add `anthropic-beta` to the chat provider
+
+Have `createClaudeOAuthProvider` reuse `getAnthropicHeaders(token)` so the chat
+path carries `anthropic-beta` (incl. `oauth-2025-04-20`) and `anthropic-version`
+like every reference path and every other Anthropic call in this repo.
+
+---
+
+## What was verified vs. assumed
+
+- **Verified by reading the code:** the per-tab Agent cache, the cache-invalidation
+ gate that ignores token expiry, the frozen `claudeCredentials` flowing into the
+ provider, the wrong refresh URL + form-encoded body, the silent failure +
+ stale-token fallback, the missing `anthropic-beta` on the chat provider, the
+ `opus-4-7` hardcode.
+- **Verified against the reference** (`references/oh-my-pi`): correct token URL
+ (`api.anthropic.com/v1/oauth/token`), JSON body, and that the OAuth chat path
+ sends `anthropic-beta`.
+- **Could not run/reproduce:** deps aren't installed here and network is
+ sandboxed, so the live request, the SDK's default-header behavior, and the
+ exact DB token `expiresAt` values were not inspected. Root cause #1 fully
+ explains the reported new-vs-old-tab behavior on its own; #2 explains why the
+ system can't self-heal once the DB token lapses.
diff --git a/notes/claude-report.md b/notes/claude-report.md
new file mode 100644
index 0000000..6635fa4
--- /dev/null
+++ b/notes/claude-report.md
@@ -0,0 +1,114 @@
+# Cache Miss Analysis for Dispatch's Claude Code Integration
+
+## Executive Summary
+
+The massive token burn and 0% cache hit rate observed after upgrading to AI SDK v6 are the result of two interacting flaws in how the Dispatch harness constructs requests for Anthropic:
+
+1. **Missing Beta Header:** The Claude OAuth provider completely omits the required `anthropic-beta: prompt-caching-scope-2026-01-05` header. Without this specific beta header, the Anthropic API silently ignores all `cache_control` markers.
+2. **Suboptimal Breakpoint Placement:** The logic that assigns `cache_control` breakpoints misinterprets the structure of AI SDK v6 messages. Because each tool result is serialized as an independent `role: "tool"` message, the caching logic places markers on the last two *individual tool results* rather than the actual conversational turns, wasting breakpoints and failing to cache the preceding `assistant` message.
+
+The tool duplication incident (where tools were echoed 150+ times) is highly likely a symptom of the context window blowing up due to these cache misses, causing the model's generation loop to degenerate into repetitive tool hallucinations.
+
+---
+
+## Root Cause 1: The Missing Beta Header
+
+### Analysis
+Anthropic's prompt caching relies on explicit `cache_control` markers embedded in the request payload. However, for the Claude CLI ecosystem and OAuth flow, caching features are gated behind specific beta headers.
+
+In `packages/core/src/llm/provider.ts`, the `createClaudeOAuthProvider` factory initializes the AI SDK Anthropic provider. While it correctly mimics the `x-app` and `user-agent` headers of the Claude CLI, it **fails to include the `anthropic-beta` header**:
+
+```typescript
+function createClaudeOAuthProvider(config: ProviderConfig): ModelFactory {
+ const anthropic = createAnthropic({
+ // ...
+ headers: {
+ "anthropic-dangerous-direct-browser-access": "true",
+ "x-app": "cli",
+ "user-agent": "claude-cli/2.1.112 (external, sdk-cli)",
+ // MISSING: "anthropic-beta": getAnthropicBetas().join(",")
+ },
+ });
+}
+```
+
+The AI SDK's `createAnthropic` constructor (in `@ai-sdk/anthropic`) adds `anthropic-version: 2023-06-01` but does *not* automatically inject `prompt-caching-scope-2026-01-05`.
+
+Because this header is missing from the underlying fetch request, the Anthropic API treats the request as a standard (non-cached) `/messages` call and ignores the ephemeral cache breakpoints completely.
+
+---
+
+## Root Cause 2: Inefficient Caching Breakpoints
+
+### Analysis
+Even if the beta header were present, the current breakpoint strategy in `agent.ts` defeats the purpose of caching due to how the AI SDK v6 structures tool results.
+
+In `toModelMessages()`, the agent unpacks an internal `tool-batch` chunk by creating a separate AI SDK message for every single tool result:
+```typescript
+for (const tr of trailingToolResults) {
+ result.push({ role: "tool", content: [ { type: "tool-result", ... } ] });
+}
+```
+
+Immediately after, `applyAnthropicCaching()` attempts to apply rolling cache markers:
+```typescript
+const nonSystem = msgs.filter((m) => m.role !== "system").slice(-2);
+for (const m of nonSystem) targets.add(m);
+// applies cacheControl: { type: "ephemeral" }
+```
+
+If the assistant emitted a batch of 10 tool calls, `msgs` ends with 10 individual `role: "tool"` messages. The `.slice(-2)` logic grabs only the 9th and 10th tool result messages and applies `cacheControl` to them.
+
+When the AI SDK translates this to the Anthropic wire format (in `convert-to-anthropic-messages-prompt.ts`), it groups consecutive `tool` messages into a single `role: "user"` Anthropic block. The cache markers are applied strictly to the 9th and 10th tool result parts at the very end of this user block.
+
+**Why this fails:**
+1. **Wasted Breakpoints:** Anthropic requires at least 1,024 tokens between breakpoints to effectively cache the delta. Placing breakpoints on two consecutive tool results at the end of the chain wastes a breakpoint because the token delta between them is microscopic.
+2. **Missing Assistant Context:** The `assistant` message (which contains the expensive reasoning tokens and the tool calls themselves) receives NO cache marker.
+
+### Comparison with `oh-my-pi`
+The `oh-my-pi` reference implementation avoids this by applying cache controls logically across the wire format:
+- `applyCacheControlToLastBlock(params.system)`
+- `applyCacheControlToLastBlock(params.tools)`
+- `penultimate user message`
+- `last user message`
+
+By targeting the end of distinct conversational phases (the system block, the tools definition, and the entire user response block), `oh-my-pi` maximizes the cached prefix length.
+
+---
+
+## Tool Duplication Incident Correlation
+
+The documented incident describes 150+ duplicated `read_file` results for `package.json` with wildly uneven counts.
+
+This behavior is not indicative of the execution harness blindly retrying tools. In `agent.ts`, tool calls are harvested directly from the AI SDK stream (`event.type === "tool-call"`) and executed synchronously. If 150 identical `read_file` results were yielded, it is because **the LLM actively generated 150 identical `<tool_use>` blocks** in a single turn.
+
+This generation loop is a known failure mode for Claude when the context window becomes excessively bloated or loses coherence (often compounded by a lack of cached prefixes breaking the model's structural attention). Fixing the cache miss will likely resolve the generation instability.
+
+---
+
+## Recommendations
+
+Since this is a read-only analysis, no code has been edited. The following changes should be applied to fix the system:
+
+1. **Inject the Beta Headers:**
+ Modify `packages/core/src/llm/provider.ts` to import `getAnthropicBetas` from `../credentials/claude.js` and inject it into the Claude OAuth headers:
+ ```typescript
+ headers: {
+ "anthropic-beta": getAnthropicBetas().join(","),
+ "anthropic-dangerous-direct-browser-access": "true",
+ // ...
+ }
+ ```
+
+2. **Group Tool Results in `toModelMessages`:**
+ Instead of pushing an individual `role: "tool"` message for every single tool result, group all `trailingToolResults` into a single `content` array inside one `role: "tool"` message. This accurately represents the tool-batch as a single turn.
+
+3. **Revise the Breakpoint Strategy in `applyAnthropicCaching`:**
+ Instead of naively grabbing the last two elements of the array, specifically target:
+ - The first `system` message.
+ - The **last** `assistant` message in the array.
+ - The **last** `user` (or `tool`) message in the array.
+ This ensures that the entire prefix leading up to the most recent reasoning step is successfully cached.
+
+4. **Add a Tool Deduplication Safeguard (Optional but recommended):**
+ In `agent.ts`'s execution loop, hash the tool name and arguments. If an identical tool call appears in the exact same `stepToolCalls` batch, automatically copy the result from the first execution rather than re-running it or blowing up the LLM's next prompt with redundant shell outputs. \ No newline at end of file
diff --git a/notes/context.md b/notes/context.md
new file mode 100644
index 0000000..896eed4
--- /dev/null
+++ b/notes/context.md
@@ -0,0 +1,169 @@
+# Dispatch — Phase 1 Implementation Context
+
+This file captures all decisions, open questions, and constraints established during planning. It serves as the source of truth for any agent working on Phase 1 implementation.
+
+---
+
+## Stack Decisions
+
+| Layer | Choice | Notes |
+|---|---|---|
+| Runtime | **Bun** | Runtime + package manager. Native SQLite, fast installs/execution |
+| Backend framework | **Hono.js** | Lightweight, WebSocket support |
+| Frontend framework | **Vite + Svelte + DaisyUI** | Strict TypeScript throughout |
+| Language | **TypeScript (strict mode)** | Both frontend and backend |
+| LLM SDK | **Vercel AI SDK (`ai`)** | Provider-agnostic, streaming, tool calling |
+| Default LLM | **DeepSeek V4 Flash Free** | Via OpenCode Go (Zen). Hardcoded for Phase 1 |
+| Database | **SQLite (no ORM)** | `bun:sqlite` (native), no Drizzle or other ORM |
+| Testing | **Vitest** | Set up across all packages from day one |
+| Linting/Formatting | **Biome** | Single tool for both linting and formatting. Confirmed compatible with OpenCode |
+| Package Manager | **Bun workspaces** | Monorepo with `@dispatch/*` packages |
+| Dev Server | **Separate ports + CORS** | Frontend `:5173`, backend `:3000`, explicit CORS |
+
+---
+
+## Resolved Decisions
+
+### Streaming Architecture: WebSocket Only
+All real-time communication flows over a single WebSocket connection:
+- Chat messages (user -> server)
+- Streaming LLM tokens (server -> client)
+- Tool call notifications and results (server -> client)
+- Agent status updates (server -> client)
+
+`POST /chat` is not a streaming endpoint — it just queues/sends a message. The WebSocket handles all real-time data. `GET /status` remains as a simple REST endpoint for polling agent state.
+
+### Database: SQLite Without ORM
+No Drizzle ORM. Use `better-sqlite3` (or equivalent) directly with raw SQL. Keep it simple. The question of whether to set up persistence in Phase 1 or defer to Phase 5 is still open — the user indicated willingness to use SQLite but the timing wasn't finalized. **Default assumption: defer heavy persistence to Phase 5, but the library choice is locked in.**
+
+### Tool Scoping: Working Directory
+File tools (`read_file`, `write_file`, `list_files`) will be scoped to a configurable working directory from Phase 1. This prevents accidental filesystem damage and provides a clean foundation for the permission system in Phase 2.
+
+### Testing: Vitest From Day One
+Vitest set up across the monorepo. Unit tests for core logic (agent loop, tool registry, individual tools).
+
+### DaisyUI Theme: Configurable With Persistence
+Support multiple DaisyUI themes with a user-selectable theme switcher in settings. The selected theme is remembered across sessions (localStorage or equivalent).
+
+### Default LLM Provider: DeepSeek V4 Flash via OpenRouter
+The Phase 1 hardcoded model is DeepSeek V4 Flash, accessed through OpenRouter. This means:
+- The Vercel AI SDK OpenRouter provider will be used
+- A single `OPENROUTER_API_KEY` env var (or equivalent) is needed
+- Model ID will be the OpenRouter model string for DeepSeek V4 Flash
+
+---
+
+## Resolved Decisions (Previously Open Questions)
+
+### 1. Package Manager and Monorepo Tooling: Bun Workspaces
+**Decision:** Bun as both runtime and package manager, using Bun workspaces.
+
+Bun workspaces work similarly to pnpm workspaces — `packages/core`, `packages/api`, and `packages/frontend` reference each other as `@dispatch/*` dependencies, and Bun symlinks them locally.
+
+### 2. Runtime: Bun
+**Decision:** Bun is the runtime.
+
+Implications:
+- Native SQLite via `bun:sqlite` — no need for `better-sqlite3`
+- Bun is faster for installs, script execution, and testing
+- Vitest is the test runner (works with Bun)
+- No Node.js version to manage
+
+### 3. Dev Server Setup: Separate Ports + CORS
+**Decision:** Frontend on `:5173` (Vite dev server), backend on `:3000` (Hono). Explicit CORS configuration on the backend.
+
+This means:
+- Hono backend needs CORS middleware allowing the frontend origin
+- WebSocket connections go directly to `:3000` from the frontend
+- No Vite proxy configuration needed
+- Production will also be cross-origin (matches the split deployment model)
+
+### 4. Biome Compatibility: Confirmed
+**Decision:** Biome is fully compatible with OpenCode.
+
+OpenCode has Biome as a **built-in formatter**. It auto-detects `biome.json(c)` config files and handles `.js`, `.jsx`, `.ts`, `.tsx`, `.html`, `.css`, `.md`, `.json`, `.yaml`, and more. Just needs a `biome.json` in the project root and formatters enabled in OpenCode config (`"formatter": true` or `"formatter": {}`).
+
+---
+
+## Phase 1 Scope (from plan.md)
+
+**Goal:** Chat with one agent in a browser, watch it read and write files.
+
+### Backend Tasks
+- Project scaffolding (monorepo with `packages/core`, `packages/api`, `packages/frontend`)
+- Agent runtime: message -> LLM -> tool call -> result -> repeat loop
+- Vercel AI SDK integration with streaming responses
+- Single provider config (DeepSeek V4 Flash via OpenRouter, env var for API key)
+- Basic tools:
+ - `read_file` — read file contents (scoped to working directory)
+ - `write_file` — write/overwrite a file (scoped to working directory)
+ - `list_files` — glob/list directory contents (scoped to working directory)
+- HTTP API:
+ - `POST /chat` — send a message (non-streaming, queues to agent)
+ - `GET /status` — agent status (idle, running, etc.)
+- WebSocket: stream agent output tokens and tool calls in real-time
+
+### Frontend Tasks
+- Single chat panel — text input field, send button
+- Streamed response rendering (tokens appear as they arrive via WebSocket)
+- Tool call display (collapsible: show tool name, arguments, result)
+- Model/provider indicator in header
+- Basic layout: chat takes full screen, clean and minimal
+- Theme switcher in settings (DaisyUI themes, persisted to localStorage)
+
+### Done When
+Open a browser, type "read the contents of package.json and summarize it," see the agent call `read_file`, stream back a summary. Ask it to create a new file — it calls `write_file` and confirms.
+
+---
+
+## Project Structure (Planned)
+
+```
+dispatch/
+ packages/
+ core/ # Agent runtime, LLM integration, tools
+ src/
+ agent/ # Agent loop, lifecycle
+ llm/ # Vercel AI SDK wrapper, provider config
+ tools/ # Tool registry, built-in tools (read_file, write_file, list_files)
+ types/ # Shared TypeScript types
+ tests/
+ api/ # Hono HTTP + WebSocket server
+ src/
+ routes/ # HTTP route handlers
+ ws/ # WebSocket handlers
+ tests/
+ frontend/ # Vite + Svelte + DaisyUI client
+ src/
+ lib/ # Svelte components, stores, utilities
+ routes/ # Page routes (if using SvelteKit) or views
+ tests/
+ biome.json # Biome config (pending compatibility check)
+ tsconfig.base.json # Shared TypeScript config
+ package.json # Root workspace config
+ .env.example # Environment variables template (OPENROUTER_API_KEY, etc.)
+```
+
+---
+
+## Concurrency Map (What Can Be Built in Parallel)
+
+Once scaffolding is complete, the following sections can be developed concurrently:
+
+```
+[A] Scaffolding (sequential — must be first)
+ |
+ +---> [B] Core Agent Runtime (agent loop, LLM, tool system)
+ |
+ +---> [C] API Server Shell (Hono setup, route stubs, WebSocket setup)
+ |
+ +---> [D] Frontend Shell (Svelte app, chat UI, WebSocket client, theme system)
+
+Then integration (sequential — depends on B, C, D):
+
+[E] Wire core into API (connect agent runtime to routes/WebSocket)
+[F] Wire frontend to API (connect UI to live WebSocket)
+[G] End-to-end testing and polish
+```
+
+Sections B, C, and D are independent and can be coded by concurrent agents. Section E requires B and C. Section F requires C and D. Section G requires everything.
diff --git a/notes/eviction-limitation.md b/notes/eviction-limitation.md
new file mode 100644
index 0000000..91d1a88
--- /dev/null
+++ b/notes/eviction-limitation.md
@@ -0,0 +1,105 @@
+# Known Limitation: Frontend eviction is whole-message, not per-chunk
+
+Status: **RESOLVED.** Fixed by the chunk-native frontend store (see
+`plan-chunk-eviction.md`). The frontend's source of truth for history is now a
+flat `ChunkRow[]` (`tab.chunks`, real per-tab `seq`); the live turn is a
+transient tail (`tab.live`) reconciled into the sealed log on a `turn-sealed`
+event. Eviction (`evictChunks`) is a rolling per-chunk trim of the oldest rows —
+so a single oversized turn is trimmed chunk-by-chunk instead of pinned whole.
+Pagination loads raw chunks (`GET /tabs/:id/chunks`), deduped by `seq`. The
+historical analysis below is retained for context.
+
+---
+
+Documented from the append-only chunk-log work (see `plan-chunk-log.md`). The
+backend was not the problem — this was purely a frontend in-memory concern.
+
+## TL;DR
+
+The append-only chunk log made **loading** per-chunk (you fetch the last N
+*chunks*, not N whole turns), but the frontend's in-memory **eviction** still
+drops whole *messages* (turns). A single pathological turn (e.g. the 150-tool-call
+incident — one assistant message holding ~150 chunks) therefore stays resident in
+browser memory in full until it scrolls out of the protected window. On a
+memory-constrained device, one giant turn can still blow the budget.
+
+So: the chunk log helps long *histories* (many normal turns), but does **not**
+yet help a single oversized *turn*.
+
+## What "eviction" is (for the record)
+
+`evictMessages` (`packages/frontend/src/lib/tabs.svelte.ts:345`) trims a tab's
+in-memory `messages` array when its total chunk count exceeds
+`tab.chunkLimit`, to bound browser RAM. It never deletes anything from the DB —
+the `chunks` table is the durable source of truth and evicted content is
+re-fetched on scroll-up via `loadMoreMessages`
+(`tabs.svelte.ts:402`). The active/streaming turn and the last user+assistant
+pair are pinned; eviction is suppressed while the user is scrolled up.
+
+## Root cause
+
+Eviction operates at message granularity and explicitly refuses to trim within a
+message. From `tabs.svelte.ts` (the `evictMessages` comment, ~`:360-372`):
+
+> We never trim chunks from WITHIN a message: messages are the
+> persistence/pagination unit ... Whole-message eviction from the front is the
+> correct granularity.
+
+That comment is now stale in spirit: persistence/pagination is per-**chunk**
+(the flat `chunks` table + `GET /messages` windowing by chunk `seq`), but the
+in-memory store still holds **grouped `ChatMessage[]`** as its source of truth,
+so the smallest evictable unit is a whole turn.
+
+```
+DB / wire: per-chunk ✅ (chunks table, chunk-seq pagination)
+Frontend load: per-chunk ✅ (GET /messages?limit=N windows the chunk log)
+Frontend evict: per-MESSAGE ❌ (tab.messages is grouped; a turn is atomic)
+```
+
+## Why it wasn't fixed in this pass
+
+True per-chunk eviction requires the frontend store's **source of truth to be the
+flat chunk list**, with `messages` derived for rendering (this was P5 in
+`plan-chunk-log.md`). That means:
+
+- store `tab.chunks: ChunkRow[]` (+ the in-flight live turn) instead of
+ `tab.messages: ChatMessage[]`;
+- rewrite every streaming handler (`applyChunkEvent`, `routeSystemEvent`, the
+ `done` / `status` / `statuses` / error paths) to mutate the flat list;
+- derive `messages` via `groupRowsToMessages` for the render layer.
+
+That touches ~10 handler sites in `tabs.svelte.ts` and the ~60 frontend tests in
+`chat-store.test.ts`, all of which currently assert against the grouped
+`tab.messages` model. It was deferred to keep the test suite green and the
+diff bounded. The hard part (flat storage + DB-free `explode`/`group` transforms
+in `packages/core/src/chunks/transform.ts`) is already done and reusable.
+
+## Fix options (later)
+
+1. **Flat-chunk store + derived messages (recommended, the "proper" fix).**
+ Make `tab.chunks: ChunkRow[]` the source of truth; derive `messages` with
+ `groupRowsToMessages` (already shared from
+ `@dispatch/core/src/chunks/transform.js`). Eviction then trims the flat array
+ at chunk granularity; pin the in-flight turn + the tail. Re-fetch on
+ scroll-up by chunk `seq` (already supported). Highest effort (handler +
+ test rewrite), but fully solves it and matches `plan-chunk-log.md` P5.
+
+2. **Partial trim of an oversized message (incremental, lower effort).**
+ Keep the grouped model, but when the *oldest* in-memory message alone exceeds
+ the limit, drop its leading chunks (whole `text`/`thinking`/`tool-batch`
+ units) and record a per-message `oldestChunkSeq` so `loadMoreMessages` can
+ re-hydrate it. Caveat: must keep the message renderable and merge correctly on
+ scroll-up (the `turnId` merge in `loadMoreMessages` already handles the
+ boundary case). A pragmatic stopgap.
+
+3. **Render virtualization (separate concern).**
+ Windowing the *DOM* (only mount visible bubbles) reduces render cost but not
+ the JS-heap cost of holding the chunks. Complementary to 1/2, not a
+ substitute.
+
+## Acceptance check for whichever fix
+
+Load a tab whose history contains one turn with ≫ `chunkLimit` chunks; confirm
+in-memory chunk count stays ≤ `chunkLimit` (± the pinned tail) while scrolled to
+the bottom, and that scrolling up re-hydrates the trimmed chunks of that same
+turn without duplication.
diff --git a/notes/gemini-chunk-eviction-review-2.md b/notes/gemini-chunk-eviction-review-2.md
new file mode 100644
index 0000000..a2a1ac3
--- /dev/null
+++ b/notes/gemini-chunk-eviction-review-2.md
@@ -0,0 +1,95 @@
+# Code Review (Pass 2): Chunk-Native Frontend Eviction
+
+## Executive Summary
+
+The fixes applied since Pass 1 successfully address the two major Blockers related to reconciling active turns and optimistic UI state. The decoupling of reconcile logic from `agentStatus` in favor of `liveTurnId` elegantly solves the race conditions around deferred reconciliations and interrupt boundaries.
+
+However, a **NEW Blocker** was identified in how `turn_id`s are backfilled onto user messages when a turn starts. The loop indiscriminately tags pending `queued-` messages that belong to *future* turns, causing them to be wiped when the *current* turn finishes.
+
+**Verdict: DO NOT SHIP.** The new Blocker must be fixed. The fix is a trivial one-line condition.
+
+---
+
+## Pass 1 Fixes & Invariants
+
+* **Fix #1: Deferred reconcile wipes concurrent active turns.**
+ **VERDICT: FIXED.** `reloadChunksFromApi`'s new `preserveTurnId` logic perfectly handles overlapping turns. It correctly preserves Turn B's in-flight chunks while dropping Turn A's, and cleanly nullifies state when appropriate. The Map `pendingReconcileTabs` ensures any intermediate turns are fetched comprehensively from the DB.
+* **Fix #2: Optimistic queued user messages are dropped on reconcile.**
+ **VERDICT: FIXED (Partially).** The condition `(m.turnId === undefined && m.role === "user")` inside `keptLive` effectively retains optimistic user messages. However, see the New Blocker below regarding `turnId` assignment.
+* **Cache Safety Invariant.**
+ **VERDICT: SAFE.** `toModelMessages` and Anthropic normalization logic were completely untouched. The backend's prompt caching remains strictly bound to DB-persisted rows.
+* **`messages` -> `renderGroups` Rename Completeness.**
+ **VERDICT: COMPLETE.** No stray references to `tab.messages` were found. Reactivity safely bounds derived state within `updateTab`.
+
+---
+
+## NEW Findings
+
+### 1. `turn-start` wrongly tags future queued messages (Block)
+**Location:** `packages/frontend/src/lib/tabs.svelte.ts:873-882` (inside `handleEvent` case `turn-start`)
+
+**Description:**
+When a turn starts, the frontend loops backwards through `tab.live` to tag untagged user messages with the new `turnId`. It breaks only when it hits a non-user or an already-tagged message.
+
+If multiple messages are queued (e.g. `[queued-B, queued-C]`), and the backend dequeues `B`, `message-consumed` removes the prefix and puts `B` at the end of the array: `[queued-C, B]`. Immediately after, `turn-start` for Turn B fires, looping backward. It tags `B` with `"Turn B"`, but continues and ALSO tags `queued-C` with `"Turn B"`. When Turn B subsequently seals, `queued-C` is wiped from the UI because its `turnId` matches the sealing turn, and it loses the protection of `turnId === undefined`. It vanishes until Turn C actually finishes processing.
+
+This same bug happens if the user queues a message in the tiny race-window between sending their first prompt and the WS `turn-start` event arriving.
+
+**Direction:** The backfill loop must explicitly ignore queued messages. Add a check to prevent tagging them: `if (m && m.role === "user" && m.turnId === undefined && !m.id.startsWith("queued-"))`.
+
+---
+
+## Evaluation of New Regression Tests
+
+The two new regression tests are highly valuable, but miss critical permutations:
+
+1. **"preserves an optimistic queued user message..."**
+ * **Genuinely covers:** Fix #2. It proves `keptLive` preserves an untagged queued message during a reconcile.
+ * **Untested edge case:** It queues the message *after* `turn-start` fires. Therefore, it completely bypasses the `turn-start` backfill loop. A test should queue a message *before* `turn-start` fires (or queue *two* messages and consume one) to expose the new Blocker above.
+2. **"preserves a concurrent newer turn..."**
+ * **Genuinely covers:** Fix #1. It correctly validates `preserveTurnId` logic, the `currentAssistantId` mapping, and `liveTurnId` integrity when a deferred reconcile flushes across a newer streaming turn.
+ * **Untested edge case (Map last-write-wins):** It only seals *one* turn while scrolled up. A test should seal *two* turns while scrolled up, then scroll down, ensuring `pendingReconcileTabs` handles the latest `turnId` and the entire window is correctly loaded.
+
+---
+
+## Resolution (OpenCode)
+
+### Block #1 (`turn-start` wrongly tags future queued messages) — FIXED
+`packages/frontend/src/lib/tabs.svelte.ts` `handleEvent` case `turn-start`.
+
+Backend reality check (not in Gemini's model): a queued message **never** gets its
+own `turn-start`. `turn-start` is emitted exactly once per user-initiated
+`processMessage` (`agent-manager.ts:1128`), which persists exactly one user row via
+`explodeUserText`. Queued messages are drained *into* the running turn through
+`dequeueMessages`/`message-consumed` (`agent.ts:1241,1308`). So the turn's initiator
+is always the single most-recent **non-queued** untagged user row, and any `queued-`
+row in the live tail belongs to a future turn.
+
+Gemini's literal one-liner (`&& !m.id.startsWith("queued-")` with the existing
+`else break`) is **insufficient**: when the queued row is the trailing element the
+loop would `break` on it and never tag the real initiator, leaving the initiator
+untagged → it duplicates against its own sealed chunk row on reconcile.
+
+Implemented fix: the backfill now (a) stops at the first non-user row, (b) **skips
+past** (`continue`) pending `queued-` rows, and (c) tags exactly the one most-recent
+non-queued untagged user row, then breaks. `keptLive` is unchanged — untagged
+`queued-` rows remain untagged and are preserved on reconcile.
+
+### Test gap — ADDRESSED
+Added `"turn-start backfill skips a pending queued row trailing the turn initiator"`
+in `chat-store.test.ts`: constructs live = `[<plain initiator>, queued-q2]`, fires
+`turn-start`, asserts the initiator is tagged and `queued-q2` is **not**, then seals
+and asserts `queued-q2` survives with no duplicate initiator bubble
+(`renderGroups` roles = `[user, assistant, user]`). This exercises the `continue`
+(skip-queued) branch and covers the queue-present-before-`turn-start` race Gemini
+flagged as untested.
+
+### Nits (deferred, non-blocking)
+- `pendingReconcileTabs` last-write-wins doc/test (seal two turns while scrolled up):
+ acceptable as-is — the deferred flush refetches the full window from the DB, so the
+ latest sealed `turnId` correctly supersedes; left as a follow-up test.
+
+### Verification
+Full suite: **326 tests pass** (core 223 incl. 33-test agent/cache-stability suite,
+api 35, frontend 68). Biome clean; `tsc` core+api and `svelte-check` report 0 errors.
+Cache invariant remains untouched (no `agent.ts` wire/folding/caching changes). \ No newline at end of file
diff --git a/notes/gemini-chunk-eviction-review-3.md b/notes/gemini-chunk-eviction-review-3.md
new file mode 100644
index 0000000..3637946
--- /dev/null
+++ b/notes/gemini-chunk-eviction-review-3.md
@@ -0,0 +1,118 @@
+# Code Review (Pass 3): Verify the `turn-start` backfill Block fix
+
+## Executive Summary
+
+The fix to the `turn-start` backfill loop successfully prevents pending `queued-` messages from being wiped by correctly skipping them. However, it exposes a critical flaw in how **consumed** messages are handled.
+
+When a queued message is consumed (`message-consumed`), its `queued-` prefix is stripped, leaving it as an untagged, plain user row in `live`. Because it lacks a `turnId`, it survives `reconcileSealedTurn` and lingers in the UI forever, floating to the bottom and duplicating the `[USER INTERRUPT]` text already present in the sealed chunks. Additionally, in multi-client scenarios with no local initiator, the backfill loop will incorrectly tag this lingering consumed message with a future turn's ID.
+
+**Verdict: DO NOT SHIP.** A new Blocker must be fixed. The fix is a one-line change in the `message-consumed` handler to bind consumed messages to the active turn.
+
+---
+
+## Detailed Findings
+
+### Q1. Backend Claims Verification
+**Confirmed.** The backend code in `packages/api/src/agent-manager.ts` and `packages/core/src/agent/agent.ts` confirms that:
+1. `turn-start` is emitted exactly once per `processMessage`.
+2. Exactly one user row draft is persisted via `explodeUserText`.
+3. Queued messages are pulled into a running turn via `dequeueMessages` and NEVER trigger a `turn-start`.
+
+### Q2. Block Fix Correctness
+The new backfill logic correctly skips `queued-` rows and tags exactly one initiator when a local initiator exists.
+* **(a) `[initiator, queued-X]`**: Correctly skips `queued-X`, tags `initiator`, and breaks.
+* **(b) `[queued-X, initiator]`**: Logically impossible; the initiator is appended before a queue can form.
+* **(c) `[queued-X, queued-Y]` consumed**: When `queued-X` is consumed, its prefix is stripped. It becomes an untagged user message. When the NEXT turn starts, if there is a local initiator, it tags the new initiator and ignores the consumed message.
+* **(d) Multi-client (no local initiator)**: If Client B starts a turn, Client A receives `turn-start` but has no local initiator in `live`. Client A's backfill loop will find the untagged consumed message from the *previous* turn and incorrectly tag it with the *new* turn's ID.
+
+### Q3. No Duplication / No Loss
+**Failed.** Because the backfill loop `break`s after tagging the new initiator, the consumed message from the previous turn remains permanently untagged (`turnId === undefined`). It survives the `turn-sealed` reconcile process and floats to the bottom of the live tail. Since the backend injects the consumed message's text into the `[USER INTERRUPT]` chunk row, the user sees BOTH the tool result chunk text AND a lingering user bubble.
+
+### Q4. `keptLive` Unchanged — Still Correct?
+**Failed.** `keptLive` preserves optimistic rows by checking `m.turnId === undefined && m.role === "user"`. This was intended for pending initiators and queues, but it inadvertently preserves consumed messages because they never received a `turnId`. Consumed messages MUST be bound to the turn that consumed them so they are cleanly dropped when that turn seals.
+
+### Q5. Test Adequacy
+**The new test simulates an impossible backend sequence.**
+The test creates `q1`, consumes it, and then fires `turn-start`, assuming `turn-start` applies to the consumed `q1`. The backend never emits a `turn-start` for a consumed message. By forcing the backfill loop to tag a consumed message, the test mistakenly verifies an action that in reality constitutes cross-turn ID theft.
+
+### Q6. Prompt-Caching Invariant
+**SAFE.** The backend logic in `packages/core/src/agent/agent.ts` was untouched. Cache stability is preserved.
+
+---
+
+## NEW Blockers
+
+### Block #1: Consumed messages linger and duplicate due to missing `turnId`
+**Location:** `packages/frontend/src/lib/tabs.svelte.ts:1226` (inside `handleEvent` case `"message-consumed"`)
+
+**Description:**
+When `message-consumed` extracts a queued message, it strips the prefix but leaves `turnId` undefined. The message survives `turn-sealed` and lingers in `live` forever, duplicating the chunk data. In multi-client scenarios, it also acts as a trap for the next `turn-start` backfill.
+
+**Direction:**
+In `message-consumed`, bind the consumed message to the actively streaming turn so `reconcileSealedTurn` drops it cleanly:
+```typescript
+if (mcEvent.messageIds.includes(queuedId)) {
+ consumed.push({
+ ...m,
+ id: queuedId,
+ turnId: mcTab.liveTurnId ?? undefined // Bind to active turn
+ });
+ continue;
+}
+```
+With this fix, the `turn-start` loop's `break` behavior is perfectly safe, as consumed messages will no longer be "untagged".
+
+---
+
+## Resolution (OpenCode)
+
+### Block #1 (consumed interrupt messages linger + duplicate) — FIXED
+`packages/frontend/src/lib/tabs.svelte.ts`, `handleEvent` case `"message-consumed"`.
+
+Verified the finding against the code: `keptLive` (in `reloadChunksFromApi`) preserves
+every `m.turnId === undefined && m.role === "user"` row, and `message-consumed` was
+pushing `{ ...m, id: queuedId }` with **no** `turnId`. So a consumed interrupt bubble
+was kept on every reconcile — lingering at the tail and duplicating the
+`[USER INTERRUPT]` text the backend folds into the sealed tool-result chunk
+(`agent.ts:1248-1255`). This contradicted the intended "collapse to persisted shape"
+behavior.
+
+Applied Gemini's direction (with the codebase's conditional-spread style): bind the
+consumed row to the in-flight turn so reconcile drops it on seal —
+```ts
+consumed.push({
+ ...m,
+ id: queuedId,
+ ...(mcTab.liveTurnId !== null ? { turnId: mcTab.liveTurnId } : {}),
+});
+```
+`liveTurnId` is always set while a turn runs (the only time a consume happens). The
+ChatPanel keyer (`${turnId}:${role}:${n}`, per-(turn,role) counter) gives the consumed
+row a distinct key from the initiator, so no key collision during the live interrupt
+split. This also makes the Pass-2 `turn-start` backfill `break` fully safe (no
+untagged consumed rows remain to be mis-tagged in the multi-client path Q2(d)).
+
+### Test fixes (addressing Q5)
+The Pass-2 test was rewritten — Gemini correctly flagged that consuming a message and
+then firing `turn-start` at it is not a real backend sequence. Replaced with two
+realistic tests in `chat-store.test.ts`:
+1. `"turn-start backfill skips a pending queued row (race), tags only the initiator"` —
+ drives the **real** race via `store.sendMessage`: an idle send (plain optimistic
+ row) followed by a running send (queued row), then a late `turn-start`. Asserts the
+ initiator is tagged, the queued row is NOT, and the queued row survives the seal
+ with no duplicate initiator.
+2. `"a consumed interrupt message collapses into the sealed turn (no lingering bubble)"`
+ — realistic interrupt flow (turn-start → stream → queue → consume → stream → seal);
+ asserts the consumed row is bound to the turn and dropped on reconcile. Fails on the
+ pre-fix code (row lingers untagged), passes after.
+
+### Verdict response
+- Q1 backend claims: confirmed by Gemini and re-confirmed here.
+- Q2(d) multi-client mis-tag: eliminated — consumed rows are no longer untagged.
+- Q3/Q4 lingering+duplication: FIXED.
+- Q6 cache invariant: SAFE — change is frontend store/test only; `agent.ts` untouched.
+- Q7: no new Blockers introduced.
+
+### Verification
+Frontend **69 tests pass** (chat-store 54 incl. the two new tests, sidebar 15);
+`svelte-check` 0 errors; Biome clean. Core (223) + api (35) unaffected → 327 total. \ No newline at end of file
diff --git a/notes/gemini-chunk-eviction-review.md b/notes/gemini-chunk-eviction-review.md
new file mode 100644
index 0000000..373917a
--- /dev/null
+++ b/notes/gemini-chunk-eviction-review.md
@@ -0,0 +1,69 @@
+# Code Review: Chunk-Native Frontend Eviction (Dispatch)
+
+## Executive Summary
+
+The rewrite to a chunk-native frontend store successfully resolves the unbounded memory pinning caused by oversized turns. The migration from grouped messages to a flat chunk log (`tab.chunks`) as the source of truth, with a decoupled `renderGroups` cache, correctly achieves rolling chunk-level eviction and seamless pagination.
+
+However, while the backend invariants and cache safety are perfectly maintained, there are two **Blocker** regressions in the frontend's reconcile logic relating to the handling of concurrent/queued messages and deferred UI updates. `reloadChunksFromApi` assumes it is strictly processing a single sequential turn and acts destructively on in-flight UI state when edge cases overlap.
+
+**Verdict: DO NOT SHIP.** The cache invariant holds, but the Blocks must be fixed first.
+
+---
+
+## Cache Safety Invariant
+
+**VERDICT: SAFE.**
+The diff contains **zero** modifications to `packages/core/src/agent/agent.ts`. The prompt-cache cohesion logic, `toModelMessages`, `applyAnthropicCaching`, and normalisation remain 100% server-side and entirely isolated from the frontend's transient render representations. Model-bound bytes are unchanged.
+
+---
+
+## Findings
+
+### 1. Deferred reconcile wipes concurrent active turns (Block)
+**Location:** `packages/frontend/src/lib/tabs.svelte.ts:658` (`reconcileSealedTurn`) and `:636` (`reloadChunksFromApi`)
+
+**Description:**
+If the user scrolls up, automatic eviction and reconciliation are suppressed. If Turn A finishes while scrolled up, its reconciliation is deferred (`pendingReconcileTabs.add`). If the agent then starts Turn B (e.g. via queue processor), it establishes a new live streaming bubble (`liveTurnId = 'Turn B'`).
+When the user subsequently scrolls down, the deferred flush fires for Turn A, calling `reloadChunksFromApi`. This function blindly sets `live: []`, `liveTurnId: null`, and `currentAssistantId: null` — immediately deleting all in-flight chunks for the *currently active* Turn B. Turn B will then spawn a fresh disconnected live bubble on its next stream delta, permanently losing its prior chunks and its `turnId` tag (causing it to remount/flash on seal).
+
+**Direction:** `reloadChunksFromApi` must not unconditionally nuke `live` and `currentAssistantId` if a *new* turn is actively in-flight (`agentStatus === "running" && liveTurnId !== turnIdToReconcile`). The reconcile must be turn-aware.
+
+### 2. Optimistic queued user messages are dropped on reconcile (Block)
+**Location:** `packages/frontend/src/lib/tabs.svelte.ts:636` (`reloadChunksFromApi`)
+
+**Description:**
+When a user sends a message while the agent is busy, it is added to `tab.live` as an optimistic unsealed bubble. When the current active turn completes and emits `turn-sealed`, `reloadChunksFromApi` wipes the entire `live` array (`live: []`). The queued user message vanishes from the UI entirely. While it remains safely in `queuedMessages` and will eventually be processed by the backend, the UI will not reflect it again until its own eventual `turn-sealed` event completes the backend loop.
+
+**Direction:** `reloadChunksFromApi` must preserve unsealed optimistic user messages (e.g. those matching `queuedMessages` IDs) when clearing `live`.
+
+### 3. Edge-case fetch race on WS reconnect desync (Ship-with-followup)
+**Location:** `packages/frontend/src/lib/tabs.svelte.ts:860` (`handleEvent` case `statuses`)
+
+**Description:**
+The backend emits `status: idle` immediately before making the synchronous SQLite `flushAssistant` write, and emits `turn-sealed` immediately after. If a WS reconnect happens exactly in this microsecond window, the frontend's `hydrateFromBackend` sees `backendStatus !== "running"` and triggers `reloadChunksFromApi`. Because the DB write hasn't landed, it loads a chunk window missing the just-finished turn.
+
+**Direction:** This is non-fatal because the backend will still emit `turn-sealed` milliseconds later, which triggers a second `reloadChunksFromApi` that corrects the UI state. It will manifest as a sub-second flicker. Safe to ship, but ideally, `statuses` should infer completion from the presence of unpersisted chunks rather than the raw agent status.
+
+### 4. Tool-batches undercount the live eviction budget (Nit)
+**Location:** `packages/frontend/src/lib/tabs.svelte.ts:76` (`countLiveChunks`)
+
+**Description:**
+In `countLiveChunks`, a single `tool-batch` render bubble counts as `1` against the live budget (`m.chunks.length`). However, upon sealing, `explodeTurn` expands each batch into `N * 2` rows (a `tool_call` and `tool_result` for each parallel call). This means an in-flight turn with heavily parallelized tool execution will temporarily consume more memory than `chunkLimit` targets.
+
+**Direction:** Minor inconsistency. Once the turn seals, the correct DB row count will be respected. Acceptable trade-off for simplicity in the live rendering path.
+
+### 5. `trimLiveChunks` drops the prompt under extreme pressure (Nit)
+**Location:** `packages/frontend/src/lib/tabs.svelte.ts:431` (`trimLiveChunks`)
+
+**Description:**
+If a single streaming turn heavily exceeds `chunkLimit`, `trimLiveChunks` enforces a strict rolling window on `live`. Since it trims oldest-first indiscriminately, it will `shift()` away the user's prompt chunk from the UI entirely before it touches the assistant's streaming chunks.
+
+**Direction:** This technically aligns with the design ("never the chunk currently being streamed"), but means the user's initiating context disappears mid-stream under memory pressure. Expected behavior for a raw chunk limit.
+
+---
+
+## Notable Correctness Risks Validated
+
+- **Seq Cursor Logic:** Perfect. `oldestLoadedSeq` rigidly derives from `ChunkRow.seq` via `minSeqOf(sealed)`. Live transient state is securely walled off from pagination logic.
+- **Transaction Wrapper:** Sound. `db.transaction()` around the `appendChunks` batch is synchronously robust and correctly yields monotonic `seq` allocations per turn.
+- **Scroll-up pagination:** Deduplication works flawlessly. `mergeChunksBySeq` efficiently handles overlap at the pagination boundary without duplicating message bubbles.
diff --git a/notes/gemini-chunk-log-review.md b/notes/gemini-chunk-log-review.md
new file mode 100644
index 0000000..e04f6c3
--- /dev/null
+++ b/notes/gemini-chunk-log-review.md
@@ -0,0 +1,90 @@
+# Review: Append-Only Chunk-Log Refactor
+
+## Executive Summary
+
+The refactor successfully transitions the codebase from a "message-as-container" model to a flat, append-only **chunk log**. This is a major structural improvement that enables granular pagination and addresses the primary root cause of Anthropic prompt-cache churn by segmenting multi-step turns into stable message pairs.
+
+**Status: Partially Correct.**
+- **Goal 1 (Cache Fix):** **Achieved for happy-path turns**, but **broken for turns involving user interrupts**. The planned "New model" for interrupts (append-only user chunks) was not implemented; the legacy mutation-based stripping remains in `agent.ts`, which continues to bust the cache prefix across steps when an interrupt occurs.
+- **Goal 2 (Flat Storage/Pagination):** **Fully Achieved.** The explode/group transforms are robust, lossless, and handle window boundaries correctly via `turnId` merging in the frontend.
+
+---
+
+## Findings & Questions
+
+### 1. Cache Stability
+**Severity: Should-Fix**
+The implementation of `toModelMessages` (`agent.ts:162`) correctly segments assistant turns into stable `[assistant, tool]` pairs per step. However, the **interrupt logic** (`agent.ts:241`) reintroduces instability.
+- In Step 1 of a turn, an interrupt is marked "freshest" and included in the Step 1 `tool` message.
+- In Step 2, that same Step 1 `tool` message is now "stale" and has the interrupt stripped.
+- **Result:** The serialized content of Step 1 changes between Request 1 and Request 2, shattering the Anthropic cache prefix for everything following the Step 1 text.
+- *Reference:* `packages/core/src/agent/agent.ts:241-247` and `packages/core/src/agent/agent.ts:80-86`.
+
+### 2. Explode/Group Fidelity
+**Severity: Verified Correct**
+The round-trip between `Chunk[]` and `ChunkRow[]` is lossless.
+- `explodeTurn` correctly splits `tool-batch` into paired `tool_call` and `tool_result` rows.
+- `groupRowsToMessages` correctly reconstructs turns using `turnId` and `step`, and gracefully handles orphan `tool_result` rows by creating synthetic entries in the batch. This is vital for pagination.
+- *Reference:* `packages/core/src/chunks/transform.ts`.
+
+### 3. Step Derivation
+**Severity: Verified Correct**
+The assumption that a `tool-batch` marks the end of an LLM step is consistent with the `Agent.run()` loop. The `step` increment in `explodeTurn` (`transform.ts:89`) and the segmentation in `toModelMessages` (`agent.ts:251`) are in sync.
+
+### 4. Migration Safety
+**Severity: Verified Correct**
+The migration in `db/index.ts` correctly detects the legacy `messages` table and performs a one-shot nuke of messages/tabs.
+- It is safe for repeat runs (idempotent check on `sqlite_master`).
+- Tests are safe as they mock `getDatabase()` and use in-memory fakes.
+- *Reference:* `packages/core/src/db/index.ts:106-118`.
+
+### 5. Persistence Correctness
+**Severity: Verified Correct**
+The "Write-on-seal" strategy is correctly implemented.
+- `processMessage` accumulates chunks in memory and flushes exactly once via `flushAssistant()` when the turn settles.
+- The fallback-retry path correctly avoids calling `flushAssistant()` for failed attempts, preventing partial/duplicate turns in the log.
+- *Reference:* `packages/api/src/agent-manager.ts:1081-1085` and `:1168`.
+
+### 6. Rebuild Correctness
+**Severity: Nit**
+`getMessagesForTab` fetches the *entire* chunk log for a tab to rebuild the Agent's in-memory history. For very long conversations (thousands of chunks), this will cause increasing latency and memory pressure whenever an Agent is reconstructed (e.g., on model switch).
+- *Reference:* `packages/core/src/db/chunks.ts:121`.
+
+### 7. Pagination Correctness
+**Severity: Verified Correct**
+Frontend pagination correctly handles turns split across the 50-chunk window.
+- `loadMoreMessages` in `tabs.svelte.ts` detects `turnId` + `role` matches at the boundary and merges the chunks.
+- This ensures that scrolling up restores the "tail" of a turn and prepends the "head" seamlessly.
+- *Reference:* `packages/frontend/src/lib/tabs.svelte.ts:445-467`.
+
+### 8. Interrupt Handling
+**Severity: Blocker (for Caching Goal)**
+The refactor failed to implement the "New model" described in `plan-chunk-log.md` (Section 4).
+- The plan called for interrupts to be appended as their own `user/text` chunks, making history immutable.
+- The implementation instead kept the legacy `[USER INTERRUPT]` string injection and the unstable `stripUserInterruptBlock` logic.
+- This preserves the cache-churn bug for any session involving interrupts.
+- *Reference:* `packages/core/src/agent/agent.ts:162-256`.
+
+### 9. Anthropic Wire Validity
+**Severity: Verified Correct**
+- `applyAnthropicStructuralNormalisations` correctly handles Anthropic's strict requirements, including splitting assistant messages if tool-calls are followed by text (Pass 3) and scrubbing tool IDs (Pass 2).
+- Empty reasoning blocks are stripped to avoid API errors while maintaining the signature in the DB for future turns.
+
+---
+
+## Verified Correct
+The following components were reviewed and found to be implementation-perfect:
+- **`packages/core/src/chunks/transform.ts`**: Pure logic for flattening and re-grouping.
+- **`packages/core/src/db/chunks.ts`**: Monotonic `seq` allocation and pagination queries.
+- **`packages/api/src/routes/tabs.ts`**: Cursor-based history endpoint.
+- **`packages/core/src/agent/agent.ts`**: `applyAnthropicCaching` breakpoint placement.
+
+---
+
+## Recommendations
+
+1. **Fix Interrupt Immutability (High Priority):** Follow the original plan: remove `USER_INTERRUPT_MARKER` injection into tool results. Instead, when an interrupt is dequeued in `Agent.run()`, append it as a new `user` role chunk to the log *after* the current step's tool results. Remove the stripping logic from `toModelMessages`. This makes the prefix 100% stable.
+2. **Explicit Transactions in `appendChunks` (Nit):** Wrap the loop in `appendChunks` (`db/chunks.ts:40`) in an explicit `db.transaction()` to ensure atomicity and improve performance for large turns.
+3. **Optimize History Rebuild (Nit):** Consider limiting the history rebuild in `getOrCreateAgentForTab` to the last N turns or the last M chunks, rather than the entire history, to bound startup time for long-lived tabs.
+4. **Tab-level Lock in `processMessage` (Nit):** Add a primitive lock or a "running" check at the start of `processMessage` to prevent concurrent execution on the same tab, which could otherwise corrupt the `tabAgent` state or the chunk log if two turns are interleaved.
+5. **Update `eviction-limitation.md`**: The document accurately reflects that eviction is still whole-message; this remains a valid technical debt item.
diff --git a/notes/harness-comparison.md b/notes/harness-comparison.md
new file mode 100644
index 0000000..1fd7aec
--- /dev/null
+++ b/notes/harness-comparison.md
@@ -0,0 +1,373 @@
+# Open-Source AI Agent Harness Comparison
+
+This report evaluates 20+ open-source AI agent frameworks against the Dispatch requirements. Frameworks are rated on a 15-point checklist covering architecture, configuration, tooling, integration, and session management. Each requirement scores 1 (fully supported), 0.5 (partial), or 0 (not supported), for a maximum of 15.
+
+Frameworks that are archived, in maintenance mode, or clearly unsuitable (GPT-Engineer, Mentat, Sweep, Bolt.diy, SuperAGI, Semantic Kernel) are excluded from the main comparison but noted in the appendix.
+
+---
+
+## The Critical Gap: No Framework Has a Three-Layer Hierarchy
+
+The single most distinctive Dispatch requirement -- a three-layer **dispatch -> orchestrator -> subagent** architecture -- does not exist natively in any open-source framework evaluated. Every framework is either:
+
+- **Single-agent** (Aider, Crush, Plandex, Pi)
+- **Two-layer** (Claude Code, Goose, Cline, Agency Swarm, CrewAI)
+- **Flat pool** (AutoGen, MetaGPT, CAMEL)
+- **DAG-based** (LangGraph, ChatDev 2.0) -- the closest to true hierarchy via subgraph nesting
+
+This means **any choice involves building the hierarchical orchestration layer yourself**. The question becomes: which framework gives you the best foundation to build on?
+
+---
+
+## Top Contenders
+
+### Tier 1: Highest Feature Alignment (87% match)
+
+#### Goose (Block / AAIF) -- 13/15
+
+| Requirement | Rating | Notes |
+|---|---|---|
+| 1. Three-layer hierarchy | Partial | Main agent -> subagents (2 layers). Cannot nest further. |
+| 2. Config-driven orchestrators | **Full** | YAML recipes define subagent behavior, extensions, parameters |
+| 3. Parallel subagent execution | **Full** | Native parallel subagent support via trigger keywords |
+| 4. Strict hierarchy communication | **Full** | Subagents cannot spawn further subagents or manage extensions |
+| 5. User-to-agent messaging | **Full** | Continuous sessions, real-time subagent visibility |
+| 6. Conflict prevention | Partial | Process isolation; no explicit file-scope assignment |
+| 7. Role-scoped tooling | **Full** | Per-subagent extension sets via recipes |
+| 8. Skills system | **Full** | `~/.agents/skills/` and `.agents/skills/` with SKILL.md |
+| 9. LSP integration | **None** | No LSP support |
+| 10. Shell + directory perms | **Full** | Permission modes (auto/approve/smart_approve), allowlists |
+| 11. Session management | **Full** | Start, resume, search, model switching |
+| 12. HITL checkpoints | **Full** | Permission modes, per-action approval |
+| 13. State persistence | **Full** | Session persistence across restarts |
+| 14. Provider-agnostic LLM | **Full** | 15+ providers |
+| 15. Multiple interfaces | **Full** | Desktop app, CLI, API (ACP server) |
+
+**Language**: Rust + TypeScript. **Stars**: 45.5k. **Status**: Very active, moved to Linux Foundation AAIF (Apr 2026). Apache 2.0 license.
+
+**Strengths**: Closest to Dispatch's architecture out of the box. Config-driven recipes, parallel subagents, strict hierarchy enforcement, skills system, permission model, and multi-interface support all map directly to requirements. The MCP extension ecosystem (70+ extensions) provides broad tool coverage.
+
+**Weaknesses**: Only 2 layers of hierarchy (no recursive nesting). No LSP. Written in Rust, which makes deep architectural modification harder than Python/TypeScript. Subagents cannot spawn sub-subagents.
+
+**Extensibility verdict**: Adding a third layer would mean building an orchestrator abstraction that manages multiple Goose agent instances, each of which manages its own subagents. The recipe system could potentially be extended to define orchestrator types.
+
+---
+
+#### Cline -- 13/15
+
+| Requirement | Rating | Notes |
+|---|---|---|
+| 1. Three-layer hierarchy | Partial | Coordinator -> specialist agents (2 layers via SDK) |
+| 2. Config-driven orchestrators | Partial | Code-driven SDK; CLI flags provide some config |
+| 3. Parallel subagent execution | **Full** | Kanban enables parallel agents with separate worktrees |
+| 4. Strict hierarchy communication | Partial | Coordinator pattern implies parent-mediated, not enforced |
+| 5. User-to-agent messaging | **Full** | Interactive CLI, `ask_question` tool |
+| 6. Conflict prevention | Partial | Kanban uses Git worktrees for isolation |
+| 7. Role-scoped tooling | **Full** | Per-agent tool sets via plugin system |
+| 8. Skills system | **Full** | `.agents/skills/` with SKILL.md files |
+| 9. LSP integration | **Full** | VS Code extension integrates with editor LSP |
+| 10. Shell + directory perms | **Full** | Command permission allow/deny lists |
+| 11. Session management | **Full** | History, persistence, model override per run |
+| 12. HITL checkpoints | **Full** | Plan/Act modes, per-action approval, auto-approve toggle |
+| 13. State persistence | **Full** | Sessions persist across restarts, snapshot/restore |
+| 14. Provider-agnostic LLM | **Full** | 200+ models via OpenRouter, plus direct providers |
+| 15. Multiple interfaces | **Full** | CLI, VS Code, JetBrains, Kanban web, SDK |
+
+**Language**: TypeScript. **Stars**: 62k. **Status**: Very active (v3.0.7, May 2026). Apache 2.0 license.
+
+**Strengths**: The only framework with real LSP integration (through VS Code). Broadest interface coverage (IDE extensions, CLI, web Kanban, SDK). Git worktree isolation in Kanban is a creative approach to conflict prevention. The SDK (`@cline/core`, `@cline/agents`, `@cline/llms`) is well-layered and embeddable.
+
+**Weaknesses**: Orchestration is code-driven, not config-driven -- no YAML orchestrator definitions. LSP integration is tied to the VS Code extension host; unclear if it works outside IDE context. Hierarchy enforcement is not strict. TypeScript-only.
+
+**Extensibility verdict**: The layered SDK architecture is a strong foundation. You could build the dispatch and orchestrator layers on top of `@cline/core` and `@cline/agents`. The plugin system (`AgentPlugin`) provides lifecycle hooks. However, adding config-driven orchestrator definitions would require custom work.
+
+---
+
+#### Claude Code (Anthropic) -- 13/15 (NOT fully open source)
+
+| Requirement | Rating | Notes |
+|---|---|---|
+| 1. Three-layer hierarchy | Partial | Main agent -> subagents (2 layers). Agent teams add peer coordination |
+| 2. Config-driven orchestrators | Partial | Subagents defined via YAML frontmatter in .md files |
+| 3. Parallel subagent execution | **Full** | Multiple concurrent subagents + agent teams |
+| 4. Strict hierarchy communication | **Full** | Subagents report to parent only |
+| 5. User-to-agent messaging | **Full** | Shift+Down for in-process subagent messaging |
+| 6. Conflict prevention | Partial | Git worktrees, file locking in agent teams |
+| 7. Role-scoped tooling | **Full** | `tools` and `disallowedTools` in subagent YAML frontmatter |
+| 8. Skills system | **Full** | Full Agent Skills standard, YAML frontmatter, multi-scope |
+| 9. LSP integration | **Full** | Through IDE integrations (VS Code, JetBrains) |
+| 10. Shell + directory perms | **Full** | allow/ask/deny rules, wildcard matching, sandboxing |
+| 11. Session management | **Full** | Resume, fork (`/fork`), model switch (`/model`), persistence |
+| 12. HITL checkpoints | **Full** | Permission modes, plan mode, hooks for custom approval |
+| 13. State persistence | **Full** | Full session persistence in `~/.claude/projects/` |
+| 14. Provider-agnostic LLM | Partial | Primarily Claude; Bedrock/Vertex/Azure as backends |
+| 15. Multiple interfaces | **Full** | CLI, VS Code, JetBrains, Desktop, Web, Slack, SDKs |
+
+**Language**: Proprietary core (distributed as npm binary); Shell/Python/TypeScript for installer, plugins, examples. **Stars**: 125k. **Status**: Very active. **License**: Partially open source -- core engine is proprietary.
+
+**Strengths**: Highest overall feature coverage. Best permission system. Best skills system. Chat forking built-in. Agent SDK available in Python and TypeScript. Hooks system (PreToolUse, PostToolUse, SubagentStart/Stop) provides excellent lifecycle control.
+
+**Weaknesses**: **Core engine is proprietary.** You cannot fork or modify the agent loop. Primarily Claude-only for models. Building on top means depending on Anthropic's closed binary. This is a dealbreaker if full ownership of the codebase is required.
+
+**Extensibility verdict**: If you're comfortable depending on a proprietary core, Claude Code's Agent SDK is probably the fastest path to a Dispatch-like system. But you'd be building on a dependency you cannot modify or audit internally.
+
+---
+
+### Tier 2: Strong Architectural Foundation (53-60% match)
+
+#### LangGraph -- 9/15
+
+| Requirement | Rating | Notes |
+|---|---|---|
+| 1. Three-layer hierarchy | **Full** | Arbitrary subgraph nesting with private state schemas |
+| 2. Config-driven orchestrators | **None** | Purely code-defined (Python) |
+| 3. Parallel subagent execution | **Full** | Parallel edges, `Send()` for map-reduce fan-out |
+| 4. Strict hierarchy communication | **Full** | Subgraph state isolation via separate schemas |
+| 5. User-to-agent messaging | **Full** | `interrupt()` / `Command(resume=...)` anywhere |
+| 6. Conflict prevention | **None** | No file-scope mechanisms |
+| 7. Role-scoped tooling | **Full** | Per-node tool sets |
+| 8. Skills system | **None** | No skills/instruction injection system |
+| 9. LSP integration | **None** | No LSP |
+| 10. Shell + directory perms | **None** | No built-in shell or permissions |
+| 11. Session management | Partial | Checkpointer history, time travel, `update_state()` |
+| 12. HITL checkpoints | **Full** | `interrupt()`, static breakpoints, approval patterns |
+| 13. State persistence | **Full** | Multiple checkpointers (SQLite, Postgres, CosmosDB) |
+| 14. Provider-agnostic LLM | **Full** | All LangChain providers + standalone |
+| 15. Multiple interfaces | Partial | Python API, LangSmith Studio, LangGraph API |
+
+**Language**: Python. **Stars**: 32.4k. **Status**: Very active (v1.2.0, May 2026).
+
+**Key insight**: LangGraph is the **only framework that natively supports arbitrary hierarchy depth** via subgraph nesting. This is the single most important Dispatch requirement. The trade-off: it has zero application-layer features (no skills, no shell, no LSP, no session management in the user-facing sense). It's a low-level orchestration runtime, not an end-user tool.
+
+**Extensibility verdict**: LangGraph provides the best orchestration primitives but you'd need to build everything else on top -- the CLI, the skills system, the shell with permissions, the LSP integration, session management UI. It's essentially a graph execution engine, not an agent harness.
+
+---
+
+#### CrewAI -- 9/15
+
+| Requirement | Rating | Notes |
+|---|---|---|
+| 1. Three-layer hierarchy | Partial | Manager -> agents (2 layers); Flows chain crews |
+| 2. Config-driven orchestrators | **Full** | YAML `agents.yaml`, `tasks.yaml` |
+| 3. Parallel subagent execution | **Full** | `async_execution=True` on tasks |
+| 4. Strict hierarchy communication | Partial | Manager delegates but `allow_delegation` enables P2P |
+| 5. User-to-agent messaging | Partial | `@human_feedback` at configured points only |
+| 6. Conflict prevention | **None** | No file-scope mechanisms |
+| 7. Role-scoped tooling | **Full** | Per-agent tools, task tool overrides |
+| 8. Skills system | Partial | Agent templates, prompt customization |
+| 9. LSP integration | **None** | No LSP |
+| 10. Shell + directory perms | **None** | Code execution deprecated |
+| 11. Session management | Partial | Flow persist/fork via `@persist` |
+| 12. HITL checkpoints | **Full** | `@human_feedback`, `human_input=True` |
+| 13. State persistence | **Full** | `@persist` with SQLite |
+| 14. Provider-agnostic LLM | **Full** | Many providers via LiteLLM |
+| 15. Multiple interfaces | Partial | CLI + Python API |
+
+**Language**: Python. **Stars**: 51.7k. **Status**: Very active, backed by CrewAI Inc.
+
+**Extensibility verdict**: Best config-driven agent/task definitions. The YAML approach maps well to Dispatch's config-driven orchestrators. However, no shell access, no LSP, no skills directory system. Better suited for general automation than coding-specific workflows.
+
+---
+
+#### ChatDev 2.0 -- 9/15
+
+| Requirement | Rating | Notes |
+|---|---|---|
+| 1. Three-layer hierarchy | Partial | DAG + subgraph nesting, not strict tree |
+| 2. Config-driven orchestrators | **Full** | Full YAML workflow definitions |
+| 3. Parallel subagent execution | **Full** | Map/Tree modes with `max_parallel` |
+| 4. Strict hierarchy communication | Partial | Edge-routed but no parent-only enforcement |
+| 5. User-to-agent messaging | Partial | Human nodes in DAG at predefined points |
+| 6. Conflict prevention | **None** | No file-scope mechanisms |
+| 7. Role-scoped tooling | **Full** | Per-node tooling in YAML |
+| 8. Skills system | Partial | `.agents/skills` directory exists |
+| 9. LSP integration | **None** | No LSP |
+| 10. Shell + directory perms | **None** | No permission system |
+| 11. Session management | Partial | Context snapshots, no forking/resume |
+| 12. HITL checkpoints | **Full** | Human nodes + edge conditions |
+| 13. State persistence | Partial | Artifacts persist, no execution recovery |
+| 14. Provider-agnostic LLM | **Full** | Per-node provider config |
+| 15. Multiple interfaces | **Full** | Web UI, CLI, HTTP API, Python SDK |
+
+**Language**: Python + Vue.js. **Stars**: 33.1k. **Status**: Active (v2.2.0, Mar 2026). Very new (released Jan 2026).
+
+**Extensibility verdict**: Best zero-code workflow definition system. YAML DAGs with subgraphs, parallel execution, and multiple interfaces. The main risk is maturity -- ChatDev 2.0 is only months old and documentation is still evolving.
+
+---
+
+#### OpenHands -- 8.5/15
+
+| Requirement | Rating | Notes |
+|---|---|---|
+| 1. Three-layer hierarchy | **None** | Single Conversation -> Agent -> Tools pipeline |
+| 2. Config-driven orchestrators | Partial | SDK is code-driven, config template exists |
+| 3. Parallel subagent execution | **None** | One agent step at a time per conversation |
+| 4. Strict hierarchy communication | **None** | No hierarchy enforced |
+| 5. User-to-agent messaging | **Full** | `send_message()` at any time via WebSocket |
+| 6. Conflict prevention | **None** | No scope assignment |
+| 7. Role-scoped tooling | **Full** | Per-agent tool sets via typed Action/Observation |
+| 8. Skills system | **Full** | Three skill types, YAML frontmatter, MCP integration |
+| 9. LSP integration | **None** | No LSP |
+| 10. Shell + directory perms | Partial | Risk-based security (LOW/MEDIUM/HIGH), not directory-based |
+| 11. Session management | Partial | Persistence + resume, no forking or model switching |
+| 12. HITL checkpoints | **Full** | ConfirmationPolicy with configurable thresholds |
+| 13. State persistence | **Full** | Auto-save, resume, incremental events |
+| 14. Provider-agnostic LLM | **Full** | 100+ providers via LiteLLM |
+| 15. Multiple interfaces | **Full** | CLI, React GUI, Cloud, Enterprise, SDK, REST API |
+
+**Language**: Python + TypeScript. **Stars**: 74.1k. **Status**: Very active (v1.7.0, May 2026). MIT license.
+
+**Extensibility verdict**: The four-package SDK architecture (`openhands.sdk`, `openhands.tools`, `openhands.workspace`, `openhands.agent_server`) is clean and composable. The event-driven design provides good extension points. However, no native multi-agent hierarchy -- you'd build the orchestration layer from scratch using the SDK primitives. Best choice if you want a battle-tested single-agent SDK with excellent security and skills, and are willing to build hierarchy on top.
+
+---
+
+#### Crush (OpenCode successor) -- 8/15
+
+| Requirement | Rating | Notes |
+|---|---|---|
+| 1. Three-layer hierarchy | **None** | Single-agent with `agent` tool for sub-tasks |
+| 2. Config-driven orchestrators | **None** | No orchestrator concept |
+| 3. Parallel subagent execution | **None** | Sub-agent tool is sequential |
+| 4. Strict hierarchy communication | Partial | Agent tool returns results; no P2P |
+| 5. User-to-agent messaging | **None** | User types at main session only |
+| 6. Conflict prevention | **None** | No scope assignment |
+| 7. Role-scoped tooling | **Full** | Different agents can have different tool sets via config |
+| 8. Skills system | **Full** | Agent Skills standard, reads `.crush/`, `.claude/`, `.agents/` |
+| 9. LSP integration | **Full** | Built-in LSP with configurable language servers |
+| 10. Shell + directory perms | Partial | `allowed_tools` allowlist, no directory scoping |
+| 11. Session management | **Full** | Save/load/switch, model switching, SQLite persistence |
+| 12. HITL checkpoints | **None** | No checkpoint system |
+| 13. State persistence | **Full** | SQLite-based session persistence |
+| 14. Provider-agnostic LLM | **Full** | 15+ providers |
+| 15. Multiple interfaces | **Full** | Interactive TUI, CLI, scripting |
+
+**Language**: Go. **Stars**: 24.4k. **Status**: Very active (v0.70.0, May 2026). MIT license.
+
+**Key insight**: The only CLI-native framework with built-in LSP integration for compiler diagnostics. If LSP is a hard requirement, Crush is one of only two options (the other being Cline's VS Code extension). However, it has no multi-agent architecture at all.
+
+---
+
+#### Pi.dev -- 6.5/15
+
+| Requirement | Rating | Notes |
+|---|---|---|
+| 1. Three-layer hierarchy | **None** | Single-agent. Subagent extension is a bash demo. |
+| 2. Config-driven orchestrators | **None** | No orchestrator concept |
+| 3. Parallel subagent execution | **None** | Subagent extension demo: 8 tasks, 4 concurrent, via bash |
+| 4. Strict hierarchy communication | **None** | No agent communication framework |
+| 5. User-to-agent messaging | **Full** | Built-in steer/follow-up message queuing |
+| 6. Conflict prevention | **None** | "YOLO mode" by design |
+| 7. Role-scoped tooling | Partial | `--tools` flag restricts globally, no role system |
+| 8. Skills system | Partial | Agent Skills standard, but no `default/agents/project/` dirs |
+| 9. LSP integration | **None** | No LSP |
+| 10. Shell + directory perms | **None** | Full unrestricted access by design |
+| 11. Session management | **Full** | Tree-structured, fork, clone, resume, model switch |
+| 12. HITL checkpoints | Partial | `tool_call` event can block; no built-in checkpoint system |
+| 13. State persistence | **Full** | JSONL auto-save, sessions survive restarts |
+| 14. Provider-agnostic LLM | **Full** | 15+ providers, cross-provider context handoff |
+| 15. Multiple interfaces | **Full** | TUI, CLI, JSON, RPC, SDK, web UI package |
+
+**Language**: TypeScript. **Stars**: 51.4k. **Status**: Very active (v0.75.3, May 2026). MIT license.
+
+**Key insight**: Pi has the best session management (tree-structured branching with fork/clone/resume) and the most powerful extension system (30+ lifecycle events, full tool registration, UI components). The `@earendil-works/pi-ai` and `@earendil-works/pi-agent-core` packages are clean, well-documented TypeScript libraries. However, the maintainer explicitly rejects multi-agent patterns ("sub-agents are an anti-pattern"). Building Dispatch on Pi means fighting its philosophy.
+
+**Best use**: Harvest `pi-ai` (LLM abstraction) and `pi-agent-core` (agent runtime) as libraries in a custom system, rather than extending the Pi CLI.
+
+---
+
+### Eliminated Frameworks
+
+| Framework | Score | Reason for Elimination |
+|---|---|---|
+| AutoGen (AG2) | 8.5/15 | **Maintenance mode.** Microsoft recommends migrating to Agent Framework. No new features. |
+| Agency Swarm | 8/15 | 2-layer only, code-defined, no LSP/shell/sessions. Active but smaller community (4.4k stars). |
+| CAMEL | 6.5/15 | Research-oriented. Sequential workforce. No production features (sessions, perms). |
+| Plandex | 6/15 | Single-agent. Strong for plan-then-execute but no hierarchy or multi-agent. |
+| MetaGPT | 5/15 | Flat role pool, sequential, broadcast communication. Team pivoting to commercial MGX. |
+| Aider | 5/15 | Single-agent pair programmer. No hierarchy, no subagents, no persistence. |
+| Semantic Kernel | 4.5/15 | SDK-only, no interfaces, experimental orchestration. Being superseded by MS Agent Framework. |
+| SuperAGI | 5.5/15 | Stale (v0.0.11). Single-agent. No hierarchy. |
+| SWE-agent | 5/15 | Maintenance mode, superseded by mini-SWE-agent. Academic benchmarking tool. |
+| GPT-Engineer | N/A | Archived April 2026. |
+| Mentat | N/A | Archived January 2025. |
+| Sweep | N/A | Pivoted to JetBrains plugin. |
+| Bolt.diy | N/A | Web app builder, not an agent harness. |
+| Continue | N/A | Pivoted to CI/CD checks product. |
+
+---
+
+## Summary: Ratings at a Glance
+
+| Rank | Framework | Score | % | Language | Stars | Key Strength | Key Gap |
+|---|---|---|---|---|---|---|---|
+| 1 | **Goose** | 13/15 | 87% | Rust/TS | 45.5k | Closest architecture match (recipes, subagents, permissions) | No LSP, no 3rd layer |
+| 1 | **Cline** | 13/15 | 87% | TypeScript | 62k | Only framework with LSP + parallel agents + SDK | Config-driven orchestration is weak |
+| 1 | **Claude Code** | 13/15 | 87% | Proprietary | 125k | Most complete feature set, best permissions/skills | **Not fully open source** |
+| 4 | **LangGraph** | 9/15 | 60% | Python | 32.4k | Only native arbitrary-depth hierarchy | Zero application features (no shell, skills, UI) |
+| 4 | **CrewAI** | 9/15 | 60% | Python | 51.7k | Best config-driven agents (YAML) | No shell, no LSP, no skills dirs |
+| 4 | **ChatDev 2.0** | 9/15 | 60% | Python/Vue | 33.1k | Best zero-code YAML workflows | Very new (Jan 2026), immature |
+| 7 | **OpenHands** | 8.5/15 | 57% | Python/TS | 74.1k | Best SDK architecture, 100+ LLM providers | No hierarchy, no parallel agents |
+| 8 | **Crush** | 8/15 | 53% | Go | 24.4k | Only CLI with built-in LSP | No multi-agent anything |
+| 9 | **Pi.dev** | 6.5/15 | 43% | TypeScript | 51.4k | Best session management + extension system | Anti-multi-agent philosophy |
+
+---
+
+## Recommendation: Build vs. Extend
+
+No existing framework is a drop-in match. The choice depends on which gaps you're most willing to fill:
+
+### Option A: Extend Goose
+**Best if**: You want the most features out of the box and are comfortable with Rust/TypeScript.
+- **Already have**: Subagents, parallel execution, config recipes, skills, permissions, session management, multi-interface
+- **Must build**: Third orchestrator layer, LSP integration, custom directory permissions, skills directory restructuring
+- **Risk**: Rust codebase makes deep architectural changes harder. Goose's subagent model may resist being generalized into a full orchestrator pattern.
+
+### Option B: Build on Cline SDK
+**Best if**: You want LSP and a well-layered TypeScript SDK.
+- **Already have**: LSP, parallel agents (Kanban), skills, permissions, sessions, plugins, IDE integration
+- **Must build**: Config-driven orchestrator definitions, third dispatch layer, strict hierarchy enforcement
+- **Risk**: Cline is IDE-first; extracting the SDK for standalone use may have rough edges.
+
+### Option C: Build on LangGraph
+**Best if**: You prioritize getting the hierarchy right and are comfortable building everything else.
+- **Already have**: Arbitrary hierarchy depth, parallel execution, interrupts, state persistence, provider support
+- **Must build**: Skills system, shell + permissions, LSP, session management UI, CLI/TUI, config-driven orchestrator definitions
+- **Risk**: Enormous amount of application-layer work. LangGraph is an execution engine, not an end-user tool.
+
+### Option D: Build from Scratch, Harvest Libraries
+**Best if**: You want full architectural control and no framework fights.
+- **Harvest from Pi.dev**: `@earendil-works/pi-ai` (LLM abstraction), `@earendil-works/pi-agent-core` (agent runtime)
+- **Harvest from Cline**: `@cline/llms` (provider gateway), `@cline/agents` (stateless agent loop)
+- **Harvest from Crush**: LSP integration patterns (Go)
+- **Must build**: Everything else -- dispatch layer, orchestrator management, hierarchy enforcement, skills system, permissions, session management
+- **Risk**: Highest initial effort, but cleanest architecture alignment.
+
+---
+
+## Sources
+
+- [Goose GitHub](https://github.com/aaif-goose/goose) -- Architecture, subagents, skills, security docs
+- [Goose Documentation](https://goose-docs.ai/) -- Subagents, recipes, config, sessions, security guides
+- [Cline GitHub](https://github.com/cline/cline) -- SDK, multi-agent, Kanban, plugins
+- [Cline Documentation](https://docs.cline.bot/) -- SDK architecture, tools, CLI, building agents
+- [Claude Code GitHub](https://github.com/anthropics/claude-code) -- Installer, plugins, examples
+- [Claude Code Documentation](https://code.claude.com/docs/en/overview) -- Subagents, skills, permissions, hooks, agent teams, SDK
+- [LangGraph GitHub](https://github.com/langchain-ai/langgraph) -- Graph API, subgraphs, interrupts, persistence
+- [LangGraph Documentation](https://docs.langchain.com/oss/python/langgraph/) -- Overview, subgraphs, interrupts, persistence
+- [CrewAI GitHub](https://github.com/crewAIInc/crewAI) -- YAML config, agents, tasks, flows
+- [CrewAI Documentation](https://docs.crewai.com/) -- Agents, tasks, flows, processes, memory
+- [ChatDev GitHub](https://github.com/OpenBMB/ChatDev) -- Workflow authoring guide, YAML definitions
+- [OpenHands GitHub](https://github.com/OpenHands/OpenHands) -- SDK overview
+- [OpenHands SDK Docs](https://docs.openhands.dev/sdk) -- Architecture, agent, conversation, LLM, skills, security
+- [Crush GitHub](https://github.com/charmbracelet/crush) -- LSP, sessions, skills, providers
+- [Pi.dev GitHub](https://github.com/earendil-works/pi) -- Extensions, skills, SDK, sessions
+- [Pi.dev Extensions Docs](https://github.com/earendil-works/pi/blob/main/packages/coding-agent/docs/extensions.md) -- Event lifecycle, tool registration
+- [Pi.dev Blog](https://mariozechner.at/posts/2025-11-30-pi-coding-agent/) -- Design philosophy
+- [Agency Swarm GitHub](https://github.com/VRSEN/agency-swarm) -- Communication flows, agents
+- [Agency Swarm Docs](https://agency-swarm.ai/) -- Agencies, agents, running
+- [AutoGen GitHub](https://github.com/microsoft/autogen) -- Maintenance mode notice, teams, HITL
+- [CAMEL GitHub](https://github.com/camel-ai/camel) -- Workforce, toolkits, model factory
+- [MetaGPT GitHub](https://github.com/geekan/MetaGPT) -- Roles, teams, serialization
+- [Plandex GitHub](https://github.com/plandex-ai/plandex) -- Plan-execute workflow, diff sandbox
+- [Aider GitHub](https://github.com/Aider-AI/aider) -- Conventions, modes, LLM support
+- [SWE-agent GitHub](https://github.com/SWE-agent/SWE-agent) -- Architecture, tools, batch mode
diff --git a/notes/plan-bg-restore.md b/notes/plan-bg-restore.md
new file mode 100644
index 0000000..e622669
--- /dev/null
+++ b/notes/plan-bg-restore.md
@@ -0,0 +1,1294 @@
+# Plan: Background-running agents + layout restore on browser reopen
+
+> **Audience**: this plan is consumed by two flash subagents running in parallel,
+> plus a Gemini review pass. Flash agents are weak and cheap — every code shape,
+> import path, function signature, expected behavior, and test assertion is
+> spelled out below. Do NOT improvise. Do NOT rename anything. Do NOT touch
+> files outside your segment's "Files owned" list.
+
+---
+
+## 1. Spec (the goal)
+
+Three user-visible behaviors:
+
+1. **Browser-close keeps agents alive.** If the user closes the browser
+ window / reloads the page / loses the network, any running agent
+ continues processing on the backend. No cancellation is triggered.
+
+2. **Layout restore on browser reopen.** When the page next loads, every
+ tab that existed at the time the window was closed is restored, in
+ the same order, with full message history. Tabs whose agents finished
+ while disconnected appear with the completed message. Tabs whose
+ agents are still running appear streaming live (the in-flight
+ assistant message is reconstructed from the backend's in-memory
+ `currentChunks` plus any new deltas).
+
+3. **Explicit tab-close cancels + forgets.** Clicking the X on a tab in
+ the sidebar still cancels the running agent (existing behavior) and
+ also prevents that tab from being restored next time (existing
+ behavior — `DELETE /tabs/:id` already archives the row by setting
+ `is_open = 0`).
+
+The only NEW work is in Behavior 2. Behaviors 1 and 3 already work on
+the current `dev` branch; the implementation must preserve them.
+
+---
+
+## 2. Current state (verified findings)
+
+The investigation report from explore agent (`task_id ses_19461dcf5ffe4r7wLyAji7Bn5b`) is the canonical
+source. Key facts the implementation MUST respect:
+
+### 2.1 Backend
+- The `tabs` table has columns `id, title, key_id, model_id, parent_tab_id, status, is_open, position, created_at, updated_at` (`packages/core/src/db/index.ts:77-88`).
+- `archiveTab(id)` sets `is_open = 0` — never hard-deletes. Used by `DELETE /tabs/:id`. (`packages/core/src/db/tabs.ts:118-124`).
+- `listOpenTabs()` returns rows where `is_open = 1`, ordered by `position` (`packages/core/src/db/tabs.ts:80-86`).
+- `agentManager.tabAgents` is the in-memory `Map<string, TabAgent>` where each TabAgent tracks `agent`, `status`, `currentChunks: Chunk[] | null`, `currentAssistantId: string | null`, `messageQueue: QueuedMessage[]` (`packages/api/src/agent-manager.ts:137-177`).
+- `getAllStatuses()` currently returns `Record<string, AgentStatus>` — just the strings (`packages/api/src/agent-manager.ts:714-720`).
+- `processMessage` is fire-and-forget: `app.ts:77-79` calls `.processMessage(...).catch(console.error)` and returns immediately.
+- WS `onClose` does NOT stop agents — it only unsubscribes the per-client event listener (`packages/api/src/index.ts:58-66`).
+- `flushAssistant` (the per-turn DB write) is only called at turn-end (`done`) or on system events — NOT on every delta. Mid-stream chunks live in memory only.
+
+### 2.2 Frontend
+- `App.svelte:onMount` currently does: `wsClient.connect()` → `fetchModels()` → `if (tabStore.tabs.length === 0) tabStore.createNewTab()`. **It never reads existing backend tabs.** Every page load is a clean slate.
+- Tab state lives only in Svelte `$state` — zero `localStorage` / `IndexedDB` persistence of tab ids.
+- The existing `statuses` WS event handler (`tabs.svelte.ts:419-446`) reconciles status drift but only for tabs the frontend already has in `$state`. Tabs the frontend doesn't know about are silently ignored.
+
+### 2.3 The wire shapes (the flash agents MUST match these exactly)
+- `GET /tabs` → `{ tabs: TabRow[] }` where each `TabRow = { id, title, keyId, modelId, parentTabId, status, isOpen, position, createdAt, updatedAt }`. **Note camelCase** — the DB function `listOpenTabs` translates snake_case to camelCase.
+- `GET /tabs/:id/messages` → `{ messages: Array<{ id, role, chunks, ... }> }`.
+- `GET /status` → `{ status, messageCount, statuses }`. The `statuses` field's shape is being changed in this plan.
+- `DELETE /tabs/:id` → `{ success: true }`. Already cancels + archives. Unchanged.
+- `POST /tabs` body `{ id?: string; title?: string }` → returns the new tab. Unchanged.
+
+---
+
+## 3. Design
+
+### 3.1 Strategy
+
+The backend already runs agents independently of WS subscribers (Behavior 1
+works for free). The persistent record (DB) already survives across browser
+sessions (Behavior 2's data is already on disk). The X-button already
+cancels and archives (Behavior 3 works for free).
+
+So the entire feature reduces to: **on browser reopen, the frontend must
+fetch the persisted tabs and rebuild the UI state, and for any tab that's
+still streaming it must pick up the live event flow without losing the
+chunks that were emitted before the WS handshake completed.**
+
+### 3.2 The mid-stream catch-up problem
+
+If a tab is `running` at the moment the new browser session connects, the
+DB has chunks from the most recent `flushAssistant` call (turn-end or
+system event), NOT the live in-memory `currentChunks` array. The frontend
+needs that in-memory array to render the streaming assistant message
+correctly.
+
+**Solution**: enrich the existing `statuses` snapshot (sent over both
+`GET /status` HTTP and the WS `onOpen`) to include `currentChunks` and
+`currentAssistantId` for every running tab. The frontend uses this
+snapshot to seed the in-flight assistant message before live deltas
+arrive. The race is safe because:
+
+1. JS is single-threaded — reading `currentChunks` and serializing it in
+ the WS `onOpen` handler is atomic with respect to other event-loop
+ ticks.
+2. The frontend treats the snapshot as authoritative initial state for
+ the in-flight assistant message; subsequent live deltas append on top
+ via the existing `applyChunkEvent` path.
+
+### 3.3 What's NOT in scope
+
+- Per-delta DB flushing (not needed — snapshot covers in-flight state).
+- Persisting queued messages across server restart (server restart kills
+ the in-memory agent anyway; queued messages were always best-effort).
+- Subagent (`parentTabId != null`) ordering changes — they restore the
+ same way as user tabs; the existing `TabBar.svelte` already separates
+ parent vs child rows.
+- LocalStorage cache of tabs (the backend DB is the source of truth).
+- Migrating any DB schema (the existing schema already supports
+ everything we need via `is_open`).
+
+---
+
+## 4. Phase plan
+
+```
+Phase 0 (sequential, I do it)
+ ├─ Add shared `TabStatusSnapshot` type in packages/core/src/types/index.ts
+ ├─ Re-export from packages/core/src/index.ts
+ └─ Verify core typecheck + biome
+
+Phase 1 (parallel, two flash agents)
+ ├─ Segment A: Backend — flash agent A
+ │ ├─ packages/api/src/agent-manager.ts
+ │ └─ packages/api/tests/agent-manager.test.ts
+ │
+ └─ Segment B: Frontend — flash agent B
+ ├─ packages/frontend/src/lib/types.ts (mirror type)
+ ├─ packages/frontend/src/lib/tabs.svelte.ts (hydrateFromBackend + statuses handler update)
+ ├─ packages/frontend/src/App.svelte (onMount sequencing)
+ └─ packages/frontend/tests/chat-store.test.ts
+
+Phase 2 (sequential, I do it)
+ ├─ Run typecheck on all three packages
+ ├─ Run tests on all three packages
+ ├─ Run biome
+ └─ Sanity-check the integration manually
+
+Phase 3 (I do it)
+ ├─ Launch gemini subagent for read-only review (writes report.md only)
+ ├─ Do my own review in parallel
+ └─ WAIT for gemini to finish before applying any fixes
+
+Phase 4 (I do it, possibly spawning more flash agents)
+ └─ Apply fixes from gemini + self-review
+```
+
+---
+
+## 5. Phase 0 — Shared type (main agent only)
+
+### 5.1 File: `packages/core/src/types/index.ts`
+
+Add the following ABOVE the existing `AgentEvent` discriminated union (i.e.
+in the "Agent Status & Events" section, immediately after `AgentStatus`
+type definition):
+
+```ts
+/**
+ * Snapshot of a single tab's live state, sent on WS connect and via
+ * `GET /status`. Carries enough information for a freshly-loaded
+ * frontend to reconstruct any in-flight assistant message.
+ *
+ * - `status` — always present; mirrors the in-memory `TabAgent.status`.
+ * - `currentChunks` — the live in-flight `Chunk[]` for the running
+ * assistant turn. Present iff `status === "running"` AND
+ * `TabAgent.currentChunks` is non-null. Defensively copied at
+ * snapshot time; the consumer owns the array.
+ * - `currentAssistantId` — DB id of the in-flight assistant message
+ * (the row that the eventual `flushAssistant` call will write/update).
+ * Present iff `status === "running"` AND `TabAgent.currentAssistantId`
+ * is set. The frontend uses this to align its local assistant message
+ * id with the persisted id so subsequent `done` / reload paths line up.
+ */
+export interface TabStatusSnapshot {
+ status: AgentStatus;
+ currentChunks?: Chunk[];
+ currentAssistantId?: string;
+}
+```
+
+Then UPDATE the existing `statuses` variant of `AgentEvent` (currently around line 134, but verify with grep on `"statuses"` literal) to use the new shape:
+
+```ts
+// before:
+// | { type: "statuses"; statuses: Record<string, AgentStatus> }
+// after:
+ | { type: "statuses"; statuses: Record<string, TabStatusSnapshot> }
+```
+
+### 5.2 File: `packages/core/src/index.ts`
+
+If `TabStatusSnapshot` is not already re-exported via a `export * from "./types"` (it almost certainly already is — verify by reading the file), no change needed. Otherwise add it to the explicit type re-export list.
+
+### 5.3 Verification
+
+```sh
+bun run --cwd packages/core typecheck
+bun run --cwd packages/core test
+bunx biome check packages/core
+```
+
+All three must be clean. If they're not, fix the type issue before
+proceeding to Phase 1.
+
+---
+
+## 6. Phase 1 — Parallel implementation
+
+> **CRITICAL FOR FLASH AGENTS**: read your segment in full BEFORE touching
+> any file. Every file path, function signature, and code block is exact.
+> Do not improvise. If something is ambiguous, leave a `TODO(plan-bg-restore):`
+> comment instead of guessing.
+
+### 6.A SEGMENT A — Backend
+
+**Files owned (exclusive write access):**
+- `packages/api/src/agent-manager.ts`
+- `packages/api/tests/agent-manager.test.ts`
+
+**Files allowed to READ (do not modify):**
+- `packages/core/src/types/index.ts` (already has `TabStatusSnapshot` from Phase 0)
+- `packages/api/src/app.ts` (already calls `agentManager.getAllStatuses()` — its consumption pattern shows the contract)
+- `packages/api/src/index.ts` (already calls `agentManager.getAllStatuses()` in the WS `onOpen` — same contract)
+
+#### Task A.1 — Add the snapshot import
+
+At the top of `packages/api/src/agent-manager.ts`, locate the existing import block from `@dispatch/core`. Add `TabStatusSnapshot` to that type-import list. Example:
+
+```ts
+// Before (illustrative; merge with the existing import in place):
+import type { AgentEvent, AgentStatus, Chunk, ChatMessage } from "@dispatch/core";
+
+// After:
+import type { AgentEvent, AgentStatus, Chunk, ChatMessage, TabStatusSnapshot } from "@dispatch/core";
+```
+
+#### Task A.2 — Rewrite `getAllStatuses`
+
+Find the existing method at `packages/api/src/agent-manager.ts:714-720`. It currently reads:
+
+```ts
+getAllStatuses(): Record<string, AgentStatus> {
+ const result: Record<string, AgentStatus> = {};
+ for (const [tabId, tabAgent] of this.tabAgents.entries()) {
+ result[tabId] = tabAgent.status;
+ }
+ return result;
+}
+```
+
+Replace with:
+
+```ts
+/**
+ * Snapshot of every tab the manager is currently tracking. Sent on WS
+ * connect and via GET /status so a freshly-loaded frontend can
+ * reconstruct any in-flight assistant turn without missing the chunks
+ * that arrived before its WS handshake completed.
+ *
+ * For each running tab, the snapshot includes:
+ * - status: "running"
+ * - currentChunks: a defensive shallow copy of `tabAgent.currentChunks`
+ * (the live chunk array the streaming loop appends to). The
+ * consumer owns this copy and may mutate it freely.
+ * - currentAssistantId: the DB id of the in-flight assistant message
+ * row. The frontend aligns its local assistant message id with
+ * this so the next `done` event lands on the right message.
+ *
+ * For idle/error tabs, only `status` is present. Tabs not in
+ * `this.tabAgents` (e.g. tabs in the DB that have never been touched
+ * since server start) are absent from the returned record — the
+ * caller infers their status from the DB row (always "idle" at rest).
+ */
+getAllStatuses(): Record<string, TabStatusSnapshot> {
+ const result: Record<string, TabStatusSnapshot> = {};
+ for (const [tabId, tabAgent] of this.tabAgents.entries()) {
+ const snap: TabStatusSnapshot = { status: tabAgent.status };
+ if (tabAgent.status === "running") {
+ if (tabAgent.currentChunks) {
+ // Defensive shallow copy: callers may serialize/mutate.
+ snap.currentChunks = [...tabAgent.currentChunks];
+ }
+ if (tabAgent.currentAssistantId) {
+ snap.currentAssistantId = tabAgent.currentAssistantId;
+ }
+ }
+ result[tabId] = snap;
+ }
+ return result;
+}
+```
+
+**DO NOT** add a separate `getTabSnapshot(tabId)` method. The single
+`getAllStatuses` covers every consumer.
+
+#### Task A.3 — Tests
+
+In `packages/api/tests/agent-manager.test.ts`, add a new `describe` block at the end of the existing top-level `describe("AgentManager", ...)`. Place it as the LAST sub-section, AFTER the existing "done event includes a thinking chunk with metadata in its message" test and AFTER the "History pre-population on Agent (re)construction" tests already in place. Use this exact code:
+
+```ts
+ // ─── getAllStatuses snapshot shape (for browser-reopen restore) ────
+ //
+ // The snapshot enriches the legacy `Record<string, AgentStatus>` shape
+ // with per-tab in-flight context so a fresh frontend can render the
+ // streaming assistant message correctly after a reload.
+
+ it("getAllStatuses returns an empty record when no tabs are tracked", () => {
+ const manager = new AgentManager();
+ expect(manager.getAllStatuses()).toEqual({});
+ });
+
+ it("getAllStatuses returns { status } for an idle tab (no currentChunks/currentAssistantId)", async () => {
+ const manager = new AgentManager();
+ // Drive a full turn so the tab gets registered; default mock run
+ // settles back to idle by the time `await` resolves.
+ await manager.processMessage("tab-idle", "hi");
+ const snap = manager.getAllStatuses();
+ expect(snap["tab-idle"]).toBeDefined();
+ expect(snap["tab-idle"]?.status).toBe("idle");
+ expect(snap["tab-idle"]).not.toHaveProperty("currentChunks");
+ expect(snap["tab-idle"]).not.toHaveProperty("currentAssistantId");
+ });
+
+ it("getAllStatuses includes currentChunks and currentAssistantId for a running tab", () => {
+ const manager = new AgentManager();
+ // Reach into the private map to set up a synthetic running state.
+ // Justification: there is no public API to enter a sustained
+ // "running" state without actually streaming, and we want to
+ // assert the snapshot shape — not the streaming pipeline.
+ const inner = manager as unknown as {
+ tabAgents: Map<string, {
+ agent: null;
+ status: "running" | "idle" | "error";
+ keyId: null;
+ modelId: null;
+ taskList: { onChange: (cb: unknown) => void };
+ messageQueue: unknown[];
+ queueListeners: unknown[];
+ shellStore: unknown;
+ transcriptStore: unknown;
+ currentChunks: Array<{ type: string; text?: string }> | null;
+ currentAssistantId: string | null;
+ }>;
+ };
+ inner.tabAgents.set("tab-running", {
+ agent: null,
+ status: "running",
+ keyId: null,
+ modelId: null,
+ taskList: { onChange: () => {} },
+ messageQueue: [],
+ queueListeners: [],
+ shellStore: {},
+ transcriptStore: {},
+ currentChunks: [
+ { type: "thinking", text: "let me think" },
+ { type: "text", text: "partial answer" },
+ ],
+ currentAssistantId: "assistant-msg-id-7",
+ });
+
+ const snap = manager.getAllStatuses();
+ expect(snap["tab-running"]).toBeDefined();
+ expect(snap["tab-running"]?.status).toBe("running");
+ expect(snap["tab-running"]?.currentAssistantId).toBe("assistant-msg-id-7");
+ expect(snap["tab-running"]?.currentChunks).toEqual([
+ { type: "thinking", text: "let me think" },
+ { type: "text", text: "partial answer" },
+ ]);
+ });
+
+ it("getAllStatuses defensively copies currentChunks (mutating the snapshot doesn't affect the live array)", () => {
+ const manager = new AgentManager();
+ const inner = manager as unknown as {
+ tabAgents: Map<string, {
+ agent: null;
+ status: "running";
+ keyId: null;
+ modelId: null;
+ taskList: { onChange: (cb: unknown) => void };
+ messageQueue: unknown[];
+ queueListeners: unknown[];
+ shellStore: unknown;
+ transcriptStore: unknown;
+ currentChunks: Array<{ type: string; text?: string }>;
+ currentAssistantId: string;
+ }>;
+ };
+ const liveChunks = [{ type: "text", text: "live" }];
+ inner.tabAgents.set("tab-copy", {
+ agent: null,
+ status: "running",
+ keyId: null,
+ modelId: null,
+ taskList: { onChange: () => {} },
+ messageQueue: [],
+ queueListeners: [],
+ shellStore: {},
+ transcriptStore: {},
+ currentChunks: liveChunks,
+ currentAssistantId: "msg-x",
+ });
+
+ const snap = manager.getAllStatuses();
+ // Mutate the snapshot's array
+ snap["tab-copy"]?.currentChunks?.push({ type: "text", text: "polluted" });
+ // Live array must be untouched
+ expect(liveChunks).toEqual([{ type: "text", text: "live" }]);
+ });
+
+ it("getAllStatuses omits currentChunks when a running tab has none yet", () => {
+ const manager = new AgentManager();
+ const inner = manager as unknown as {
+ tabAgents: Map<string, {
+ agent: null;
+ status: "running";
+ keyId: null;
+ modelId: null;
+ taskList: { onChange: (cb: unknown) => void };
+ messageQueue: unknown[];
+ queueListeners: unknown[];
+ shellStore: unknown;
+ transcriptStore: unknown;
+ currentChunks: null;
+ currentAssistantId: null;
+ }>;
+ };
+ inner.tabAgents.set("tab-early", {
+ agent: null,
+ status: "running",
+ keyId: null,
+ modelId: null,
+ taskList: { onChange: () => {} },
+ messageQueue: [],
+ queueListeners: [],
+ shellStore: {},
+ transcriptStore: {},
+ currentChunks: null,
+ currentAssistantId: null,
+ });
+
+ const snap = manager.getAllStatuses();
+ expect(snap["tab-early"]?.status).toBe("running");
+ expect(snap["tab-early"]).not.toHaveProperty("currentChunks");
+ expect(snap["tab-early"]).not.toHaveProperty("currentAssistantId");
+ });
+```
+
+#### Task A.4 — Verify
+
+```sh
+bun run --cwd packages/api typecheck # must pass
+bun run --cwd packages/api test # all tests must pass (existing + new)
+bunx biome check packages/api # must be clean (no errors)
+```
+
+If biome reports formatting issues, run `bunx biome check --write packages/api` and re-verify.
+
+#### Task A.5 — What NOT to do
+
+- **DO NOT** modify `packages/api/src/index.ts`. The WS `onOpen` already calls `agentManager.getAllStatuses()` — the wire payload changes automatically when the return type changes.
+- **DO NOT** modify `packages/api/src/app.ts`. The `GET /status` route already calls `agentManager.getAllStatuses()` — same automatic propagation.
+- **DO NOT** modify the frontend. Segment B owns it.
+- **DO NOT** change any existing test cases. Only add the new ones described above.
+- **DO NOT** add or change the `getStatus()` (singular, deprecated) method. Leave it as-is.
+- **DO NOT** add or change `getTabStatus(tabId)`. Leave it returning `AgentStatus`.
+
+---
+
+### 6.B SEGMENT B — Frontend
+
+**Files owned (exclusive write access):**
+- `packages/frontend/src/lib/types.ts`
+- `packages/frontend/src/lib/tabs.svelte.ts`
+- `packages/frontend/src/App.svelte`
+- `packages/frontend/tests/chat-store.test.ts`
+
+**Files allowed to READ (do not modify):**
+- `packages/core/src/types/index.ts` (already has `TabStatusSnapshot` from Phase 0; you'll mirror the type)
+- `packages/frontend/src/lib/config.ts` (gives you `config.apiBase`)
+- `packages/frontend/src/lib/ws.svelte.ts` (gives you `wsClient.connect()`)
+
+#### Task B.1 — Mirror the `TabStatusSnapshot` type in the frontend
+
+The frontend deliberately mirrors core's type shapes locally (see the existing `Chunk` and `AgentEvent` types in `packages/frontend/src/lib/types.ts:21-127`). Add the snapshot type next to those.
+
+In `packages/frontend/src/lib/types.ts`, find the existing `AgentEvent` discriminated union (currently at line 79). The `statuses` variant currently reads:
+
+```ts
+| { type: "statuses"; statuses: Record<string, "idle" | "running" | "error"> }
+```
+
+Replace this single variant with:
+
+```ts
+| { type: "statuses"; statuses: Record<string, TabStatusSnapshot> }
+```
+
+Then add the `TabStatusSnapshot` interface immediately before the `AgentEvent` union (i.e. between the existing `ConnectionStatus` type and the `AgentEvent` union):
+
+```ts
+/**
+ * Mirror of core's `TabStatusSnapshot` (see packages/core/src/types/index.ts).
+ *
+ * Sent on every WS (re)connect and via `GET /status`. The frontend uses
+ * this to:
+ * - reconcile its in-memory `agentStatus` with the backend's truth
+ * after a disconnect window;
+ * - reconstruct the in-flight assistant message for any tab the
+ * backend is currently streaming, so the user sees the partial
+ * thinking / text without waiting for the next delta.
+ *
+ * Wire-format symmetry MUST be kept with core. If you change one,
+ * change the other.
+ */
+export interface TabStatusSnapshot {
+ status: "idle" | "running" | "error";
+ currentChunks?: Chunk[];
+ currentAssistantId?: string;
+}
+```
+
+#### Task B.2 — Add `hydrateFromBackend()` to the tab store
+
+In `packages/frontend/src/lib/tabs.svelte.ts`, add a new function `hydrateFromBackend()` and export it from the store. Place the function definition adjacent to the existing `openAgentTab` and `reloadTabMessagesFromApi` (around line 172-400 of the file).
+
+The function should be defined inside the `createTabStore` factory (or wherever the existing exported functions like `createNewTab`, `closeTab` live). It should be exported by adding it to the returned object literal at the bottom of `createTabStore`.
+
+Exact behavior (see also Task B.3 and B.4 for the integration):
+
+```ts
+/**
+ * Hydrate the tab store from the backend on app mount. Restores the
+ * full list of open tabs (every row with `is_open = 1` in the DB),
+ * loads each tab's persisted message history, and seeds the in-flight
+ * assistant message for any tab the backend is currently streaming.
+ *
+ * Wire calls:
+ * - GET /tabs → list of open tabs in `position` order
+ * - GET /tabs/:id/messages → persisted ChatMessage[] for each
+ * - GET /status → in-flight TabStatusSnapshot map
+ *
+ * Failure modes (all log + continue with whatever was successfully
+ * hydrated; callers fall back to creating a fresh tab if the final
+ * `tabs` array is empty):
+ * - /tabs request fails → no tabs restored
+ * - /tabs/:id/messages fails → that tab restored with empty messages
+ * - /status fails → tabs restored, in-flight streaming will be
+ * lost (will surface as a static "running" status until the next
+ * event arrives); harmless because the WS will broadcast `statuses`
+ * on reconnect anyway.
+ *
+ * Returns the number of tabs hydrated (0 on total failure, ≥1 on
+ * partial or full success). Caller uses this to decide whether to
+ * create a fresh tab.
+ *
+ * Idempotency: if `tabs.length > 0` when called, returns 0 without
+ * touching state — the caller already has tabs from elsewhere (e.g.
+ * a hot-reload that preserved Svelte state).
+ */
+async function hydrateFromBackend(): Promise<number> {
+ if (tabs.length > 0) return 0;
+
+ // 1. Fetch the list of open tabs from the DB.
+ let tabRows: Array<{
+ id: string;
+ title: string;
+ keyId?: string | null;
+ modelId?: string | null;
+ parentTabId?: string | null;
+ }> = [];
+ try {
+ const res = await fetch(`${config.apiBase}/tabs`);
+ if (!res.ok) return 0;
+ const data = (await res.json()) as { tabs?: typeof tabRows };
+ tabRows = Array.isArray(data.tabs) ? data.tabs : [];
+ } catch {
+ return 0;
+ }
+
+ if (tabRows.length === 0) return 0;
+
+ // 2. Fetch the in-flight snapshot. Failure is non-fatal.
+ let statusMap: Record<string, TabStatusSnapshot> = {};
+ try {
+ const res = await fetch(`${config.apiBase}/status`);
+ if (res.ok) {
+ const data = (await res.json()) as { statuses?: Record<string, TabStatusSnapshot> };
+ if (data.statuses && typeof data.statuses === "object") {
+ statusMap = data.statuses;
+ }
+ }
+ } catch {
+ // Non-fatal: tabs still restore with idle status.
+ }
+
+ // 3. For each tab, fetch its persisted messages in parallel.
+ const messageFetches = tabRows.map(async (row) => {
+ try {
+ const res = await fetch(`${config.apiBase}/tabs/${row.id}/messages`);
+ if (!res.ok) return { id: row.id, messages: [] as ChatMessage[] };
+ const data = (await res.json()) as {
+ messages?: Array<{ id?: string; role: string; chunks?: Chunk[] }>;
+ };
+ const messages: ChatMessage[] = (data.messages ?? []).map((m) => ({
+ id: m.id ?? generateId(),
+ role: m.role as ChatMessage["role"],
+ chunks: Array.isArray(m.chunks) ? m.chunks : [],
+ isStreaming: false,
+ }));
+ return { id: row.id, messages };
+ } catch {
+ return { id: row.id, messages: [] as ChatMessage[] };
+ }
+ });
+
+ const messagesByTab = new Map<string, ChatMessage[]>();
+ for (const result of await Promise.all(messageFetches)) {
+ messagesByTab.set(result.id, result.messages);
+ }
+
+ // 4. Build the Tab objects, splicing in the in-flight snapshot for
+ // running tabs.
+ const restored: Tab[] = tabRows.map((row) => {
+ const snap = statusMap[row.id];
+ const messages = messagesByTab.get(row.id) ?? [];
+ const agentStatus: Tab["agentStatus"] = snap?.status ?? "idle";
+
+ let currentAssistantId: string | null = null;
+ let finalMessages = messages;
+
+ if (agentStatus === "running" && snap?.currentAssistantId) {
+ currentAssistantId = snap.currentAssistantId;
+ // Find or create the in-flight assistant message. If the DB
+ // already has a row with this id (the backend appended on
+ // first flush and we picked it up via /tabs/:id/messages),
+ // merge the snapshot chunks on top — the snapshot is the
+ // live source of truth and may have chunks the DB doesn't.
+ // If there's no matching row, append a new in-flight
+ // assistant message holding only the snapshot chunks.
+ const existingIdx = finalMessages.findIndex((m) => m.id === snap.currentAssistantId);
+ if (existingIdx >= 0) {
+ finalMessages = finalMessages.map((m, i) =>
+ i === existingIdx
+ ? {
+ ...m,
+ chunks: snap.currentChunks ? [...snap.currentChunks] : m.chunks,
+ isStreaming: true,
+ }
+ : m,
+ );
+ } else {
+ finalMessages = [
+ ...finalMessages,
+ {
+ id: snap.currentAssistantId,
+ role: "assistant",
+ chunks: snap.currentChunks ? [...snap.currentChunks] : [],
+ isStreaming: true,
+ },
+ ];
+ }
+ }
+
+ return {
+ id: row.id,
+ title: row.title,
+ messages: finalMessages,
+ agentStatus,
+ keyId: row.keyId ?? null,
+ modelId: row.modelId ?? null,
+ reasoningEffort: "max",
+ currentAssistantId,
+ tasks: [],
+ injectedSkills: [],
+ parentTabId: row.parentTabId ?? null,
+ persistent: true,
+ agentSlug: null,
+ agentScope: null,
+ agentModels: null,
+ workingDirectory: null,
+ queuedMessages: [],
+ };
+ });
+
+ tabs = restored;
+ // Activate the first restored tab (the list is already ordered by
+ // `position` from the backend).
+ activeTabId = restored[0]?.id ?? null;
+ return restored.length;
+}
+```
+
+Then add `hydrateFromBackend` to the returned object at the bottom of `createTabStore` so it's accessible via `tabStore.hydrateFromBackend()`. Find the existing `return { ... }` block (it lists `createNewTab`, `switchTab`, `closeTab`, etc.) and add `hydrateFromBackend,` to it. **DO NOT** reorder existing keys.
+
+#### Task B.3 — Update the WS `statuses` handler
+
+In `packages/frontend/src/lib/tabs.svelte.ts`, find the existing `case "statuses":` block inside `handleEvent` (currently around line 419-446). It currently reads:
+
+```ts
+case "statuses": {
+ const backend = event.statuses;
+ for (const t of tabs) {
+ const backendStatus = backend[t.id] ?? "idle";
+ if (t.agentStatus === "running" && backendStatus !== "running") {
+ void reloadTabMessagesFromApi(t.id);
+ }
+ if (t.agentStatus !== backendStatus) {
+ updateTab(t.id, { agentStatus: backendStatus });
+ }
+ if (backendStatus !== "running" && t.currentAssistantId) {
+ updateMessages(t.id, (msgs) =>
+ msgs.map((m) => (m.id === t.currentAssistantId ? { ...m, isStreaming: false } : m)),
+ );
+ updateTab(t.id, { currentAssistantId: null });
+ }
+ }
+ break;
+}
+```
+
+Replace this entire `case "statuses":` block with the following:
+
+```ts
+case "statuses": {
+ // WS (re)connect snapshot. The shape was widened to
+ // TabStatusSnapshot (status + optional currentChunks +
+ // optional currentAssistantId) so the frontend can seed
+ // in-flight assistant messages on browser reopen.
+ const backend = event.statuses;
+ for (const t of tabs) {
+ const snap = backend[t.id];
+ const backendStatus = snap?.status ?? "idle";
+
+ // Desync case: frontend thought it was streaming, backend
+ // has already moved on. Pull the persisted chunks so the
+ // final answer shows up.
+ if (t.agentStatus === "running" && backendStatus !== "running") {
+ void reloadTabMessagesFromApi(t.id);
+ }
+
+ // Status alignment.
+ if (t.agentStatus !== backendStatus) {
+ updateTab(t.id, { agentStatus: backendStatus });
+ }
+
+ if (backendStatus === "running") {
+ // Seed the in-flight assistant message from the snapshot.
+ // This handles the "browser just reopened mid-stream"
+ // path: the DB only has chunks up to the last
+ // flushAssistant call, but the snapshot has the live
+ // in-memory currentChunks.
+ if (snap?.currentAssistantId) {
+ const targetId = snap.currentAssistantId;
+ updateTab(t.id, { currentAssistantId: targetId });
+ updateMessages(t.id, (msgs) => {
+ const idx = msgs.findIndex((m) => m.id === targetId);
+ if (idx >= 0) {
+ return msgs.map((m, i) =>
+ i === idx
+ ? {
+ ...m,
+ chunks: snap.currentChunks ? [...snap.currentChunks] : m.chunks,
+ isStreaming: true,
+ }
+ : m,
+ );
+ }
+ return [
+ ...msgs,
+ {
+ id: targetId,
+ role: "assistant",
+ chunks: snap.currentChunks ? [...snap.currentChunks] : [],
+ isStreaming: true,
+ },
+ ];
+ });
+ }
+ } else if (t.currentAssistantId) {
+ // Not running: clear streaming flags.
+ updateMessages(t.id, (msgs) =>
+ msgs.map((m) =>
+ m.id === t.currentAssistantId ? { ...m, isStreaming: false } : m,
+ ),
+ );
+ updateTab(t.id, { currentAssistantId: null });
+ }
+ }
+ break;
+}
+```
+
+**DO NOT** modify any other case in the `handleEvent` switch.
+
+#### Task B.4 — Update `App.svelte` onMount to hydrate
+
+In `packages/frontend/src/App.svelte`, find the existing `onMount` (currently lines 78-99). It reads:
+
+```ts
+onMount(() => {
+ // Apply saved theme
+ const saved = localStorage.getItem(STORAGE_KEY);
+ if (saved) {
+ document.documentElement.setAttribute("data-theme", saved);
+ }
+
+ // Connect WebSocket
+ wsClient.connect();
+
+ // Initial models fetch
+ fetchModels();
+
+ // Create initial tab
+ if (tabStore.tabs.length === 0) {
+ tabStore.createNewTab();
+ }
+
+ return () => {
+ wsClient.disconnect();
+ };
+});
+```
+
+Replace its body so it (1) hydrates tabs from the backend BEFORE deciding to create a fresh tab, and (2) connects the WS in parallel with hydration (since hydration uses HTTP, the WS connect can race ahead — the WS `statuses` handler is now idempotent for already-restored tabs):
+
+```ts
+onMount(() => {
+ // Apply saved theme
+ const saved = localStorage.getItem(STORAGE_KEY);
+ if (saved) {
+ document.documentElement.setAttribute("data-theme", saved);
+ }
+
+ // Connect WebSocket in parallel with hydration. The `statuses`
+ // snapshot delivered on WS open is idempotent against
+ // already-hydrated tabs (the handler reconciles per-tab).
+ wsClient.connect();
+
+ // Initial models fetch (fire-and-forget; UI tolerates models
+ // arriving later than tabs).
+ fetchModels();
+
+ // Restore tabs from the backend. The user's previous session is
+ // the source of truth; only fall back to a fresh tab if nothing
+ // was restored (first-ever load, or DB was wiped, or HTTP failed).
+ void (async () => {
+ const restored = await tabStore.hydrateFromBackend();
+ if (restored === 0 && tabStore.tabs.length === 0) {
+ await tabStore.createNewTab();
+ }
+ })();
+
+ return () => {
+ wsClient.disconnect();
+ };
+});
+```
+
+**DO NOT** change the `<script>` imports unless the `void (async () => { ... })()` pattern triggers a lint warning — in which case wrap in a named async function expression instead.
+
+#### Task B.5 — Tests
+
+In `packages/frontend/tests/chat-store.test.ts`, append a new `describe` block at the very end of the file (after the existing tests' closing brace).
+
+The test fixture pattern in this file already mocks `fetch`. You will override the mock per-test using `vi.stubGlobal("fetch", ...)`.
+
+Add the following tests (with their imports merged into the existing top-of-file import block as needed):
+
+```ts
+// ─── hydrateFromBackend ─────────────────────────────────────────
+//
+// Verifies the browser-reopen restore path: GET /tabs + GET /status +
+// GET /tabs/:id/messages combined into the in-memory tab store with
+// in-flight chunks seeded for any running tab.
+
+describe("hydrateFromBackend", () => {
+ it("restores tabs from /tabs with their persisted messages", async () => {
+ vi.stubGlobal(
+ "fetch",
+ vi.fn((url: string) => {
+ if (url.endsWith("/tabs")) {
+ return Promise.resolve({
+ ok: true,
+ json: () =>
+ Promise.resolve({
+ tabs: [
+ { id: "t1", title: "First", keyId: null, modelId: null, parentTabId: null },
+ { id: "t2", title: "Second", keyId: "k", modelId: "m", parentTabId: null },
+ ],
+ }),
+ });
+ }
+ if (url.endsWith("/status")) {
+ return Promise.resolve({
+ ok: true,
+ json: () => Promise.resolve({ statuses: {} }),
+ });
+ }
+ if (url.endsWith("/tabs/t1/messages")) {
+ return Promise.resolve({
+ ok: true,
+ json: () =>
+ Promise.resolve({
+ messages: [
+ { id: "m1", role: "user", chunks: [{ type: "text", text: "hello" }] },
+ {
+ id: "m2",
+ role: "assistant",
+ chunks: [{ type: "text", text: "hi back" }],
+ },
+ ],
+ }),
+ });
+ }
+ if (url.endsWith("/tabs/t2/messages")) {
+ return Promise.resolve({
+ ok: true,
+ json: () => Promise.resolve({ messages: [] }),
+ });
+ }
+ return Promise.reject(new Error(`unexpected fetch ${url}`));
+ }),
+ );
+
+ const store = createTabStore();
+ const n = await store.hydrateFromBackend();
+ expect(n).toBe(2);
+ expect(store.tabs.length).toBe(2);
+ expect(store.tabs[0]?.id).toBe("t1");
+ expect(store.tabs[0]?.messages.length).toBe(2);
+ expect(store.tabs[1]?.id).toBe("t2");
+ expect(store.tabs[1]?.messages.length).toBe(0);
+ expect(store.activeTabId).toBe("t1");
+ });
+
+ it("seeds the in-flight assistant message from /status for a running tab", async () => {
+ vi.stubGlobal(
+ "fetch",
+ vi.fn((url: string) => {
+ if (url.endsWith("/tabs")) {
+ return Promise.resolve({
+ ok: true,
+ json: () =>
+ Promise.resolve({
+ tabs: [
+ { id: "tr", title: "Running tab", keyId: null, modelId: null, parentTabId: null },
+ ],
+ }),
+ });
+ }
+ if (url.endsWith("/status")) {
+ return Promise.resolve({
+ ok: true,
+ json: () =>
+ Promise.resolve({
+ statuses: {
+ tr: {
+ status: "running",
+ currentAssistantId: "live-msg-id",
+ currentChunks: [
+ { type: "thinking", text: "still thinking" },
+ { type: "text", text: "partial " },
+ ],
+ },
+ },
+ }),
+ });
+ }
+ if (url.endsWith("/tabs/tr/messages")) {
+ return Promise.resolve({
+ ok: true,
+ json: () =>
+ Promise.resolve({
+ messages: [{ id: "u1", role: "user", chunks: [{ type: "text", text: "go" }] }],
+ }),
+ });
+ }
+ return Promise.reject(new Error(`unexpected fetch ${url}`));
+ }),
+ );
+
+ const store = createTabStore();
+ const n = await store.hydrateFromBackend();
+ expect(n).toBe(1);
+ const tab = store.tabs[0];
+ expect(tab?.agentStatus).toBe("running");
+ expect(tab?.currentAssistantId).toBe("live-msg-id");
+ // Two messages: the user message + the seeded in-flight assistant.
+ expect(tab?.messages.length).toBe(2);
+ const inflight = tab?.messages.find((m) => m.id === "live-msg-id");
+ expect(inflight).toBeDefined();
+ expect(inflight?.isStreaming).toBe(true);
+ expect(inflight?.chunks).toEqual([
+ { type: "thinking", text: "still thinking" },
+ { type: "text", text: "partial " },
+ ]);
+ });
+
+ it("returns 0 and leaves tabs empty when /tabs fails", async () => {
+ vi.stubGlobal(
+ "fetch",
+ vi.fn(() => Promise.resolve({ ok: false, json: () => Promise.resolve({}) })),
+ );
+ const store = createTabStore();
+ const n = await store.hydrateFromBackend();
+ expect(n).toBe(0);
+ expect(store.tabs.length).toBe(0);
+ });
+
+ it("returns 0 and leaves tabs empty when /tabs returns an empty array", async () => {
+ vi.stubGlobal(
+ "fetch",
+ vi.fn((url: string) => {
+ if (url.endsWith("/tabs")) {
+ return Promise.resolve({ ok: true, json: () => Promise.resolve({ tabs: [] }) });
+ }
+ return Promise.reject(new Error(`unexpected fetch ${url}`));
+ }),
+ );
+ const store = createTabStore();
+ const n = await store.hydrateFromBackend();
+ expect(n).toBe(0);
+ expect(store.tabs.length).toBe(0);
+ });
+
+ it("is a no-op when the store already has tabs (idempotency)", async () => {
+ vi.stubGlobal(
+ "fetch",
+ vi.fn(() => Promise.reject(new Error("should not be called"))),
+ );
+ const store = createTabStore();
+ // Pretend the store already has a tab (e.g. from a hot-reload).
+ // We do this by reaching into the store via the public API.
+ // Use the create path with mocked fetch failure (existing
+ // `createNewTab` already tolerates fetch failure — adds locally).
+ vi.stubGlobal(
+ "fetch",
+ vi.fn(() => Promise.reject(new Error("test: createNewTab fetch reject"))),
+ );
+ await store.createNewTab();
+ expect(store.tabs.length).toBe(1);
+
+ // Now swap to a fetch that would lie about there being 3 tabs;
+ // hydrateFromBackend must NOT call it.
+ const sentinelFetch = vi.fn(() =>
+ Promise.resolve({
+ ok: true,
+ json: () => Promise.resolve({ tabs: [{ id: "x" }, { id: "y" }, { id: "z" }] }),
+ }),
+ );
+ vi.stubGlobal("fetch", sentinelFetch);
+ const n = await store.hydrateFromBackend();
+ expect(n).toBe(0);
+ expect(store.tabs.length).toBe(1);
+ expect(sentinelFetch).not.toHaveBeenCalled();
+ });
+
+ it("restores a tab with an idle status when /status omits it", async () => {
+ vi.stubGlobal(
+ "fetch",
+ vi.fn((url: string) => {
+ if (url.endsWith("/tabs")) {
+ return Promise.resolve({
+ ok: true,
+ json: () =>
+ Promise.resolve({
+ tabs: [{ id: "ti", title: "Idle", keyId: null, modelId: null, parentTabId: null }],
+ }),
+ });
+ }
+ if (url.endsWith("/status")) {
+ return Promise.resolve({ ok: true, json: () => Promise.resolve({ statuses: {} }) });
+ }
+ if (url.endsWith("/tabs/ti/messages")) {
+ return Promise.resolve({ ok: true, json: () => Promise.resolve({ messages: [] }) });
+ }
+ return Promise.reject(new Error(`unexpected fetch ${url}`));
+ }),
+ );
+ const store = createTabStore();
+ const n = await store.hydrateFromBackend();
+ expect(n).toBe(1);
+ expect(store.tabs[0]?.agentStatus).toBe("idle");
+ expect(store.tabs[0]?.currentAssistantId).toBeNull();
+ });
+});
+
+// ─── statuses WS event with the wider TabStatusSnapshot shape ───
+//
+// The handler must reconcile snapshot.status against the local tab,
+// and (when running) seed currentChunks into the in-flight assistant
+// message.
+
+describe("handleEvent statuses with TabStatusSnapshot", () => {
+ it("seeds the in-flight assistant message when a running snapshot arrives", () => {
+ const store = createTabStore();
+ // Manually add a tab to the store via the existing createNewTab path
+ // (fetch was mocked to reject in beforeEach; createNewTab tolerates).
+ // We then drive a statuses event.
+ void store.createNewTab();
+ const tabId = store.tabs[0]?.id;
+ if (!tabId) throw new Error("test fixture: tab id missing");
+
+ store.handleEvent({
+ type: "statuses",
+ statuses: {
+ [tabId]: {
+ status: "running",
+ currentAssistantId: "live-x",
+ currentChunks: [{ type: "text", text: "live data" }],
+ },
+ },
+ });
+
+ const tab = store.tabs.find((t) => t.id === tabId);
+ expect(tab?.agentStatus).toBe("running");
+ expect(tab?.currentAssistantId).toBe("live-x");
+ const inflight = tab?.messages.find((m) => m.id === "live-x");
+ expect(inflight).toBeDefined();
+ expect(inflight?.chunks).toEqual([{ type: "text", text: "live data" }]);
+ expect(inflight?.isStreaming).toBe(true);
+ });
+
+ it("clears in-flight pointers when snapshot says the tab is idle", () => {
+ const store = createTabStore();
+ void store.createNewTab();
+ const tabId = store.tabs[0]?.id;
+ if (!tabId) throw new Error("test fixture: tab id missing");
+
+ // First put the tab into a running state with an in-flight message.
+ store.handleEvent({
+ type: "statuses",
+ statuses: {
+ [tabId]: {
+ status: "running",
+ currentAssistantId: "msg-a",
+ currentChunks: [{ type: "text", text: "x" }],
+ },
+ },
+ });
+ expect(store.tabs.find((t) => t.id === tabId)?.currentAssistantId).toBe("msg-a");
+
+ // Now snapshot says idle.
+ store.handleEvent({
+ type: "statuses",
+ statuses: { [tabId]: { status: "idle" } },
+ });
+ const tab = store.tabs.find((t) => t.id === tabId);
+ expect(tab?.agentStatus).toBe("idle");
+ expect(tab?.currentAssistantId).toBeNull();
+ const msgA = tab?.messages.find((m) => m.id === "msg-a");
+ expect(msgA?.isStreaming).toBe(false);
+ });
+});
+```
+
+You may need to import `createTabStore` and the `TabStatusSnapshot` type — check the existing imports at the top of the test file. The store factory and types should already be importable from the locations the existing tests use.
+
+#### Task B.6 — Verify
+
+```sh
+bun run --cwd packages/frontend typecheck # svelte-check, must pass
+bun run --cwd packages/frontend test # all tests must pass
+bunx biome check packages/frontend # must be clean
+```
+
+If biome reports formatting issues, run `bunx biome check --write packages/frontend` and re-verify.
+
+#### Task B.7 — What NOT to do
+
+- **DO NOT** modify any backend file. Segment A owns those.
+- **DO NOT** rename, delete, or reorder any existing exported functions on the tab store.
+- **DO NOT** modify `createNewTab`, `closeTab`, `openAgentTab`, `reloadTabMessagesFromApi`, or any other existing function. Only ADD `hydrateFromBackend` and CHANGE the body of the `case "statuses":` block in `handleEvent`.
+- **DO NOT** touch the WebSocket lifecycle (`wsClient.connect` / `disconnect`). The `App.svelte` change is the only WS-related edit.
+- **DO NOT** add a `beforeunload` handler or any "I'm leaving" signal. The whole point of Behavior 1 is that the agent keeps running silently.
+- **DO NOT** change `tabs.svelte.ts:openAgentTab` even if it looks similar to `hydrateFromBackend`. They serve different purposes (single-tab opening vs full restore).
+- **DO NOT** add localStorage persistence of tab ids. The backend DB is the source of truth.
+
+---
+
+## 7. Phase 2 — Integration check (main agent)
+
+After both flash agents report success, the main agent runs:
+
+```sh
+# from /home/tradam/projects/dispatch
+bun run --cwd packages/core typecheck
+bun run --cwd packages/core test
+bun run --cwd packages/api typecheck
+bun run --cwd packages/api test
+bun run --cwd packages/frontend typecheck
+bun run --cwd packages/frontend test
+bun run check # root-level biome
+```
+
+All seven commands must be green. If any fail, classify the failure:
+- **Type drift across the segment boundary** (e.g. shape of `TabStatusSnapshot` differs between core mirror and frontend mirror) → fix in Phase 4 with a targeted edit.
+- **Test failure inside a single segment** → flash agent missed a case; re-launch that segment's flash agent with the failure log.
+- **Test failure crossing segments** → main agent fixes directly.
+
+Also do a manual smoke read:
+- Open `packages/api/src/agent-manager.ts:714` and confirm `getAllStatuses` returns `Record<string, TabStatusSnapshot>`.
+- Open `packages/frontend/src/lib/tabs.svelte.ts` and confirm `hydrateFromBackend` exists, is exported, and is called from `App.svelte:onMount`.
+- Open `packages/frontend/src/App.svelte` and confirm the onMount await-and-fallback pattern.
+
+---
+
+## 8. Phase 3 — Review
+
+### 8.1 Launch Gemini (read-only)
+
+Spawn the `gemini` subagent with this prompt:
+
+> Review the implementation of the "background agents + layout restore" feature against the spec in `plan-bg-restore.md`. Read every file referenced in Phase 1 (Segment A files and Segment B files). Verify:
+>
+> 1. `getAllStatuses` in `packages/api/src/agent-manager.ts` matches Task A.2 exactly — including the defensive shallow copy of `currentChunks` and the conditional omission for non-running tabs.
+> 2. The frontend `TabStatusSnapshot` mirror in `packages/frontend/src/lib/types.ts` is structurally identical to the core type.
+> 3. `hydrateFromBackend` in `packages/frontend/src/lib/tabs.svelte.ts` correctly handles every failure mode listed in Task B.2 (no fatal throws, partial-failure tolerance, idempotency guard).
+> 4. The WS `statuses` handler in `tabs.svelte.ts` correctly seeds in-flight chunks for running tabs and clears streaming pointers for idle/error tabs, without regressing the existing desync recovery path (`reloadTabMessagesFromApi` should still fire when frontend thinks running but backend says idle).
+> 5. `App.svelte:onMount` calls `hydrateFromBackend` before falling back to `createNewTab`, AND only creates a fresh tab when `restored === 0 && tabStore.tabs.length === 0`.
+> 6. Tests cover: empty DB; partial failure of `/tabs/:id/messages`; in-flight snapshot seeding; idempotency when tabs already exist; clearing in-flight pointers on idle snapshot.
+> 7. Behavior 1 (browser close ≠ agent stop) is preserved — verify no new `beforeunload`, `unload`, or WS-close-triggered `stopTab` calls were introduced.
+> 8. Behavior 3 (X-close cancels + archives) is preserved — verify `closeTab` in `tabs.svelte.ts` still calls `DELETE /tabs/:id` and that the backend `DELETE /tabs/:id` route still calls both `agentManager.deleteTab` and `archiveTab`.
+>
+> Write your findings to `report.md` in the repo root. DO NOT modify any source file. Categorize each finding as Block / Ship-with-followup / Nit. Quote `file_path:line_number` for every finding.
+
+### 8.2 Self-review (in parallel with gemini)
+
+While gemini runs, the main agent independently reviews the same files
+against the spec. Note any issues to compare against gemini's output.
+
+### 8.3 Wait for gemini
+
+**Hard requirement**: do NOT apply any fixes until gemini's report.md
+exists and has been read. The user explicitly asked for this gate.
+
+---
+
+## 9. Phase 4 — Apply fixes
+
+### 9.1 Combine findings
+
+Merge self-review + gemini findings into a single deduplicated list,
+classified Block / Ship-with-followup / Nit. Resolve disagreements by
+re-reading the relevant code; default to following the spec.
+
+### 9.2 Apply
+
+- **Block** issues: fix immediately.
+- **Ship-with-followup**: fix if trivial; otherwise file in `wishlist.md` and proceed.
+- **Nit**: fix in batch with main agent edits; do not spawn flash agents for these.
+
+If a Block fix touches multiple files in disjoint segments, may spawn
+flash agents again (with a fresh, narrow prompt). Otherwise the main
+agent does the edits.
+
+### 9.3 Re-verify
+
+Run the Phase 2 verification commands again. All must be green.
+
+### 9.4 Await user direction
+
+Do NOT commit. Do NOT push. Surface the diff stat and a summary of what
+changed. Wait for the user to say "commit and push" or equivalent.
+
+---
+
+## 10. Risks & mitigations
+
+| Risk | Mitigation |
+|---|---|
+| `currentChunks` race between read and serialize | Single-threaded JS; the snapshot is built synchronously in `getAllStatuses`. Defensive copy further isolates the consumer. |
+| WS `statuses` event arrives during `hydrateFromBackend` and double-applies state | `hydrateFromBackend` checks `tabs.length > 0` and bails; conversely, the WS `statuses` handler is per-tab idempotent (uses snapshot as authoritative state, doesn't append duplicates). |
+| Hydration restores an archived tab if `is_open` was flipped after a crash | `listOpenTabs` filters on `is_open = 1`. No path in this plan touches that filter. |
+| User has 50 tabs and hydration is slow | `messageFetches` runs in `Promise.all`; bottleneck is total DB row count, not request count. Acceptable for v1. If it becomes a problem, add `?limit=N` to `/tabs/:id/messages`. |
+| Subagent (`parentTabId != null`) tabs restore with `persistent: true` and never auto-clean | Existing `persistent: true` semantics in `createNewTab` already apply to user tabs; subagent tabs created via WS `tab-created` keep their non-persistent flag. Hydration restores them as persistent (matching the "they survived a reload, the user clearly cares") — acceptable. |
+| The `currentAssistantId` from the snapshot doesn't match any DB message id | `hydrateFromBackend` and the `statuses` handler both append a new in-flight message in this case. Subsequent `done` will close it correctly. |
+
+---
+
+## 11. Acceptance checklist
+
+Before declaring done, the main agent verifies:
+
+- [ ] Open a tab, send a message that triggers a long response, close the browser DURING the response, reopen the browser: the message completes in the restored tab (visible in chat history).
+- [ ] Open a tab, send a message that triggers a long response, close the browser DURING the response, immediately reopen: the response is streaming live (the user sees deltas appearing).
+- [ ] Open three tabs, close the browser, reopen: all three tabs appear in the same order, with their messages.
+- [ ] Open three tabs, click the X on the middle one, close the browser, reopen: two tabs remain (not three) — the closed one was archived.
+- [ ] Open a tab, send a message, click the X on the tab DURING the response, reopen the browser: the tab is gone AND the agent didn't keep running (verifiable by checking that the tab's last assistant message has a `cancelled` system chunk in the DB, and that no subsequent assistant content was appended).
+- [ ] First-ever browser load with no tabs in the DB: a fresh empty tab is created (the existing default UX).
diff --git a/notes/plan-chunk-eviction.md b/notes/plan-chunk-eviction.md
new file mode 100644
index 0000000..74f850d
--- /dev/null
+++ b/notes/plan-chunk-eviction.md
@@ -0,0 +1,251 @@
+# Plan: Chunk-Native Frontend (per-chunk eviction + pagination)
+
+Closes `eviction-limitation.md`. Kills the frontend `ChatMessage[]` "message"
+structure entirely. The frontend's state becomes a **flat chunk log**; memory is
+bounded by a **rolling per-chunk eviction**; history is **paginated as raw
+chunks** from the backend.
+
+## Locked decisions (from the design discussion)
+
+1. **No stored "message" structure on the frontend.** The store holds a flat
+ chunk list. Any grouping into bubbles is a *render-time* projection only
+ (see "Rendering" — the one item still to confirm).
+2. **Rolling per-chunk eviction.** When a chunk completes while streaming and the
+ in-memory chunk count is at `chunkLimit`, drop the oldest chunk. Pure
+ count-based window. The only chunk never evicted is the single one currently
+ receiving deltas (the open chunk). No turn pinning, no "keep last pair."
+3. **Per-turn persistence stays** (backend unchanged in timing). It is the more
+ performant choice for a low-RAM/low-IO backend: peak RAM is dominated by the
+ agent's in-memory conversation context (required for the cache prefix) which
+ both strategies share, while per-seal would multiply fsyncs across the
+ streaming loop. One cheap win: wrap the end-of-turn write in a single SQLite
+ transaction.
+4. **Mid-stream reload is deferred, not engineered away.** A still-streaming turn
+ isn't in the DB until it seals. If the user scrolls up into chunks that were
+ roll-evicted from the *current* turn, the refetch waits until the turn's write
+ lands (the running→idle signal). Accepted.
+
+## The dominating constraint (unchanged)
+
+**Cache cohesion is built 100% server-side and must not be touched.** The
+Anthropic wire prefix comes from `toModelMessages` + `applyAnthropicCaching` in
+`packages/core/src/agent/agent.ts`, built from the agent's in-memory history
+(rebuilt from the chunk log) + the live turn — never from anything the frontend
+holds or the WS sends. So **everything here is frontend + one additive raw API
+endpoint**; it cannot regress the cache. Out of scope, do not edit: `agent.ts`
+wire/folding/caching/interrupt code, `explodeTurn` row ordering, and the
+backend's use of `groupRowsToMessages` for *agent history rebuild*.
+
+## Already done (reuse, don't rebuild)
+
+- `getChunksForTab(tabId,{limit,before})` (`db/chunks.ts`) already paginates the
+ flat log by `seq`.
+- `explodeTurn` and `groupRowsToMessages` (`chunks/transform.ts`) are DB-free,
+ browser-importable, and orphan-tolerant (a turn split across a page boundary
+ groups correctly). `explodeTurn` emits a **role per draft** — we use that.
+- `appendEventToChunks` (`chunks/append.ts`) folds a delta stream into render
+ `Chunk[]` correctly (text/thinking/metadata/tool-result/shell-output ordering).
+- Chunk table schema already has `seq, turnId, step, role, type, data`.
+ **No DB migration, no chat-history reset needed.**
+
+---
+
+## Frontend architecture (chunk-native)
+
+### Store state (per tab) — `tabs.svelte.ts`
+
+```
+chunks : ChunkRow[] // SEALED history, real per-tab seq, oldest→newest.
+ // The growing / evictable / paginated structure.
+liveRender : Chunk[] // transient fold buffer for the CURRENT turn only,
+ // built by the EXISTING appendEventToChunks.
+liveTurnId : string | null // turn_id of the in-flight turn (for stable keys).
+liveAssistantId: string | null // == currentAssistantId today.
+oldestLoadedSeq: number | null // min seq in `chunks` (pagination cursor).
+totalChunks : number // backend total (drives "more to load?").
+// DELETED: messages, currentAssistantId(renamed→liveAssistantId), the turnId-merge state.
+// Untouched: queuedMessages, cacheStats, agentStatus, model/agent fields, etc.
+```
+
+There is **no `messages` field.** The live turn is not a "message" — it's a
+one-turn streaming buffer that is reconciled into `chunks` (as real-seq rows) the
+moment the turn seals.
+
+### Unified flat view (derived, for render + eviction count)
+
+```
+liveFlat = liveTurnId ? explodeTurn(liveTurnId, liveRender)
+ .map((d,i) => ({...d, id:`${liveTurnId}:${i}`, seq:null, live:true}))
+ : []
+log = [...chunks, ...liveFlat] // one flat, chunk-native list
+```
+
+`explodeTurn` is the *same* transform the backend persists with, so the live flat
+chunks are byte-for-byte what will land in the DB → reconciliation at seal is
+seamless and symmetric. `log` is what we render and count for eviction.
+
+### Streaming (reuse, no new reducer)
+
+- Content events (`text-delta`/`reasoning-*`/`tool-call`/`tool-result`/
+ `shell-output`/`error`/`notice`/`model-changed`/`config-reload`) fold into
+ `liveRender` via the existing `appendEventToChunks` + `$state.snapshot` clone
+ (the documented Svelte-5 proxy hazard — keep it).
+- `turn-start` (new tiny WS event, see backend) sets `liveTurnId` /
+ `liveAssistantId` so live-flat ids match the sealed rows' turn → no remount at
+ seal. Without it, a one-frame remount at seal with identical content; harmless.
+- Optimistic user message: on send, append a provisional user row to `chunks`
+ with `seq:null, live:true` (or carry it in a tiny `pendingUser` slot). It is
+ replaced by the real user row at reconcile.
+
+### Rendering — `ChatPanel.svelte` / `ChatMessage.svelte`
+
+```
+renderGroups = groupRowsToMessages(log) // ephemeral, render-time ONLY
+{#each renderGroups as g (g.id)} <ChatMessage .../> {/each}
+```
+
+Grouping pairs `tool_call`+`tool_result` by `callId` and wraps a turn's
+assistant chunks into a bubble — **purely a view projection of the flat log**,
+never stored, never used for eviction/pagination. `ChatMessage.svelte` is largely
+unchanged (it already renders a grouped turn's `Chunk[]`).
+
+> **DECISION TO CONFIRM (rendering):** default = keep the current bubble look via
+> this render-time grouping (chunk-native state, same UX). Alternative = render
+> each chunk as a standalone element (fully flat, no assistant-bubble wrapper).
+> I'm proceeding with the default; say the word to go fully flat (smaller view
+> change, different look).
+
+### Eviction (`evictChunks`, replaces `evictMessages`)
+
+```
+limit = appSettings.chunkLimit; if !finite or ≤0 → return
+if scrolledUp and !force → return // unchanged suppression
+total = chunks.length + liveFlat.length
+while total > limit:
+ if chunks.length > 0 and chunks[0] is not the open chunk:
+ drop chunks[0]; total-- // evict sealed front first
+ else if liveRender has a non-open leading chunk:
+ drop oldest liveRender chunk; recompute liveFlat; total = chunks.length+liveFlat.length
+ else break // only the open chunk remains
+oldestLoadedSeq = min seq remaining in chunks (or unchanged if chunks emptied)
+```
+
+This trims a huge **old** turn chunk-by-chunk (the fix) and even trims within a
+huge **live** turn (bounding memory mid-stream), never evicting the open chunk.
+Triggered on each content event (chunk completion) and after load/reconcile.
+
+### Pagination (`loadOlderChunks`, replaces `loadMoreMessages`)
+
+```
+if oldestLoadedSeq == null or ≤ 0 → nothing older
+GET /tabs/:id/chunks?limit=50&before=oldestLoadedSeq
+merge rows into `chunks`, DEDUPE BY seq, keep seq-sorted
+oldestLoadedSeq = min seq held; totalChunks = resp.total
+```
+
+The old `turnId`-merge hack (`tabs.svelte.ts:440-462`) is **deleted** — raw rows
+in one seq-sorted array make `groupRowsToMessages` rejoin a boundary-split turn
+automatically. Overlap is fine; dedupe-by-seq is trivial.
+
+### Turn-completion reconcile (the one new flow) — on running→idle for the tab
+
+```
+refetch GET /tabs/:id/chunks?limit=chunkLimit // tail, real seqs (write already landed)
+drop all live entries (liveRender=[], liveTurnId=null) and provisional user row
+merge refetched rows into `chunks`, dedupe by seq, seq-sort
+oldestLoadedSeq = min seq; evictChunks()
+if scrolledUp: defer this refetch until the user returns to bottom
+```
+
+This single mechanism (a) gives the just-completed turn **real seqs** (so the
+log never accumulates seq-less middles that break pagination), and (b) **recovers
+any live chunks roll-evicted mid-stream** — they come back from the DB now that
+the write landed. Exactly the "delay the reload until the write lands" model.
+Keying render groups by `turnId` makes the swap from live→sealed flicker-free.
+
+---
+
+## Backend changes (additive, cache-neutral, low-cost)
+
+1. **`routes/tabs.ts`** — add `GET /tabs/:id/chunks?limit&before` →
+ `{ chunks: ChunkRow[], total, oldestSeq }` (raw rows; no grouping). Leave
+ `/messages` as-is (unused by the new frontend; harmless).
+2. **`db/chunks.ts`** — wrap the `appendChunks` insert loop in a single
+ `db.transaction(...)` (one fsync per turn instead of N). Pure perf; no behavior
+ change.
+3. **`types/index.ts` + `agent-manager.ts`** — add a tiny
+ `{ type:"turn-start"; turnId; assistantId }` event, emitted at turn start
+ (`agent-manager.ts:1120-1122`). Optional but recommended (stable keys → no
+ seal remount). `appendEventToChunks` lists it in its no-op branch for
+ exhaustiveness.
+4. **WS:** nothing — `index.ts:25-26` forwards all emitted events to the browser.
+
+No change to persistence *timing*, `explodeTurn`, the agent wire, or the caching
+path. (Optional future optimization, not in this plan: a `chunk-rows` push at
+seal to avoid the per-turn reconcile GET. The reconcile-refetch is simpler and
+matches the agreed "reload" model, so we ship that.)
+
+---
+
+## Test plan
+
+Rewrite frontend tests from `tab.messages` assertions to **chunk-log** assertions
+(the store no longer has `messages`). Cover:
+
+- **Rolling eviction (headline):** stream a turn of `chunkLimit + N` chunks;
+ assert in-memory `log` length stays `≤ chunkLimit` (open chunk exempt) and the
+ oldest entries are dropped — including dropping the *leading chunks of a single
+ giant turn* (the exact case the limitation was about).
+- **Pagination + dedupe:** `loadOlderChunks` prepends older rows, dedupes by seq,
+ updates `oldestLoadedSeq`; a turn split at the window boundary renders as one
+ group (no merge hack, no duplication).
+- **Turn-completion reconcile:** stream a turn (live, seq=null) → running→idle →
+ assert live entries are replaced by real-seq rows from the refetch, `log`
+ content identical, render-group identity stable (no remount), and a mid-stream-
+ evicted chunk is recovered.
+- **Deferred reload while scrolled up:** reconcile is deferred until return-to-
+ bottom; nothing yanks the viewport.
+- **Streaming fidelity:** the existing delta→chunk behaviors (coalescing,
+ thinking metadata seal, tool-result fill, shell-output) still hold on
+ `liveRender` (these reuse `appendEventToChunks`, so port the assertions to read
+ the live flat view).
+- **Cache untouched:** the 3-step byte-identical cache-stability test in
+ `packages/core/tests/agent/agent.test.ts` passes unchanged; `routes.test.ts`
+ gains `/chunks` coverage.
+
+## Phasing (each phase: `bun run check` + `bun run test` green)
+
+- **P1 — Backend additive:** `GET /chunks` raw endpoint; wrap `appendChunks` in a
+ transaction; add `turn-start`. Tests for endpoint + event. (Cache test must
+ still pass byte-for-byte.)
+- **P2 — Frontend store core:** introduce `chunks`/`liveRender`/`liveTurnId`;
+ retarget every streaming handler from `messages` to `liveRender`; build the
+ derived `log` + render groups; delete the `messages` field. Port streaming
+ tests.
+- **P3 — Eviction + pagination + reconcile:** `evictChunks`, `loadOlderChunks`
+ (delete turnId-merge), running→idle reconcile-refetch, load paths
+ (`hydrateFromBackend`/`openAgentTab`/`reloadTabMessagesFromApi`) switched to
+ `GET /chunks`. New tests.
+- **P4 — Cleanup:** delete dead code (`oldestSeqOf` if unused, old eviction
+ comment, `loadMoreMessages` name), update `eviction-limitation.md` →
+ resolved, full `bun run test` + `bun run check`.
+
+## Risks / open items
+
+- **Rendering decision** (above) — the one thing to confirm; proceeding with
+ render-time grouping (preserves UX).
+- **Interrupt turns at reconcile.** The live UI shows an interrupt as a split
+ (`message-consumed` handler); the persisted turn keeps the interrupt inside a
+ tool result (cache-safe, prior decision). So at reconcile the bubble re-renders
+ to the persisted shape. Cosmetic; documented; not fixed (fixing touches the
+ wire/cache).
+- **shell-output live placement.** `appendEventToChunks` attaches live shell
+ output to the running tool call; `explodeTurn` places it on the `tool_result`
+ row. The live flat view and the post-reconcile rows therefore differ only in
+ where shellOutput hangs until seal — render reads it from the paired call/result
+ either way. Minor; assert in a test.
+- **Scrolled-up during reconcile** — deferred (handled above); verify no viewport
+ jump using ChatPanel's existing scroll-anchor logic.
+- **Renames ripple:** `totalMessages→totalChunks`, `currentAssistantId→
+ liveAssistantId`, `evictMessages→evictChunks`, `loadMoreMessages→
+ loadOlderChunks` touch a few call sites + tests.
diff --git a/notes/plan-chunk-log.md b/notes/plan-chunk-log.md
new file mode 100644
index 0000000..3b39303
--- /dev/null
+++ b/notes/plan-chunk-log.md
@@ -0,0 +1,271 @@
+# Plan: Append-Only Chunk Log (supersedes plan-chunk-refactor.md)
+
+## Goal
+
+Replace the `messages(role, content_json=Chunk[])` "turn-as-container" model with a
+**flat, append-only chunk log**. Each chunk is its own row, ordered by a per-tab
+monotonic `seq`. "Message" and "turn" become *derived groupings*, not stored
+containers.
+
+This single change fixes two problems at once:
+
+1. **Prompt-cache churn** (see `cache-miss-report.md`). A multi-step turn is
+ currently one growing assistant message that `toModelMessages` re-buckets every
+ step, reshuffling earlier `tool_use`/`tool_result` positions and busting the
+ Anthropic cache prefix. An immutable, append-only log folded **per step** makes
+ the prefix byte-stable across requests → rolling cache accumulates.
+2. **Frontend memory** on constrained devices. Today the smallest
+ loadable/evictable unit is a whole turn (`tabs.svelte.ts:360-372` admits it).
+ With chunks as the unit, the frontend pages and evicts at **chunk granularity**.
+
+Beta software — **no backward compatibility**. On first boot of the new build:
+drop `messages`, clear `tabs`. Preserve `settings`, `credentials`, `api_keys`,
+`usage_cache`, `wake_schedule`.
+
+## Decisions (locked)
+
+- **Option A**: turn is a `turn_id` column on chunks; there is **no `turns` table**
+ (A→C upgrade later is additive if a turn-level UI is ever needed).
+- **Separate rows** for `tool_call` (role `assistant`) and `tool_result` (role
+ `tool`), linked by `call_id`.
+- **Write-on-seal**: a chunk is persisted when it *seals* (a block completes), not
+ per delta. The single currently-open block lives only in memory until it seals;
+ a mid-turn crash loses at most that one open block. `sealed` is therefore an
+ in-memory concept only — **not** a DB column.
+- Clear chunks **and** tabs on migration.
+
+---
+
+## 1. Schema
+
+```sql
+CREATE TABLE chunks (
+ id TEXT PRIMARY KEY, -- uuid; streaming deltas target this id
+ tab_id TEXT NOT NULL,
+ seq INTEGER NOT NULL, -- per-tab monotonic; ordering + pagination cursor
+ turn_id TEXT NOT NULL, -- one run(): the user msg + its full assistant response
+ step INTEGER NOT NULL DEFAULT 0, -- LLM round-trip index within the turn (user/system = 0)
+ role TEXT NOT NULL, -- 'user' | 'assistant' | 'tool' | 'system'
+ type TEXT NOT NULL, -- 'text'|'thinking'|'tool_call'|'tool_result'|'error'|'system'
+ data_json TEXT NOT NULL, -- type-specific payload (below)
+ created_at INTEGER NOT NULL
+);
+CREATE INDEX idx_chunks_tab_seq ON chunks(tab_id, seq);
+```
+
+`seq` is allocated as `MAX(seq)+1 WHERE tab_id=?` (mirrors current `appendMessage`),
+or from an in-memory per-tab counter for the live turn.
+
+### `data_json` payloads by `type`
+
+| type | role | payload |
+|---------------|-------------|---------|
+| `text` | user/assistant | `{ text }` |
+| `thinking` | assistant | `{ text, metadata? }` (Anthropic signature blob) |
+| `tool_call` | assistant | `{ callId, name, arguments }` |
+| `tool_result` | tool | `{ callId, name, result, isError, shellOutput? }` |
+| `error` | assistant/system | `{ message, statusCode? }` |
+| `system` | system | `{ kind, text }` |
+
+### Migration (one-shot, in `getDatabase()` bootstrap)
+
+- `DROP TABLE IF EXISTS messages;`
+- `DELETE FROM tabs;` (clears tab shells too, per decision)
+- `CREATE TABLE chunks …` if not exists.
+- Everything else untouched.
+
+---
+
+## 2. Core types (`packages/core/src/types/index.ts`)
+
+Introduce the persisted row + payload unions; retire `ChatMessage`/`Chunk`-as-container.
+
+```ts
+export type ChunkRole = "user" | "assistant" | "tool" | "system";
+export type ChunkType =
+ | "text" | "thinking" | "tool_call" | "tool_result" | "error" | "system";
+
+export interface LogChunk {
+ id: string;
+ tabId: string;
+ seq: number;
+ turnId: string;
+ step: number;
+ role: ChunkRole;
+ type: ChunkType;
+ data: ChunkData; // discriminated by `type`
+ createdAt: number;
+}
+// ChunkData = TextData | ThinkingData | ToolCallData | ToolResultData | ErrorData | SystemData
+```
+
+`ToolCall`/`ToolResult`/`TaskItem`/`AgentConfig` stay. `ChatMessage`,
+`ToolBatchChunk`, `ToolBatchEntry`, `TabStatusSnapshot.currentChunks` change or go.
+
+---
+
+## 3. The wire builder — `chunksToModelMessages` (the caching fix)
+
+New pure function (replaces `toModelMessages` in `agent.ts:104`). Input: the tab's
+`LogChunk[]` in `seq` order. Output: `ModelMessage[]`.
+
+**Folding rules** (group contiguous chunks by `turn_id` + `role-bucket` + `step`):
+
+- `user/text` → `{ role:"user", content }` (concatenate consecutive user text).
+- For each **step** of a turn's assistant output:
+ - `assistant` `text`/`thinking`/`tool_call` chunks of that step → **one**
+ `{ role:"assistant", content:[parts in seq order] }`. Natural order is
+ thinking → text → tool_call, so tool-calls are last.
+ - `tool` `tool_result` chunks of that step → **one** `{ role:"tool", content:[…] }`
+ immediately after.
+- `system`/`error` chunks → skipped (display-only), exactly as today.
+
+**Why this is cache-stable:** every prior step's `[assistant, tool]` pair is built
+from immutable chunks at fixed `seq`s, so it serializes byte-identically on every
+later request. Step N+1 only appends new messages at the tail. The Pass-3 split
+never triggers (no later-step text after a tool-call inside one message). Result:
+Anthropic matches the full prefix up to the current step; `applyAnthropicCaching`
+(unchanged, `agent.ts:448`) marks the current `[assistant, tool]` and the cached
+prefix grows monotonically.
+
+**Regression test (must-have):** a 3-step sequential turn → assert the step-0 and
+step-1 ModelMessages are byte-identical between the step-2 and step-3 builds.
+
+### Structural normalizations (carry over + one fix)
+
+Keep empty-text drop, `toolCallId` scrub, Pass-3 (now a no-op safety net). **Fix
+the report's divergence:** retain a `reasoning` part when it has a signature even if
+its text is empty (mirror opencode `transform.ts:144-150`); today `agent.ts:356-360`
+drops it.
+
+---
+
+## 4. Agent loop (`packages/core/src/agent/agent.ts`)
+
+- Drop the single accumulating `assistantTurnMessage` + shared `chunks` array
+ (`agent.ts:786, 923-926`). Maintain a flat `log: LogChunk[]` (or emit via
+ callback).
+- Allocate `turnId` per `run()`. Write the user `text` chunk (`step:0`) at start.
+- The step loop tags every appended chunk with `turnId` + current `step` index.
+- Stream → log reduction (the new shared reducer, §5) emits `chunk-open` /
+ `chunk-delta` / `chunk-seal`. On seal, persist (§6) and yield the WS event.
+- Build requests with `chunksToModelMessages(log)` instead of `toModelMessages`.
+
+### Interrupt handling change (bonus cache fix)
+
+Today queued user messages are spliced **into** a tool result and later stripped
+(`stripUserInterruptBlock`, `agent.ts:73`) — a history mutation that busts the cache.
+New model: append the interrupt as its own `user/text` chunk **after** the step's
+`tool_result` chunks. Append-only, immutable, cache-stable. `stripUserInterruptBlock`
+and the "freshest tool batch" logic are deleted.
+
+---
+
+## 5. Stream→chunk reducer (`packages/core/src/chunks/append.ts` rewrite)
+
+`appendEventToChunks` (mutates a `Chunk[]`) → a reducer that, given the current open
+chunk + an `AgentEvent`, returns chunk-lifecycle ops:
+
+- `text-delta` → open `text` if none open, else append; (thinking/tool start seals it).
+- `reasoning-delta` → open/append `thinking`; `reasoning-end` → attach `metadata` + seal.
+- `tool-call` → emit a sealed `tool_call` chunk immediately (no streaming body).
+- `tool-result` → emit a sealed `tool_result` chunk.
+- `error`/system notices → sealed single chunks.
+
+Shared by backend (persist + WS) and frontend (apply WS), keeping wire format in
+lockstep (the current symmetry contract).
+
+---
+
+## 6. Persistence (`db/messages.ts` → `db/chunks.ts`)
+
+```ts
+appendChunk(c: Omit<LogChunk,"seq"|"createdAt">): LogChunk // allocates seq
+getChunksForTab(tabId, { limit?, before? }): LogChunk[] // seq paginate (DESC→reverse)
+getTotalChunkCount(tabId): number
+clearChunksForTab(tabId): void
+```
+
+Pagination mirrors current `getMessagesForTab` semantics but at chunk grain. Agent
+persists each chunk **on seal**.
+
+---
+
+## 7. WS / event protocol
+
+Replace `currentChunks` snapshot + `done.message` with:
+
+- `chunk-open` `{ id, seq, turnId, step, role, type }`
+- `chunk-delta` `{ id, text }`
+- `chunk-seal` `{ id, data }`
+- `turn-done` `{ turnId }` (status → idle; no payload reconstruction needed)
+
+On WS reconnect, the frontend just refetches the tail via REST (§8) — no bespoke
+snapshot needed. `TabStatusSnapshot` loses `currentChunks`/`currentAssistantId`.
+
+---
+
+## 8. API routes
+
+- `GET /tabs/:id/messages` → `GET /tabs/:id/chunks?limit&before` → `{ chunks, total }`.
+- Update WS broadcast to emit the §7 events.
+- Audit other consumers of message endpoints.
+
+---
+
+## 9. Frontend store (`tabs.svelte.ts`)
+
+- `tab.messages: ChatMessage[]` → `tab.chunks: LogChunk[]` (flat window) +
+ `oldestLoadedSeq`.
+- `applyChunkEvent` → apply `chunk-open/delta/seal` by `id`.
+- `evictMessages` → `evictChunks`: drop oldest chunks beyond `chunkLimit`; pin the
+ in-flight chunks + the last turn. **Now trims within a turn** — removes the
+ `:360-372` caveat.
+- `loadMoreMessages` → `loadMoreChunks` (page by `seq`).
+- A `$derived` selector groups the flat chunk window → render groups
+ `{ turnId, role, step, chunks[] }` for the view layer (no nested storage).
+
+## 10. Frontend components
+
+Chat view renders from the derived groups: user bubble, assistant bubble
+(thinking/text/tool-calls), tool results paired by `callId`, system notices,
+errors. Partial (half-paged) turns render as partial bubbles.
+
+---
+
+## Phasing (each phase compiles + tests green independently)
+
+- **P0 — Schema/DB**: migration (drop messages, clear tabs), `db/chunks.ts`, tests.
+- **P1 — Core wire builder**: types, stream reducer, `chunksToModelMessages` +
+ normalizations + the 3-step stability regression test + caching-breakpoint tests.
+ *(Delivers the cache fix in isolation.)*
+- **P2 — Agent loop**: flat log, `turnId`/`step` tagging, interrupt-as-user-chunk,
+ remove `stripUserInterruptBlock` + accumulating message.
+- **P3 — Persistence wiring**: write-on-seal in agent-manager; pre-populate agent
+ history from `getChunksForTab`.
+- **P4 — Protocol/API**: §7 events, `/chunks` route, status snapshot trim.
+- **P5 — Frontend store**: flat window, chunk-grain eviction, pagination, grouping.
+- **P6 — Frontend components**: render from groups.
+- **P7 — Cleanup**: delete `toModelMessages`, `messages` table refs, dead types;
+ docs; full test pass (`bun run test`, `bun run check`).
+
+## Test plan (highlights)
+
+- **Cache stability (P1)**: 3-step turn; prior steps' ModelMessages byte-identical
+ across builds. Breakpoints on `[system]`, `[assistant(step N), tool(step N)]`.
+- **Reasoning retention (P1)**: empty-text + signature reasoning part is kept.
+- **Pagination (P0/P5)**: `before`/`limit` windows; eviction trims within a turn and
+ pins the live turn; scroll-up refetch.
+- **Interrupt (P2)**: queued message becomes a trailing `user` chunk; no history
+ mutation; prefix of earlier steps unchanged.
+
+## Risks / watch-items
+
+- **Anthropic ordering**: thinking must precede non-thinking in a turn — guaranteed
+ by per-step grouping + seq order; assert in tests.
+- **Many tiny rows**: one row per block (not per delta) keeps counts sane; index on
+ `(tab_id, seq)` covers paginate/evict.
+- **Signature round-trip**: thinking `metadata` must persist in `data_json` and
+ replay verbatim.
+- **Subagents**: confirm child-agent tabs use the same log path.
+```
diff --git a/notes/plan-chunk-refactor.md b/notes/plan-chunk-refactor.md
new file mode 100644
index 0000000..0a60237
--- /dev/null
+++ b/notes/plan-chunk-refactor.md
@@ -0,0 +1,245 @@
+# Plan: Chunk-Based Message Refactor
+
+## Goal
+
+Replace the current `content: ContentSegment[]` + separate `thinking: string` representation with a single ordered `chunks: Chunk[]` list per message. Preserve the actual temporal ordering of text, reasoning, tool calls, system notices, and errors as they arrive from the model — losing none of that information to flat-string accumulation.
+
+This is foundational work for downstream features (editable history, resumable mid-generation chats, structured truncation sentinels) that all assume an honest, lossless representation of a turn.
+
+Beta software — **no backward compatibility required**. The existing `messages` and `tabs` rows will be destroyed. Settings, keys, credentials, and other non-chat-history state will be preserved.
+
+## Final Design
+
+### Chunk union
+
+| Type | Body | Opens on | Closes on | Sent to LLM? |
+|---|---|---|---|---|
+| `text` | `text: string` | First `text-delta` after non-text. Consecutive `text-delta` events append. | `reasoning-delta`, `tool-call`, `error`, system event, `done`. | Yes (as `text` part). |
+| `thinking` | `text: string` | First `reasoning-delta` after non-thinking. Consecutive `reasoning-delta` events append. | `text-delta`, `tool-call`, `error`, system event, `done`. | Yes (as `reasoning` part — handles Claude's `interleaved-thinking-2025-05-14`). |
+| `tool-batch` | `calls: Array<{ id, name, arguments, result?, isError?, shellOutput? }>` | First `tool-call` after non-tool. Consecutive `tool-call` events append a new entry to `calls`. | Any non-tool event. | Yes (each call/result as `tool-call` / `tool-result` parts). |
+| `error` | `message: string`, `statusCode?: number` | An `error` event. Always its own chunk. No coalescing. | Single-event. | No (the turn ended anyway). |
+| `system` | `text: string`, `kind: "notice" \| "model-changed" \| "config-reload" \| "cancelled"` | A system event. Always its own chunk. No coalescing. | Single-event. | **No — stripped in `toCoreMessages`.** |
+
+### Message roles
+
+`user | assistant | system`
+
+- `user` and `assistant` messages can contain any chunk types.
+- `system` role messages exist for system events that fire **outside an active assistant turn**. They contain only `system` chunks. They are skipped entirely by `toCoreMessages`.
+
+### System event routing
+
+When a system event arrives:
+
+1. Active assistant turn in flight → append a `system` chunk to that message's chunks at its current position.
+2. No turn in flight; most recent message is `role: "system"` → append a `system` chunk to that message.
+3. No turn in flight; most recent message is anything else → create a new `role: "system"` message containing one `system` chunk.
+
+### `toCoreMessages` rebuild rules
+
+- Iterate messages in seq order.
+- Skip `role: "system"` messages entirely.
+- For each message, iterate `chunks` and emit AI SDK parts:
+ - `text` → `{ type: "text", text }`
+ - `thinking` → `{ type: "reasoning", text }`
+ - `tool-batch` → one `{ type: "tool-call" }` per entry (and tool-result parts in the following `tool` message, same as current logic)
+ - `error` → skip
+ - `system` → skip
+
+### Example turn
+
+User clicks Send, model thinks, says something, hits a rate limit (auto-switches model), thinks more, calls two tools, says something, errors out:
+
+```
+Message #N (role=user):
+ chunks: [{ type: "text", text: "explain X" }]
+
+Message #N+1 (role=assistant):
+ chunks: [
+ { type: "thinking", text: "I should..." },
+ { type: "text", text: "Sure, here's the gist..." },
+ { type: "system", kind: "model-changed", text: "Switched to Sonnet 4 (rate limit)" },
+ { type: "thinking", text: "Now I need to look at..." },
+ { type: "tool-batch", calls: [
+ { id: "1", name: "read_file", ..., result: "..." },
+ { id: "2", name: "list_files", ..., result: "..." }
+ ]},
+ { type: "text", text: "Looking at the file..." },
+ { type: "error", message: "Network error", statusCode: 503 }
+ ]
+```
+
+## Database cleanup
+
+**Live DB:** `~/.local/share/dispatch/dispatch.db` (XDG: `$XDG_DATA_HOME/dispatch/dispatch.db`)
+
+Current size: ~122 MB. Tables and row counts as of this plan:
+
+| Table | Rows | Disposition |
+|---|---|---|
+| `api_keys` | 7 | **Preserve** — user-imported API keys |
+| `credentials` | 2 | **Preserve** — OAuth credentials |
+| `settings` | 11 | **Preserve** — app settings |
+| `usage_cache` | 2 | **Preserve** — usage report cache |
+| `wake_schedule` | 5 | **Preserve** — wake schedule |
+| `messages` | 524 | **Delete all** — chat history |
+| `tabs` | 361 | **Delete all** — chat sessions |
+
+### Proposed cleanup statements
+
+```sql
+-- Order matters: messages references tabs.id via foreign key (ON in WAL mode)
+DELETE FROM messages;
+DELETE FROM tabs;
+VACUUM;
+```
+
+Then, in the live schema, drop the `thinking` column from `messages` (or recreate the table without it). SQLite has historically been fussy about `DROP COLUMN`; modern SQLite (3.35+) supports it directly:
+
+```sql
+ALTER TABLE messages DROP COLUMN thinking;
+```
+
+If the installed SQLite is older, fall back to the table-rebuild pattern (create new table without the column, copy rows, swap names, drop old). Given we just `DELETE FROM messages`, the copy step is a no-op — we can just `DROP TABLE messages` and recreate via the application's `CREATE TABLE IF NOT EXISTS` on next startup (after the schema change in `db/index.ts`).
+
+**Preferred sequence:**
+
+1. Stop any running dispatch processes.
+2. Run `DELETE FROM messages; DELETE FROM tabs; VACUUM;` to clear chat history.
+3. Update `db/index.ts` to remove the `thinking` column from the `CREATE TABLE messages` statement.
+4. Drop the existing messages table so the new schema takes effect on next startup: `DROP TABLE messages; DROP INDEX IF EXISTS idx_messages_tab;` (the app will recreate it on launch).
+
+Effective end state: same DB file, settings/keys preserved, chat history gone, new schema.
+
+## Implementation phases
+
+### Phase 0 — Confirm starting point
+
+- [ ] `git status` clean (only `wishlist.md` untracked is acceptable — unrelated).
+- [ ] No running dispatch processes that hold the DB.
+
+### Phase 1 — Types
+
+File: `packages/core/src/types/index.ts`
+
+- [ ] Define `Chunk` union (5 variants per the table above).
+- [ ] Replace `ChatMessage.content: string` + `toolCalls?` + `toolResults?` with `chunks: Chunk[]`.
+- [ ] Update `MessageRole` to include `"system"` (currently `"user" | "assistant" | "tool"` — drop `"tool"` since tool messages are now embedded as `tool-batch` chunks).
+
+File: `packages/frontend/src/lib/types.ts`
+
+- [ ] Mirror `Chunk` union.
+- [ ] Replace `content: ContentSegment[]` + `thinking?: string` with `chunks: Chunk[]`.
+- [ ] Drop `ContentSegment` and `ToolCallDisplay` (replaced by chunk variants).
+
+### Phase 2 — Core helper + unit tests
+
+New file: `packages/core/src/chunks/append.ts` (or co-located in `agent/`)
+
+- [ ] Implement `appendEventToChunks(chunks: Chunk[], event: AgentEvent): void` (mutating, returns void).
+- [ ] Implement `applySystemEvent(messages: ChatMessage[], event: SystemEvent): void` for the standalone-system-message routing logic.
+
+Tests: `packages/core/tests/chunks/append.test.ts`
+
+- [ ] Empty chunks + text-delta → one text chunk with the delta.
+- [ ] Two consecutive text-deltas → one text chunk with concatenated text.
+- [ ] text-delta then reasoning-delta → two chunks (text, thinking).
+- [ ] text-delta then tool-call → two chunks (text, tool-batch with one entry).
+- [ ] Two consecutive tool-calls → one tool-batch with two entries.
+- [ ] tool-call then tool-call then text → two chunks (tool-batch with 2 entries, text).
+- [ ] tool-result arrives → updates matching tool-call entry in the latest tool-batch chunk by id.
+- [ ] shell-output arrives → appends to the most recent tool-call's `shellOutput`.
+- [ ] error event → opens an error chunk; subsequent events go to new chunks.
+- [ ] system event during text run → closes text, opens system, would re-open text on next text-delta.
+- [ ] Two consecutive system events → two separate system chunks (no coalescing).
+- [ ] Interleaved think → text → think → tool → think → text → 6 chunks in order.
+
+### Phase 3 — Database schema + cleanup
+
+File: `packages/core/src/db/index.ts`
+
+- [ ] Update `CREATE TABLE messages` to drop the `thinking` column.
+
+Live DB:
+
+- [ ] Run `DELETE FROM messages; DELETE FROM tabs; VACUUM;` (show user before executing).
+- [ ] Drop the `messages` table so the new schema takes effect on next startup.
+
+File: `packages/core/src/db/messages.ts`
+
+- [ ] Update `appendMessage()` signature: drop the `thinking` parameter; `content_json` now holds chunks.
+- [ ] Update read functions to parse `content_json` as `Chunk[]` directly (no thinking column read).
+
+### Phase 4 — Agent aggregation
+
+File: `packages/core/src/agent/agent.ts`
+
+- [ ] Replace `finalText: string`, `allToolCalls: ToolCall[]`, `allToolResults: ToolResult[]`, and `assistantThinking` (lines 331-333 and the equivalent in agent-manager) with a single `chunks: Chunk[]`.
+- [ ] On each event from `result.fullStream`, call `appendEventToChunks(chunks, event)`.
+- [ ] On `done`, ship `chunks` in the message payload.
+- [ ] Update `toCoreMessages` (lines 20-46): iterate chunks per message, emit AI SDK parts, skip system / error chunks, skip `role: "system"` messages entirely.
+
+### Phase 5 — Persistence + system event routing
+
+File: `packages/api/src/agent-manager.ts`
+
+- [ ] Replace the three-accumulator pattern (lines 893-978) with a single `chunks: Chunk[]` that delegates to `appendEventToChunks`.
+- [ ] On `notice` / `model-changed` / `config-reload` / cancel: route through `applySystemEvent` — appends to the in-flight assistant message if one exists, else appends to a `role: "system"` message (creating one if needed).
+- [ ] `appendMessage` call: drop the thinking arg; `content_json` is `JSON.stringify(chunks)`.
+
+### Phase 6 — Frontend store state machine
+
+File: `packages/frontend/src/lib/tabs.svelte.ts`
+
+- [ ] Replace per-event-type mutation logic (lines 288-626) with a call to the shared `appendEventToChunks` helper. Import from core if feasible; otherwise duplicate carefully — but prefer import to keep wire-format symmetry guaranteed.
+- [ ] Update `openAgentTab` DB-load path (lines 185-199): parse `chunks` directly from `content_json`, no thinking field merging.
+- [ ] Update `currentAssistantId` semantics: still tracks the in-progress assistant message. System events that arrive when `currentAssistantId` is null create/append a `role: "system"` message via `applySystemEvent`.
+
+### Phase 7 — Frontend display
+
+File: `packages/frontend/src/lib/components/ChatMessage.svelte`
+
+- [ ] Remove the top-hoisted `{#if message.thinking}` block.
+- [ ] Replace the segments loop with `{#each message.chunks as chunk}` and switch on `chunk.type`:
+ - `text` → `<MarkdownRenderer>`
+ - `thinking` → collapsible (DaisyUI `collapse`), default-collapsed per `appSettings.autoExpandThinking`, at its actual position in the turn.
+ - `tool-batch` → render each entry via `<ToolCallDisplay>` (existing component, possibly minor prop tweaks).
+ - `error` → red-bordered error card with the message + status code.
+ - `system` → thin separator-style block with the `kind` as a small label and the text.
+- [ ] Handle `role: "system"` messages: render as a standalone thin bubble with just the system chunks, no avatar / no actions.
+
+### Phase 8 — Integration tests
+
+- [ ] End-to-end: a real streaming run produces the expected chunk shape, including interleaved thinking.
+- [ ] Persistence round-trip: write chunks → read back → identical structure.
+- [ ] System event during turn ends up in the assistant message at the right position.
+- [ ] System event with no turn in flight creates a `role: "system"` message.
+- [ ] `toCoreMessages` correctly strips system / error chunks and `role: "system"` messages.
+
+### Phase 9 — Cleanup
+
+- [ ] Remove dead code: `ContentSegment` type, `thinking` field references, old per-event accumulator vars.
+- [ ] Run `vitest run` across all packages and `tsc --noEmit`.
+- [ ] Commit with a focused message; push.
+
+## Test strategy
+
+The load-bearing piece is `appendEventToChunks`. If this is correct in isolation, every downstream layer inherits correctness. **Write Phase 2 first and lock its tests in green before touching agent/store/UI.**
+
+State-machine tests should cover every transition pair in the matrix (text→text, text→thinking, text→tool, text→error, text→system, thinking→text, ...) so that no transition gets accidentally broken later.
+
+Integration tests can be lighter — they just need to confirm the real wire format flows through the helper end-to-end.
+
+## Risks and notes
+
+1. **Frontend store imports from core.** Currently the frontend duplicates some types. Importing `appendEventToChunks` from core is the safest way to guarantee the wire format stays in sync — but requires the build to handle the import. If it doesn't, duplicate the helper carefully and have a test that runs both side by side on the same fixture events to detect drift.
+
+2. **`ai` SDK `reasoning` events.** Confirmed that `result.fullStream` emits `reasoning` events with `event.textDelta` for Anthropic with the `interleaved-thinking-2025-05-14` beta header. Other providers (OpenAI-compat) emit reasoning via different mechanisms; the middleware in `provider.ts:6-38` already handles those.
+
+3. **Tool result ordering across multiple steps.** Currently, `tool-result` events arrive at varying times relative to subsequent `text-delta` events depending on whether the AI streams text before vs after acknowledging the result. The new chunk model places results inside the `tool-batch` chunk that holds the matching call — this should be correct regardless of when the result event arrives, as long as the lookup-by-id finds the right batch.
+
+4. **Coalescing edge case: simultaneous tool calls.** If the AI emits multiple tool-call events with no intervening events (parallel tool use), they all batch together. If a `tool-result` for the first call arrives before the second `tool-call` event, the tool-batch chunk is already open and just absorbs both. No special handling needed.
+
+5. **Settings preservation.** The `settings` table (`key`, `value` columns) is preserved across the wipe. Verify the dispatch UI doesn't rely on any session-scoped row that lives inside `tabs` (e.g., "last opened tab id") — if so, the user will need to re-pick after the wipe. This is acceptable for beta.
+
+6. **`MessageRole` change.** Dropping `"tool"` from the role union means any code path that pattern-matches on `role === "tool"` must change. Grep for it before merging.
diff --git a/notes/plan-v6-upgrade.md b/notes/plan-v6-upgrade.md
new file mode 100644
index 0000000..ad3d223
--- /dev/null
+++ b/notes/plan-v6-upgrade.md
@@ -0,0 +1,450 @@
+# AI SDK v4 → v6 Upgrade Plan
+
+## Why
+
+The `reasoning-signature without reasoning` bug we've been fighting is a v4
+SDK artefact:
+
+ - `@ai-sdk/[email protected]` emits Anthropic's `signature_delta` as a
+ separate stream chunk (`reasoning-signature`). The SDK accumulator
+ requires every signature to follow a `reasoning` chunk; Anthropic's
+ adaptive thinking happily emits empty thinking blocks (signature only,
+ no `thinking_delta`), which trips an `InvalidStreamPartError`.
+ - Sending a thinking block back to Anthropic without its signature is
+ rejected. Our chunk store had no signature field, so the round-trip
+ failed.
+
+Both classes of bug are **gone in v6** because:
+
+ - `@ai-sdk/[email protected]` packages Anthropic's signature inside
+ `providerMetadata` on the `reasoning-end` stream event. There is no
+ separate `reasoning-signature` chunk to orphan.
+ - `providerOptions` on a v6 `ReasoningPart` carries the metadata back
+ verbatim, so the round-trip Just Works.
+ - v6 `@ai-sdk/[email protected]` natively supports
+ `providerOptions.anthropic.thinking = { type: "adaptive" }` — the
+ `rewriteBodyForOpus47` custom-fetch hack disappears.
+ - v6 `createAnthropic({ authToken })` natively supports OAuth Bearer
+ auth — the `customFetch` that swaps `x-api-key` for `Authorization:
+ Bearer` also disappears.
+
+The opencode reference at `../opencode-patch/opencode/` runs the same
+extended-thinking flow on v6 with none of the workarounds we've been
+piling onto v4.
+
+## Versions
+
+| Package | v4 (current) | v6 (target) |
+|---|---|---|
+| `ai` | `4.3.19` | `6.0.191` |
+| `@ai-sdk/anthropic` | `1.2.12` | `3.0.79` |
+| `@ai-sdk/openai-compatible` | `0.2.x` | `2.0.48` |
+| `@ai-sdk/provider` | `1.1.3` | `3.0.10` |
+| `@ai-sdk/provider-utils` | `2.2.8` | `4.0.27` |
+| Middleware spec | `v1` | `v3` |
+
+`packages/core/package.json` already bumped; `bun install` already done.
+
+## What we keep from the current codebase
+
+- **The whole `Chunk` model.** `ChatMessage { chunks: Chunk[] }` stays as
+ the canonical persisted shape. Renderers, DB layer, and the frontend
+ store are unchanged in their architecture.
+- **The outer manual tool loop** in `Agent.run()`. We run tools ourselves
+ for permission prompts, shell-output streaming, and queued-message
+ injection. The SDK is given tools without `execute` so it never
+ auto-runs them.
+- **`appendEventToChunks` as the single source of event → chunk truth.**
+ Frontend store + backend agent both call into it.
+- **`createToolRegistry` indirection.** Our internal `ToolDefinition`
+ shape stays; only the registry's conversion to AI SDK tools changes.
+- **The agent-manager broadcaster.** `AgentEvent` shape changes (see
+ below); the broadcast plumbing does not.
+
+## What changes (high level)
+
+| Surface | v4 | v6 |
+|---|---|---|
+| Message type | `CoreMessage` | `ModelMessage` |
+| Streaming events | `text-delta` + `reasoning` + `reasoning-signature` + `tool-call` | `text-start/-delta/-end`, `reasoning-start/-delta/-end`, `tool-input-start/-delta/-end`, `tool-call`, `tool-result`, `tool-error`, `start-step`, `finish-step`, `start`, `finish`, `abort`, `error` |
+| Reasoning signature | separate `reasoning-signature` event | `providerMetadata` on `reasoning-end` |
+| Tool definition | `tool({ parameters: ZodSchema })` | `tool({ inputSchema: jsonSchema(...) })` |
+| Tool call payload | `args` | `input` |
+| Tool result payload | raw value | `ToolResultOutput` (`{ type: "text"; value: string }` etc.) |
+| Token budget | `maxTokens` | `maxOutputTokens` |
+| OAuth | custom fetch swapping `x-api-key` → `Authorization: Bearer` | `createAnthropic({ authToken })` |
+| Opus 4.7 thinking | body-rewrite injection of `thinking: { type: "adaptive" }` | `providerOptions.anthropic.thinking = { type: "adaptive" }` (native) |
+| Middleware spec | `"v1"`, `transformParams({ type, params })` returning params | `"v3"`, `transformParams({ type, params, model })` returning `LanguageModelV3CallOptions` |
+
+## Canonical types (locked)
+
+### Chunk (no breaking changes vs. current HEAD, signature added back)
+
+`packages/core/src/types/index.ts` and mirrored in
+`packages/frontend/src/lib/types.ts`:
+
+```ts
+export type Chunk =
+ | TextChunk
+ | ThinkingChunk
+ | ToolBatchChunk
+ | ErrorChunk
+ | SystemChunk;
+
+export interface TextChunk {
+ type: "text";
+ text: string;
+}
+
+export interface ThinkingChunk {
+ type: "thinking";
+ text: string;
+ /**
+ * Full Anthropic `providerMetadata` blob captured from the
+ * `reasoning-end` stream event. Round-tripped verbatim as
+ * `providerOptions` on the ReasoningPart in the next request so
+ * Anthropic can validate the thinking block's signature. Optional
+ * because non-Anthropic models don't produce one.
+ */
+ metadata?: Record<string, unknown>;
+}
+
+export interface ToolBatchChunk {
+ type: "tool-batch";
+ calls: ToolBatchEntry[];
+}
+
+export interface ToolBatchEntry {
+ id: string;
+ name: string;
+ arguments: Record<string, unknown>;
+ result?: string;
+ isError?: boolean;
+ shellOutput?: { stdout: string; stderr: string };
+}
+
+export interface ErrorChunk {
+ type: "error";
+ message: string;
+ statusCode?: number;
+}
+
+export type SystemChunkKind = "notice" | "model-changed" | "config-reload" | "cancelled";
+
+export interface SystemChunk {
+ type: "system";
+ kind: SystemChunkKind;
+ text: string;
+}
+```
+
+### AgentEvent (one new variant, otherwise unchanged)
+
+```ts
+export type AgentEvent =
+ | { type: "status"; status: AgentStatus }
+ | { type: "text-delta"; delta: string }
+ | { type: "reasoning-delta"; delta: string }
+ // NEW: emitted on the v6 `reasoning-end` stream event when it carries
+ // providerMetadata. `appendEventToChunks` attaches the blob to the
+ // most recent unsealed thinking chunk; `toModelMessages` reads it back
+ // out as `providerOptions` on the ReasoningPart in the next request.
+ | { type: "reasoning-end"; metadata?: Record<string, unknown> }
+ | { type: "tool-call"; toolCall: ToolCall }
+ | { type: "tool-result"; toolResult: ToolResult }
+ | { type: "shell-output"; data: string; stream: "stdout" | "stderr" }
+ | { type: "error"; error: string; statusCode?: number }
+ | { type: "notice"; message: string }
+ | { type: "model-changed"; keyId: string; modelId: string }
+ | { type: "done"; message: ChatMessage }
+ | { type: "task-list-update"; tasks: TaskItem[] }
+ | { type: "config-reload" }
+ | { type: "tab-created"; id: string; title: string; keyId: string | null;
+ modelId: string | null; parentTabId: string | null;
+ workingDirectory: string | null }
+ | { type: "message-queued"; tabId: string; messageId: string; message: string }
+ | { type: "message-consumed"; tabId: string; messageIds: string[] }
+ | { type: "message-cancelled"; tabId: string; messageId: string };
+```
+
+### appendEventToChunks sealing semantics
+
+A `thinking` chunk is *sealed* once it has a `metadata` field. A
+`reasoning-delta` after a sealed chunk opens a new chunk. A
+`reasoning-end` walks back to the most recent unsealed chunk and
+attaches metadata. (Same shape as the current code; just renamed from
+the v4-era "signature".)
+
+```ts
+case "reasoning-delta": {
+ const last = chunks[chunks.length - 1];
+ if (last && last.type === "thinking" && last.metadata === undefined) {
+ last.text += event.delta;
+ } else {
+ chunks.push({ type: "thinking", text: event.delta });
+ }
+ return;
+}
+
+case "reasoning-end": {
+ if (event.metadata === undefined) return; // nothing to attach
+ for (let i = chunks.length - 1; i >= 0; i--) {
+ const c = chunks[i];
+ if (!c || c.type !== "thinking") continue;
+ if (c.metadata !== undefined) return; // already sealed
+ c.metadata = event.metadata;
+ return;
+ }
+ return; // orphan — drop
+}
+```
+
+## Stream event → AgentEvent mapping (agent.ts)
+
+| v6 fullStream chunk | AgentEvent we emit |
+|---|---|
+| `start` | none (ignored) |
+| `text-start { id }` | none |
+| `text-delta { id, text }` | `{ type: "text-delta", delta: text }` |
+| `text-end { id }` | none |
+| `reasoning-start { id }` | none |
+| `reasoning-delta { id, text }` | `{ type: "reasoning-delta", delta: text }` |
+| `reasoning-end { id, providerMetadata }` | `{ type: "reasoning-end", metadata: providerMetadata }` (only when metadata is present, otherwise none) |
+| `tool-input-start/-delta/-end` | none (we don't surface argument streaming to the UI) |
+| `tool-call { toolCallId, toolName, input }` | `{ type: "tool-call", toolCall: { id: toolCallId, name: toolName, arguments: input } }` |
+| `tool-result` | none — the SDK only emits this if the tool's `execute` ran. Our SDK tools have no `execute` (see below), so this never fires from the stream. The agent's manual executor emits `tool-result` events itself. |
+| `tool-error` | log + treat as an SDK-level error (defensive) |
+| `start-step` / `finish-step` | none (we manage steps ourselves) |
+| `abort { reason }` | `{ type: "error", error: "aborted: " + reason }` (defensive; we don't currently abort streams from agent.ts) |
+| `error { error }` | `{ type: "error", error: formatError(error, config), statusCode? }` |
+| `finish { finishReason, totalUsage }` | none (we break out of the step loop based on whether the step emitted any tool-call) |
+| `raw` | none |
+
+## toModelMessages (replaces v4 toCoreMessages)
+
+Same shape as today, with these changes:
+
+- Renamed: `toCoreMessages` → `toModelMessages`, return `ModelMessage[]`.
+- `ReasoningPart`:
+ ```ts
+ case "thinking":
+ parts.push({
+ type: "reasoning",
+ text: chunk.text,
+ ...(chunk.metadata !== undefined
+ ? { providerOptions: chunk.metadata as ProviderOptions }
+ : {}),
+ });
+ break;
+ ```
+ No "skip signatureless chunks" branch — even bare reasoning is legal
+ in v6 (it's just sent without providerOptions). Anthropic accepts that;
+ it just doesn't use the absent signature.
+
+- `ToolCallPart`:
+ ```ts
+ parts.push({
+ type: "tool-call",
+ toolCallId: entry.id,
+ toolName,
+ input: entry.arguments, // was: args
+ });
+ ```
+
+- `ToolResultPart`:
+ ```ts
+ result.push({
+ role: "tool",
+ content: [{
+ type: "tool-result",
+ toolCallId: tr.toolCallId,
+ toolName: tr.toolName,
+ output: { type: "text", value: tr.result }, // was: result
+ }],
+ });
+ ```
+
+- **Anthropic structural normalisations** (cribbed from opencode
+ `provider/transform.ts`):
+ 1. Filter out `text` / `reasoning` parts whose `text === ""`. Drop
+ messages whose content array becomes empty.
+ 2. For assistant messages whose content contains at least one
+ `tool-call` followed by a non-`tool-call` part, split into two
+ consecutive assistant messages: `[non-tool parts] + [tool-call parts]`.
+ Anthropic rejects `[tool_use, text]` ordering with
+ "`tool_use` ids were found without `tool_result` blocks immediately
+ after". opencode does this at `transform.ts:124-148`.
+
+ Both apply only when the active provider is Anthropic (Claude OAuth or
+ `opencode-anthropic`). Skip for OpenCode Zen / openai-compatible.
+
+## Provider (provider.ts)
+
+### Claude OAuth (anthropic provider)
+
+```ts
+import { createAnthropic } from "@ai-sdk/anthropic";
+
+function createClaudeOAuthProvider(config: ProviderConfig): ModelFactory {
+ const anthropic = createAnthropic({
+ baseURL: config.baseURL || "https://api.anthropic.com/v1",
+ authToken: config.claudeCredentials?.accessToken ?? config.apiKey,
+ headers: {
+ "anthropic-dangerous-direct-browser-access": "true",
+ "x-app": "cli",
+ "user-agent": "claude-cli/2.1.112 (external, sdk-cli)",
+ },
+ });
+ return (modelId: string) => anthropic(modelId);
+}
+```
+
+- `authToken` replaces the v4 customFetch + apiKey-swap dance.
+- `rewriteBodyForOpus47` deleted. Opus 4.7 thinking is configured via
+ `providerOptions.anthropic.thinking = { type: "adaptive" }` in agent.ts.
+
+### Plain-API-key Anthropic (opencode-go MiniMax/Qwen)
+
+```ts
+function createApiKeyAnthropicProvider(config: ProviderConfig): ModelFactory {
+ const anthropic = createAnthropic({
+ apiKey: config.apiKey,
+ baseURL: config.baseURL || "https://opencode.ai/zen/go/v1",
+ });
+ return (modelId: string) => anthropic(modelId);
+}
+```
+
+### OpenAI-compatible (default)
+
+`createOpenAICompatible` API is unchanged (`{ name, apiKey, baseURL }`).
+
+### Middleware (v3 spec) for `normalizeMessages` tweak
+
+OpenCode Go MiniMax/Qwen still need the reasoning-text-stripping
+middleware we have today. The v3 spec changes from v1:
+
+```ts
+const middleware: LanguageModelV3Middleware = {
+ specificationVersion: "v3" as const, // ← v3 spec replaces v4's "v1" — actual field name is `specificationVersion`
+ async transformParams({ type, params, model }) {
+ if (type === "stream" && params.prompt) {
+ return {
+ ...params,
+ prompt: normalizeMessages(params.prompt as unknown[]) as LanguageModelV3Prompt,
+ };
+ }
+ return params;
+ },
+};
+return wrapLanguageModel({ model: provider(modelId), middleware: [middleware] });
+```
+
+(`LanguageModelV3Middleware` type and `specificationVersion` field — re-exported as `LanguageModelMiddleware` from `ai`, and as `LanguageModelV3Middleware` from `@ai-sdk/[email protected]`.)
+
+## Tools (tools/registry.ts)
+
+v6 tool factory:
+
+```ts
+import { tool, jsonSchema } from "ai";
+import { z, type ZodTypeAny } from "zod";
+
+function zodToJsonSchema(schema: ZodTypeAny): unknown {
+ // tiny helper — most opencode/AI-SDK examples use a third-party
+ // zod-to-json-schema package, but for our simple parameter objects
+ // the AI SDK's `jsonSchema()` wrapper accepts raw JSONSchema7 directly.
+ // We can import zod-to-json-schema OR keep using our existing
+ // (zod-style) ToolDefinition.parameters as-is and pass through
+ // `tool({ inputSchema: <jsonSchemaOrZod> })`.
+ ...
+}
+
+function toAISDKTool(def: ToolDefinition) {
+ return tool({
+ description: def.description,
+ inputSchema: jsonSchema(zodToJsonSchema(def.parameters)),
+ // NO execute: we don't want the SDK to auto-run tools. We collect
+ // tool-call events from fullStream, run tools ourselves with
+ // permission/shell-output streaming, then push the results as a
+ // tool ModelMessage in the next streamText call.
+ });
+}
+```
+
+**Decision: use `zod` v4's built-in `z.toJSONSchema()` if available, else
+add `zod-to-json-schema` as a dep.** Our zod version is `^3.23.0`; we
+need to verify what's available. If neither, the Agent C task will
+include adding `zod-to-json-schema` as a dependency.
+
+## File ownership for parallel implementation
+
+### Phase 0 — me, sequential (load-bearing, no parallel work safe)
+
+| File | Change |
+|---|---|
+| `packages/core/src/types/index.ts` | Add `metadata?` to `ThinkingChunk`; add `reasoning-end` variant to `AgentEvent`; everything else unchanged |
+| `packages/core/src/chunks/append.ts` | Replace `signature` handling with `metadata`; add `reasoning-end` case; update jsdoc table |
+| `packages/core/tests/chunks/append.test.ts` | Replace `rs` helper with `re` (reasoning-end), exercise new sealing semantics + walk-back + orphan-drop |
+| `packages/frontend/src/lib/types.ts` | Mirror the `metadata` + `reasoning-end` additions |
+| `packages/frontend/src/lib/tabs.svelte.ts` | Add `reasoning-end` to the `applyChunkEvent` switch case |
+
+After Phase 0: `bun run --cwd packages/core test tests/chunks` MUST pass.
+Everything else will still be broken until Phase 1 lands.
+
+### Phase 1 — parallel sonnet agents, disjoint file ownership
+
+| Agent | Owns | Cannot touch |
+|---|---|---|
+| **A: agent** | `packages/core/src/agent/agent.ts`, `packages/core/tests/agent/agent.test.ts` | provider.ts, registry.ts, types.ts |
+| **B: provider** | `packages/core/src/llm/provider.ts`, `packages/core/tests/llm/provider.test.ts` | agent.ts, registry.ts |
+| **C: tools** | `packages/core/src/tools/registry.ts`, `packages/core/tests/tools/registry.test.ts` (only registry; do NOT touch individual tool files like `read-file.ts`), `packages/core/package.json` (may add `zod-to-json-schema` dep) | agent.ts, provider.ts |
+| **D: api** | `packages/api/src/agent-manager.ts`, `packages/api/tests/agent-manager.test.ts` | anything in core/src |
+| **E: frontend** | `packages/frontend/src/lib/components/ChatMessage.svelte`, `packages/frontend/tests/chat-store.test.ts` | tabs.svelte.ts (I already touched it in Phase 0); types.ts (also Phase 0) |
+
+`packages/core/src/credentials/claude.ts` — left alone. Its
+`buildBillingHeaderValue` and `SYSTEM_IDENTITY` are pure helpers; agent.ts
+still calls them.
+
+`packages/core/src/db/*` — left alone. The DB stores `chunks` as JSON;
+the only schema-relevant change (signature → metadata) is internal to
+the chunk shape and persists transparently.
+
+Individual tool implementations under `packages/core/src/tools/*.ts`
+(except `registry.ts`) — left alone. They follow the
+`ToolDefinition { name, description, parameters: ZodSchema, execute }`
+interface, which is unchanged.
+
+### Phase 2 — me, sequential
+
+- Workspace-wide typecheck: `bun run --cwd packages/core typecheck` and
+ `bun run --cwd packages/frontend typecheck` (which is `svelte-check`).
+- Workspace-wide tests: `bun run --cwd packages/core test`,
+ `bun run --cwd packages/api test`, `bun run --cwd packages/frontend test`.
+- `bun run check` (biome).
+- Fix any integration gaps that Phase 1 agents missed (e.g. type
+ mismatches between agent.ts and provider.ts).
+
+### Phase 3 — Gemini code review
+
+Headless: `gemini --yolo --skip-trust -m gemini-3-pro-preview -p "$(cat
+prompt)"`. Prompt restricts Gemini to read-only inspection and to writing
+ONLY `report.md` at the project root.
+
+### Phase 4 — me, sequential
+
+Apply each fix Gemini flags. Re-run typecheck + tests + biome.
+
+## Risks & mitigations
+
+- **Sequential dependency between phases.** Phase 1 agents must wait for
+ Phase 0 to land before they start, because they read the new
+ `AgentEvent`/`Chunk` types. Mitigated by doing Phase 0 in-conversation.
+- **Phase 1 agents writing diverging adapter code.** Mitigated by this
+ spec locking down exact AgentEvent variants and exact tool-event
+ mappings.
+- **Frontend rendering regressions.** Mitigated by keeping the chunk
+ shape stable — the only change for the frontend is `signature` →
+ `metadata` (an internal detail the UI doesn't render anyway).
+- **Subtle v6 streaming semantics we miss.** Mitigated by Gemini
+ read-only review against opencode reference patterns.
diff --git a/notes/plan.md b/notes/plan.md
new file mode 100644
index 0000000..1dca443
--- /dev/null
+++ b/notes/plan.md
@@ -0,0 +1,451 @@
+# Dispatch — Build Plan
+
+## Stack
+
+| Layer | Choice | Why |
+|---|---|---|
+| Runtime | TypeScript / Node.js | Rich LLM ecosystem, strong async, same language front+back |
+| LLM | Vercel AI SDK (`ai`) | Provider-agnostic, streaming, tool calling, 15+ providers |
+| API | Hono or Fastify | Lightweight, WebSocket support |
+| Persistence | better-sqlite3 + drizzle-orm | Embedded, no external DB dependency |
+| Config/Skills | gray-matter + yaml + chokidar | YAML frontmatter parsing, hot-reload on file changes |
+| Frontend | HTML/CSS/JS | Lightweight for MVP, no heavy framework |
+| Process mgmt | child_process + tree-kill | Subagent lifecycle management |
+| LSP (Phase 6) | vscode-languageserver-protocol | Standard LSP client library |
+
+## Project Structure
+
+```
+dispatch/
+ packages/
+ core/ # Agent runtime, LLM, tools, permissions, config
+ api/ # HTTP + WebSocket server
+ frontend/ # HTML/CSS/JS client
+ .skills/ # Project-level skills (dogfooding)
+ dispatch.yaml # Project config (dogfooding)
+```
+
+---
+
+## Phase 1: Single Agent + Basic UI
+
+**Goal:** Chat with one agent in a browser, watch it read and write files.
+
+**Effort:** 2-3 weeks
+
+### Backend
+
+- [ ] Project scaffolding (monorepo with packages/core, packages/api, packages/frontend)
+- [ ] Agent runtime: message -> LLM -> tool call -> result -> repeat loop
+- [ ] Vercel AI SDK integration with streaming responses
+- [ ] Single provider config (one API key, one model — hardcoded or env vars for now)
+- [ ] Basic tools:
+ - `read_file` — read file contents
+ - `write_file` — write/overwrite a file
+ - `list_files` — glob/list directory contents
+- [ ] HTTP API:
+ - `POST /chat` — send a message, get streaming response
+ - `GET /status` — agent status (idle, running, etc.)
+- [ ] WebSocket: stream agent output tokens and tool calls in real-time
+
+### Frontend
+
+- [ ] Single chat panel — text input field, send button
+- [ ] Streamed response rendering (tokens appear as they arrive)
+- [ ] Tool call display (collapsible: show tool name, arguments, result)
+- [ ] Model/provider indicator in header
+- [ ] Basic layout: chat takes full screen, clean and minimal
+
+### Done When
+
+Open a browser, type "read the contents of package.json and summarize it," see the agent call `read_file`, stream back a summary. Ask it to create a new file — it calls `write_file` and confirms.
+
+---
+
+## Phase 2: Shell Permissions + UI
+
+**Goal:** Agent can run shell commands with directory-scoped permission controls. Usable on real projects.
+
+**Effort:** 2-3 weeks
+
+### Backend — Permission Engine
+
+- [ ] Rule-based permission engine:
+ - Rules: `{ permission, pattern, action }` where action is `allow | deny | ask`
+ - Wildcard glob matching on both permission name and target pattern
+ - Ordered ruleset, last-match-wins (user config overrides defaults)
+ - Default action when no rule matches: `"ask"`
+- [ ] Permission grant types:
+ - Per-request — allow this one operation (no rule stored)
+ - Per-session — add to in-memory approved set for process lifetime
+ - Permanent — write to `dispatch.yaml` config
+- [ ] Reject cascade: rejecting one pending request auto-rejects all pending requests in that session
+- [ ] Path resolution:
+ - Working directory + all subdirectories: always in-scope (no prompt)
+ - All other paths: `external_directory` permission check
+ - `~` / `$HOME` expansion in config patterns
+
+### Backend — Config Format
+
+- [ ] `dispatch.yaml` permission block supporting per-permission patterns:
+ ```yaml
+ permissions:
+ read: allow # shorthand: "read:*" = allow
+ edit:
+ "*": ask # prompt for edits by default
+ "src/**": allow # auto-allow edits inside src/
+ "/tmp/*": allow
+ external_directory:
+ "~/projects/*": allow
+ "/tmp/*": allow
+ bash:
+ "npm test": allow
+ "git commit *": allow
+ "git push *": ask
+ "*": ask # prompt for unknown commands
+ ```
+- [ ] Config hot-reload via chokidar
+
+### Backend — Shell Tool with Tree-Sitter Analysis
+
+- [ ] Dependencies: `web-tree-sitter` (WASM runtime), `tree-sitter-bash` (grammar WASM)
+ - ~2.5 MB total, loaded lazily on first shell command call
+- [ ] `run_shell` tool:
+ - Executes arbitrary commands via `child_process`, captures stdout/stderr/exit code
+ - Streaming output for long-running commands
+ - Configurable timeout (default 2 minutes)
+ - Working directory parameter (defaults to project root)
+- [ ] Tree-sitter static analysis pipeline:
+ 1. Parse command string into AST using `tree-sitter-bash`
+ 2. Walk all `command` nodes (recurses into pipelines, subshells, `&&`, `if` bodies)
+ 3. For file-touching commands (`rm`, `cp`, `mv`, `mkdir`, `touch`, `chmod`, `chown`, `cat`):
+ - Extract path arguments (skip flags like `-rf`)
+ - Resolve to absolute paths, check workspace boundary
+ - If outside workspace → add dir to `external_directory` permission ask
+ 4. Normalize command to pattern via `BashArity`:
+ - `git checkout main` → `"git checkout *"` (arity 2)
+ - `npm run dev --watch` → `"npm run dev *"` (arity 3)
+ - Unknown commands → fallback to command name only
+ 5. Fire two permission requests:
+ - `external_directory` — for any out-of-workspace paths detected
+ - `bash` — for the command pattern itself
+- [ ] Known gaps (documented, no fix needed for MVP):
+ - `find -exec rm`, `xargs`, `sudo` — can't see through these wrappers
+ - `$(...)` substitution paths — dynamic, skipped (still prompts for the command)
+ - Variable-stored paths — skipped
+ - Parse failure → hard error (no fallback), tool call aborts
+- [ ] Apply permission checks to existing file tools (`read_file`, `write_file`, `list_files`)
+
+### Frontend
+
+- [ ] Permission prompt modal:
+ - Agent name and AI's description of the operation
+ - For file ops: target path, operation type (read / write / execute)
+ - For shell commands: `$ command text` display with normalized "always allow" pattern preview
+ - For external directories: directory path and glob pattern
+ - Buttons: Approve / Deny / Always Allow
+ - "Always Allow" shows secondary confirmation listing the patterns that will be permanently approved
+- [ ] Permission log panel: scrollable history of grants and denials
+- [ ] Shell output display in chat: stdout/stderr with monospace formatting, exit code indicator
+- [ ] Visual distinction between tool calls (file ops vs shell commands)
+
+### Done When
+
+Ask the agent to "run the test suite." It executes `npm test` in the project dir (allowed). Then ask it to "check what's in /etc/hosts." Permission prompt appears showing the target path. You approve. It reads the file and reports back. Next time it tries `/etc/`, it remembers your per-session grant.
+
+Ask the agent to "clean up build artifacts with rm -rf dist." Permission prompt shows `$ rm -rf dist` with "always allow" pattern `"rm *"`. You click "Always Allow." Future `rm` commands auto-approve.
+
+---
+
+## Phase 3: Config + Skills + Model Groups
+
+**Goal:** YAML-driven agent templates, skills auto-loading from directory structure, multi-provider model groups with key budgets, fallback chains, and wait-on-exhaustion.
+
+**Effort:** 2-3 weeks
+
+### Backend — Config System
+
+- [ ] Full `dispatch.yaml` config loader:
+ - Agent templates: name, description, system prompt, tools, permissions, model group
+ - Model definitions with tags
+ - Key definitions with budget limits
+ - Fallback order
+ - Permission auto-allow list (already from Phase 2, now in full config)
+- [ ] Config validation on load (clear errors for missing fields, bad references)
+- [ ] Hot-reload: watch `dispatch.yaml` for changes, apply without restart
+
+### Backend — Model Groups + Key Management
+
+- [ ] Model tag system: each model has a list of tags (`heavy`, `medium`, `light`, `coding`, `review`, etc.)
+- [ ] Tag resolution: agent requests a tag -> system finds the best available model matching that tag, respecting fallback order
+- [ ] Key budget tracking:
+ - Track token usage and/or cost per key
+ - Configurable budget limits (per-month, per-day, or total)
+- [ ] Key fallback chain:
+ - Use highest-priority key first
+ - On exhaustion, switch to next key in chain
+ - Log the switch
+- [ ] Key exhaustion wait:
+ - When ALL keys for an agent are exhausted, agent enters wait state
+ - Poll for key availability on configurable interval
+ - Resume with whichever key refreshes first
+ - Per-agent: other agents with available keys continue running
+ - Preserve full agent context across the wait
+- [ ] API endpoints:
+ - `GET /config` — current config state
+ - `GET /models` — available models, tags, key status, budget remaining
+ - `GET /models/resolve?tag=heavy` — which model would be selected for a tag right now
+
+### Backend — Skills System
+
+- [ ] Skills directory loader (both levels):
+ ```
+ ~/.skills/
+ default/ # Auto-loaded for all agents globally
+ agents/ # Agent-type mappings
+ project/ # Available to any project, manually activated
+
+ <project>/.skills/
+ default/ # Auto-loaded for agents in this project
+ agents/ # Agent-type mappings for this project
+ project/ # Available in this project, manually activated
+ ```
+- [ ] Markdown skill files with YAML frontmatter (name, description, tags)
+- [ ] Agent mapping files in `agents/`:
+ - `<name>.txt` — maps skills to a subagent type
+ - `<name>.o.txt` — maps skills to an orchestrator type
+ - File contents: list of skill filenames to activate
+- [ ] Loading order:
+ 1. Global `default/` skills
+ 2. Project `default/` skills
+ 3. Agent-specific skills from `agents/` mappings (global then project)
+ 4. Manually activated `project/` skills on demand
+- [ ] Scope disambiguation: `global:skill-name` vs `project:skill-name` when both exist
+- [ ] Hot-reload: watch skills directories for changes via chokidar
+- [ ] API endpoints:
+ - `GET /skills` — all loaded skills, organized by scope and directory
+ - `GET /skills/:name` — skill content
+
+### Backend — Task List Tool
+
+- [ ] `task_list` tool available to all agents:
+ - `add(title, description)` — returns task ID
+ - `update(task_id, status)` — status: pending, in_progress, done, blocked
+ - `list()` — returns current task state
+ - `get(task_id)` — returns task details
+- [ ] Task list persists with agent state (survives context compaction and key exhaustion waits)
+- [ ] Parent agents can read child agent task lists
+
+### Frontend
+
+- [ ] Config viewer panel:
+ - Agent templates: name, description, model group, permissions
+ - Model groups: which models have which tags
+ - Key status: active / exhausted / waiting, budget used / remaining
+- [ ] Key/model status visualization:
+ - Per-key budget bar (used / remaining)
+ - Current fallback position indicator
+ - "Waiting for refresh" state with estimated time if known
+- [ ] Skills browser:
+ - Tree view organized by scope (global / project) and directory (default / agents / project)
+ - Click a skill to see its content
+ - Show which skills are mapped to which agent types
+- [ ] Hot-reload indicator: visual flash when config or skills change on disk
+- [ ] Task list view: show current agent's task list with status indicators
+
+### Done When
+
+You have a `dispatch.yaml` with two API keys (Anthropic + OpenAI), model groups tagged `heavy` and `light`, and a $5 budget on each key. Skills are loading from `.skills/default/`. You chat with the agent — it uses the Anthropic key until the budget drains, switches to OpenAI, drains that, then shows "waiting for key refresh" in the UI. You leave it overnight. In the morning, the key has refreshed and the task completed.
+
+---
+
+## Phase 4: Agent Spawning + Tree UI
+
+**Goal:** Agents can spawn child agents with defined context, model, and permissions. Full hierarchy visible in real-time. User can message any agent.
+
+**Effort:** 2-3 weeks
+
+### Backend
+
+- [ ] `summon_agent` tool:
+ - Parameters: task description, context (text and/or skill names), model tag or specific model, permission set, `detached` flag
+ - Returns an agent handle (ID, status)
+ - Parent can summon multiple children concurrently
+ - Child inherits project working directory but gets its own conversation context
+ - **Detached mode** (`detached: true`):
+ - Child agent gets a direct user-facing conversation channel
+ - It can ask the user questions, request clarification, and wait for input
+ - Child may spawn its own subagents (leaf workers, not further detached)
+ - Child reports results back to parent when its task is complete
+ - Parent continues running while detached child is active
+ - User sees detached child as a separate conversation thread in the UI
+- [ ] Permission enforcement:
+ - Agent can only use `summon_agent` if it has `summon_subagents` permission
+ - Child agent's permissions cannot exceed parent's permissions
+ - Parent defines child's permissions explicitly at spawn time
+- [ ] Agent tree data structure:
+ - Parent-child relationships
+ - Per-agent: status (running / waiting / done / error / waiting_for_key), model, permissions, task list
+ - Tree updates broadcast via WebSocket
+- [ ] Parent-child communication:
+ - Child results flow back to parent as tool call results
+ - Parent can read child's task list for progress without consuming full conversation
+ - Parent waits for child completion (or can check status asynchronously)
+- [ ] User-to-agent messaging:
+ - `POST /agents/:id/message` — queue a message for a specific agent
+ - Message delivered at the agent's next tool boundary
+ - Agent acknowledges and incorporates the message
+- [ ] Agent lifecycle management:
+ - Running: actively processing
+ - Waiting: blocked on child agents or user input
+ - Waiting for key: all keys exhausted, polling for refresh
+ - Done: completed, results available to parent
+ - Error: failed, error details available
+ - Cleanup: terminate child processes on completion or error
+- [ ] Conflict prevention:
+ - Not enforced by the system — this is the orchestrator agent's responsibility
+ - Orchestrator skills should instruct the agent to assign non-overlapping file scopes to children
+ - The system provides the tools; the skills provide the discipline
+
+### Frontend
+
+- [ ] Agent tree panel (sidebar or split view):
+ - Collapsible tree showing full hierarchy
+ - Per-agent: name/task summary, status icon, model badge
+ - Real-time updates (new agent appears, status changes, agent completes)
+ - Click any agent to view its chat stream
+- [ ] Agent detail view:
+ - Chat/output stream for the selected agent
+ - Metadata: model, permissions, parent agent, loaded skills, detached status
+ - Task list for this agent
+ - "Send message" input for user-to-agent injection (always available for detached agents, available for any agent via message routing)
+- [ ] Detached orchestrator support:
+ - Detached agents appear as separate conversation threads alongside the main dispatch thread
+ - User can switch between the dispatch conversation and any active detached orchestrator
+ - Notifications when a detached orchestrator is waiting for user input
+ - When a detached orchestrator completes, results flow back to the parent and the thread becomes read-only
+- [ ] Permission prompts now show which specific agent is requesting access
+- [ ] Tree-level status summary: total agents, running, waiting, done, errors
+- [ ] Visual indicators for key exhaustion: which agents are waiting for keys vs actively running
+
+### Done When
+
+You tell the dispatch agent: "Plan the authentication system for this project." The dispatch agent spawns a planning orchestrator in **detached** mode. The orchestrator opens its own conversation thread in the UI. It asks you: "Should this support OAuth, JWT, or both?" You answer. It asks about session duration. You clarify. Once it has enough input, it writes the plan, reports back to the dispatch agent, and its thread becomes read-only. Meanwhile you were still chatting with the dispatch agent about other things.
+
+You tell the dispatch agent: "Research how authentication works in this codebase and write a summary." The agent (given orchestration skills) spawns a research subagent to search the code and a writing subagent to draft the summary. You see both appear in the tree panel. Click into the research agent — watch it grep files. Click into the writer — it's waiting for the researcher to finish. Researcher completes, results flow to the orchestrator, orchestrator hands context to the writer, writer produces the summary, orchestrator delivers it back to you. Send a message to the writer mid-task: "focus on OAuth specifically." It acknowledges and adjusts.
+
+---
+
+## Phase 5: Session Management
+
+**Goal:** Full session persistence. Close the browser, come back tomorrow, pick up where you left off. Fork conversations to try different approaches.
+
+**Effort:** 1-2 weeks
+
+### Backend
+
+- [ ] SQLite schema:
+ - Sessions: id, project path, created_at, updated_at, metadata
+ - Messages: session_id, agent_id, role, content, tool_calls, timestamp
+ - Agent snapshots: session_id, agent_id, parent_id, config, status, task_list
+- [ ] Auto-save: persist every message and tool result as it happens
+- [ ] Resume: load a session, restore conversation context for the dispatch agent
+ - Note: child agents are NOT resumed (they completed or were terminated)
+ - Conversation history is restored so the agent has full context
+- [ ] Fork: create a new session branching from any message in an existing session
+ - Copies conversation up to the fork point
+ - New session diverges from there
+- [ ] Model switching: change the model for any agent mid-session
+ - Context preserved, next LLM call uses the new model
+- [ ] Session search: query by date range, project, content keywords
+- [ ] API endpoints:
+ - `GET /sessions` — list sessions with metadata
+ - `GET /sessions/:id` — full session data
+ - `POST /sessions/:id/resume` — resume a session
+ - `POST /sessions/:id/fork?at=message_id` — fork from a point
+ - `PATCH /agents/:id/model` — switch model for an agent
+ - `GET /sessions/search?q=...` — search sessions
+
+### Frontend
+
+- [ ] Session sidebar:
+ - List of past sessions with metadata (date, project, message count, cost)
+ - Search/filter bar
+ - "New session" button
+- [ ] Resume: click a past session to load and continue
+- [ ] Fork: right-click or button on any message -> "Fork from here"
+ - Opens a new session tab branching from that point
+- [ ] Model switcher: dropdown per agent to change models
+- [ ] Session cost summary: total tokens, estimated cost, breakdown by key/provider
+- [ ] Active session indicator: which session you're currently in
+
+### Done When
+
+You've been working on a task for an hour. Close the browser tab. Open it again. Click the session in the sidebar — full conversation loads, you continue from where you left off. Go back to message #5, click "Fork," try a completely different approach without losing the original.
+
+---
+
+## Phase 6: LSP Integration
+
+**Goal:** Agents can access real compiler/linter diagnostics via Language Server Protocol.
+
+**Effort:** 1-2 weeks
+
+### Backend
+
+- [ ] LSP client manager:
+ - Spawn language server processes (e.g., `typescript-language-server --stdio`)
+ - Manage lifecycle: start, initialize, monitor, restart on crash
+ - One server per language per project, shared across all agents
+- [ ] Auto-detection: inspect project files to determine language(s)
+ - `tsconfig.json` / `package.json` -> TypeScript
+ - `pyproject.toml` / `setup.py` -> Python
+ - `go.mod` -> Go
+ - etc.
+- [ ] Manual config overrides in `dispatch.yaml`:
+ ```yaml
+ lsp:
+ servers:
+ typescript:
+ command: typescript-language-server
+ args: [--stdio]
+ python:
+ command: pylsp
+ ```
+- [ ] `diagnostics` tool for agents:
+ - `get_diagnostics(file?)` — returns current errors/warnings, optionally filtered to a file
+ - `get_diagnostics_summary()` — count of errors/warnings across workspace
+- [ ] File sync: notify LSP when agents modify files (via `textDocument/didChange` or `textDocument/didOpen`)
+- [ ] API endpoints:
+ - `GET /lsp/status` — which servers are running
+ - `GET /lsp/diagnostics` — current diagnostics
+
+### Frontend
+
+- [ ] Diagnostics panel:
+ - List of current errors/warnings grouped by file
+ - Severity indicators (error / warning / info)
+ - Click to see full diagnostic message
+- [ ] Per-agent diagnostic context: show which errors an agent was given to work on
+- [ ] LSP server status: indicator showing which language servers are running/healthy
+
+### Done When
+
+An agent edits a TypeScript file and introduces a type error. You see the error appear in the diagnostics panel. Another agent (or the same one) calls `get_diagnostics()` and gets the error. It fixes the issue. The diagnostic disappears.
+
+---
+
+## Summary
+
+| Phase | Scope | Effort | Cumulative |
+|---|---|---|---|
+| 1. Single Agent + UI | One agent, chat in browser | 2-3w | 2-3w |
+| 2. Shell Permissions | Rule engine, tree-sitter shell analysis, permission prompts | 2-3w | 4-6w |
+| 3. Config + Skills + Models | YAML config, skills dirs, model groups, key fallback | 2-3w | 5-8w |
+| 4. Spawning + Tree | Multi-agent hierarchy, tree UI, user messaging | 2-3w | 7-11w |
+| 5. Sessions | Persistence, fork, resume, model switch | 1-2w | 8-13w |
+| 6. LSP | Compiler diagnostics for agents | 1-2w | 9-15w |
+
+After Phase 2: usable on real projects.
+After Phase 4: full vision working.
+After Phase 6: feature-complete MVP.
diff --git a/notes/problem.md b/notes/problem.md
new file mode 100644
index 0000000..a29f801
--- /dev/null
+++ b/notes/problem.md
@@ -0,0 +1,79 @@
+# Problem: DeepSeek `reasoning_content` Dropped on Multi-Step Tool Calls
+
+## Symptom
+
+The first LLM call works (model makes a tool call). The second call fails with:
+
+```
+Error from provider (DeepSeek): The `reasoning_content` in the thinking mode must be passed back to the API.
+```
+
+This only happens when `maxSteps > 1` in `streamText` — i.e., when the agent loop calls the LLM a second time after executing a tool.
+
+## Root Cause
+
+A bug in `@ai-sdk/openai-compatible`. The package correctly **receives** `reasoning_content` from DeepSeek's response but silently **drops** it when building the next request.
+
+### The chain of events:
+
+1. **DeepSeek responds** with both `reasoning_content` (chain-of-thought) and `content` (answer) plus tool calls.
+
+2. **`@ai-sdk/openai-compatible` parses the response** and correctly captures `reasoning_content` into the SDK's internal `reasoning` field.
+
+3. **The AI SDK stores it** as a `{ type: "reasoning", text: "..." }` content part on the assistant message — this is correct.
+
+4. **On the next step**, the SDK passes the message history back through `@ai-sdk/openai-compatible`'s `convertToOpenAICompatibleChatMessages()` to serialize it for the API. This function handles assistant message content parts with a switch statement:
+
+ ```js
+ // @ai-sdk/openai-compatible/dist/index.js, lines 90-120
+ for (const part of content) {
+ switch (part.type) {
+ case "text": { /* handled */ break; }
+ case "tool-call": { /* handled */ break; }
+ // NO case "reasoning" — silently dropped!
+ }
+ }
+ ```
+
+5. **The outgoing request** has no `reasoning_content` field. DeepSeek requires it to be echoed back and rejects the request.
+
+### Important: The agent code cannot fix this
+
+The `streamText` function with `maxSteps` manages its own internal multi-step loop. The agent's `toCoreMessages()` is only called once for the initial prompt. The second call to DeepSeek is built entirely inside the SDK — the serialization bug is in `@ai-sdk/openai-compatible`, not in our code.
+
+## Fix Options
+
+### Option A: Patch `@ai-sdk/openai-compatible` (recommended)
+
+Add a `case "reasoning"` branch to `convertToOpenAICompatibleChatMessages()` that writes `reasoning_content` back into the outgoing assistant message:
+
+```js
+case "reasoning": {
+ reasoningContent = (reasoningContent ?? "") + part.text;
+ break;
+}
+// Then in the push:
+messages.push({
+ role: "assistant",
+ content: text,
+ reasoning_content: reasoningContent ?? undefined,
+ tool_calls: toolCalls.length > 0 ? toolCalls : void 0,
+ ...metadata
+});
+```
+
+Apply via `bun patch` or `patch-package`. File a bug/PR upstream against `@ai-sdk/openai-compatible`.
+
+### Option B: AI middleware to strip reasoning
+
+Use the AI SDK's `wrapLanguageModel` to intercept responses and remove `reasoning` parts before they enter the multi-step history. This avoids the error but loses the chain-of-thought content. Acceptable for Phase 1 since we don't display reasoning in the UI.
+
+### Option C: Switch to a model without thinking mode
+
+Use a DeepSeek model or configuration that doesn't enable thinking mode, if one is available through the OpenCode Zen endpoint. This avoids the problem entirely but limits model capability.
+
+## Affected Files
+
+- Bug location: `node_modules/@ai-sdk/openai-compatible/dist/index.js` (lines 90-120 in `convertToOpenAICompatibleChatMessages`)
+- Our agent code: `packages/core/src/agent/agent.ts` — not the cause, cannot fix it from here
+- Upstream repo: https://github.com/vercel/ai (the `@ai-sdk/openai-compatible` package)
diff --git a/notes/queue-interrupt-reconcile-edge-cases.md b/notes/queue-interrupt-reconcile-edge-cases.md
new file mode 100644
index 0000000..e564895
--- /dev/null
+++ b/notes/queue-interrupt-reconcile-edge-cases.md
@@ -0,0 +1,183 @@
+# The Queue / Interrupt / Reconcile Path Is an Edge-Case Magnet
+
+> Status: living document. Started after the chunk-native frontend rewrite, when
+> three consecutive independent code-review passes each surfaced a *new* Blocker
+> in the same ~40 lines of code. This note explains **why** that area is fragile,
+> catalogs the bugs found so far, states the invariants that must hold, and
+> records the recommended longer-term fix so the next person doesn't relearn it
+> the hard way.
+
+## TL;DR
+
+The frontend turn-completion reconcile (`reconcileSealedTurn` →
+`reloadChunksFromApi`) decides which transient `live` rows to **keep** vs **drop**
+when a turn's durable chunks arrive. It makes that decision from a tangle of
+loosely-coupled signals — `turnId` present/absent, the `queued-` id prefix,
+`queuedMessages` membership, `liveTurnId`, scroll state — that are mutated by
+**six** asynchronous events arriving in non-deterministic order, sometimes from
+**other clients**. Every signal is a *proxy* for the real question ("is this row
+already durable in `chunks`?"), and every bug so far has been a case where the
+proxy disagreed with reality. The result: either a row is **lost** (dropped but
+not yet sealed) or **duplicated/lingers** (kept but already sealed).
+
+**If you touch this code, re-read "Invariants" and "Why it keeps breaking" below,
+and add a test for the exact interleaving you're changing.**
+
+---
+
+## The moving parts
+
+### State (per tab, frontend store `tabs.svelte.ts`)
+- `tab.chunks: ChunkRow[]` — SEALED, durable history (real DB `seq`). Source of truth.
+- `tab.live: ChatMessage[]` — TRANSIENT buffer for the in-flight turn + optimistic
+ / queued user rows not yet folded into `chunks`.
+- `tab.renderGroups` — DERIVED: `groupRowsToMessages(chunks) ++ live`. What the UI shows.
+- `tab.queuedMessages: QueuedMessage[]` — messages the user sent while the agent was
+ busy, awaiting consumption.
+- `tab.liveTurnId` / `tab.currentAssistantId` — the in-flight turn + its streaming row.
+- per-row `turnId` — set once the row is bound to a concrete turn; `undefined` while
+ still optimistic/pending.
+- per-row id convention — a still-pending queued row has id `queued-<queueId>`.
+
+### Events (arrive over WS, order NOT guaranteed relative to each other)
+- `turn-start { turnId }` — fired **once per user-initiated `processMessage`**
+ (`agent-manager.ts`), which persists **exactly one** user chunk row
+ (`explodeUserText`). **A queued message NEVER gets its own `turn-start`.**
+- `text-delta` / tool events — stream content into the live assistant row.
+- `message-queued { messageId }` — a send was queued (this client or another).
+- `message-consumed { messageIds }` — the agent drained queued message(s) **into the
+ running turn**: either injected as a `[USER INTERRUPT]` block inside a tool result,
+ or appended as trailing history (`agent.ts` — the two `dequeueMessages()` sites).
+- `status { idle | running | error }` — note: `idle` fires **before** the DB write.
+- `turn-sealed { turnId }` — fired **after** `flushAssistant` (the durable write).
+ This is the only safe trigger for reconcile.
+
+### The reconcile decision (`reloadChunksFromApi`)
+On `turn-sealed` (deferred while the user is scrolled up, via `pendingReconcileTabs`),
+refetch the chunk window and recompute `live` as `keptLive`:
+
+```
+keptLive = live.filter(m =>
+ (preserveTurnId !== null && m.turnId === preserveTurnId) // a NEWER in-flight turn
+ || (m.turnId === undefined && m.role === "user") // optimistic/queued, not yet sealed
+)
+```
+
+Everything else is assumed already-durable in `chunks` and dropped.
+
+---
+
+## Why it keeps breaking
+
+The reconcile filter answers **"is this live row already in the sealed chunks?"**
+but it has no direct way to know. It infers the answer from `turnId` and the
+`queued-` prefix. That inference is correct **only if tagging is perfectly
+exhaustive and perfectly conservative**:
+
+- **Exhaustive:** every row that *is* (or will be) sealed into a turn's chunks must
+ carry that `turnId` *before the turn seals*. Miss one → it stays "untagged" → kept
+ → **duplicate** (it's also in `chunks`), and it lingers forever because nothing
+ ever tags it later.
+- **Conservative:** every row that is *not* part of any sealed turn (a pending queued
+ message, an optimistic initiator before `turn-start`) must stay untagged. Over-tag
+ one → it looks sealed → dropped → **the user's message vanishes**.
+
+Both failure modes hinge on the *same* `turnId`-presence bit, pulled in opposite
+directions, mutated by events whose ordering we don't control. That is the whole
+problem in one sentence. Add multi-client (events for turns this client never
+initiated) and deferred reconcile (state mutates *while* a reconcile is pending),
+and the interleaving space explodes.
+
+---
+
+## Catalog of bugs found (one per review pass, same ~40 lines)
+
+| # | Pass | Failure mode | Root cause | Fix |
+|---|------|--------------|-----------|-----|
+| A | 1 | A newer, still-streaming turn got **wiped** when an earlier turn's deferred reconcile flushed. | reconcile blindly cleared the whole live tail. | `preserveTurnId`: keep rows whose `turnId === liveTurnId` when `liveTurnId !== sealedTurnId`. |
+| B | 1 | Optimistic **queued** user bubble **dropped** on reconcile. | filter didn't keep untagged user rows. | keep `turnId === undefined && role === "user"`. |
+| C | 2 | `turn-start` backfill **over-tagged** trailing `queued-` rows with the new turnId → **wiped** on seal. | backfill tagged *every* trailing untagged user row. | skip `queued-` rows; tag only the single most-recent non-queued initiator; then `break`. |
+| D | 3 | **Consumed** interrupt bubble **lingered forever** + **duplicated** the `[USER INTERRUPT]` text in the sealed tool result. | `message-consumed` stripped the `queued-` prefix to a *plain untagged* row, which rule B then preserved on every reconcile. | bind the consumed row to `liveTurnId` so reconcile drops it (collapse to persisted shape). |
+
+Notice the chain reaction: the **rule B fix** ("keep untagged user rows") is the
+hinge that **both C and D** then bent. C was about *creating* wrongly-tagged rows;
+D was about *failing to tag* rows that should have been. Each fix narrowed the
+definition of "untagged ⇒ keep" without ever making it airtight.
+
+---
+
+## Invariants (the contract reconcile depends on)
+
+1. **No loss.** A live row that is not yet represented in `chunks` must survive
+ reconcile. (Pending queued messages, optimistic initiators pre-`turn-start`.)
+2. **No duplicate / no linger.** A live row whose content has been sealed into
+ `chunks` (as a user row OR folded into a tool result) must be dropped on the
+ sealing turn's reconcile.
+3. **Newer turn preserved.** A turn that started streaming *after* the one being
+ reconciled must not be touched; only the sealed turn folds into `chunks`.
+4. **Idempotent + deferral-safe.** Reconcile may run late (after the user scrolls
+ back to the bottom) and may be preceded by arbitrary further events. Re-running it
+ must not violate 1–3. Key reconcile off `turn-sealed` (post-write), never `status:idle`.
+
+Corollaries that are easy to forget:
+- A consumed/interrupt message is sealed **inside a tool result**, not as its own
+ user chunk row — yet its optimistic bubble must still be dropped (invariant 2).
+- A freshly-created tab defaults to `agentStatus: "running"`, so the *first* user
+ send may be treated as queued unless the tab is known-idle.
+
+---
+
+## Recommended longer-term fix (not yet done)
+
+Stop inferring "is this sealed?" from `turnId` presence. Make preservation key off
+**explicit, positive membership** instead of the absence of a tag:
+
+> Keep a live row iff it belongs to a turn we are explicitly preserving
+> (`turnId === preserveTurnId`) **or** it is still a *pending* queued message
+> (its `queueId` is still in `tab.queuedMessages`). Drop everything else.
+
+This flips the dangerous default. Today "untagged ⇒ keep" means *any* tagging gap
+causes a linger/dup (bugs C/D class). With membership-based keep, an untagged row
+that is *not* a pending queued message is dropped — which is correct, because the
+only rows that should outlive a reconcile are (a) the explicitly-preserved newer
+turn and (b) genuinely-pending queue entries, both of which have positive signals.
+It also makes the `turn-start` backfill purely cosmetic (stable render keys), so a
+tagging miss there can no longer lose or duplicate data.
+
+Until that refactor lands, treat this path as **high-touch**: any change needs a
+targeted interleaving test.
+
+---
+
+## Testing guidance
+
+Property/interleaving coverage beats one-off scenario tests here. Suggested:
+- A randomized sequence generator over `{turn-start, text-delta, message-queued,
+ message-consumed, status, turn-sealed, scrollUp/Down}` with multiple concurrent
+ turns and a simulated second client, asserting invariants 1–3 after every
+ `turn-sealed` + final state.
+- Scenario tests that already exist in `packages/frontend/tests/chat-store.test.ts`
+ and should stay green (named for the bugs above):
+ - deferred reconcile preserves a concurrent newer turn (A)
+ - optimistic queued user message survives an earlier turn's reconcile (B)
+ - `turn-start` backfill skips a pending queued row, tags only the initiator (C)
+ - a consumed interrupt message collapses into the sealed turn — no lingering bubble (D)
+- Always assert **both** directions: the kept row is still present (no loss) AND the
+ sealed rows are not duplicated (no linger).
+
+## Pointers
+
+- Frontend store: `packages/frontend/src/lib/tabs.svelte.ts`
+ (`reloadChunksFromApi`, `reconcileSealedTurn`, `handleEvent` cases
+ `turn-start` / `turn-sealed` / `message-queued` / `message-consumed` / `status`,
+ `setScrolledUp`, `pendingReconcileTabs`).
+- Render keying: `packages/frontend/src/lib/components/ChatPanel.svelte`
+ (`${turnId}:${role}:${n}`).
+- Backend event emission: `packages/api/src/agent-manager.ts` (`processMessage`,
+ `dequeueMessages`, `turn-start` / `turn-sealed`); `packages/core/src/agent/agent.ts`
+ (the two `dequeueMessages()` consumption sites).
+- Review records: `notes/gemini-chunk-eviction-review.md` (Pass 1, bugs A+B),
+ `notes/gemini-chunk-eviction-review-2.md` (Pass 2, bug C),
+ `notes/gemini-chunk-eviction-review-3.md` (Pass 3, bug D).
+</content>
+</invoke>
diff --git a/notes/report.md b/notes/report.md
new file mode 100644
index 0000000..3c6d6f7
--- /dev/null
+++ b/notes/report.md
@@ -0,0 +1,38 @@
+# Sidebar Layout Persistence Review
+
+## Verdict
+SHIP
+
+## Block-level findings
+None. The implementation is robust and follows the established patterns for localStorage usage in this codebase.
+
+## Ship-with-followup findings
+None.
+
+## Nits
+None.
+
+## What was checked
+- **A. sidebar-storage.ts correctness**:
+ - `loadSidebarPanels` correctly handles all failure modes (missing key, malformed JSON, non-array types, and mixed-type arrays) without throwing. It returns a defensive shallow copy of the default layout.
+ - `saveSidebarPanels` is best-effort and safely swallows any storage errors (e.g., QuotaExceededError or SecurityError).
+ - The localStorage key `dispatch-sidebar-panels` is correctly namespaced.
+- **B. SidebarPanel.svelte integration**:
+ - Initialization correctly seeds the `panels` state using `loadSidebarPanels()`.
+ - The `$effect` correctly captures all changes to the `panels` array (addition, removal, and selection changes) due to the reactive read in the map function.
+ - Session-ephemeral `id` fields are correctly regenerated and not persisted, avoiding potential collisions across sessions.
+ - The 'minimum 1 panel' invariant is preserved by the existing UI logic (`{#if idx > 0}`) and reinforced by the storage fallback.
+- **C. Test coverage**:
+ - The unit tests in `sidebar-storage.test.ts` are comprehensive, covering initial load, valid round-trips, corruption, malformed data, filtering, and storage exceptions.
+ - Verified mutation isolation (fresh array on each load).
+- **D. Regression risks**:
+ - Add/remove and dropdown handlers are preserved and correctly trigger persistence via state reassignment.
+ - No infinite loops or race conditions identified; the `$effect` is one-way (state -> localStorage).
+- **E. Stylistic / consistency**:
+ - Matches the `dispatch-` prefix pattern seen in `config.ts`.
+ - Documentation and comments are thorough and clear.
+
+## What was NOT checked
+- Persistence of the `sidebarOpen` toggle (explicitly out of scope).
+- Drag-to-reorder support (not implemented in the UI).
+- Backend synchronization (feature designed for per-device localStorage).
diff --git a/notes/requirements.md b/notes/requirements.md
new file mode 100644
index 0000000..96b2fb8
--- /dev/null
+++ b/notes/requirements.md
@@ -0,0 +1,353 @@
+# Dispatch - AI Agent Harness Requirements
+
+## Overview
+
+Dispatch is a multi-layered AI agent orchestration harness. The user interacts with a top-level **dispatch** layer, which spawns background **orchestrators** for high-level tasks. Orchestrators in turn spawn parallel **subagents** to execute atomic units of work.
+
+The goal is to enable complex, multi-step software engineering workflows (planning, research, implementation, review) through composable, config-driven agent hierarchies.
+
+## Architecture
+
+### Agent Hierarchy (Emergent, Not Rigid)
+
+The hierarchy is not a fixed three-layer structure enforced in code. Instead, it **emerges naturally from agent permissions and context**. Every agent is the same primitive -- what distinguishes a "dispatch agent" from an "orchestrator" from a "leaf worker" is the permissions and skills it was given when spawned.
+
+```
+User <-> Dispatch Agent (top-level, has all permissions)
+ |
+ +---> Agent A (given orchestration skills + summon permission)
+ | |-- Agent A1 (given task skills, no summon permission)
+ | |-- Agent A2 (given task skills, no summon permission)
+ |
+ +---> Agent B (given orchestration skills + summon permission)
+ |-- Agent B1 (given task skills + summon permission)
+ |-- Agent B1a (given narrow task skills, no summon)
+```
+
+**When an agent spawns a subagent, the parent defines:**
+- **Context**: What information and skills the child receives
+- **Model/pool**: Which model or model group tag the child should use (e.g., `heavy`, `coding`)
+- **Permissions**: What the child is allowed to do -- run shell commands, summon its own subagents, access specific directories, etc.
+
+This means:
+- An "orchestrator" is just an agent with the `summon_subagents` permission and a skill file that teaches it how to decompose and delegate work
+- A "leaf worker" is just an agent without `summon_subagents` permission
+- Hierarchy depth is unlimited and determined by the agents themselves, not hardcoded
+- The dispatch layer is simply the first agent in the tree, with full permissions
+
+### Communication Model
+
+- **Strict hierarchy**: Subagents report only to the agent that spawned them. No peer-to-peer communication between sibling agents.
+- Each agent communicates with its parent (upward) and its children (downward).
+- The specific transport mechanism (filesystem, IPC, message queue) is an implementation detail left open.
+
+### Detached Orchestrators
+
+The dispatch agent can spawn orchestrators in **detached mode**. A detached orchestrator:
+
+- Runs independently from the dispatch conversation
+- Has its own **direct communication channel to the user** — it can ask questions, request clarification, and wait for user input
+- The user interacts with the detached orchestrator as if it were its own conversation thread
+- The orchestrator may spawn its own subagents (which follow strict hierarchy — reporting only to the orchestrator)
+- When the orchestrator completes its task, it reports results back to the dispatch agent
+
+**Use case:** The dispatch agent spawns a planning orchestrator. The orchestrator opens a conversation with the user, asks clarifying questions about requirements, iterates on the plan with user feedback, and when the plan is finalized, hands it back to the dispatch agent for execution.
+
+A detached orchestrator is simply an agent spawned with `detached: true` — it receives a user-facing channel in addition to its parent channel.
+
+### User-to-Agent Messaging
+
+The user can send messages to any running agent at any time, regardless of where that agent is in its execution. Messages are delivered through tool interfaces -- any tool invocation point doubles as a message reception point.
+
+**Message types:**
+- **Instructions**: Direct the agent to change approach or focus
+- **Corrections**: Fix a misunderstanding or wrong assumption
+- **Context**: Provide additional information the agent lacks
+- **Data**: Supply concrete values, file contents, references, etc.
+
+**Requirements:**
+- Messages can target any agent in the hierarchy (dispatch, orchestrator, or subagent)
+- Delivery must not require the agent to finish its current tool call first -- the message is available on the next tool boundary
+- The agent must acknowledge and incorporate the message into its ongoing work
+- The dispatch layer provides a mechanism for the user to see which agents are active and route messages to them
+
+### Conflict Prevention
+
+When multiple subagents operate on code, the orchestrator must assign **non-overlapping scopes** (e.g., distinct files or file regions) to each subagent before dispatching them. Orchestrators are responsible for partitioning work to avoid merge conflicts.
+
+## Configuration
+
+### Config-Driven Orchestrators
+
+Orchestrators are defined through configuration files, not hardcoded. A configuration defines:
+
+- **Name and description** of the orchestrator type
+- **System prompt / instructions** for the orchestrator's LLM context
+- **Allowed tool set** for the orchestrator itself
+- **Subagent templates** -- what types of subagents this orchestrator can spawn, with their own tool scopes and prompts
+- **Concurrency limits** -- max parallel subagents
+- **Checkpoint rules** -- which stages require human approval (if any)
+
+### Role-Scoped Tooling
+
+Tools available to each agent are scoped by role:
+- Research subagents: web search, file read, documentation fetch
+- Coding subagents: file read/write, shell execution, code analysis
+- Review subagents: file read, test execution, linting
+- Custom roles define their own tool sets via config
+
+## Skills System
+
+Skills are markdown files (`.md`) containing specialized instructions, context, or workflows that are injected into an agent's context. Skills are organized in a standardized directory structure at two levels: **global** (home directory) and **project-level**.
+
+### Directory Structure
+
+```
+~/.skills/
+ default/ # .md skills auto-loaded for ALL agents globally
+ agents/ # Agent-type mappings (which skills activate for which agent)
+ project/ # .md skills available to any project (manually activated)
+
+<project>/.skills/
+ default/ # .md skills auto-loaded for agents working in this project
+ agents/ # Agent-type mappings specific to this project
+ project/ # .md skills available in this project (manually activated)
+```
+
+### Directories Explained
+
+| Directory | Scope | Loading |
+|-----------|-------|---------|
+| `default/` | All agents at that level | **Auto-loaded** -- always injected into agent context |
+| `project/` | Agents at that level | **Available** -- must be explicitly activated or referenced |
+| `agents/` | Specific agent types | **Mapped** -- defines which skills load for which agent type |
+
+### Agent Mapping Files (`agents/`)
+
+Files in the `agents/` directory map skills to specific agent types. The filename encodes the agent name and tier:
+
+- `<name>.txt` -- maps to a **subagent** of that name
+- `<name>.o.txt` -- maps to an **orchestrator** of that name
+
+File contents list skill filenames (from `default/` or `project/`) to activate for that agent type.
+
+**Examples:**
+```
+# agents/coding.txt (subagent)
+git-conventions.md
+code-style.md
+
+# agents/research.o.txt (orchestrator)
+search-strategy.md
+source-evaluation.md
+```
+
+### Scope and Precedence
+
+- Global skills (`~/.skills/`) are available to all projects.
+- Project skills (`<project>/.skills/`) are available only within that project.
+- When a skill with the same name exists at both levels, **both are retained and distinguishable by scope**. References can disambiguate using a scope prefix (e.g., `global:code-style` vs `project:code-style`).
+- `default/` skills at both levels stack: global defaults and project defaults are both auto-loaded.
+
+### Loading Order
+
+1. Global `default/` skills are loaded first
+2. Project `default/` skills are loaded next
+3. Agent-specific skills from `agents/` mappings are loaded (global then project)
+4. Manually activated `project/` skills are loaded on demand
+
+## LSP Integration
+
+Agents have access to Language Server Protocol diagnostics for the project they are operating in.
+
+### Capabilities
+
+- **Primary use case**: Real-time compiler/linter diagnostics and errors. Agents receive ground-truth error information from the language toolchain rather than inferring errors from output or guessing.
+- Agents can query the LSP for diagnostics on specific files or the entire workspace.
+- Diagnostics are available as a tool that any agent with appropriate scope can invoke.
+
+### Configuration
+
+- **Auto-detection**: The system detects the project language(s) and starts appropriate LSP servers automatically (similar to how an IDE discovers and launches language servers).
+- **Manual overrides**: A project-level config file can specify custom LSP server commands, initialization options, and settings. Manual config takes precedence over auto-detected defaults.
+- LSP servers are managed as long-lived background processes, shared across agents operating on the same project.
+
+## Filesystem and Shell Access
+
+### General Shell Access
+
+All agents have access to a general-purpose shell for running commands. This is not restricted to a predefined set of tools -- agents can execute arbitrary shell commands.
+
+### Directory Permissions
+
+- Agents may freely read and write within the **current working directory** (the project root) and its subdirectories.
+- **Any access to directories outside the current working directory requires explicit user permission.** When an agent attempts to read, write, or execute in an external directory, the system prompts the user for approval.
+- **Auto-allow list**: A configurable list of directories that are pre-approved for access without prompting. Defined in the project or global config.
+
+```
+# Example config
+permissions:
+ auto_allow:
+ - /tmp
+ - ~/.config/dispatch
+ - /usr/local/share/data
+```
+
+- Permission prompts include: the agent requesting access, the target path, and the operation (read/write/execute).
+- Permissions can be granted per-request, per-session, or permanently (added to auto-allow).
+
+## Session Management
+
+### Chat Forking
+
+The user can fork the current dispatch conversation at any point, creating a new branch from that moment in the chat history. Forking applies to the **dispatch-level conversation only** -- active orchestrators and subagents are not duplicated into the fork. The original session continues unaffected.
+
+### Model Switching
+
+The user can switch the LLM model for **any active agent** in the hierarchy mid-session:
+- Switch the dispatch agent's model during a conversation
+- Switch an orchestrator's or subagent's model while it is running
+- The agent continues with its existing context under the new model
+
+Model switches take effect immediately. Prior context is preserved and passed to the new model.
+
+### Chat History and Resumption
+
+All dispatch-level conversations are persisted and can be loaded later to continue where the user left off. Loading an old chat restores the **conversation history only** -- background orchestrators and subagents from the original session are not resumed.
+
+Loaded chats can be:
+- Continued with new messages
+- Forked from any point in the history
+- Searched/filtered by date, topic, or content
+
+## Human-in-the-Loop
+
+The system supports **configurable checkpoints** where execution pauses for human approval. Examples:
+
+- Approve a generated plan before implementation begins
+- Review proposed code changes before they are written
+- Confirm destructive operations (file deletions, large refactors)
+
+Checkpoints are configurable per orchestrator type. They can be enabled, disabled, or set to auto-approve with a timeout.
+
+## State Persistence
+
+The system persists state across sessions:
+
+- **Plans**: Generated plans are saved and can be resumed
+- **Research artifacts**: Research findings are stored for reuse
+- **Session state**: Interrupted orchestrators can be resumed from their last checkpoint
+- **History**: Past dispatches and their outcomes are queryable
+
+## Observability
+
+Basic logging is required:
+- Agent activity logs (what each agent did and when)
+- Error reporting with context
+- Optional: token usage tracking, cost estimates, decision traces
+
+Observability is a secondary concern -- basic logging is sufficient initially, with hooks for richer tracing later.
+
+## LLM Provider
+
+The system is **provider-agnostic**. It defines an abstract LLM interface that can be backed by any provider (Anthropic, OpenAI, local models, OpenRouter, etc.). Provider selection is configurable per agent or globally.
+
+### Key and Model Hierarchy
+
+Multiple API keys and models can be configured with a **fallback hierarchy**. When a key's quota or budget is exhausted, the system automatically falls through to the next key or model in the hierarchy.
+
+**Fallback triggers:**
+- API key quota or budget exhausted (daily, monthly, or total spend limits)
+
+**Configuration:**
+- Each API key has a configurable budget/quota limit. When reached, the system moves to the next key in the fallback chain.
+- Fallback chains are ordered lists: the system tries the first key/model, and on exhaustion moves to the second, and so on.
+- Fallback can cross providers (e.g., exhaust an Anthropic key, fall back to an OpenAI key).
+
+### Model Groups and Tags
+
+Models are organized into **groups** using tags. Tags allow orchestrators and subagents to request a model by capability rather than by name.
+
+**Built-in group tiers:**
+- `heavy` -- largest, most capable models (e.g., Claude Opus, GPT-4.5)
+- `medium` -- balanced capability/cost (e.g., Claude Sonnet, GPT-4o)
+- `light` -- fast, cheap models for simple tasks (e.g., Claude Haiku, GPT-4o-mini)
+
+**Task-specific tags:**
+- `coding` -- models best suited for code generation and editing
+- `review` -- models suited for code review and analysis
+- `research` -- models suited for search and synthesis
+- Custom tags can be defined freely in config
+
+**Resolution:** A model can have multiple tags (e.g., a model tagged `heavy, coding`). When an agent requests a tag, the system resolves it to the best available model matching that tag, respecting the key fallback hierarchy.
+
+```
+# Example config
+models:
+ keys:
+ - provider: anthropic
+ key: ${ANTHROPIC_KEY_1}
+ budget: $50/month
+ models:
+ claude-opus-4:
+ tags: [heavy, coding, review]
+ claude-sonnet-4:
+ tags: [medium, coding, research]
+ - provider: openai
+ key: ${OPENAI_KEY_1}
+ budget: $30/month
+ models:
+ gpt-4.5:
+ tags: [heavy, coding]
+ gpt-4o:
+ tags: [medium, research]
+ gpt-4o-mini:
+ tags: [light]
+
+ fallback_order:
+ - ${ANTHROPIC_KEY_1}
+ - ${OPENAI_KEY_1}
+```
+
+### Key Exhaustion Behavior
+
+When all configured keys for an agent's task are exhausted, the agent **does not fail**. Instead it enters a **wait state**, polling until any of its configured keys become available again, then resumes automatically from where it left off.
+
+**Behavior:**
+- Keys are used in priority order (highest priority first)
+- When the active key is exhausted, the system immediately falls through to the next key
+- When ALL configured keys are exhausted, the agent sleeps and polls for the first key to refresh
+- Whichever key refreshes first is used to resume -- priority order applies again from that point
+- Waiting is **per-agent**: other agents in the tree that still have available keys continue running unaffected
+- The agent's context and state are preserved across the wait -- resumption is seamless
+
+**Use case:** A complex overnight task drains all keys. The system sleeps until a rate window resets (e.g., a 5-hour cooldown expires), then picks up automatically. The user wakes up to a completed task.
+
+## Interface
+
+The system is **API-first** (HTTP + WebSocket) with an HTML frontend built alongside the backend from day one. The frontend is the primary testing and interaction surface.
+
+The core exposes a programmatic API that additional interfaces can be built on top of later:
+- Interactive CLI (REPL)
+- Command-based CLI
+- TUI
+- Desktop app
+
+## Language/Runtime
+
+**TypeScript / Node.js.** Chosen for:
+- Rich LLM SDK ecosystem (Vercel AI SDK, pi-ai, cline/llms)
+- Strong async/streaming support
+- Large pool of AI tooling libraries
+- Same language for backend and frontend
+
+**Library strategy:** Use existing battle-tested libraries heavily (Vercel AI SDK for LLM, etc.). Focus custom work on the novel parts -- hierarchy, orchestration, skills, permissions.
+
+## Key Design Principles
+
+1. **Emergent hierarchy** -- Agents are a single primitive. "Orchestrators" and "workers" emerge from the permissions and skills given at spawn time
+2. **Composability** -- Agent templates and skills are building blocks, combined via config
+3. **Parallelism** -- Subagents run concurrently by default; parent agents manage fan-out/fan-in
+4. **Isolation** -- Each agent operates in a scoped context with scoped tools and permissions
+5. **Resumability** -- Work can be interrupted and resumed, including across key exhaustion waits
+6. **Extensibility** -- New agent types, tool sets, and providers added via config, not code changes
diff --git a/notes/tool-runner-duplication-incident.md b/notes/tool-runner-duplication-incident.md
new file mode 100644
index 0000000..188110d
--- /dev/null
+++ b/notes/tool-runner-duplication-incident.md
@@ -0,0 +1,147 @@
+# Incident Report: Duplicated Tool Output in Dispatch Harness
+
+**Date:** 2025-05-30
+**Component:** Dispatch AI harness — tool runner / tool-result delivery
+**Severity:** Medium (no data corruption, but severe context pollution and wasted tokens/latency)
+**Status:** Observed, not root-caused
+
+---
+
+## Summary
+
+During the exploration phase of a code task (disabling the todo/tasks skill),
+a single assistant turn that batched a large number of independent,
+read-only tool calls came back with **massively duplicated tool results**.
+The same `read_file`, `list_files`, and `run_shell` invocations were echoed
+back dozens of times each, producing a multi-thousand-line "wall" of
+near-identical output for what should have been a few dozen distinct results.
+
+The actual file edits performed later in the task landed correctly — this was
+an **output/delivery problem**, not a state-corruption problem. But the
+duplication wasted a large amount of context window, made the transcript very
+hard to read, and could plausibly cause an agent to lose track of its plan or
+hit context limits on a larger task.
+
+---
+
+## What I observed
+
+I emitted **one** assistant message containing a large parallel batch of
+independent tool calls (the correct pattern for independent reads). The batch
+included things like:
+
+- `read_file packages/api/src/agent-manager.ts` (several distinct line ranges)
+- `read_file package.json`
+- `list_files packages/api/src`
+- various `run_shell` commands (`grep`, `sed`, `awk`, plus debug probes like
+ `echo hello`, `echo hi`, `pwd`)
+
+Instead of one result per call, the results stream contained the **same
+results repeated many times over**. Concrete examples from the transcript:
+
+| Tool call | Approx. times the identical result appeared |
+|---|---|
+| `read_file package.json` (lines 1-5) | ~150+ |
+| `read_file agent-manager.ts` (lines 75-126 / 76-127 / 78-117) | ~40+ |
+| `list_files packages/api/src` | ~20 |
+| `run_shell sed -n '75,127p' ...` | ~13 |
+| `run_shell awk 'NR>=75...'` | ~12 |
+| `run_shell echo hello` | ~16 |
+| `run_shell pwd` | ~20 |
+
+The duplicated blocks were **byte-for-byte identical** (same line ranges, same
+content, same exit codes). They were not re-reads of changed files — the
+content was stable across every repeat.
+
+### Notable characteristics
+
+1. **Read-only and idempotent calls were the ones duplicated.** The heavy
+ duplication clustered on `read_file` / `list_files` / trivial `run_shell`
+ probes. This is consistent with the harness retrying or re-emitting results
+ for calls it considered safe/idempotent.
+2. **The duplication count was wildly uneven.** `package.json` (a tiny file)
+ was echoed ~150 times while a single `read_file_slice` appeared once. The
+ multiplier was not constant across calls.
+3. **A genuinely valuable signal was nearly buried.** One real result hid in
+ the noise: `read_file agent-manager.ts` lines 75-126 truncated mid-line and
+ instructed me to use `read_file_slice` — easy to miss in a wall of
+ duplicates.
+4. **No effect on writes.** The later `write_file`/`run_shell` mutation steps
+ each returned exactly once and produced correct results, and the final
+ `git diff --stat` matched expectations (6 insertions, 168 deletions across
+ 8 files). So the glitch appears confined to result *delivery/echo*, not
+ tool *execution*.
+
+---
+
+## Impact
+
+- **Context pollution:** Thousands of redundant lines consumed the context
+ window. On a larger codebase or a longer task this could trigger truncation
+ or force premature summarization.
+- **Readability:** The human-facing transcript became almost unreadable for
+ the exploration phase; it's hard to audit what the agent actually looked at.
+- **Cost/latency:** Re-emitting and re-processing identical results wastes
+ tokens and time.
+- **Reasoning risk:** Heavy duplication can bias or confuse a model
+ (repetition can be read as emphasis), and the near-miss on the truncation
+ hint shows real signals can get lost.
+
+---
+
+## What I could NOT determine
+
+- Whether the duplication originated in:
+ - the **tool runner** (executing/emitting each call multiple times),
+ - the **result serializer/transport** (one execution, many echoes), or
+ - a **retry/timeout loop** (calls re-issued after a perceived stall).
+- Whether the trigger was the **large fan-out batch size** specifically, or
+ something incidental.
+- Whether other agents on the same harness see it (single-session
+ observation only).
+
+The byte-identical nature of the repeats (including stable `pwd` and `echo`
+output) leans toward an **echo/transport duplication** rather than genuine
+re-execution, but that is inference, not proof.
+
+---
+
+## Reproduction hypothesis
+
+Not reliably reproduced. The strongest correlated factor was **a single turn
+with a large number (dozens) of independent tool calls batched together**,
+heavily weighted toward `read_file` on small files. Suggested repro attempt:
+
+1. In one assistant turn, batch ~30+ independent `read_file` calls, several
+ targeting the same small file at different line ranges.
+2. Mix in a few cheap `run_shell` probes (`echo`, `pwd`).
+3. Observe whether results are echoed >1× and whether the multiplier varies
+ per call.
+
+---
+
+## Suggested follow-ups for the harness maintainers
+
+1. **Deduplicate / cap result emission per tool_call_id.** Each tool call has
+ a unique id; the result stream should emit exactly one result per id.
+ Assert this invariant and drop/merge duplicates.
+2. **Log the executor path.** Add a counter for "executions per tool_call_id"
+ vs "results delivered per tool_call_id" to distinguish re-execution from
+ re-echo.
+3. **Investigate the batch/fan-out path.** Check whether large parallel tool
+ batches trigger a retry or a fan-in bug that replays buffered results.
+4. **Surface truncation hints robustly.** When a `read_file` result is
+ truncated, ensure that notice survives any dedup/compaction so the agent
+ doesn't miss it.
+5. **Consider a per-turn result-size guardrail** that warns when duplicate
+ identical results are detected.
+
+---
+
+## Notes
+
+- This document records a single observed session. Numbers above are
+ approximate counts read from the transcript, not exact instrumentation.
+- The underlying code task completed successfully and was verified
+ independently (targeted test suites passed, typechecks clean); this report
+ concerns only the harness behavior, not the task outcome.
diff --git a/notes/wishlist.md b/notes/wishlist.md
new file mode 100644
index 0000000..43e579b
--- /dev/null
+++ b/notes/wishlist.md
@@ -0,0 +1,26 @@
+# Wishlist
+
+- **Persist dashboard layout and chat history across sessions and devices.**
+ - Restore any tabs that were left open when revisiting the page, including their order and active state.
+ - If a chat was mid-generation (AI actively calling tools and streaming responses), automatically resume and continue from where it left off — even if the page was closed.
+ - Chats continue processing server-side even when the frontend is entirely closed, meaning the AI keeps generating responses and calling tools without any browser open.
+ - Start a chat on one device (e.g. desktop) and seamlessly pick it up later on another (e.g. phone).
+ - Sidebar remembers which views were open and in what order, restoring them exactly as they were.
+
+- **Edit chat history.** Click on any existing message in the chat history and choose to edit it — this applies to user messages, AI responses, and tool results.
+
+- **Update the way tools appear in the chat UI.** Improve the visual presentation of tool calls and their results — make them more readable, compact, and scannable.
+
+- **Show git diffs for edited files.** When the AI edits a file (write_file tool call), display a git diff in the UI rather than just the raw file content.
+
+- **Show live shell output in a collapsible block.** When a shell command is running, show live stdout/stderr in a collapsible shell block (similar to the thinking block), instead of requiring the user to expand the tool call and read raw JSON.
+
+- **ntfy push notifications.** Configurable ntfy.sh notifications — ping on chat completion, errors, permission prompts, and other events. Configure topic URL and which events trigger notifications.
+
+- **Fix the todo system.** The current task list tool and its UI have bugs or limitations that need addressing.
+
+- **Track token usage in a tab.** Display token usage (e.g. prompt/completion/total tokens) for the chat within each tab.
+
+- **Fix queue not being consumed after the AI finishes its turn.** When the AI completes its turn, a queued user message is just attached to the chat without continuing the conversation — the turn ends instead of consuming the queue and generating a response. The queued message should kick off a new turn.
+
+- **Compaction tool.** A tool to compact/summarize the conversation history to reduce context size while preserving important information. \ No newline at end of file