chore(notes): collect loose root docs into notes/; add reconcile edge-cases note

Move all loose root-level .md files (plans, reports, gemini reviews, incident notes) into a single notes/ directory, and update the doc-reference breadcrumbs in code comments/test labels to the notes/ path. Add notes/queue-interrupt-reconcile-edge-cases.md: documents why the queue/interrupt/turn-sealed reconcile path keeps surfacing edge cases (a catalog of the four review-pass bugs, the no-loss/no-duplicate invariants, the recommended membership-based reconcile refactor, and interleaving-test guidance).
author: Adam Malczewski <[email protected]> 2026-05-30 23:15:18 +0900
committer: Adam Malczewski <[email protected]> 2026-05-30 23:15:18 +0900
commit: 4e636511ae748d606d8871f5068a2bd18b386bd0 (patch)
tree: a5e0726d71d9d88d09d938ea2318a61e36ade68f
parent: 624b808da0f2f8bbad8a4fbbcca3f82f24ecfc47 (diff)
download: dispatch-4e636511ae748d606d8871f5068a2bd18b386bd0.tar.gz
dispatch-4e636511ae748d606d8871f5068a2bd18b386bd0.zip
30 files changed, 1334 insertions, 11 deletions
diff --git a/cache-miss-report.md b/notes/cache-miss-report.md
index 03342af..03342af 100644
--- a/cache-miss-report.md
+++ b/notes/cache-miss-report.md
diff --git a/notes/changes-report.md b/notes/changes-report.md
new file mode 100644
index 0000000..f30b611
--- /dev/null
+++ b/notes/changes-report.md
@@ -0,0 +1,54 @@
+# Changes Report
+
+## Overview
+This report reviews the uncommitted changes in the `dispatch` repository. The changes can be broadly categorized into two parts:
+1. **Codebase Updates**: The removal of the `todo` (task list) tool and a fix to a `summon` tool test.
+2. **Documentation Additions**: Several new untracked markdown documents regarding planning and incident reports.
+
+## Per-File Analysis
+
+### Untracked Files
+- **`claude-auth-report.md`** & **`tool-runner-duplication-incident.md`**: New post-mortem and incident investigation reports.
+- **`cyberdeck/credentials.md`**: Documentation regarding credentials.
+- **`skill-plan/` directory**: Multiple markdown files outlining a multi-step plan for building a new "skill" architecture (tool definitions, registration, routes, etc.).
+
+**Assessment**: Adding detailed markdown documents for incidents and project planning is a very good practice. They are cleanly separated from the application source code.
+
+### Modified Files (Staged & Unstaged)
+
+#### 1. `packages/core/src/tools/task-list.ts`
+- **Change**: Removed the `createTaskListTool` function and the Zod/ToolDefinition imports. The `TaskList` class remains.
+- **Correctness**: The removal of the factory is clean.
+
+#### 2. `packages/api/src/agent-manager.ts`
+- **Change**: Removed `createTaskListTool` imports and invocations. Removed `todo` from the `TOOL_DESCRIPTIONS` and the lengthy `TODO_GUIDANCE` string from the agent's system prompt. 
+- **Correctness**: Consistently applies the removal of the `todo` tool. However, it still retains `tabAgent.taskList = new TaskList()`, which may now be dead state (see Issues section).
+
+#### 3. `packages/core/src/agents/loader.ts` & `packages/core/src/tools/summon.ts`
+- **Change**: Removed `"todo"` from the default tool arrays and documentation comments for child agents.
+- **Correctness**: Follows through with the removal of the `todo` tool across tool lists, ensuring child agents don't request a missing tool.
+
+#### 4. `packages/core/src/index.ts`
+- **Change**: Updated exports to remove `createTaskListTool`, while keeping `TaskList`.
+- **Correctness**: Correctly reflects the module changes.
+
+#### 5. Tests (`packages/api/tests/*.test.ts` & `packages/core/tests/agents/loader.test.ts`)
+- **Change**: Removed mocked `createTaskListTool` injections and updated assertions that previously verified the automatic injection of the `todo` tool.
+- **Correctness**: Properly aligns tests with the new implementation. Build/tests should pass cleanly.
+
+#### 6. `packages/core/tests/tools/summon.test.ts`
+- **Change**: Updated the "surfaces child errors when blocking" test to correctly anticipate the `agent_id: <id>` prefix before the child output. Added an excellent explanatory comment about why this prefix exists (for frontend `ToolCallDisplay` regex parsing).
+- **Correctness**: The change is highly robust. Explicitly asserting `.toContain()` before the final `.toBe()` provides great debugging signals if the test fails in the future. The comment is an excellent practice.
+
+## Correctness Assessment
+The changes successfully and cleanly remove the `todo` tool logic from the backend tools and system prompt. The test fix for `summon.test.ts` is technically sound and well-documented. The changes are internally consistent from a backend tooling perspective.
+
+## Issues & Concerns Found
+**Incomplete Refactoring (Dead Code)**: 
+While the `todo` tool was completely removed, the `TaskList` class itself and its instantiation in `AgentManager` (`tabAgent.taskList = new TaskList()`) were kept. Because the agent no longer has a tool to interact with this list, any tasks will remain empty. 
+Consequently, the UI component that relies on this (`TaskListPanel.svelte` in the `frontend` package) will now always render an empty state. This represents an incomplete feature removal/refactor.
+
+## Recommendations
+1. **Clean Up `TaskList`**: If the todo feature is permanently removed, you should also delete the `TaskList` class from `packages/core`, remove the `taskList` property from `tabAgent` in `packages/api`, and delete `TaskListPanel.svelte` (along with its imports in `SidebarPanel.svelte`) from the frontend.
+2. **Staging**: If the untracked markdown documents are ready, ensure they are intentionally added (`git add`) and committed. 
+3. **Commit the Changes**: The code removal of the `todo` tool is safe to commit. I recommend summarizing it as "refactor: remove todo tool from agent capabilities".
+\ No newline at end of file
diff --git a/changes.md b/notes/changes.md
index 66389d9..66389d9 100644
--- a/changes.md
+++ b/notes/changes.md
diff --git a/notes/claude-auth-report.md b/notes/claude-auth-report.md
new file mode 100644
index 0000000..db2c6e6
--- /dev/null
+++ b/notes/claude-auth-report.md
@@ -0,0 +1,282 @@
+# Claude Auth Report — "old tab 401s, new tab works"
+
+## Symptom
+
+A tab that has been open for a while starts failing **every** Claude request with:
+
+```
+Invalid authentication credentials
+status=401  authentication_error
+url=https://api.anthropic.com/v1/messages
+model=claude-opus-4-8   baseURL=https://api.anthropic.com/v1
+```
+
+Opening a **new tab** and selecting the same Claude key works fine. The old tab
+keeps 401-ing no matter what you send. Both tabs are configured against the same
+`anthropic` key (`claude-pro` / `claude-max`), so "they should be using the same
+auth" — but they are not.
+
+This is an **authentication-layer** problem, not an agent/tool-wiring/permission
+bug. The 401 comes back from Anthropic on the chat request itself.
+
+---
+
+## Decisive clue: new tab works, old tab doesn't
+
+This single fact rules out most hypotheses and points at exactly one mechanism:
+**the per-tab `Agent` instance is cached, and it freezes the access token it was
+constructed with.** A new tab builds a fresh `Agent`, re-resolves credentials
+from the DB, and gets a currently-valid token. The old tab never re-resolves.
+
+---
+
+## Root cause #1 (proximate, confirmed): the cached per-tab Agent freezes a stale access token
+
+### Where the token is captured
+
+Credentials are resolved **only when a new `Agent` is constructed**, inside
+`getOrCreateAgentForTab` in `packages/api/src/agent-manager.ts`:
+
+- The `anthropic` branch (lines ~579–636) resolves `claudeCredentials =
+  { accessToken: <token> }` and passes it into the `Agent` constructor
+  (lines ~686–700, `...(claudeCredentials ? { claudeCredentials } : {})`).
+- That token is stored on the Agent's `this.config.claudeCredentials` and never
+  updated for the life of the Agent.
+
+In `packages/core/src/agent/agent.ts`, every `run()` call rebuilds the provider
+from that frozen config:
+
+```ts
+const providerFactory = createProvider({
+    apiKey: this.config.apiKey,
+    baseURL: this.config.baseURL,
+    provider: this.config.provider,
+    claudeCredentials: this.config.claudeCredentials,   // <-- frozen at construction
+});
+```
+
+And in `packages/core/src/llm/provider.ts`, `createClaudeOAuthProvider` sends it
+as the bearer token:
+
+```ts
+authToken: config.claudeCredentials?.accessToken ?? config.apiKey,
+```
+
+So the access token used on the wire is whatever was captured when the tab's
+Agent was first built. Calling `run()` again does **not** refresh it.
+
+### Why the cache is never invalidated on expiry
+
+`getOrCreateAgentForTab` (agent-manager.ts lines ~361–368) only discards the
+cached Agent when the key, model, or permissions change:
+
+```ts
+if (
+    tabAgent.agent &&
+    (effectiveKeyId !== tabAgent.keyId ||
+        effectiveModelId !== tabAgent.modelId ||
+        permKey !== tabAgent._lastPermKey)
+) {
+    tabAgent.agent = null;
+}
+```
+
+**Token expiry is not one of the conditions.** A tab that keeps the same key +
+model + permissions will reuse the same `Agent` — and therefore the same frozen
+token — indefinitely. The credential-refresh code below this gate only runs when
+`tabAgent.agent` is null, which for a stable tab never happens again.
+
+### Why this produces the exact symptom
+
+- **Old tab:** built its `Agent` earlier with token **A**. Time passes. Token A
+  is no longer accepted by Anthropic (it either reached its ~8h TTL, or a refresh
+  elsewhere — a new tab, the Model Status panel, model listing — rotated the
+  credential in the DB and **revoked A** server-side). The cached Agent still
+  holds A. Cache gate sees no key/model/perm change → Agent reused → every
+  request sends the dead token A → `401 authentication_error`.
+- **New tab:** no cached Agent → runs the resolution path → reads the **current**
+  token from the DB (and/or refreshes) → token **B** → works.
+
+Both tabs "use the same key," but the old tab is pinned to a now-invalid
+*instance* of that key's token. That is the whole discrepancy.
+
+> Note on OAuth token rotation: Anthropic's refresh flow rotates the refresh
+> token and can invalidate the previously issued access token. So the moment any
+> code path successfully refreshes the credential (updating the DB), every
+> already-constructed Agent still holding the old access token is now holding a
+> *revoked* token — not merely an expired one. This is why the old tab can fail
+> even when its captured token's local `expiresAt` hasn't elapsed yet.
+
+---
+
+## Root cause #2 (contributing, confirmed by reference): the OAuth refresh call is malformed
+
+Even when the resolution path *does* run, the refresh half of it is broken, so
+the system can't reliably self-heal an expired token.
+
+`packages/core/src/credentials/claude.ts`:
+
+```ts
+const OAUTH_TOKEN_URL = "https://claude.ai/v1/oauth/token";   // line 27 — wrong host
+
+async function refreshViaOAuth(refreshToken: string) {
+    const body = new URLSearchParams({ grant_type: "refresh_token", client_id: OAUTH_CLIENT_ID, refresh_token: refreshToken });
+    const response = await fetch(OAUTH_TOKEN_URL, {
+        method: "POST",
+        headers: { "Content-Type": "application/x-www-form-urlencoded" },   // wrong content-type
+        body: body.toString(),                                              // form-encoded
+    });
+    if (!response.ok) return null;   // failure is swallowed silently
+    ...
+}
+```
+
+Two concrete defects, both contradicted by the working reference implementation
+in `references/oh-my-pi/packages/ai/src/utils/oauth/anthropic.ts`:
+
+1. **Wrong endpoint host.** The token exchange/refresh endpoint lives on the API
+   host, not the web-app host:
+   ```ts
+   const TOKEN_URL = "https://api.anthropic.com/v1/oauth/token";   // reference, line 11
+   ```
+   The reference's test suite asserts exactly this URL
+   (`references/oh-my-pi/packages/ai/test/anthropic-oauth.test.ts`). `claude.ai`
+   is correct only for the *authorize* step (`https://claude.ai/oauth/authorize`).
+
+2. **Wrong body format.** Anthropic's `/v1/oauth/token` expects **JSON**, not
+   form-encoding. The reference posts:
+   ```ts
+   headers: { "Content-Type": "application/json", Accept: "application/json" },
+   body: JSON.stringify(body),
+   ```
+
+Because both are wrong, `refreshViaOAuth` effectively always returns `null`. When
+a token has genuinely expired, the manager's `anthropic` branch hits its final
+`else` and does the worst possible thing — proceeds with the **stale** token:
+
+```ts
+console.warn(`dispatch: unable to refresh Claude credentials for "${account.label}" — using stale token`);
+claudeCredentials = { accessToken: account.credentials.accessToken };  // expired
+...
+useOverride = true;
+```
+
+So the contributing failure mode is:
+- New tabs work **only as long as the DB token is still valid** (e.g. because
+  Claude Code refreshed it on disk and it was re-imported, or it simply hasn't
+  expired yet). Dispatch's *own* refresh cannot renew it.
+- Once the DB token expires with no external renewal, even new tabs will 401,
+  and the warning above appears in the server log.
+
+**Diagnostic check:** look in the server logs for
+`dispatch: unable to refresh Claude credentials for "..." — using stale token`.
+Its presence confirms the refresh path is failing and the stale-token fallback
+fired.
+
+> History: both the wrong URL and the form-encoded body were introduced in the
+> original commit `8151447` ("feat: claude max oauth support…"). They have never
+> been correct in this repo.
+
+---
+
+## Secondary suspect (unverified): the chat provider omits `anthropic-beta`
+
+`createClaudeOAuthProvider` (provider.ts lines ~76–84) sets only:
+
+```ts
+headers: {
+    "anthropic-dangerous-direct-browser-access": "true",
+    "x-app": "cli",
+    "user-agent": "claude-cli/2.1.112 (external, sdk-cli)",
+}
+```
+
+It does **not** send `anthropic-beta: …,oauth-2025-04-20,…` or
+`anthropic-version`. Every other path in this codebase and in the reference does:
+
+- `getAnthropicHeaders()` (claude.ts:387) builds the full set, including
+  `oauth-2025-04-20`, and is used by the models/profile/usage calls.
+- The reference's chat path always sends `anthropic-beta: ANTHROPIC_OAUTH_BETA`
+  with the OAuth bearer (`references/oh-my-pi/.../provider-models/openai-compat.ts`,
+  `.../providers/anthropic.ts:1542`). Anthropic generally requires the
+  `oauth-2025-04-20` beta for Bearer/OAuth requests.
+
+This is flagged **secondary** because chat works at all on a fresh token, which
+suggests `@ai-sdk/anthropic` may inject the oauth beta itself for `authToken`
+requests. **I could not verify this** — the workspace deps are not installed in
+this environment (`node_modules/@ai-sdk/anthropic` is absent) and outbound
+network is blocked, so the SDK could not be inspected and the request could not
+be reproduced live. If fixing #1 and #2 doesn't fully resolve things, make this
+provider reuse `getAnthropicHeaders()` so the chat path carries the same
+`anthropic-beta` / `anthropic-version` as every other Anthropic call.
+
+---
+
+## Minor, unrelated observation
+
+`agent.ts:836` gates adaptive thinking on a hardcoded model string:
+
+```ts
+const isOpus47 = this.config.model === "claude-opus-4-7";
+```
+
+The failing model is `claude-opus-4-8`, so it silently takes the non-adaptive
+`thinking: { type: "enabled", budgetTokens }` branch. Not related to the 401, but
+that literal will need updating for opus-4-8 to get adaptive thinking.
+
+---
+
+## Recommended fixes
+
+In priority order:
+
+### 1. Stop freezing the token in the cached Agent (fixes the new-vs-old-tab bug)
+
+Pick one of:
+
+- **Re-resolve credentials per `run()` for `anthropic` keys**, instead of caching
+  them in the Agent's config. e.g. have the Agent pull the access token through a
+  callback/getter at request time (`() => refreshAccountCredentials(account)`),
+  so each `run()` uses the live DB token. This is the cleanest fix.
+- **Or** invalidate the cached Agent when its captured token is near/after
+  expiry. Add an expiry check to the cache-invalidation gate in
+  `getOrCreateAgentForTab` (store the captured `expiresAt` on `tabAgent` and null
+  the agent when `Date.now() > expiresAt - 60_000`). Coarser, but localized.
+- **Or** when a refresh rotates the DB token, proactively null every cached
+  `tabAgent.agent` whose `keyId` matches, forcing re-resolution on next send.
+
+The getter approach is preferred because it also survives token rotation
+(root-cause note above) without any cache bookkeeping.
+
+### 2. Fix the OAuth refresh call in `credentials/claude.ts`
+
+- `OAUTH_TOKEN_URL` → `https://api.anthropic.com/v1/oauth/token`.
+- In `refreshViaOAuth`, send JSON: `Content-Type: application/json`,
+  `Accept: application/json`, `body: JSON.stringify({ grant_type, client_id, refresh_token })`.
+- Don't swallow failures silently — log `response.status` and the body so the
+  next stale-token event is diagnosable. (Mirror the reference's `postJson`,
+  which throws with `status` + `body` included.)
+
+### 3. (If still needed) add `anthropic-beta` to the chat provider
+
+Have `createClaudeOAuthProvider` reuse `getAnthropicHeaders(token)` so the chat
+path carries `anthropic-beta` (incl. `oauth-2025-04-20`) and `anthropic-version`
+like every reference path and every other Anthropic call in this repo.
+
+---
+
+## What was verified vs. assumed
+
+- **Verified by reading the code:** the per-tab Agent cache, the cache-invalidation
+  gate that ignores token expiry, the frozen `claudeCredentials` flowing into the
+  provider, the wrong refresh URL + form-encoded body, the silent failure +
+  stale-token fallback, the missing `anthropic-beta` on the chat provider, the
+  `opus-4-7` hardcode.
+- **Verified against the reference** (`references/oh-my-pi`): correct token URL
+  (`api.anthropic.com/v1/oauth/token`), JSON body, and that the OAuth chat path
+  sends `anthropic-beta`.
+- **Could not run/reproduce:** deps aren't installed here and network is
+  sandboxed, so the live request, the SDK's default-header behavior, and the
+  exact DB token `expiresAt` values were not inspected. Root cause #1 fully
+  explains the reported new-vs-old-tab behavior on its own; #2 explains why the
+  system can't self-heal once the DB token lapses.
diff --git a/notes/claude-report.md b/notes/claude-report.md
new file mode 100644
index 0000000..6635fa4
--- /dev/null
+++ b/notes/claude-report.md
@@ -0,0 +1,114 @@
+# Cache Miss Analysis for Dispatch's Claude Code Integration
+
+## Executive Summary
+
+The massive token burn and 0% cache hit rate observed after upgrading to AI SDK v6 are the result of two interacting flaws in how the Dispatch harness constructs requests for Anthropic:
+
+1. **Missing Beta Header:** The Claude OAuth provider completely omits the required `anthropic-beta: prompt-caching-scope-2026-01-05` header. Without this specific beta header, the Anthropic API silently ignores all `cache_control` markers.
+2. **Suboptimal Breakpoint Placement:** The logic that assigns `cache_control` breakpoints misinterprets the structure of AI SDK v6 messages. Because each tool result is serialized as an independent `role: "tool"` message, the caching logic places markers on the last two *individual tool results* rather than the actual conversational turns, wasting breakpoints and failing to cache the preceding `assistant` message.
+
+The tool duplication incident (where tools were echoed 150+ times) is highly likely a symptom of the context window blowing up due to these cache misses, causing the model's generation loop to degenerate into repetitive tool hallucinations.
+
+---
+
+## Root Cause 1: The Missing Beta Header
+
+### Analysis
+Anthropic's prompt caching relies on explicit `cache_control` markers embedded in the request payload. However, for the Claude CLI ecosystem and OAuth flow, caching features are gated behind specific beta headers. 
+
+In `packages/core/src/llm/provider.ts`, the `createClaudeOAuthProvider` factory initializes the AI SDK Anthropic provider. While it correctly mimics the `x-app` and `user-agent` headers of the Claude CLI, it **fails to include the `anthropic-beta` header**:
+
+```typescript
+function createClaudeOAuthProvider(config: ProviderConfig): ModelFactory {
+    const anthropic = createAnthropic({
+        // ...
+        headers: {
+            "anthropic-dangerous-direct-browser-access": "true",
+            "x-app": "cli",
+            "user-agent": "claude-cli/2.1.112 (external, sdk-cli)",
+            // MISSING: "anthropic-beta": getAnthropicBetas().join(",")
+        },
+    });
+}
+```
+
+The AI SDK's `createAnthropic` constructor (in `@ai-sdk/anthropic`) adds `anthropic-version: 2023-06-01` but does *not* automatically inject `prompt-caching-scope-2026-01-05`. 
+
+Because this header is missing from the underlying fetch request, the Anthropic API treats the request as a standard (non-cached) `/messages` call and ignores the ephemeral cache breakpoints completely.
+
+---
+
+## Root Cause 2: Inefficient Caching Breakpoints
+
+### Analysis
+Even if the beta header were present, the current breakpoint strategy in `agent.ts` defeats the purpose of caching due to how the AI SDK v6 structures tool results.
+
+In `toModelMessages()`, the agent unpacks an internal `tool-batch` chunk by creating a separate AI SDK message for every single tool result:
+```typescript
+for (const tr of trailingToolResults) {
+    result.push({ role: "tool", content: [ { type: "tool-result", ... } ] });
+}
+```
+
+Immediately after, `applyAnthropicCaching()` attempts to apply rolling cache markers:
+```typescript
+const nonSystem = msgs.filter((m) => m.role !== "system").slice(-2);
+for (const m of nonSystem) targets.add(m);
+// applies cacheControl: { type: "ephemeral" }
+```
+
+If the assistant emitted a batch of 10 tool calls, `msgs` ends with 10 individual `role: "tool"` messages. The `.slice(-2)` logic grabs only the 9th and 10th tool result messages and applies `cacheControl` to them.
+
+When the AI SDK translates this to the Anthropic wire format (in `convert-to-anthropic-messages-prompt.ts`), it groups consecutive `tool` messages into a single `role: "user"` Anthropic block. The cache markers are applied strictly to the 9th and 10th tool result parts at the very end of this user block.
+
+**Why this fails:**
+1. **Wasted Breakpoints:** Anthropic requires at least 1,024 tokens between breakpoints to effectively cache the delta. Placing breakpoints on two consecutive tool results at the end of the chain wastes a breakpoint because the token delta between them is microscopic.
+2. **Missing Assistant Context:** The `assistant` message (which contains the expensive reasoning tokens and the tool calls themselves) receives NO cache marker. 
+
+### Comparison with `oh-my-pi`
+The `oh-my-pi` reference implementation avoids this by applying cache controls logically across the wire format:
+- `applyCacheControlToLastBlock(params.system)`
+- `applyCacheControlToLastBlock(params.tools)`
+- `penultimate user message`
+- `last user message`
+
+By targeting the end of distinct conversational phases (the system block, the tools definition, and the entire user response block), `oh-my-pi` maximizes the cached prefix length.
+
+---
+
+## Tool Duplication Incident Correlation
+
+The documented incident describes 150+ duplicated `read_file` results for `package.json` with wildly uneven counts. 
+
+This behavior is not indicative of the execution harness blindly retrying tools. In `agent.ts`, tool calls are harvested directly from the AI SDK stream (`event.type === "tool-call"`) and executed synchronously. If 150 identical `read_file` results were yielded, it is because **the LLM actively generated 150 identical `<tool_use>` blocks** in a single turn.
+
+This generation loop is a known failure mode for Claude when the context window becomes excessively bloated or loses coherence (often compounded by a lack of cached prefixes breaking the model's structural attention). Fixing the cache miss will likely resolve the generation instability.
+
+---
+
+## Recommendations
+
+Since this is a read-only analysis, no code has been edited. The following changes should be applied to fix the system:
+
+1. **Inject the Beta Headers:**
+   Modify `packages/core/src/llm/provider.ts` to import `getAnthropicBetas` from `../credentials/claude.js` and inject it into the Claude OAuth headers:
+   ```typescript
+   headers: {
+       "anthropic-beta": getAnthropicBetas().join(","),
+       "anthropic-dangerous-direct-browser-access": "true",
+       // ...
+   }
+   ```
+
+2. **Group Tool Results in `toModelMessages`:**
+   Instead of pushing an individual `role: "tool"` message for every single tool result, group all `trailingToolResults` into a single `content` array inside one `role: "tool"` message. This accurately represents the tool-batch as a single turn.
+
+3. **Revise the Breakpoint Strategy in `applyAnthropicCaching`:**
+   Instead of naively grabbing the last two elements of the array, specifically target:
+   - The first `system` message.
+   - The **last** `assistant` message in the array.
+   - The **last** `user` (or `tool`) message in the array.
+   This ensures that the entire prefix leading up to the most recent reasoning step is successfully cached.
+
+4. **Add a Tool Deduplication Safeguard (Optional but recommended):**
+   In `agent.ts`'s execution loop, hash the tool name and arguments. If an identical tool call appears in the exact same `stepToolCalls` batch, automatically copy the result from the first execution rather than re-running it or blowing up the LLM's next prompt with redundant shell outputs.
+\ No newline at end of file
diff --git a/context.md b/notes/context.md
index 896eed4..896eed4 100644
--- a/context.md
+++ b/notes/context.md
diff --git a/eviction-limitation.md b/notes/eviction-limitation.md
index a5b948f..91d1a88 100644
--- a/eviction-limitation.md
+++ b/notes/eviction-limitation.md
@@ -1,8 +1,18 @@
 # Known Limitation: Frontend eviction is whole-message, not per-chunk
 
-Status: **open / deal-with-later.** Documented from the append-only chunk-log
-work (see `plan-chunk-log.md`). The backend is not the problem — this is purely
-a frontend in-memory concern.
+Status: **RESOLVED.** Fixed by the chunk-native frontend store (see
+`plan-chunk-eviction.md`). The frontend's source of truth for history is now a
+flat `ChunkRow[]` (`tab.chunks`, real per-tab `seq`); the live turn is a
+transient tail (`tab.live`) reconciled into the sealed log on a `turn-sealed`
+event. Eviction (`evictChunks`) is a rolling per-chunk trim of the oldest rows —
+so a single oversized turn is trimmed chunk-by-chunk instead of pinned whole.
+Pagination loads raw chunks (`GET /tabs/:id/chunks`), deduped by `seq`. The
+historical analysis below is retained for context.
+
+---
+
+Documented from the append-only chunk-log work (see `plan-chunk-log.md`). The
+backend was not the problem — this was purely a frontend in-memory concern.
 
 ## TL;DR
 
diff --git a/notes/gemini-chunk-eviction-review-2.md b/notes/gemini-chunk-eviction-review-2.md
new file mode 100644
index 0000000..a2a1ac3
--- /dev/null
+++ b/notes/gemini-chunk-eviction-review-2.md
@@ -0,0 +1,95 @@
+# Code Review (Pass 2): Chunk-Native Frontend Eviction
+
+## Executive Summary
+
+The fixes applied since Pass 1 successfully address the two major Blockers related to reconciling active turns and optimistic UI state. The decoupling of reconcile logic from `agentStatus` in favor of `liveTurnId` elegantly solves the race conditions around deferred reconciliations and interrupt boundaries. 
+
+However, a **NEW Blocker** was identified in how `turn_id`s are backfilled onto user messages when a turn starts. The loop indiscriminately tags pending `queued-` messages that belong to *future* turns, causing them to be wiped when the *current* turn finishes.
+
+**Verdict: DO NOT SHIP.** The new Blocker must be fixed. The fix is a trivial one-line condition.
+
+---
+
+## Pass 1 Fixes & Invariants
+
+* **Fix #1: Deferred reconcile wipes concurrent active turns.** 
+  **VERDICT: FIXED.** `reloadChunksFromApi`'s new `preserveTurnId` logic perfectly handles overlapping turns. It correctly preserves Turn B's in-flight chunks while dropping Turn A's, and cleanly nullifies state when appropriate. The Map `pendingReconcileTabs` ensures any intermediate turns are fetched comprehensively from the DB.
+* **Fix #2: Optimistic queued user messages are dropped on reconcile.**
+  **VERDICT: FIXED (Partially).** The condition `(m.turnId === undefined && m.role === "user")` inside `keptLive` effectively retains optimistic user messages. However, see the New Blocker below regarding `turnId` assignment.
+* **Cache Safety Invariant.**
+  **VERDICT: SAFE.** `toModelMessages` and Anthropic normalization logic were completely untouched. The backend's prompt caching remains strictly bound to DB-persisted rows.
+* **`messages` -> `renderGroups` Rename Completeness.**
+  **VERDICT: COMPLETE.** No stray references to `tab.messages` were found. Reactivity safely bounds derived state within `updateTab`.
+
+---
+
+## NEW Findings
+
+### 1. `turn-start` wrongly tags future queued messages (Block)
+**Location:** `packages/frontend/src/lib/tabs.svelte.ts:873-882` (inside `handleEvent` case `turn-start`)
+
+**Description:**
+When a turn starts, the frontend loops backwards through `tab.live` to tag untagged user messages with the new `turnId`. It breaks only when it hits a non-user or an already-tagged message. 
+
+If multiple messages are queued (e.g. `[queued-B, queued-C]`), and the backend dequeues `B`, `message-consumed` removes the prefix and puts `B` at the end of the array: `[queued-C, B]`. Immediately after, `turn-start` for Turn B fires, looping backward. It tags `B` with `"Turn B"`, but continues and ALSO tags `queued-C` with `"Turn B"`. When Turn B subsequently seals, `queued-C` is wiped from the UI because its `turnId` matches the sealing turn, and it loses the protection of `turnId === undefined`. It vanishes until Turn C actually finishes processing.
+
+This same bug happens if the user queues a message in the tiny race-window between sending their first prompt and the WS `turn-start` event arriving.
+
+**Direction:** The backfill loop must explicitly ignore queued messages. Add a check to prevent tagging them: `if (m && m.role === "user" && m.turnId === undefined && !m.id.startsWith("queued-"))`.
+
+---
+
+## Evaluation of New Regression Tests
+
+The two new regression tests are highly valuable, but miss critical permutations:
+
+1. **"preserves an optimistic queued user message..."**
+   * **Genuinely covers:** Fix #2. It proves `keptLive` preserves an untagged queued message during a reconcile.
+   * **Untested edge case:** It queues the message *after* `turn-start` fires. Therefore, it completely bypasses the `turn-start` backfill loop. A test should queue a message *before* `turn-start` fires (or queue *two* messages and consume one) to expose the new Blocker above.
+2. **"preserves a concurrent newer turn..."**
+   * **Genuinely covers:** Fix #1. It correctly validates `preserveTurnId` logic, the `currentAssistantId` mapping, and `liveTurnId` integrity when a deferred reconcile flushes across a newer streaming turn.
+   * **Untested edge case (Map last-write-wins):** It only seals *one* turn while scrolled up. A test should seal *two* turns while scrolled up, then scroll down, ensuring `pendingReconcileTabs` handles the latest `turnId` and the entire window is correctly loaded.
+
+---
+
+## Resolution (OpenCode)
+
+### Block #1 (`turn-start` wrongly tags future queued messages) — FIXED
+`packages/frontend/src/lib/tabs.svelte.ts` `handleEvent` case `turn-start`.
+
+Backend reality check (not in Gemini's model): a queued message **never** gets its
+own `turn-start`. `turn-start` is emitted exactly once per user-initiated
+`processMessage` (`agent-manager.ts:1128`), which persists exactly one user row via
+`explodeUserText`. Queued messages are drained *into* the running turn through
+`dequeueMessages`/`message-consumed` (`agent.ts:1241,1308`). So the turn's initiator
+is always the single most-recent **non-queued** untagged user row, and any `queued-`
+row in the live tail belongs to a future turn.
+
+Gemini's literal one-liner (`&& !m.id.startsWith("queued-")` with the existing
+`else break`) is **insufficient**: when the queued row is the trailing element the
+loop would `break` on it and never tag the real initiator, leaving the initiator
+untagged → it duplicates against its own sealed chunk row on reconcile.
+
+Implemented fix: the backfill now (a) stops at the first non-user row, (b) **skips
+past** (`continue`) pending `queued-` rows, and (c) tags exactly the one most-recent
+non-queued untagged user row, then breaks. `keptLive` is unchanged — untagged
+`queued-` rows remain untagged and are preserved on reconcile.
+
+### Test gap — ADDRESSED
+Added `"turn-start backfill skips a pending queued row trailing the turn initiator"`
+in `chat-store.test.ts`: constructs live = `[<plain initiator>, queued-q2]`, fires
+`turn-start`, asserts the initiator is tagged and `queued-q2` is **not**, then seals
+and asserts `queued-q2` survives with no duplicate initiator bubble
+(`renderGroups` roles = `[user, assistant, user]`). This exercises the `continue`
+(skip-queued) branch and covers the queue-present-before-`turn-start` race Gemini
+flagged as untested.
+
+### Nits (deferred, non-blocking)
+- `pendingReconcileTabs` last-write-wins doc/test (seal two turns while scrolled up):
+  acceptable as-is — the deferred flush refetches the full window from the DB, so the
+  latest sealed `turnId` correctly supersedes; left as a follow-up test.
+
+### Verification
+Full suite: **326 tests pass** (core 223 incl. 33-test agent/cache-stability suite,
+api 35, frontend 68). Biome clean; `tsc` core+api and `svelte-check` report 0 errors.
+Cache invariant remains untouched (no `agent.ts` wire/folding/caching changes).
+\ No newline at end of file
diff --git a/notes/gemini-chunk-eviction-review-3.md b/notes/gemini-chunk-eviction-review-3.md
new file mode 100644
index 0000000..3637946
--- /dev/null
+++ b/notes/gemini-chunk-eviction-review-3.md
@@ -0,0 +1,118 @@
+# Code Review (Pass 3): Verify the `turn-start` backfill Block fix
+
+## Executive Summary
+
+The fix to the `turn-start` backfill loop successfully prevents pending `queued-` messages from being wiped by correctly skipping them. However, it exposes a critical flaw in how **consumed** messages are handled.
+
+When a queued message is consumed (`message-consumed`), its `queued-` prefix is stripped, leaving it as an untagged, plain user row in `live`. Because it lacks a `turnId`, it survives `reconcileSealedTurn` and lingers in the UI forever, floating to the bottom and duplicating the `[USER INTERRUPT]` text already present in the sealed chunks. Additionally, in multi-client scenarios with no local initiator, the backfill loop will incorrectly tag this lingering consumed message with a future turn's ID.
+
+**Verdict: DO NOT SHIP.** A new Blocker must be fixed. The fix is a one-line change in the `message-consumed` handler to bind consumed messages to the active turn.
+
+---
+
+## Detailed Findings
+
+### Q1. Backend Claims Verification
+**Confirmed.** The backend code in `packages/api/src/agent-manager.ts` and `packages/core/src/agent/agent.ts` confirms that:
+1. `turn-start` is emitted exactly once per `processMessage`.
+2. Exactly one user row draft is persisted via `explodeUserText`.
+3. Queued messages are pulled into a running turn via `dequeueMessages` and NEVER trigger a `turn-start`.
+
+### Q2. Block Fix Correctness
+The new backfill logic correctly skips `queued-` rows and tags exactly one initiator when a local initiator exists.
+* **(a) `[initiator, queued-X]`**: Correctly skips `queued-X`, tags `initiator`, and breaks.
+* **(b) `[queued-X, initiator]`**: Logically impossible; the initiator is appended before a queue can form.
+* **(c) `[queued-X, queued-Y]` consumed**: When `queued-X` is consumed, its prefix is stripped. It becomes an untagged user message. When the NEXT turn starts, if there is a local initiator, it tags the new initiator and ignores the consumed message.
+* **(d) Multi-client (no local initiator)**: If Client B starts a turn, Client A receives `turn-start` but has no local initiator in `live`. Client A's backfill loop will find the untagged consumed message from the *previous* turn and incorrectly tag it with the *new* turn's ID.
+
+### Q3. No Duplication / No Loss
+**Failed.** Because the backfill loop `break`s after tagging the new initiator, the consumed message from the previous turn remains permanently untagged (`turnId === undefined`). It survives the `turn-sealed` reconcile process and floats to the bottom of the live tail. Since the backend injects the consumed message's text into the `[USER INTERRUPT]` chunk row, the user sees BOTH the tool result chunk text AND a lingering user bubble.
+
+### Q4. `keptLive` Unchanged — Still Correct?
+**Failed.** `keptLive` preserves optimistic rows by checking `m.turnId === undefined && m.role === "user"`. This was intended for pending initiators and queues, but it inadvertently preserves consumed messages because they never received a `turnId`. Consumed messages MUST be bound to the turn that consumed them so they are cleanly dropped when that turn seals.
+
+### Q5. Test Adequacy
+**The new test simulates an impossible backend sequence.**
+The test creates `q1`, consumes it, and then fires `turn-start`, assuming `turn-start` applies to the consumed `q1`. The backend never emits a `turn-start` for a consumed message. By forcing the backfill loop to tag a consumed message, the test mistakenly verifies an action that in reality constitutes cross-turn ID theft.
+
+### Q6. Prompt-Caching Invariant
+**SAFE.** The backend logic in `packages/core/src/agent/agent.ts` was untouched. Cache stability is preserved.
+
+---
+
+## NEW Blockers
+
+### Block #1: Consumed messages linger and duplicate due to missing `turnId`
+**Location:** `packages/frontend/src/lib/tabs.svelte.ts:1226` (inside `handleEvent` case `"message-consumed"`)
+
+**Description:**
+When `message-consumed` extracts a queued message, it strips the prefix but leaves `turnId` undefined. The message survives `turn-sealed` and lingers in `live` forever, duplicating the chunk data. In multi-client scenarios, it also acts as a trap for the next `turn-start` backfill.
+
+**Direction:**
+In `message-consumed`, bind the consumed message to the actively streaming turn so `reconcileSealedTurn` drops it cleanly:
+```typescript
+if (mcEvent.messageIds.includes(queuedId)) {
+    consumed.push({ 
+        ...m, 
+        id: queuedId, 
+        turnId: mcTab.liveTurnId ?? undefined // Bind to active turn
+    });
+    continue;
+}
+```
+With this fix, the `turn-start` loop's `break` behavior is perfectly safe, as consumed messages will no longer be "untagged".
+
+---
+
+## Resolution (OpenCode)
+
+### Block #1 (consumed interrupt messages linger + duplicate) — FIXED
+`packages/frontend/src/lib/tabs.svelte.ts`, `handleEvent` case `"message-consumed"`.
+
+Verified the finding against the code: `keptLive` (in `reloadChunksFromApi`) preserves
+every `m.turnId === undefined && m.role === "user"` row, and `message-consumed` was
+pushing `{ ...m, id: queuedId }` with **no** `turnId`. So a consumed interrupt bubble
+was kept on every reconcile — lingering at the tail and duplicating the
+`[USER INTERRUPT]` text the backend folds into the sealed tool-result chunk
+(`agent.ts:1248-1255`). This contradicted the intended "collapse to persisted shape"
+behavior.
+
+Applied Gemini's direction (with the codebase's conditional-spread style): bind the
+consumed row to the in-flight turn so reconcile drops it on seal —
+```ts
+consumed.push({
+  ...m,
+  id: queuedId,
+  ...(mcTab.liveTurnId !== null ? { turnId: mcTab.liveTurnId } : {}),
+});
+```
+`liveTurnId` is always set while a turn runs (the only time a consume happens). The
+ChatPanel keyer (`${turnId}:${role}:${n}`, per-(turn,role) counter) gives the consumed
+row a distinct key from the initiator, so no key collision during the live interrupt
+split. This also makes the Pass-2 `turn-start` backfill `break` fully safe (no
+untagged consumed rows remain to be mis-tagged in the multi-client path Q2(d)).
+
+### Test fixes (addressing Q5)
+The Pass-2 test was rewritten — Gemini correctly flagged that consuming a message and
+then firing `turn-start` at it is not a real backend sequence. Replaced with two
+realistic tests in `chat-store.test.ts`:
+1. `"turn-start backfill skips a pending queued row (race), tags only the initiator"` —
+   drives the **real** race via `store.sendMessage`: an idle send (plain optimistic
+   row) followed by a running send (queued row), then a late `turn-start`. Asserts the
+   initiator is tagged, the queued row is NOT, and the queued row survives the seal
+   with no duplicate initiator.
+2. `"a consumed interrupt message collapses into the sealed turn (no lingering bubble)"`
+   — realistic interrupt flow (turn-start → stream → queue → consume → stream → seal);
+   asserts the consumed row is bound to the turn and dropped on reconcile. Fails on the
+   pre-fix code (row lingers untagged), passes after.
+
+### Verdict response
+- Q1 backend claims: confirmed by Gemini and re-confirmed here.
+- Q2(d) multi-client mis-tag: eliminated — consumed rows are no longer untagged.
+- Q3/Q4 lingering+duplication: FIXED.
+- Q6 cache invariant: SAFE — change is frontend store/test only; `agent.ts` untouched.
+- Q7: no new Blockers introduced.
+
+### Verification
+Frontend **69 tests pass** (chat-store 54 incl. the two new tests, sidebar 15);
+`svelte-check` 0 errors; Biome clean. Core (223) + api (35) unaffected → 327 total.
+\ No newline at end of file
diff --git a/notes/gemini-chunk-eviction-review.md b/notes/gemini-chunk-eviction-review.md
new file mode 100644
index 0000000..373917a
--- /dev/null
+++ b/notes/gemini-chunk-eviction-review.md
@@ -0,0 +1,69 @@
+# Code Review: Chunk-Native Frontend Eviction (Dispatch)
+
+## Executive Summary
+
+The rewrite to a chunk-native frontend store successfully resolves the unbounded memory pinning caused by oversized turns. The migration from grouped messages to a flat chunk log (`tab.chunks`) as the source of truth, with a decoupled `renderGroups` cache, correctly achieves rolling chunk-level eviction and seamless pagination. 
+
+However, while the backend invariants and cache safety are perfectly maintained, there are two **Blocker** regressions in the frontend's reconcile logic relating to the handling of concurrent/queued messages and deferred UI updates. `reloadChunksFromApi` assumes it is strictly processing a single sequential turn and acts destructively on in-flight UI state when edge cases overlap.
+
+**Verdict: DO NOT SHIP.** The cache invariant holds, but the Blocks must be fixed first. 
+
+---
+
+## Cache Safety Invariant
+
+**VERDICT: SAFE.**
+The diff contains **zero** modifications to `packages/core/src/agent/agent.ts`. The prompt-cache cohesion logic, `toModelMessages`, `applyAnthropicCaching`, and normalisation remain 100% server-side and entirely isolated from the frontend's transient render representations. Model-bound bytes are unchanged.
+
+---
+
+## Findings
+
+### 1. Deferred reconcile wipes concurrent active turns (Block)
+**Location:** `packages/frontend/src/lib/tabs.svelte.ts:658` (`reconcileSealedTurn`) and `:636` (`reloadChunksFromApi`)
+
+**Description:**
+If the user scrolls up, automatic eviction and reconciliation are suppressed. If Turn A finishes while scrolled up, its reconciliation is deferred (`pendingReconcileTabs.add`). If the agent then starts Turn B (e.g. via queue processor), it establishes a new live streaming bubble (`liveTurnId = 'Turn B'`). 
+When the user subsequently scrolls down, the deferred flush fires for Turn A, calling `reloadChunksFromApi`. This function blindly sets `live: []`, `liveTurnId: null`, and `currentAssistantId: null` — immediately deleting all in-flight chunks for the *currently active* Turn B. Turn B will then spawn a fresh disconnected live bubble on its next stream delta, permanently losing its prior chunks and its `turnId` tag (causing it to remount/flash on seal).
+
+**Direction:** `reloadChunksFromApi` must not unconditionally nuke `live` and `currentAssistantId` if a *new* turn is actively in-flight (`agentStatus === "running" && liveTurnId !== turnIdToReconcile`). The reconcile must be turn-aware.
+
+### 2. Optimistic queued user messages are dropped on reconcile (Block)
+**Location:** `packages/frontend/src/lib/tabs.svelte.ts:636` (`reloadChunksFromApi`)
+
+**Description:**
+When a user sends a message while the agent is busy, it is added to `tab.live` as an optimistic unsealed bubble. When the current active turn completes and emits `turn-sealed`, `reloadChunksFromApi` wipes the entire `live` array (`live: []`). The queued user message vanishes from the UI entirely. While it remains safely in `queuedMessages` and will eventually be processed by the backend, the UI will not reflect it again until its own eventual `turn-sealed` event completes the backend loop. 
+
+**Direction:** `reloadChunksFromApi` must preserve unsealed optimistic user messages (e.g. those matching `queuedMessages` IDs) when clearing `live`.
+
+### 3. Edge-case fetch race on WS reconnect desync (Ship-with-followup)
+**Location:** `packages/frontend/src/lib/tabs.svelte.ts:860` (`handleEvent` case `statuses`)
+
+**Description:**
+The backend emits `status: idle` immediately before making the synchronous SQLite `flushAssistant` write, and emits `turn-sealed` immediately after. If a WS reconnect happens exactly in this microsecond window, the frontend's `hydrateFromBackend` sees `backendStatus !== "running"` and triggers `reloadChunksFromApi`. Because the DB write hasn't landed, it loads a chunk window missing the just-finished turn.
+
+**Direction:** This is non-fatal because the backend will still emit `turn-sealed` milliseconds later, which triggers a second `reloadChunksFromApi` that corrects the UI state. It will manifest as a sub-second flicker. Safe to ship, but ideally, `statuses` should infer completion from the presence of unpersisted chunks rather than the raw agent status.
+
+### 4. Tool-batches undercount the live eviction budget (Nit)
+**Location:** `packages/frontend/src/lib/tabs.svelte.ts:76` (`countLiveChunks`)
+
+**Description:**
+In `countLiveChunks`, a single `tool-batch` render bubble counts as `1` against the live budget (`m.chunks.length`). However, upon sealing, `explodeTurn` expands each batch into `N * 2` rows (a `tool_call` and `tool_result` for each parallel call). This means an in-flight turn with heavily parallelized tool execution will temporarily consume more memory than `chunkLimit` targets.
+
+**Direction:** Minor inconsistency. Once the turn seals, the correct DB row count will be respected. Acceptable trade-off for simplicity in the live rendering path.
+
+### 5. `trimLiveChunks` drops the prompt under extreme pressure (Nit)
+**Location:** `packages/frontend/src/lib/tabs.svelte.ts:431` (`trimLiveChunks`)
+
+**Description:**
+If a single streaming turn heavily exceeds `chunkLimit`, `trimLiveChunks` enforces a strict rolling window on `live`. Since it trims oldest-first indiscriminately, it will `shift()` away the user's prompt chunk from the UI entirely before it touches the assistant's streaming chunks. 
+
+**Direction:** This technically aligns with the design ("never the chunk currently being streamed"), but means the user's initiating context disappears mid-stream under memory pressure. Expected behavior for a raw chunk limit.
+
+---
+
+## Notable Correctness Risks Validated
+
+- **Seq Cursor Logic:** Perfect. `oldestLoadedSeq` rigidly derives from `ChunkRow.seq` via `minSeqOf(sealed)`. Live transient state is securely walled off from pagination logic.
+- **Transaction Wrapper:** Sound. `db.transaction()` around the `appendChunks` batch is synchronously robust and correctly yields monotonic `seq` allocations per turn.
+- **Scroll-up pagination:** Deduplication works flawlessly. `mergeChunksBySeq` efficiently handles overlap at the pagination boundary without duplicating message bubbles.
diff --git a/gemini-chunk-log-review.md b/notes/gemini-chunk-log-review.md
index e04f6c3..e04f6c3 100644
--- a/gemini-chunk-log-review.md
+++ b/notes/gemini-chunk-log-review.md
diff --git a/harness-comparison.md b/notes/harness-comparison.md
index 1fd7aec..1fd7aec 100644
--- a/harness-comparison.md
+++ b/notes/harness-comparison.md
diff --git a/plan-bg-restore.md b/notes/plan-bg-restore.md
index e622669..e622669 100644
--- a/plan-bg-restore.md
+++ b/notes/plan-bg-restore.md
diff --git a/notes/plan-chunk-eviction.md b/notes/plan-chunk-eviction.md
new file mode 100644
index 0000000..74f850d
--- /dev/null
+++ b/notes/plan-chunk-eviction.md
@@ -0,0 +1,251 @@
+# Plan: Chunk-Native Frontend (per-chunk eviction + pagination)
+
+Closes `eviction-limitation.md`. Kills the frontend `ChatMessage[]` "message"
+structure entirely. The frontend's state becomes a **flat chunk log**; memory is
+bounded by a **rolling per-chunk eviction**; history is **paginated as raw
+chunks** from the backend.
+
+## Locked decisions (from the design discussion)
+
+1. **No stored "message" structure on the frontend.** The store holds a flat
+   chunk list. Any grouping into bubbles is a *render-time* projection only
+   (see "Rendering" — the one item still to confirm).
+2. **Rolling per-chunk eviction.** When a chunk completes while streaming and the
+   in-memory chunk count is at `chunkLimit`, drop the oldest chunk. Pure
+   count-based window. The only chunk never evicted is the single one currently
+   receiving deltas (the open chunk). No turn pinning, no "keep last pair."
+3. **Per-turn persistence stays** (backend unchanged in timing). It is the more
+   performant choice for a low-RAM/low-IO backend: peak RAM is dominated by the
+   agent's in-memory conversation context (required for the cache prefix) which
+   both strategies share, while per-seal would multiply fsyncs across the
+   streaming loop. One cheap win: wrap the end-of-turn write in a single SQLite
+   transaction.
+4. **Mid-stream reload is deferred, not engineered away.** A still-streaming turn
+   isn't in the DB until it seals. If the user scrolls up into chunks that were
+   roll-evicted from the *current* turn, the refetch waits until the turn's write
+   lands (the running→idle signal). Accepted.
+
+## The dominating constraint (unchanged)
+
+**Cache cohesion is built 100% server-side and must not be touched.** The
+Anthropic wire prefix comes from `toModelMessages` + `applyAnthropicCaching` in
+`packages/core/src/agent/agent.ts`, built from the agent's in-memory history
+(rebuilt from the chunk log) + the live turn — never from anything the frontend
+holds or the WS sends. So **everything here is frontend + one additive raw API
+endpoint**; it cannot regress the cache. Out of scope, do not edit: `agent.ts`
+wire/folding/caching/interrupt code, `explodeTurn` row ordering, and the
+backend's use of `groupRowsToMessages` for *agent history rebuild*.
+
+## Already done (reuse, don't rebuild)
+
+- `getChunksForTab(tabId,{limit,before})` (`db/chunks.ts`) already paginates the
+  flat log by `seq`.
+- `explodeTurn` and `groupRowsToMessages` (`chunks/transform.ts`) are DB-free,
+  browser-importable, and orphan-tolerant (a turn split across a page boundary
+  groups correctly). `explodeTurn` emits a **role per draft** — we use that.
+- `appendEventToChunks` (`chunks/append.ts`) folds a delta stream into render
+  `Chunk[]` correctly (text/thinking/metadata/tool-result/shell-output ordering).
+- Chunk table schema already has `seq, turnId, step, role, type, data`.
+  **No DB migration, no chat-history reset needed.**
+
+---
+
+## Frontend architecture (chunk-native)
+
+### Store state (per tab) — `tabs.svelte.ts`
+
+```
+chunks         : ChunkRow[]        // SEALED history, real per-tab seq, oldest→newest.
+                                   // The growing / evictable / paginated structure.
+liveRender     : Chunk[]           // transient fold buffer for the CURRENT turn only,
+                                   // built by the EXISTING appendEventToChunks.
+liveTurnId     : string | null     // turn_id of the in-flight turn (for stable keys).
+liveAssistantId: string | null     // == currentAssistantId today.
+oldestLoadedSeq: number | null     // min seq in `chunks` (pagination cursor).
+totalChunks    : number            // backend total (drives "more to load?").
+// DELETED: messages, currentAssistantId(renamed→liveAssistantId), the turnId-merge state.
+// Untouched: queuedMessages, cacheStats, agentStatus, model/agent fields, etc.
+```
+
+There is **no `messages` field.** The live turn is not a "message" — it's a
+one-turn streaming buffer that is reconciled into `chunks` (as real-seq rows) the
+moment the turn seals.
+
+### Unified flat view (derived, for render + eviction count)
+
+```
+liveFlat = liveTurnId ? explodeTurn(liveTurnId, liveRender)
+                          .map((d,i) => ({...d, id:`${liveTurnId}:${i}`, seq:null, live:true}))
+                       : []
+log      = [...chunks, ...liveFlat]            // one flat, chunk-native list
+```
+
+`explodeTurn` is the *same* transform the backend persists with, so the live flat
+chunks are byte-for-byte what will land in the DB → reconciliation at seal is
+seamless and symmetric. `log` is what we render and count for eviction.
+
+### Streaming (reuse, no new reducer)
+
+- Content events (`text-delta`/`reasoning-*`/`tool-call`/`tool-result`/
+  `shell-output`/`error`/`notice`/`model-changed`/`config-reload`) fold into
+  `liveRender` via the existing `appendEventToChunks` + `$state.snapshot` clone
+  (the documented Svelte-5 proxy hazard — keep it).
+- `turn-start` (new tiny WS event, see backend) sets `liveTurnId` /
+  `liveAssistantId` so live-flat ids match the sealed rows' turn → no remount at
+  seal. Without it, a one-frame remount at seal with identical content; harmless.
+- Optimistic user message: on send, append a provisional user row to `chunks`
+  with `seq:null, live:true` (or carry it in a tiny `pendingUser` slot). It is
+  replaced by the real user row at reconcile.
+
+### Rendering — `ChatPanel.svelte` / `ChatMessage.svelte`
+
+```
+renderGroups = groupRowsToMessages(log)   // ephemeral, render-time ONLY
+{#each renderGroups as g (g.id)}  <ChatMessage .../>  {/each}
+```
+
+Grouping pairs `tool_call`+`tool_result` by `callId` and wraps a turn's
+assistant chunks into a bubble — **purely a view projection of the flat log**,
+never stored, never used for eviction/pagination. `ChatMessage.svelte` is largely
+unchanged (it already renders a grouped turn's `Chunk[]`).
+
+> **DECISION TO CONFIRM (rendering):** default = keep the current bubble look via
+> this render-time grouping (chunk-native state, same UX). Alternative = render
+> each chunk as a standalone element (fully flat, no assistant-bubble wrapper).
+> I'm proceeding with the default; say the word to go fully flat (smaller view
+> change, different look).
+
+### Eviction (`evictChunks`, replaces `evictMessages`)
+
+```
+limit = appSettings.chunkLimit; if !finite or ≤0 → return
+if scrolledUp and !force → return                       // unchanged suppression
+total = chunks.length + liveFlat.length
+while total > limit:
+    if chunks.length > 0 and chunks[0] is not the open chunk:
+        drop chunks[0]; total--                          // evict sealed front first
+    else if liveRender has a non-open leading chunk:
+        drop oldest liveRender chunk; recompute liveFlat; total = chunks.length+liveFlat.length
+    else break                                           // only the open chunk remains
+oldestLoadedSeq = min seq remaining in chunks (or unchanged if chunks emptied)
+```
+
+This trims a huge **old** turn chunk-by-chunk (the fix) and even trims within a
+huge **live** turn (bounding memory mid-stream), never evicting the open chunk.
+Triggered on each content event (chunk completion) and after load/reconcile.
+
+### Pagination (`loadOlderChunks`, replaces `loadMoreMessages`)
+
+```
+if oldestLoadedSeq == null or ≤ 0 → nothing older
+GET /tabs/:id/chunks?limit=50&before=oldestLoadedSeq
+merge rows into `chunks`, DEDUPE BY seq, keep seq-sorted
+oldestLoadedSeq = min seq held; totalChunks = resp.total
+```
+
+The old `turnId`-merge hack (`tabs.svelte.ts:440-462`) is **deleted** — raw rows
+in one seq-sorted array make `groupRowsToMessages` rejoin a boundary-split turn
+automatically. Overlap is fine; dedupe-by-seq is trivial.
+
+### Turn-completion reconcile (the one new flow) — on running→idle for the tab
+
+```
+refetch GET /tabs/:id/chunks?limit=chunkLimit         // tail, real seqs (write already landed)
+drop all live entries (liveRender=[], liveTurnId=null) and provisional user row
+merge refetched rows into `chunks`, dedupe by seq, seq-sort
+oldestLoadedSeq = min seq; evictChunks()
+if scrolledUp: defer this refetch until the user returns to bottom
+```
+
+This single mechanism (a) gives the just-completed turn **real seqs** (so the
+log never accumulates seq-less middles that break pagination), and (b) **recovers
+any live chunks roll-evicted mid-stream** — they come back from the DB now that
+the write landed. Exactly the "delay the reload until the write lands" model.
+Keying render groups by `turnId` makes the swap from live→sealed flicker-free.
+
+---
+
+## Backend changes (additive, cache-neutral, low-cost)
+
+1. **`routes/tabs.ts`** — add `GET /tabs/:id/chunks?limit&before` →
+   `{ chunks: ChunkRow[], total, oldestSeq }` (raw rows; no grouping). Leave
+   `/messages` as-is (unused by the new frontend; harmless).
+2. **`db/chunks.ts`** — wrap the `appendChunks` insert loop in a single
+   `db.transaction(...)` (one fsync per turn instead of N). Pure perf; no behavior
+   change.
+3. **`types/index.ts` + `agent-manager.ts`** — add a tiny
+   `{ type:"turn-start"; turnId; assistantId }` event, emitted at turn start
+   (`agent-manager.ts:1120-1122`). Optional but recommended (stable keys → no
+   seal remount). `appendEventToChunks` lists it in its no-op branch for
+   exhaustiveness.
+4. **WS:** nothing — `index.ts:25-26` forwards all emitted events to the browser.
+
+No change to persistence *timing*, `explodeTurn`, the agent wire, or the caching
+path. (Optional future optimization, not in this plan: a `chunk-rows` push at
+seal to avoid the per-turn reconcile GET. The reconcile-refetch is simpler and
+matches the agreed "reload" model, so we ship that.)
+
+---
+
+## Test plan
+
+Rewrite frontend tests from `tab.messages` assertions to **chunk-log** assertions
+(the store no longer has `messages`). Cover:
+
+- **Rolling eviction (headline):** stream a turn of `chunkLimit + N` chunks;
+  assert in-memory `log` length stays `≤ chunkLimit` (open chunk exempt) and the
+  oldest entries are dropped — including dropping the *leading chunks of a single
+  giant turn* (the exact case the limitation was about).
+- **Pagination + dedupe:** `loadOlderChunks` prepends older rows, dedupes by seq,
+  updates `oldestLoadedSeq`; a turn split at the window boundary renders as one
+  group (no merge hack, no duplication).
+- **Turn-completion reconcile:** stream a turn (live, seq=null) → running→idle →
+  assert live entries are replaced by real-seq rows from the refetch, `log`
+  content identical, render-group identity stable (no remount), and a mid-stream-
+  evicted chunk is recovered.
+- **Deferred reload while scrolled up:** reconcile is deferred until return-to-
+  bottom; nothing yanks the viewport.
+- **Streaming fidelity:** the existing delta→chunk behaviors (coalescing,
+  thinking metadata seal, tool-result fill, shell-output) still hold on
+  `liveRender` (these reuse `appendEventToChunks`, so port the assertions to read
+  the live flat view).
+- **Cache untouched:** the 3-step byte-identical cache-stability test in
+  `packages/core/tests/agent/agent.test.ts` passes unchanged; `routes.test.ts`
+  gains `/chunks` coverage.
+
+## Phasing (each phase: `bun run check` + `bun run test` green)
+
+- **P1 — Backend additive:** `GET /chunks` raw endpoint; wrap `appendChunks` in a
+  transaction; add `turn-start`. Tests for endpoint + event. (Cache test must
+  still pass byte-for-byte.)
+- **P2 — Frontend store core:** introduce `chunks`/`liveRender`/`liveTurnId`;
+  retarget every streaming handler from `messages` to `liveRender`; build the
+  derived `log` + render groups; delete the `messages` field. Port streaming
+  tests.
+- **P3 — Eviction + pagination + reconcile:** `evictChunks`, `loadOlderChunks`
+  (delete turnId-merge), running→idle reconcile-refetch, load paths
+  (`hydrateFromBackend`/`openAgentTab`/`reloadTabMessagesFromApi`) switched to
+  `GET /chunks`. New tests.
+- **P4 — Cleanup:** delete dead code (`oldestSeqOf` if unused, old eviction
+  comment, `loadMoreMessages` name), update `eviction-limitation.md` →
+  resolved, full `bun run test` + `bun run check`.
+
+## Risks / open items
+
+- **Rendering decision** (above) — the one thing to confirm; proceeding with
+  render-time grouping (preserves UX).
+- **Interrupt turns at reconcile.** The live UI shows an interrupt as a split
+  (`message-consumed` handler); the persisted turn keeps the interrupt inside a
+  tool result (cache-safe, prior decision). So at reconcile the bubble re-renders
+  to the persisted shape. Cosmetic; documented; not fixed (fixing touches the
+  wire/cache).
+- **shell-output live placement.** `appendEventToChunks` attaches live shell
+  output to the running tool call; `explodeTurn` places it on the `tool_result`
+  row. The live flat view and the post-reconcile rows therefore differ only in
+  where shellOutput hangs until seal — render reads it from the paired call/result
+  either way. Minor; assert in a test.
+- **Scrolled-up during reconcile** — deferred (handled above); verify no viewport
+  jump using ChatPanel's existing scroll-anchor logic.
+- **Renames ripple:** `totalMessages→totalChunks`, `currentAssistantId→
+  liveAssistantId`, `evictMessages→evictChunks`, `loadMoreMessages→
+  loadOlderChunks` touch a few call sites + tests.
diff --git a/plan-chunk-log.md b/notes/plan-chunk-log.md
index 3b39303..3b39303 100644
--- a/plan-chunk-log.md
+++ b/notes/plan-chunk-log.md
diff --git a/plan-chunk-refactor.md b/notes/plan-chunk-refactor.md
index 0a60237..0a60237 100644
--- a/plan-chunk-refactor.md
+++ b/notes/plan-chunk-refactor.md
diff --git a/plan-v6-upgrade.md b/notes/plan-v6-upgrade.md
index ad3d223..ad3d223 100644
--- a/plan-v6-upgrade.md
+++ b/notes/plan-v6-upgrade.md
diff --git a/plan.md b/notes/plan.md
index 1dca443..1dca443 100644
--- a/plan.md
+++ b/notes/plan.md
diff --git a/problem.md b/notes/problem.md
index a29f801..a29f801 100644
--- a/problem.md
+++ b/notes/problem.md
diff --git a/notes/queue-interrupt-reconcile-edge-cases.md b/notes/queue-interrupt-reconcile-edge-cases.md
new file mode 100644
index 0000000..e564895
--- /dev/null
+++ b/notes/queue-interrupt-reconcile-edge-cases.md
@@ -0,0 +1,183 @@
+# The Queue / Interrupt / Reconcile Path Is an Edge-Case Magnet
+
+> Status: living document. Started after the chunk-native frontend rewrite, when
+> three consecutive independent code-review passes each surfaced a *new* Blocker
+> in the same ~40 lines of code. This note explains **why** that area is fragile,
+> catalogs the bugs found so far, states the invariants that must hold, and
+> records the recommended longer-term fix so the next person doesn't relearn it
+> the hard way.
+
+## TL;DR
+
+The frontend turn-completion reconcile (`reconcileSealedTurn` →
+`reloadChunksFromApi`) decides which transient `live` rows to **keep** vs **drop**
+when a turn's durable chunks arrive. It makes that decision from a tangle of
+loosely-coupled signals — `turnId` present/absent, the `queued-` id prefix,
+`queuedMessages` membership, `liveTurnId`, scroll state — that are mutated by
+**six** asynchronous events arriving in non-deterministic order, sometimes from
+**other clients**. Every signal is a *proxy* for the real question ("is this row
+already durable in `chunks`?"), and every bug so far has been a case where the
+proxy disagreed with reality. The result: either a row is **lost** (dropped but
+not yet sealed) or **duplicated/lingers** (kept but already sealed).
+
+**If you touch this code, re-read "Invariants" and "Why it keeps breaking" below,
+and add a test for the exact interleaving you're changing.**
+
+---
+
+## The moving parts
+
+### State (per tab, frontend store `tabs.svelte.ts`)
+- `tab.chunks: ChunkRow[]` — SEALED, durable history (real DB `seq`). Source of truth.
+- `tab.live: ChatMessage[]` — TRANSIENT buffer for the in-flight turn + optimistic
+  / queued user rows not yet folded into `chunks`.
+- `tab.renderGroups` — DERIVED: `groupRowsToMessages(chunks) ++ live`. What the UI shows.
+- `tab.queuedMessages: QueuedMessage[]` — messages the user sent while the agent was
+  busy, awaiting consumption.
+- `tab.liveTurnId` / `tab.currentAssistantId` — the in-flight turn + its streaming row.
+- per-row `turnId` — set once the row is bound to a concrete turn; `undefined` while
+  still optimistic/pending.
+- per-row id convention — a still-pending queued row has id `queued-<queueId>`.
+
+### Events (arrive over WS, order NOT guaranteed relative to each other)
+- `turn-start { turnId }` — fired **once per user-initiated `processMessage`**
+  (`agent-manager.ts`), which persists **exactly one** user chunk row
+  (`explodeUserText`). **A queued message NEVER gets its own `turn-start`.**
+- `text-delta` / tool events — stream content into the live assistant row.
+- `message-queued { messageId }` — a send was queued (this client or another).
+- `message-consumed { messageIds }` — the agent drained queued message(s) **into the
+  running turn**: either injected as a `[USER INTERRUPT]` block inside a tool result,
+  or appended as trailing history (`agent.ts` — the two `dequeueMessages()` sites).
+- `status { idle | running | error }` — note: `idle` fires **before** the DB write.
+- `turn-sealed { turnId }` — fired **after** `flushAssistant` (the durable write).
+  This is the only safe trigger for reconcile.
+
+### The reconcile decision (`reloadChunksFromApi`)
+On `turn-sealed` (deferred while the user is scrolled up, via `pendingReconcileTabs`),
+refetch the chunk window and recompute `live` as `keptLive`:
+
+```
+keptLive = live.filter(m =>
+  (preserveTurnId !== null && m.turnId === preserveTurnId) // a NEWER in-flight turn
+  || (m.turnId === undefined && m.role === "user")         // optimistic/queued, not yet sealed
+)
+```
+
+Everything else is assumed already-durable in `chunks` and dropped.
+
+---
+
+## Why it keeps breaking
+
+The reconcile filter answers **"is this live row already in the sealed chunks?"**
+but it has no direct way to know. It infers the answer from `turnId` and the
+`queued-` prefix. That inference is correct **only if tagging is perfectly
+exhaustive and perfectly conservative**:
+
+- **Exhaustive:** every row that *is* (or will be) sealed into a turn's chunks must
+  carry that `turnId` *before the turn seals*. Miss one → it stays "untagged" → kept
+  → **duplicate** (it's also in `chunks`), and it lingers forever because nothing
+  ever tags it later.
+- **Conservative:** every row that is *not* part of any sealed turn (a pending queued
+  message, an optimistic initiator before `turn-start`) must stay untagged. Over-tag
+  one → it looks sealed → dropped → **the user's message vanishes**.
+
+Both failure modes hinge on the *same* `turnId`-presence bit, pulled in opposite
+directions, mutated by events whose ordering we don't control. That is the whole
+problem in one sentence. Add multi-client (events for turns this client never
+initiated) and deferred reconcile (state mutates *while* a reconcile is pending),
+and the interleaving space explodes.
+
+---
+
+## Catalog of bugs found (one per review pass, same ~40 lines)
+
+| # | Pass | Failure mode | Root cause | Fix |
+|---|------|--------------|-----------|-----|
+| A | 1 | A newer, still-streaming turn got **wiped** when an earlier turn's deferred reconcile flushed. | reconcile blindly cleared the whole live tail. | `preserveTurnId`: keep rows whose `turnId === liveTurnId` when `liveTurnId !== sealedTurnId`. |
+| B | 1 | Optimistic **queued** user bubble **dropped** on reconcile. | filter didn't keep untagged user rows. | keep `turnId === undefined && role === "user"`. |
+| C | 2 | `turn-start` backfill **over-tagged** trailing `queued-` rows with the new turnId → **wiped** on seal. | backfill tagged *every* trailing untagged user row. | skip `queued-` rows; tag only the single most-recent non-queued initiator; then `break`. |
+| D | 3 | **Consumed** interrupt bubble **lingered forever** + **duplicated** the `[USER INTERRUPT]` text in the sealed tool result. | `message-consumed` stripped the `queued-` prefix to a *plain untagged* row, which rule B then preserved on every reconcile. | bind the consumed row to `liveTurnId` so reconcile drops it (collapse to persisted shape). |
+
+Notice the chain reaction: the **rule B fix** ("keep untagged user rows") is the
+hinge that **both C and D** then bent. C was about *creating* wrongly-tagged rows;
+D was about *failing to tag* rows that should have been. Each fix narrowed the
+definition of "untagged ⇒ keep" without ever making it airtight.
+
+---
+
+## Invariants (the contract reconcile depends on)
+
+1. **No loss.** A live row that is not yet represented in `chunks` must survive
+   reconcile. (Pending queued messages, optimistic initiators pre-`turn-start`.)
+2. **No duplicate / no linger.** A live row whose content has been sealed into
+   `chunks` (as a user row OR folded into a tool result) must be dropped on the
+   sealing turn's reconcile.
+3. **Newer turn preserved.** A turn that started streaming *after* the one being
+   reconciled must not be touched; only the sealed turn folds into `chunks`.
+4. **Idempotent + deferral-safe.** Reconcile may run late (after the user scrolls
+   back to the bottom) and may be preceded by arbitrary further events. Re-running it
+   must not violate 1–3. Key reconcile off `turn-sealed` (post-write), never `status:idle`.
+
+Corollaries that are easy to forget:
+- A consumed/interrupt message is sealed **inside a tool result**, not as its own
+  user chunk row — yet its optimistic bubble must still be dropped (invariant 2).
+- A freshly-created tab defaults to `agentStatus: "running"`, so the *first* user
+  send may be treated as queued unless the tab is known-idle.
+
+---
+
+## Recommended longer-term fix (not yet done)
+
+Stop inferring "is this sealed?" from `turnId` presence. Make preservation key off
+**explicit, positive membership** instead of the absence of a tag:
+
+> Keep a live row iff it belongs to a turn we are explicitly preserving
+> (`turnId === preserveTurnId`) **or** it is still a *pending* queued message
+> (its `queueId` is still in `tab.queuedMessages`). Drop everything else.
+
+This flips the dangerous default. Today "untagged ⇒ keep" means *any* tagging gap
+causes a linger/dup (bugs C/D class). With membership-based keep, an untagged row
+that is *not* a pending queued message is dropped — which is correct, because the
+only rows that should outlive a reconcile are (a) the explicitly-preserved newer
+turn and (b) genuinely-pending queue entries, both of which have positive signals.
+It also makes the `turn-start` backfill purely cosmetic (stable render keys), so a
+tagging miss there can no longer lose or duplicate data.
+
+Until that refactor lands, treat this path as **high-touch**: any change needs a
+targeted interleaving test.
+
+---
+
+## Testing guidance
+
+Property/interleaving coverage beats one-off scenario tests here. Suggested:
+- A randomized sequence generator over `{turn-start, text-delta, message-queued,
+  message-consumed, status, turn-sealed, scrollUp/Down}` with multiple concurrent
+  turns and a simulated second client, asserting invariants 1–3 after every
+  `turn-sealed` + final state.
+- Scenario tests that already exist in `packages/frontend/tests/chat-store.test.ts`
+  and should stay green (named for the bugs above):
+  - deferred reconcile preserves a concurrent newer turn (A)
+  - optimistic queued user message survives an earlier turn's reconcile (B)
+  - `turn-start` backfill skips a pending queued row, tags only the initiator (C)
+  - a consumed interrupt message collapses into the sealed turn — no lingering bubble (D)
+- Always assert **both** directions: the kept row is still present (no loss) AND the
+  sealed rows are not duplicated (no linger).
+
+## Pointers
+
+- Frontend store: `packages/frontend/src/lib/tabs.svelte.ts`
+  (`reloadChunksFromApi`, `reconcileSealedTurn`, `handleEvent` cases
+  `turn-start` / `turn-sealed` / `message-queued` / `message-consumed` / `status`,
+  `setScrolledUp`, `pendingReconcileTabs`).
+- Render keying: `packages/frontend/src/lib/components/ChatPanel.svelte`
+  (`${turnId}:${role}:${n}`).
+- Backend event emission: `packages/api/src/agent-manager.ts` (`processMessage`,
+  `dequeueMessages`, `turn-start` / `turn-sealed`); `packages/core/src/agent/agent.ts`
+  (the two `dequeueMessages()` consumption sites).
+- Review records: `notes/gemini-chunk-eviction-review.md` (Pass 1, bugs A+B),
+  `notes/gemini-chunk-eviction-review-2.md` (Pass 2, bug C),
+  `notes/gemini-chunk-eviction-review-3.md` (Pass 3, bug D).
+</content>
+</invoke>
diff --git a/report.md b/notes/report.md
index 3c6d6f7..3c6d6f7 100644
--- a/report.md
+++ b/notes/report.md
diff --git a/requirements.md b/notes/requirements.md
index 96b2fb8..96b2fb8 100644
--- a/requirements.md
+++ b/notes/requirements.md
diff --git a/notes/tool-runner-duplication-incident.md b/notes/tool-runner-duplication-incident.md
new file mode 100644
index 0000000..188110d
--- /dev/null
+++ b/notes/tool-runner-duplication-incident.md
@@ -0,0 +1,147 @@
+# Incident Report: Duplicated Tool Output in Dispatch Harness
+
+**Date:** 2025-05-30
+**Component:** Dispatch AI harness — tool runner / tool-result delivery
+**Severity:** Medium (no data corruption, but severe context pollution and wasted tokens/latency)
+**Status:** Observed, not root-caused
+
+---
+
+## Summary
+
+During the exploration phase of a code task (disabling the todo/tasks skill),
+a single assistant turn that batched a large number of independent,
+read-only tool calls came back with **massively duplicated tool results**.
+The same `read_file`, `list_files`, and `run_shell` invocations were echoed
+back dozens of times each, producing a multi-thousand-line "wall" of
+near-identical output for what should have been a few dozen distinct results.
+
+The actual file edits performed later in the task landed correctly — this was
+an **output/delivery problem**, not a state-corruption problem. But the
+duplication wasted a large amount of context window, made the transcript very
+hard to read, and could plausibly cause an agent to lose track of its plan or
+hit context limits on a larger task.
+
+---
+
+## What I observed
+
+I emitted **one** assistant message containing a large parallel batch of
+independent tool calls (the correct pattern for independent reads). The batch
+included things like:
+
+- `read_file packages/api/src/agent-manager.ts` (several distinct line ranges)
+- `read_file package.json`
+- `list_files packages/api/src`
+- various `run_shell` commands (`grep`, `sed`, `awk`, plus debug probes like
+  `echo hello`, `echo hi`, `pwd`)
+
+Instead of one result per call, the results stream contained the **same
+results repeated many times over**. Concrete examples from the transcript:
+
+| Tool call | Approx. times the identical result appeared |
+|---|---|
+| `read_file package.json` (lines 1-5) | ~150+ |
+| `read_file agent-manager.ts` (lines 75-126 / 76-127 / 78-117) | ~40+ |
+| `list_files packages/api/src` | ~20 |
+| `run_shell sed -n '75,127p' ...` | ~13 |
+| `run_shell awk 'NR>=75...'` | ~12 |
+| `run_shell echo hello` | ~16 |
+| `run_shell pwd` | ~20 |
+
+The duplicated blocks were **byte-for-byte identical** (same line ranges, same
+content, same exit codes). They were not re-reads of changed files — the
+content was stable across every repeat.
+
+### Notable characteristics
+
+1. **Read-only and idempotent calls were the ones duplicated.** The heavy
+   duplication clustered on `read_file` / `list_files` / trivial `run_shell`
+   probes. This is consistent with the harness retrying or re-emitting results
+   for calls it considered safe/idempotent.
+2. **The duplication count was wildly uneven.** `package.json` (a tiny file)
+   was echoed ~150 times while a single `read_file_slice` appeared once. The
+   multiplier was not constant across calls.
+3. **A genuinely valuable signal was nearly buried.** One real result hid in
+   the noise: `read_file agent-manager.ts` lines 75-126 truncated mid-line and
+   instructed me to use `read_file_slice` — easy to miss in a wall of
+   duplicates.
+4. **No effect on writes.** The later `write_file`/`run_shell` mutation steps
+   each returned exactly once and produced correct results, and the final
+   `git diff --stat` matched expectations (6 insertions, 168 deletions across
+   8 files). So the glitch appears confined to result *delivery/echo*, not
+   tool *execution*.
+
+---
+
+## Impact
+
+- **Context pollution:** Thousands of redundant lines consumed the context
+  window. On a larger codebase or a longer task this could trigger truncation
+  or force premature summarization.
+- **Readability:** The human-facing transcript became almost unreadable for
+  the exploration phase; it's hard to audit what the agent actually looked at.
+- **Cost/latency:** Re-emitting and re-processing identical results wastes
+  tokens and time.
+- **Reasoning risk:** Heavy duplication can bias or confuse a model
+  (repetition can be read as emphasis), and the near-miss on the truncation
+  hint shows real signals can get lost.
+
+---
+
+## What I could NOT determine
+
+- Whether the duplication originated in:
+  - the **tool runner** (executing/emitting each call multiple times),
+  - the **result serializer/transport** (one execution, many echoes), or
+  - a **retry/timeout loop** (calls re-issued after a perceived stall).
+- Whether the trigger was the **large fan-out batch size** specifically, or
+  something incidental.
+- Whether other agents on the same harness see it (single-session
+  observation only).
+
+The byte-identical nature of the repeats (including stable `pwd` and `echo`
+output) leans toward an **echo/transport duplication** rather than genuine
+re-execution, but that is inference, not proof.
+
+---
+
+## Reproduction hypothesis
+
+Not reliably reproduced. The strongest correlated factor was **a single turn
+with a large number (dozens) of independent tool calls batched together**,
+heavily weighted toward `read_file` on small files. Suggested repro attempt:
+
+1. In one assistant turn, batch ~30+ independent `read_file` calls, several
+   targeting the same small file at different line ranges.
+2. Mix in a few cheap `run_shell` probes (`echo`, `pwd`).
+3. Observe whether results are echoed >1× and whether the multiplier varies
+   per call.
+
+---
+
+## Suggested follow-ups for the harness maintainers
+
+1. **Deduplicate / cap result emission per tool_call_id.** Each tool call has
+   a unique id; the result stream should emit exactly one result per id.
+   Assert this invariant and drop/merge duplicates.
+2. **Log the executor path.** Add a counter for "executions per tool_call_id"
+   vs "results delivered per tool_call_id" to distinguish re-execution from
+   re-echo.
+3. **Investigate the batch/fan-out path.** Check whether large parallel tool
+   batches trigger a retry or a fan-in bug that replays buffered results.
+4. **Surface truncation hints robustly.** When a `read_file` result is
+   truncated, ensure that notice survives any dedup/compaction so the agent
+   doesn't miss it.
+5. **Consider a per-turn result-size guardrail** that warns when duplicate
+   identical results are detected.
+
+---
+
+## Notes
+
+- This document records a single observed session. Numbers above are
+  approximate counts read from the transcript, not exact instrumentation.
+- The underlying code task completed successfully and was verified
+  independently (targeted test suites passed, typechecks clean); this report
+  concerns only the harness behavior, not the task outcome.
diff --git a/wishlist.md b/notes/wishlist.md
index 43e579b..43e579b 100644
--- a/wishlist.md
+++ b/notes/wishlist.md
diff --git a/packages/core/src/credentials/anthropic-betas.ts b/packages/core/src/credentials/anthropic-betas.ts
index f2ca7a6..802ebbd 100644
--- a/packages/core/src/credentials/anthropic-betas.ts
+++ b/packages/core/src/credentials/anthropic-betas.ts
@@ -7,7 +7,7 @@
 //
 // `prompt-caching-scope-2026-01-05` is load-bearing for cost: without it the
 // Anthropic API silently ignores every `cache_control` breakpoint we place,
-// producing a 0% cache hit rate (see claude-report.md). `oauth-2025-04-20`
+// producing a 0% cache hit rate (see notes/claude-report.md). `oauth-2025-04-20`
 // gates the Bearer/OAuth flow used by Claude Pro/Max subscriptions.
 
 const BASE_BETAS = [
diff --git a/packages/core/src/db/index.ts b/packages/core/src/db/index.ts
index 18dd1b5..29448bc 100644
--- a/packages/core/src/db/index.ts
+++ b/packages/core/src/db/index.ts
@@ -99,7 +99,7 @@ export function getDatabase(): Database {
 	// keyed by a per-tab monotonic `seq`. "Message" and "turn" are DERIVED
 	// groupings (see db/chunks.ts), never stored containers. This is what
 	// powers per-chunk frontend pagination AND the stable per-step wire
-	// format that fixes Anthropic prompt-cache churn (see plan-chunk-log.md).
+	// format that fixes Anthropic prompt-cache churn (see notes/plan-chunk-log.md).
 	//
 	//   role  : 'user' | 'assistant' | 'tool' | 'system'
 	//   type  : 'text' | 'thinking' | 'tool_call' | 'tool_result' | 'error' | 'system'
diff --git a/packages/core/src/llm/provider.ts b/packages/core/src/llm/provider.ts
index 5978196..9e8475f 100644
--- a/packages/core/src/llm/provider.ts
+++ b/packages/core/src/llm/provider.ts
@@ -81,7 +81,7 @@ export function createProvider(config: ProviderConfig): ModelFactory {
  * (computer-use, structured-outputs, etc.) — it does NOT add the prompt-caching
  * or oauth betas on its own. Without `prompt-caching-scope-2026-01-05` the API
  * silently ignores every `cache_control` breakpoint we attach to messages,
- * giving a 0% cache hit rate and a massive token burn (see claude-report.md).
+ * giving a 0% cache hit rate and a massive token burn (see notes/claude-report.md).
  * The SDK folds any `anthropic-beta` it finds on the provider's config headers
  * back into its own beta set (via `getBetasFromHeaders`), so the values here
  * are merged — not overwritten — with any tool-derived betas.
diff --git a/packages/core/src/tools/truncate.ts b/packages/core/src/tools/truncate.ts
index 2204a3f..8c62174 100644
--- a/packages/core/src/tools/truncate.ts
+++ b/packages/core/src/tools/truncate.ts
@@ -8,7 +8,7 @@ import { dirname, join } from "node:path";
 // /tmp/dispatch/tool-results/<tabId>/<callId>.txt and the model receives
 // HEAD_CHARS from the start + TAIL_CHARS from the end with a notice
 // in between. These are deliberate hardcoded defaults — see the design
-// discussion in plan.md for the rationale.
+// discussion in notes/plan.md for the rationale.
 
 export const MAX_CHARS = 10_000;
 export const MAX_LINES = 500;
diff --git a/packages/core/tests/agent/agent.test.ts b/packages/core/tests/agent/agent.test.ts
index 6c2b452..a12ea07 100644
--- a/packages/core/tests/agent/agent.test.ts
+++ b/packages/core/tests/agent/agent.test.ts
@@ -1038,7 +1038,7 @@ describe("Agent", () => {
 		expect(rc).toBeUndefined();
 	});
 
-	// ─── Prompt-caching: tool-result grouping & breakpoints (claude-report.md) ──
+	// ─── Prompt-caching: tool-result grouping & breakpoints (notes/claude-report.md) ──
 
 	it("groups a turn's tool results into a SINGLE role:'tool' message (Root Cause 2)", async () => {
 		// The agent batches three distinct read_file calls in one step. The
@@ -1158,7 +1158,7 @@ describe("Agent", () => {
 		}
 	});
 
-	// ─── Tool-call dedup (tool-runner-duplication-incident.md) ─────────────────
+	// ─── Tool-call dedup (notes/tool-runner-duplication-incident.md) ─────────────────
 
 	it("deduplicates byte-identical tool calls within a single batch", async () => {
 		// Claude can degenerate and emit the same tool call (same name + args)
@@ -1265,7 +1265,7 @@ describe("Agent", () => {
 		// A 3-step tool turn. The messages for steps 0 and 1 must serialize
 		// identically in the step-2 request and the step-3 request — that
 		// byte-stability is what lets Anthropic's rolling prompt cache extend
-		// instead of re-writing the whole prefix every step (cache-miss-report.md).
+		// instead of re-writing the whole prefix every step (notes/cache-miss-report.md).
 		// Uses the openai-compatible provider so no cacheControl markers (which
 		// intentionally move each step) obscure the content comparison.
 		let n = 0;
diff --git a/packages/core/tests/llm/provider.test.ts b/packages/core/tests/llm/provider.test.ts
index 9e6b2ad..12b6350 100644
--- a/packages/core/tests/llm/provider.test.ts
+++ b/packages/core/tests/llm/provider.test.ts
@@ -179,7 +179,7 @@ describe("createClaudeOAuthProvider", () => {
 		}
 	});
 
-	it("sends the anthropic-beta header so prompt-caching is honored (claude-report.md Root Cause 1)", () => {
+	it("sends the anthropic-beta header so prompt-caching is honored (notes/claude-report.md Root Cause 1)", () => {
 		// Without `anthropic-beta: ...,prompt-caching-scope-2026-01-05,...` the
 		// Anthropic API silently ignores every `cache_control` marker we attach
 		// to messages, producing a 0% cache hit rate. `@ai-sdk/anthropic` does
author	Adam Malczewski <[email protected]>	2026-05-30 23:15:18 +0900
committer	Adam Malczewski <[email protected]>	2026-05-30 23:15:18 +0900
commit	4e636511ae748d606d8871f5068a2bd18b386bd0 (patch)
tree	a5e0726d71d9d88d09d938ea2318a61e36ade68f
parent	624b808da0f2f8bbad8a4fbbcca3f82f24ecfc47 (diff)
download	dispatch-4e636511ae748d606d8871f5068a2bd18b386bd0.tar.gz dispatch-4e636511ae748d606d8871f5068a2bd18b386bd0.zip