diff options
| author | Adam Malczewski <[email protected]> | 2026-05-30 23:15:18 +0900 |
|---|---|---|
| committer | Adam Malczewski <[email protected]> | 2026-05-30 23:15:18 +0900 |
| commit | 4e636511ae748d606d8871f5068a2bd18b386bd0 (patch) | |
| tree | a5e0726d71d9d88d09d938ea2318a61e36ade68f | |
| parent | 624b808da0f2f8bbad8a4fbbcca3f82f24ecfc47 (diff) | |
| download | dispatch-4e636511ae748d606d8871f5068a2bd18b386bd0.tar.gz dispatch-4e636511ae748d606d8871f5068a2bd18b386bd0.zip | |
chore(notes): collect loose root docs into notes/; add reconcile edge-cases note
Move all loose root-level .md files (plans, reports, gemini reviews, incident
notes) into a single notes/ directory, and update the doc-reference breadcrumbs in
code comments/test labels to the notes/ path.
Add notes/queue-interrupt-reconcile-edge-cases.md: documents why the
queue/interrupt/turn-sealed reconcile path keeps surfacing edge cases (a catalog of
the four review-pass bugs, the no-loss/no-duplicate invariants, the recommended
membership-based reconcile refactor, and interleaving-test guidance).
| -rw-r--r-- | notes/cache-miss-report.md (renamed from cache-miss-report.md) | 0 | ||||
| -rw-r--r-- | notes/changes-report.md | 54 | ||||
| -rw-r--r-- | notes/changes.md (renamed from changes.md) | 0 | ||||
| -rw-r--r-- | notes/claude-auth-report.md | 282 | ||||
| -rw-r--r-- | notes/claude-report.md | 114 | ||||
| -rw-r--r-- | notes/context.md (renamed from context.md) | 0 | ||||
| -rw-r--r-- | notes/eviction-limitation.md (renamed from eviction-limitation.md) | 16 | ||||
| -rw-r--r-- | notes/gemini-chunk-eviction-review-2.md | 95 | ||||
| -rw-r--r-- | notes/gemini-chunk-eviction-review-3.md | 118 | ||||
| -rw-r--r-- | notes/gemini-chunk-eviction-review.md | 69 | ||||
| -rw-r--r-- | notes/gemini-chunk-log-review.md (renamed from gemini-chunk-log-review.md) | 0 | ||||
| -rw-r--r-- | notes/harness-comparison.md (renamed from harness-comparison.md) | 0 | ||||
| -rw-r--r-- | notes/plan-bg-restore.md (renamed from plan-bg-restore.md) | 0 | ||||
| -rw-r--r-- | notes/plan-chunk-eviction.md | 251 | ||||
| -rw-r--r-- | notes/plan-chunk-log.md (renamed from plan-chunk-log.md) | 0 | ||||
| -rw-r--r-- | notes/plan-chunk-refactor.md (renamed from plan-chunk-refactor.md) | 0 | ||||
| -rw-r--r-- | notes/plan-v6-upgrade.md (renamed from plan-v6-upgrade.md) | 0 | ||||
| -rw-r--r-- | notes/plan.md (renamed from plan.md) | 0 | ||||
| -rw-r--r-- | notes/problem.md (renamed from problem.md) | 0 | ||||
| -rw-r--r-- | notes/queue-interrupt-reconcile-edge-cases.md | 183 | ||||
| -rw-r--r-- | notes/report.md (renamed from report.md) | 0 | ||||
| -rw-r--r-- | notes/requirements.md (renamed from requirements.md) | 0 | ||||
| -rw-r--r-- | notes/tool-runner-duplication-incident.md | 147 | ||||
| -rw-r--r-- | notes/wishlist.md (renamed from wishlist.md) | 0 | ||||
| -rw-r--r-- | packages/core/src/credentials/anthropic-betas.ts | 2 | ||||
| -rw-r--r-- | packages/core/src/db/index.ts | 2 | ||||
| -rw-r--r-- | packages/core/src/llm/provider.ts | 2 | ||||
| -rw-r--r-- | packages/core/src/tools/truncate.ts | 2 | ||||
| -rw-r--r-- | packages/core/tests/agent/agent.test.ts | 6 | ||||
| -rw-r--r-- | packages/core/tests/llm/provider.test.ts | 2 |
30 files changed, 1334 insertions, 11 deletions
diff --git a/cache-miss-report.md b/notes/cache-miss-report.md index 03342af..03342af 100644 --- a/cache-miss-report.md +++ b/notes/cache-miss-report.md diff --git a/notes/changes-report.md b/notes/changes-report.md new file mode 100644 index 0000000..f30b611 --- /dev/null +++ b/notes/changes-report.md @@ -0,0 +1,54 @@ +# Changes Report + +## Overview +This report reviews the uncommitted changes in the `dispatch` repository. The changes can be broadly categorized into two parts: +1. **Codebase Updates**: The removal of the `todo` (task list) tool and a fix to a `summon` tool test. +2. **Documentation Additions**: Several new untracked markdown documents regarding planning and incident reports. + +## Per-File Analysis + +### Untracked Files +- **`claude-auth-report.md`** & **`tool-runner-duplication-incident.md`**: New post-mortem and incident investigation reports. +- **`cyberdeck/credentials.md`**: Documentation regarding credentials. +- **`skill-plan/` directory**: Multiple markdown files outlining a multi-step plan for building a new "skill" architecture (tool definitions, registration, routes, etc.). + +**Assessment**: Adding detailed markdown documents for incidents and project planning is a very good practice. They are cleanly separated from the application source code. + +### Modified Files (Staged & Unstaged) + +#### 1. `packages/core/src/tools/task-list.ts` +- **Change**: Removed the `createTaskListTool` function and the Zod/ToolDefinition imports. The `TaskList` class remains. +- **Correctness**: The removal of the factory is clean. + +#### 2. `packages/api/src/agent-manager.ts` +- **Change**: Removed `createTaskListTool` imports and invocations. Removed `todo` from the `TOOL_DESCRIPTIONS` and the lengthy `TODO_GUIDANCE` string from the agent's system prompt. +- **Correctness**: Consistently applies the removal of the `todo` tool. However, it still retains `tabAgent.taskList = new TaskList()`, which may now be dead state (see Issues section). + +#### 3. `packages/core/src/agents/loader.ts` & `packages/core/src/tools/summon.ts` +- **Change**: Removed `"todo"` from the default tool arrays and documentation comments for child agents. +- **Correctness**: Follows through with the removal of the `todo` tool across tool lists, ensuring child agents don't request a missing tool. + +#### 4. `packages/core/src/index.ts` +- **Change**: Updated exports to remove `createTaskListTool`, while keeping `TaskList`. +- **Correctness**: Correctly reflects the module changes. + +#### 5. Tests (`packages/api/tests/*.test.ts` & `packages/core/tests/agents/loader.test.ts`) +- **Change**: Removed mocked `createTaskListTool` injections and updated assertions that previously verified the automatic injection of the `todo` tool. +- **Correctness**: Properly aligns tests with the new implementation. Build/tests should pass cleanly. + +#### 6. `packages/core/tests/tools/summon.test.ts` +- **Change**: Updated the "surfaces child errors when blocking" test to correctly anticipate the `agent_id: <id>` prefix before the child output. Added an excellent explanatory comment about why this prefix exists (for frontend `ToolCallDisplay` regex parsing). +- **Correctness**: The change is highly robust. Explicitly asserting `.toContain()` before the final `.toBe()` provides great debugging signals if the test fails in the future. The comment is an excellent practice. + +## Correctness Assessment +The changes successfully and cleanly remove the `todo` tool logic from the backend tools and system prompt. The test fix for `summon.test.ts` is technically sound and well-documented. The changes are internally consistent from a backend tooling perspective. + +## Issues & Concerns Found +**Incomplete Refactoring (Dead Code)**: +While the `todo` tool was completely removed, the `TaskList` class itself and its instantiation in `AgentManager` (`tabAgent.taskList = new TaskList()`) were kept. Because the agent no longer has a tool to interact with this list, any tasks will remain empty. +Consequently, the UI component that relies on this (`TaskListPanel.svelte` in the `frontend` package) will now always render an empty state. This represents an incomplete feature removal/refactor. + +## Recommendations +1. **Clean Up `TaskList`**: If the todo feature is permanently removed, you should also delete the `TaskList` class from `packages/core`, remove the `taskList` property from `tabAgent` in `packages/api`, and delete `TaskListPanel.svelte` (along with its imports in `SidebarPanel.svelte`) from the frontend. +2. **Staging**: If the untracked markdown documents are ready, ensure they are intentionally added (`git add`) and committed. +3. **Commit the Changes**: The code removal of the `todo` tool is safe to commit. I recommend summarizing it as "refactor: remove todo tool from agent capabilities".
\ No newline at end of file diff --git a/changes.md b/notes/changes.md index 66389d9..66389d9 100644 --- a/changes.md +++ b/notes/changes.md diff --git a/notes/claude-auth-report.md b/notes/claude-auth-report.md new file mode 100644 index 0000000..db2c6e6 --- /dev/null +++ b/notes/claude-auth-report.md @@ -0,0 +1,282 @@ +# Claude Auth Report — "old tab 401s, new tab works" + +## Symptom + +A tab that has been open for a while starts failing **every** Claude request with: + +``` +Invalid authentication credentials +status=401 authentication_error +url=https://api.anthropic.com/v1/messages +model=claude-opus-4-8 baseURL=https://api.anthropic.com/v1 +``` + +Opening a **new tab** and selecting the same Claude key works fine. The old tab +keeps 401-ing no matter what you send. Both tabs are configured against the same +`anthropic` key (`claude-pro` / `claude-max`), so "they should be using the same +auth" — but they are not. + +This is an **authentication-layer** problem, not an agent/tool-wiring/permission +bug. The 401 comes back from Anthropic on the chat request itself. + +--- + +## Decisive clue: new tab works, old tab doesn't + +This single fact rules out most hypotheses and points at exactly one mechanism: +**the per-tab `Agent` instance is cached, and it freezes the access token it was +constructed with.** A new tab builds a fresh `Agent`, re-resolves credentials +from the DB, and gets a currently-valid token. The old tab never re-resolves. + +--- + +## Root cause #1 (proximate, confirmed): the cached per-tab Agent freezes a stale access token + +### Where the token is captured + +Credentials are resolved **only when a new `Agent` is constructed**, inside +`getOrCreateAgentForTab` in `packages/api/src/agent-manager.ts`: + +- The `anthropic` branch (lines ~579–636) resolves `claudeCredentials = + { accessToken: <token> }` and passes it into the `Agent` constructor + (lines ~686–700, `...(claudeCredentials ? { claudeCredentials } : {})`). +- That token is stored on the Agent's `this.config.claudeCredentials` and never + updated for the life of the Agent. + +In `packages/core/src/agent/agent.ts`, every `run()` call rebuilds the provider +from that frozen config: + +```ts +const providerFactory = createProvider({ + apiKey: this.config.apiKey, + baseURL: this.config.baseURL, + provider: this.config.provider, + claudeCredentials: this.config.claudeCredentials, // <-- frozen at construction +}); +``` + +And in `packages/core/src/llm/provider.ts`, `createClaudeOAuthProvider` sends it +as the bearer token: + +```ts +authToken: config.claudeCredentials?.accessToken ?? config.apiKey, +``` + +So the access token used on the wire is whatever was captured when the tab's +Agent was first built. Calling `run()` again does **not** refresh it. + +### Why the cache is never invalidated on expiry + +`getOrCreateAgentForTab` (agent-manager.ts lines ~361–368) only discards the +cached Agent when the key, model, or permissions change: + +```ts +if ( + tabAgent.agent && + (effectiveKeyId !== tabAgent.keyId || + effectiveModelId !== tabAgent.modelId || + permKey !== tabAgent._lastPermKey) +) { + tabAgent.agent = null; +} +``` + +**Token expiry is not one of the conditions.** A tab that keeps the same key + +model + permissions will reuse the same `Agent` — and therefore the same frozen +token — indefinitely. The credential-refresh code below this gate only runs when +`tabAgent.agent` is null, which for a stable tab never happens again. + +### Why this produces the exact symptom + +- **Old tab:** built its `Agent` earlier with token **A**. Time passes. Token A + is no longer accepted by Anthropic (it either reached its ~8h TTL, or a refresh + elsewhere — a new tab, the Model Status panel, model listing — rotated the + credential in the DB and **revoked A** server-side). The cached Agent still + holds A. Cache gate sees no key/model/perm change → Agent reused → every + request sends the dead token A → `401 authentication_error`. +- **New tab:** no cached Agent → runs the resolution path → reads the **current** + token from the DB (and/or refreshes) → token **B** → works. + +Both tabs "use the same key," but the old tab is pinned to a now-invalid +*instance* of that key's token. That is the whole discrepancy. + +> Note on OAuth token rotation: Anthropic's refresh flow rotates the refresh +> token and can invalidate the previously issued access token. So the moment any +> code path successfully refreshes the credential (updating the DB), every +> already-constructed Agent still holding the old access token is now holding a +> *revoked* token — not merely an expired one. This is why the old tab can fail +> even when its captured token's local `expiresAt` hasn't elapsed yet. + +--- + +## Root cause #2 (contributing, confirmed by reference): the OAuth refresh call is malformed + +Even when the resolution path *does* run, the refresh half of it is broken, so +the system can't reliably self-heal an expired token. + +`packages/core/src/credentials/claude.ts`: + +```ts +const OAUTH_TOKEN_URL = "https://claude.ai/v1/oauth/token"; // line 27 — wrong host + +async function refreshViaOAuth(refreshToken: string) { + const body = new URLSearchParams({ grant_type: "refresh_token", client_id: OAUTH_CLIENT_ID, refresh_token: refreshToken }); + const response = await fetch(OAUTH_TOKEN_URL, { + method: "POST", + headers: { "Content-Type": "application/x-www-form-urlencoded" }, // wrong content-type + body: body.toString(), // form-encoded + }); + if (!response.ok) return null; // failure is swallowed silently + ... +} +``` + +Two concrete defects, both contradicted by the working reference implementation +in `references/oh-my-pi/packages/ai/src/utils/oauth/anthropic.ts`: + +1. **Wrong endpoint host.** The token exchange/refresh endpoint lives on the API + host, not the web-app host: + ```ts + const TOKEN_URL = "https://api.anthropic.com/v1/oauth/token"; // reference, line 11 + ``` + The reference's test suite asserts exactly this URL + (`references/oh-my-pi/packages/ai/test/anthropic-oauth.test.ts`). `claude.ai` + is correct only for the *authorize* step (`https://claude.ai/oauth/authorize`). + +2. **Wrong body format.** Anthropic's `/v1/oauth/token` expects **JSON**, not + form-encoding. The reference posts: + ```ts + headers: { "Content-Type": "application/json", Accept: "application/json" }, + body: JSON.stringify(body), + ``` + +Because both are wrong, `refreshViaOAuth` effectively always returns `null`. When +a token has genuinely expired, the manager's `anthropic` branch hits its final +`else` and does the worst possible thing — proceeds with the **stale** token: + +```ts +console.warn(`dispatch: unable to refresh Claude credentials for "${account.label}" — using stale token`); +claudeCredentials = { accessToken: account.credentials.accessToken }; // expired +... +useOverride = true; +``` + +So the contributing failure mode is: +- New tabs work **only as long as the DB token is still valid** (e.g. because + Claude Code refreshed it on disk and it was re-imported, or it simply hasn't + expired yet). Dispatch's *own* refresh cannot renew it. +- Once the DB token expires with no external renewal, even new tabs will 401, + and the warning above appears in the server log. + +**Diagnostic check:** look in the server logs for +`dispatch: unable to refresh Claude credentials for "..." — using stale token`. +Its presence confirms the refresh path is failing and the stale-token fallback +fired. + +> History: both the wrong URL and the form-encoded body were introduced in the +> original commit `8151447` ("feat: claude max oauth support…"). They have never +> been correct in this repo. + +--- + +## Secondary suspect (unverified): the chat provider omits `anthropic-beta` + +`createClaudeOAuthProvider` (provider.ts lines ~76–84) sets only: + +```ts +headers: { + "anthropic-dangerous-direct-browser-access": "true", + "x-app": "cli", + "user-agent": "claude-cli/2.1.112 (external, sdk-cli)", +} +``` + +It does **not** send `anthropic-beta: …,oauth-2025-04-20,…` or +`anthropic-version`. Every other path in this codebase and in the reference does: + +- `getAnthropicHeaders()` (claude.ts:387) builds the full set, including + `oauth-2025-04-20`, and is used by the models/profile/usage calls. +- The reference's chat path always sends `anthropic-beta: ANTHROPIC_OAUTH_BETA` + with the OAuth bearer (`references/oh-my-pi/.../provider-models/openai-compat.ts`, + `.../providers/anthropic.ts:1542`). Anthropic generally requires the + `oauth-2025-04-20` beta for Bearer/OAuth requests. + +This is flagged **secondary** because chat works at all on a fresh token, which +suggests `@ai-sdk/anthropic` may inject the oauth beta itself for `authToken` +requests. **I could not verify this** — the workspace deps are not installed in +this environment (`node_modules/@ai-sdk/anthropic` is absent) and outbound +network is blocked, so the SDK could not be inspected and the request could not +be reproduced live. If fixing #1 and #2 doesn't fully resolve things, make this +provider reuse `getAnthropicHeaders()` so the chat path carries the same +`anthropic-beta` / `anthropic-version` as every other Anthropic call. + +--- + +## Minor, unrelated observation + +`agent.ts:836` gates adaptive thinking on a hardcoded model string: + +```ts +const isOpus47 = this.config.model === "claude-opus-4-7"; +``` + +The failing model is `claude-opus-4-8`, so it silently takes the non-adaptive +`thinking: { type: "enabled", budgetTokens }` branch. Not related to the 401, but +that literal will need updating for opus-4-8 to get adaptive thinking. + +--- + +## Recommended fixes + +In priority order: + +### 1. Stop freezing the token in the cached Agent (fixes the new-vs-old-tab bug) + +Pick one of: + +- **Re-resolve credentials per `run()` for `anthropic` keys**, instead of caching + them in the Agent's config. e.g. have the Agent pull the access token through a + callback/getter at request time (`() => refreshAccountCredentials(account)`), + so each `run()` uses the live DB token. This is the cleanest fix. +- **Or** invalidate the cached Agent when its captured token is near/after + expiry. Add an expiry check to the cache-invalidation gate in + `getOrCreateAgentForTab` (store the captured `expiresAt` on `tabAgent` and null + the agent when `Date.now() > expiresAt - 60_000`). Coarser, but localized. +- **Or** when a refresh rotates the DB token, proactively null every cached + `tabAgent.agent` whose `keyId` matches, forcing re-resolution on next send. + +The getter approach is preferred because it also survives token rotation +(root-cause note above) without any cache bookkeeping. + +### 2. Fix the OAuth refresh call in `credentials/claude.ts` + +- `OAUTH_TOKEN_URL` → `https://api.anthropic.com/v1/oauth/token`. +- In `refreshViaOAuth`, send JSON: `Content-Type: application/json`, + `Accept: application/json`, `body: JSON.stringify({ grant_type, client_id, refresh_token })`. +- Don't swallow failures silently — log `response.status` and the body so the + next stale-token event is diagnosable. (Mirror the reference's `postJson`, + which throws with `status` + `body` included.) + +### 3. (If still needed) add `anthropic-beta` to the chat provider + +Have `createClaudeOAuthProvider` reuse `getAnthropicHeaders(token)` so the chat +path carries `anthropic-beta` (incl. `oauth-2025-04-20`) and `anthropic-version` +like every reference path and every other Anthropic call in this repo. + +--- + +## What was verified vs. assumed + +- **Verified by reading the code:** the per-tab Agent cache, the cache-invalidation + gate that ignores token expiry, the frozen `claudeCredentials` flowing into the + provider, the wrong refresh URL + form-encoded body, the silent failure + + stale-token fallback, the missing `anthropic-beta` on the chat provider, the + `opus-4-7` hardcode. +- **Verified against the reference** (`references/oh-my-pi`): correct token URL + (`api.anthropic.com/v1/oauth/token`), JSON body, and that the OAuth chat path + sends `anthropic-beta`. +- **Could not run/reproduce:** deps aren't installed here and network is + sandboxed, so the live request, the SDK's default-header behavior, and the + exact DB token `expiresAt` values were not inspected. Root cause #1 fully + explains the reported new-vs-old-tab behavior on its own; #2 explains why the + system can't self-heal once the DB token lapses. diff --git a/notes/claude-report.md b/notes/claude-report.md new file mode 100644 index 0000000..6635fa4 --- /dev/null +++ b/notes/claude-report.md @@ -0,0 +1,114 @@ +# Cache Miss Analysis for Dispatch's Claude Code Integration + +## Executive Summary + +The massive token burn and 0% cache hit rate observed after upgrading to AI SDK v6 are the result of two interacting flaws in how the Dispatch harness constructs requests for Anthropic: + +1. **Missing Beta Header:** The Claude OAuth provider completely omits the required `anthropic-beta: prompt-caching-scope-2026-01-05` header. Without this specific beta header, the Anthropic API silently ignores all `cache_control` markers. +2. **Suboptimal Breakpoint Placement:** The logic that assigns `cache_control` breakpoints misinterprets the structure of AI SDK v6 messages. Because each tool result is serialized as an independent `role: "tool"` message, the caching logic places markers on the last two *individual tool results* rather than the actual conversational turns, wasting breakpoints and failing to cache the preceding `assistant` message. + +The tool duplication incident (where tools were echoed 150+ times) is highly likely a symptom of the context window blowing up due to these cache misses, causing the model's generation loop to degenerate into repetitive tool hallucinations. + +--- + +## Root Cause 1: The Missing Beta Header + +### Analysis +Anthropic's prompt caching relies on explicit `cache_control` markers embedded in the request payload. However, for the Claude CLI ecosystem and OAuth flow, caching features are gated behind specific beta headers. + +In `packages/core/src/llm/provider.ts`, the `createClaudeOAuthProvider` factory initializes the AI SDK Anthropic provider. While it correctly mimics the `x-app` and `user-agent` headers of the Claude CLI, it **fails to include the `anthropic-beta` header**: + +```typescript +function createClaudeOAuthProvider(config: ProviderConfig): ModelFactory { + const anthropic = createAnthropic({ + // ... + headers: { + "anthropic-dangerous-direct-browser-access": "true", + "x-app": "cli", + "user-agent": "claude-cli/2.1.112 (external, sdk-cli)", + // MISSING: "anthropic-beta": getAnthropicBetas().join(",") + }, + }); +} +``` + +The AI SDK's `createAnthropic` constructor (in `@ai-sdk/anthropic`) adds `anthropic-version: 2023-06-01` but does *not* automatically inject `prompt-caching-scope-2026-01-05`. + +Because this header is missing from the underlying fetch request, the Anthropic API treats the request as a standard (non-cached) `/messages` call and ignores the ephemeral cache breakpoints completely. + +--- + +## Root Cause 2: Inefficient Caching Breakpoints + +### Analysis +Even if the beta header were present, the current breakpoint strategy in `agent.ts` defeats the purpose of caching due to how the AI SDK v6 structures tool results. + +In `toModelMessages()`, the agent unpacks an internal `tool-batch` chunk by creating a separate AI SDK message for every single tool result: +```typescript +for (const tr of trailingToolResults) { + result.push({ role: "tool", content: [ { type: "tool-result", ... } ] }); +} +``` + +Immediately after, `applyAnthropicCaching()` attempts to apply rolling cache markers: +```typescript +const nonSystem = msgs.filter((m) => m.role !== "system").slice(-2); +for (const m of nonSystem) targets.add(m); +// applies cacheControl: { type: "ephemeral" } +``` + +If the assistant emitted a batch of 10 tool calls, `msgs` ends with 10 individual `role: "tool"` messages. The `.slice(-2)` logic grabs only the 9th and 10th tool result messages and applies `cacheControl` to them. + +When the AI SDK translates this to the Anthropic wire format (in `convert-to-anthropic-messages-prompt.ts`), it groups consecutive `tool` messages into a single `role: "user"` Anthropic block. The cache markers are applied strictly to the 9th and 10th tool result parts at the very end of this user block. + +**Why this fails:** +1. **Wasted Breakpoints:** Anthropic requires at least 1,024 tokens between breakpoints to effectively cache the delta. Placing breakpoints on two consecutive tool results at the end of the chain wastes a breakpoint because the token delta between them is microscopic. +2. **Missing Assistant Context:** The `assistant` message (which contains the expensive reasoning tokens and the tool calls themselves) receives NO cache marker. + +### Comparison with `oh-my-pi` +The `oh-my-pi` reference implementation avoids this by applying cache controls logically across the wire format: +- `applyCacheControlToLastBlock(params.system)` +- `applyCacheControlToLastBlock(params.tools)` +- `penultimate user message` +- `last user message` + +By targeting the end of distinct conversational phases (the system block, the tools definition, and the entire user response block), `oh-my-pi` maximizes the cached prefix length. + +--- + +## Tool Duplication Incident Correlation + +The documented incident describes 150+ duplicated `read_file` results for `package.json` with wildly uneven counts. + +This behavior is not indicative of the execution harness blindly retrying tools. In `agent.ts`, tool calls are harvested directly from the AI SDK stream (`event.type === "tool-call"`) and executed synchronously. If 150 identical `read_file` results were yielded, it is because **the LLM actively generated 150 identical `<tool_use>` blocks** in a single turn. + +This generation loop is a known failure mode for Claude when the context window becomes excessively bloated or loses coherence (often compounded by a lack of cached prefixes breaking the model's structural attention). Fixing the cache miss will likely resolve the generation instability. + +--- + +## Recommendations + +Since this is a read-only analysis, no code has been edited. The following changes should be applied to fix the system: + +1. **Inject the Beta Headers:** + Modify `packages/core/src/llm/provider.ts` to import `getAnthropicBetas` from `../credentials/claude.js` and inject it into the Claude OAuth headers: + ```typescript + headers: { + "anthropic-beta": getAnthropicBetas().join(","), + "anthropic-dangerous-direct-browser-access": "true", + // ... + } + ``` + +2. **Group Tool Results in `toModelMessages`:** + Instead of pushing an individual `role: "tool"` message for every single tool result, group all `trailingToolResults` into a single `content` array inside one `role: "tool"` message. This accurately represents the tool-batch as a single turn. + +3. **Revise the Breakpoint Strategy in `applyAnthropicCaching`:** + Instead of naively grabbing the last two elements of the array, specifically target: + - The first `system` message. + - The **last** `assistant` message in the array. + - The **last** `user` (or `tool`) message in the array. + This ensures that the entire prefix leading up to the most recent reasoning step is successfully cached. + +4. **Add a Tool Deduplication Safeguard (Optional but recommended):** + In `agent.ts`'s execution loop, hash the tool name and arguments. If an identical tool call appears in the exact same `stepToolCalls` batch, automatically copy the result from the first execution rather than re-running it or blowing up the LLM's next prompt with redundant shell outputs.
\ No newline at end of file diff --git a/context.md b/notes/context.md index 896eed4..896eed4 100644 --- a/context.md +++ b/notes/context.md diff --git a/eviction-limitation.md b/notes/eviction-limitation.md index a5b948f..91d1a88 100644 --- a/eviction-limitation.md +++ b/notes/eviction-limitation.md @@ -1,8 +1,18 @@ # Known Limitation: Frontend eviction is whole-message, not per-chunk -Status: **open / deal-with-later.** Documented from the append-only chunk-log -work (see `plan-chunk-log.md`). The backend is not the problem — this is purely -a frontend in-memory concern. +Status: **RESOLVED.** Fixed by the chunk-native frontend store (see +`plan-chunk-eviction.md`). The frontend's source of truth for history is now a +flat `ChunkRow[]` (`tab.chunks`, real per-tab `seq`); the live turn is a +transient tail (`tab.live`) reconciled into the sealed log on a `turn-sealed` +event. Eviction (`evictChunks`) is a rolling per-chunk trim of the oldest rows — +so a single oversized turn is trimmed chunk-by-chunk instead of pinned whole. +Pagination loads raw chunks (`GET /tabs/:id/chunks`), deduped by `seq`. The +historical analysis below is retained for context. + +--- + +Documented from the append-only chunk-log work (see `plan-chunk-log.md`). The +backend was not the problem — this was purely a frontend in-memory concern. ## TL;DR diff --git a/notes/gemini-chunk-eviction-review-2.md b/notes/gemini-chunk-eviction-review-2.md new file mode 100644 index 0000000..a2a1ac3 --- /dev/null +++ b/notes/gemini-chunk-eviction-review-2.md @@ -0,0 +1,95 @@ +# Code Review (Pass 2): Chunk-Native Frontend Eviction + +## Executive Summary + +The fixes applied since Pass 1 successfully address the two major Blockers related to reconciling active turns and optimistic UI state. The decoupling of reconcile logic from `agentStatus` in favor of `liveTurnId` elegantly solves the race conditions around deferred reconciliations and interrupt boundaries. + +However, a **NEW Blocker** was identified in how `turn_id`s are backfilled onto user messages when a turn starts. The loop indiscriminately tags pending `queued-` messages that belong to *future* turns, causing them to be wiped when the *current* turn finishes. + +**Verdict: DO NOT SHIP.** The new Blocker must be fixed. The fix is a trivial one-line condition. + +--- + +## Pass 1 Fixes & Invariants + +* **Fix #1: Deferred reconcile wipes concurrent active turns.** + **VERDICT: FIXED.** `reloadChunksFromApi`'s new `preserveTurnId` logic perfectly handles overlapping turns. It correctly preserves Turn B's in-flight chunks while dropping Turn A's, and cleanly nullifies state when appropriate. The Map `pendingReconcileTabs` ensures any intermediate turns are fetched comprehensively from the DB. +* **Fix #2: Optimistic queued user messages are dropped on reconcile.** + **VERDICT: FIXED (Partially).** The condition `(m.turnId === undefined && m.role === "user")` inside `keptLive` effectively retains optimistic user messages. However, see the New Blocker below regarding `turnId` assignment. +* **Cache Safety Invariant.** + **VERDICT: SAFE.** `toModelMessages` and Anthropic normalization logic were completely untouched. The backend's prompt caching remains strictly bound to DB-persisted rows. +* **`messages` -> `renderGroups` Rename Completeness.** + **VERDICT: COMPLETE.** No stray references to `tab.messages` were found. Reactivity safely bounds derived state within `updateTab`. + +--- + +## NEW Findings + +### 1. `turn-start` wrongly tags future queued messages (Block) +**Location:** `packages/frontend/src/lib/tabs.svelte.ts:873-882` (inside `handleEvent` case `turn-start`) + +**Description:** +When a turn starts, the frontend loops backwards through `tab.live` to tag untagged user messages with the new `turnId`. It breaks only when it hits a non-user or an already-tagged message. + +If multiple messages are queued (e.g. `[queued-B, queued-C]`), and the backend dequeues `B`, `message-consumed` removes the prefix and puts `B` at the end of the array: `[queued-C, B]`. Immediately after, `turn-start` for Turn B fires, looping backward. It tags `B` with `"Turn B"`, but continues and ALSO tags `queued-C` with `"Turn B"`. When Turn B subsequently seals, `queued-C` is wiped from the UI because its `turnId` matches the sealing turn, and it loses the protection of `turnId === undefined`. It vanishes until Turn C actually finishes processing. + +This same bug happens if the user queues a message in the tiny race-window between sending their first prompt and the WS `turn-start` event arriving. + +**Direction:** The backfill loop must explicitly ignore queued messages. Add a check to prevent tagging them: `if (m && m.role === "user" && m.turnId === undefined && !m.id.startsWith("queued-"))`. + +--- + +## Evaluation of New Regression Tests + +The two new regression tests are highly valuable, but miss critical permutations: + +1. **"preserves an optimistic queued user message..."** + * **Genuinely covers:** Fix #2. It proves `keptLive` preserves an untagged queued message during a reconcile. + * **Untested edge case:** It queues the message *after* `turn-start` fires. Therefore, it completely bypasses the `turn-start` backfill loop. A test should queue a message *before* `turn-start` fires (or queue *two* messages and consume one) to expose the new Blocker above. +2. **"preserves a concurrent newer turn..."** + * **Genuinely covers:** Fix #1. It correctly validates `preserveTurnId` logic, the `currentAssistantId` mapping, and `liveTurnId` integrity when a deferred reconcile flushes across a newer streaming turn. + * **Untested edge case (Map last-write-wins):** It only seals *one* turn while scrolled up. A test should seal *two* turns while scrolled up, then scroll down, ensuring `pendingReconcileTabs` handles the latest `turnId` and the entire window is correctly loaded. + +--- + +## Resolution (OpenCode) + +### Block #1 (`turn-start` wrongly tags future queued messages) — FIXED +`packages/frontend/src/lib/tabs.svelte.ts` `handleEvent` case `turn-start`. + +Backend reality check (not in Gemini's model): a queued message **never** gets its +own `turn-start`. `turn-start` is emitted exactly once per user-initiated +`processMessage` (`agent-manager.ts:1128`), which persists exactly one user row via +`explodeUserText`. Queued messages are drained *into* the running turn through +`dequeueMessages`/`message-consumed` (`agent.ts:1241,1308`). So the turn's initiator +is always the single most-recent **non-queued** untagged user row, and any `queued-` +row in the live tail belongs to a future turn. + +Gemini's literal one-liner (`&& !m.id.startsWith("queued-")` with the existing +`else break`) is **insufficient**: when the queued row is the trailing element the +loop would `break` on it and never tag the real initiator, leaving the initiator +untagged → it duplicates against its own sealed chunk row on reconcile. + +Implemented fix: the backfill now (a) stops at the first non-user row, (b) **skips +past** (`continue`) pending `queued-` rows, and (c) tags exactly the one most-recent +non-queued untagged user row, then breaks. `keptLive` is unchanged — untagged +`queued-` rows remain untagged and are preserved on reconcile. + +### Test gap — ADDRESSED +Added `"turn-start backfill skips a pending queued row trailing the turn initiator"` +in `chat-store.test.ts`: constructs live = `[<plain initiator>, queued-q2]`, fires +`turn-start`, asserts the initiator is tagged and `queued-q2` is **not**, then seals +and asserts `queued-q2` survives with no duplicate initiator bubble +(`renderGroups` roles = `[user, assistant, user]`). This exercises the `continue` +(skip-queued) branch and covers the queue-present-before-`turn-start` race Gemini +flagged as untested. + +### Nits (deferred, non-blocking) +- `pendingReconcileTabs` last-write-wins doc/test (seal two turns while scrolled up): + acceptable as-is — the deferred flush refetches the full window from the DB, so the + latest sealed `turnId` correctly supersedes; left as a follow-up test. + +### Verification +Full suite: **326 tests pass** (core 223 incl. 33-test agent/cache-stability suite, +api 35, frontend 68). Biome clean; `tsc` core+api and `svelte-check` report 0 errors. +Cache invariant remains untouched (no `agent.ts` wire/folding/caching changes).
\ No newline at end of file diff --git a/notes/gemini-chunk-eviction-review-3.md b/notes/gemini-chunk-eviction-review-3.md new file mode 100644 index 0000000..3637946 --- /dev/null +++ b/notes/gemini-chunk-eviction-review-3.md @@ -0,0 +1,118 @@ +# Code Review (Pass 3): Verify the `turn-start` backfill Block fix + +## Executive Summary + +The fix to the `turn-start` backfill loop successfully prevents pending `queued-` messages from being wiped by correctly skipping them. However, it exposes a critical flaw in how **consumed** messages are handled. + +When a queued message is consumed (`message-consumed`), its `queued-` prefix is stripped, leaving it as an untagged, plain user row in `live`. Because it lacks a `turnId`, it survives `reconcileSealedTurn` and lingers in the UI forever, floating to the bottom and duplicating the `[USER INTERRUPT]` text already present in the sealed chunks. Additionally, in multi-client scenarios with no local initiator, the backfill loop will incorrectly tag this lingering consumed message with a future turn's ID. + +**Verdict: DO NOT SHIP.** A new Blocker must be fixed. The fix is a one-line change in the `message-consumed` handler to bind consumed messages to the active turn. + +--- + +## Detailed Findings + +### Q1. Backend Claims Verification +**Confirmed.** The backend code in `packages/api/src/agent-manager.ts` and `packages/core/src/agent/agent.ts` confirms that: +1. `turn-start` is emitted exactly once per `processMessage`. +2. Exactly one user row draft is persisted via `explodeUserText`. +3. Queued messages are pulled into a running turn via `dequeueMessages` and NEVER trigger a `turn-start`. + +### Q2. Block Fix Correctness +The new backfill logic correctly skips `queued-` rows and tags exactly one initiator when a local initiator exists. +* **(a) `[initiator, queued-X]`**: Correctly skips `queued-X`, tags `initiator`, and breaks. +* **(b) `[queued-X, initiator]`**: Logically impossible; the initiator is appended before a queue can form. +* **(c) `[queued-X, queued-Y]` consumed**: When `queued-X` is consumed, its prefix is stripped. It becomes an untagged user message. When the NEXT turn starts, if there is a local initiator, it tags the new initiator and ignores the consumed message. +* **(d) Multi-client (no local initiator)**: If Client B starts a turn, Client A receives `turn-start` but has no local initiator in `live`. Client A's backfill loop will find the untagged consumed message from the *previous* turn and incorrectly tag it with the *new* turn's ID. + +### Q3. No Duplication / No Loss +**Failed.** Because the backfill loop `break`s after tagging the new initiator, the consumed message from the previous turn remains permanently untagged (`turnId === undefined`). It survives the `turn-sealed` reconcile process and floats to the bottom of the live tail. Since the backend injects the consumed message's text into the `[USER INTERRUPT]` chunk row, the user sees BOTH the tool result chunk text AND a lingering user bubble. + +### Q4. `keptLive` Unchanged — Still Correct? +**Failed.** `keptLive` preserves optimistic rows by checking `m.turnId === undefined && m.role === "user"`. This was intended for pending initiators and queues, but it inadvertently preserves consumed messages because they never received a `turnId`. Consumed messages MUST be bound to the turn that consumed them so they are cleanly dropped when that turn seals. + +### Q5. Test Adequacy +**The new test simulates an impossible backend sequence.** +The test creates `q1`, consumes it, and then fires `turn-start`, assuming `turn-start` applies to the consumed `q1`. The backend never emits a `turn-start` for a consumed message. By forcing the backfill loop to tag a consumed message, the test mistakenly verifies an action that in reality constitutes cross-turn ID theft. + +### Q6. Prompt-Caching Invariant +**SAFE.** The backend logic in `packages/core/src/agent/agent.ts` was untouched. Cache stability is preserved. + +--- + +## NEW Blockers + +### Block #1: Consumed messages linger and duplicate due to missing `turnId` +**Location:** `packages/frontend/src/lib/tabs.svelte.ts:1226` (inside `handleEvent` case `"message-consumed"`) + +**Description:** +When `message-consumed` extracts a queued message, it strips the prefix but leaves `turnId` undefined. The message survives `turn-sealed` and lingers in `live` forever, duplicating the chunk data. In multi-client scenarios, it also acts as a trap for the next `turn-start` backfill. + +**Direction:** +In `message-consumed`, bind the consumed message to the actively streaming turn so `reconcileSealedTurn` drops it cleanly: +```typescript +if (mcEvent.messageIds.includes(queuedId)) { + consumed.push({ + ...m, + id: queuedId, + turnId: mcTab.liveTurnId ?? undefined // Bind to active turn + }); + continue; +} +``` +With this fix, the `turn-start` loop's `break` behavior is perfectly safe, as consumed messages will no longer be "untagged". + +--- + +## Resolution (OpenCode) + +### Block #1 (consumed interrupt messages linger + duplicate) — FIXED +`packages/frontend/src/lib/tabs.svelte.ts`, `handleEvent` case `"message-consumed"`. + +Verified the finding against the code: `keptLive` (in `reloadChunksFromApi`) preserves +every `m.turnId === undefined && m.role === "user"` row, and `message-consumed` was +pushing `{ ...m, id: queuedId }` with **no** `turnId`. So a consumed interrupt bubble +was kept on every reconcile — lingering at the tail and duplicating the +`[USER INTERRUPT]` text the backend folds into the sealed tool-result chunk +(`agent.ts:1248-1255`). This contradicted the intended "collapse to persisted shape" +behavior. + +Applied Gemini's direction (with the codebase's conditional-spread style): bind the +consumed row to the in-flight turn so reconcile drops it on seal — +```ts +consumed.push({ + ...m, + id: queuedId, + ...(mcTab.liveTurnId !== null ? { turnId: mcTab.liveTurnId } : {}), +}); +``` +`liveTurnId` is always set while a turn runs (the only time a consume happens). The +ChatPanel keyer (`${turnId}:${role}:${n}`, per-(turn,role) counter) gives the consumed +row a distinct key from the initiator, so no key collision during the live interrupt +split. This also makes the Pass-2 `turn-start` backfill `break` fully safe (no +untagged consumed rows remain to be mis-tagged in the multi-client path Q2(d)). + +### Test fixes (addressing Q5) +The Pass-2 test was rewritten — Gemini correctly flagged that consuming a message and +then firing `turn-start` at it is not a real backend sequence. Replaced with two +realistic tests in `chat-store.test.ts`: +1. `"turn-start backfill skips a pending queued row (race), tags only the initiator"` — + drives the **real** race via `store.sendMessage`: an idle send (plain optimistic + row) followed by a running send (queued row), then a late `turn-start`. Asserts the + initiator is tagged, the queued row is NOT, and the queued row survives the seal + with no duplicate initiator. +2. `"a consumed interrupt message collapses into the sealed turn (no lingering bubble)"` + — realistic interrupt flow (turn-start → stream → queue → consume → stream → seal); + asserts the consumed row is bound to the turn and dropped on reconcile. Fails on the + pre-fix code (row lingers untagged), passes after. + +### Verdict response +- Q1 backend claims: confirmed by Gemini and re-confirmed here. +- Q2(d) multi-client mis-tag: eliminated — consumed rows are no longer untagged. +- Q3/Q4 lingering+duplication: FIXED. +- Q6 cache invariant: SAFE — change is frontend store/test only; `agent.ts` untouched. +- Q7: no new Blockers introduced. + +### Verification +Frontend **69 tests pass** (chat-store 54 incl. the two new tests, sidebar 15); +`svelte-check` 0 errors; Biome clean. Core (223) + api (35) unaffected → 327 total.
\ No newline at end of file diff --git a/notes/gemini-chunk-eviction-review.md b/notes/gemini-chunk-eviction-review.md new file mode 100644 index 0000000..373917a --- /dev/null +++ b/notes/gemini-chunk-eviction-review.md @@ -0,0 +1,69 @@ +# Code Review: Chunk-Native Frontend Eviction (Dispatch) + +## Executive Summary + +The rewrite to a chunk-native frontend store successfully resolves the unbounded memory pinning caused by oversized turns. The migration from grouped messages to a flat chunk log (`tab.chunks`) as the source of truth, with a decoupled `renderGroups` cache, correctly achieves rolling chunk-level eviction and seamless pagination. + +However, while the backend invariants and cache safety are perfectly maintained, there are two **Blocker** regressions in the frontend's reconcile logic relating to the handling of concurrent/queued messages and deferred UI updates. `reloadChunksFromApi` assumes it is strictly processing a single sequential turn and acts destructively on in-flight UI state when edge cases overlap. + +**Verdict: DO NOT SHIP.** The cache invariant holds, but the Blocks must be fixed first. + +--- + +## Cache Safety Invariant + +**VERDICT: SAFE.** +The diff contains **zero** modifications to `packages/core/src/agent/agent.ts`. The prompt-cache cohesion logic, `toModelMessages`, `applyAnthropicCaching`, and normalisation remain 100% server-side and entirely isolated from the frontend's transient render representations. Model-bound bytes are unchanged. + +--- + +## Findings + +### 1. Deferred reconcile wipes concurrent active turns (Block) +**Location:** `packages/frontend/src/lib/tabs.svelte.ts:658` (`reconcileSealedTurn`) and `:636` (`reloadChunksFromApi`) + +**Description:** +If the user scrolls up, automatic eviction and reconciliation are suppressed. If Turn A finishes while scrolled up, its reconciliation is deferred (`pendingReconcileTabs.add`). If the agent then starts Turn B (e.g. via queue processor), it establishes a new live streaming bubble (`liveTurnId = 'Turn B'`). +When the user subsequently scrolls down, the deferred flush fires for Turn A, calling `reloadChunksFromApi`. This function blindly sets `live: []`, `liveTurnId: null`, and `currentAssistantId: null` — immediately deleting all in-flight chunks for the *currently active* Turn B. Turn B will then spawn a fresh disconnected live bubble on its next stream delta, permanently losing its prior chunks and its `turnId` tag (causing it to remount/flash on seal). + +**Direction:** `reloadChunksFromApi` must not unconditionally nuke `live` and `currentAssistantId` if a *new* turn is actively in-flight (`agentStatus === "running" && liveTurnId !== turnIdToReconcile`). The reconcile must be turn-aware. + +### 2. Optimistic queued user messages are dropped on reconcile (Block) +**Location:** `packages/frontend/src/lib/tabs.svelte.ts:636` (`reloadChunksFromApi`) + +**Description:** +When a user sends a message while the agent is busy, it is added to `tab.live` as an optimistic unsealed bubble. When the current active turn completes and emits `turn-sealed`, `reloadChunksFromApi` wipes the entire `live` array (`live: []`). The queued user message vanishes from the UI entirely. While it remains safely in `queuedMessages` and will eventually be processed by the backend, the UI will not reflect it again until its own eventual `turn-sealed` event completes the backend loop. + +**Direction:** `reloadChunksFromApi` must preserve unsealed optimistic user messages (e.g. those matching `queuedMessages` IDs) when clearing `live`. + +### 3. Edge-case fetch race on WS reconnect desync (Ship-with-followup) +**Location:** `packages/frontend/src/lib/tabs.svelte.ts:860` (`handleEvent` case `statuses`) + +**Description:** +The backend emits `status: idle` immediately before making the synchronous SQLite `flushAssistant` write, and emits `turn-sealed` immediately after. If a WS reconnect happens exactly in this microsecond window, the frontend's `hydrateFromBackend` sees `backendStatus !== "running"` and triggers `reloadChunksFromApi`. Because the DB write hasn't landed, it loads a chunk window missing the just-finished turn. + +**Direction:** This is non-fatal because the backend will still emit `turn-sealed` milliseconds later, which triggers a second `reloadChunksFromApi` that corrects the UI state. It will manifest as a sub-second flicker. Safe to ship, but ideally, `statuses` should infer completion from the presence of unpersisted chunks rather than the raw agent status. + +### 4. Tool-batches undercount the live eviction budget (Nit) +**Location:** `packages/frontend/src/lib/tabs.svelte.ts:76` (`countLiveChunks`) + +**Description:** +In `countLiveChunks`, a single `tool-batch` render bubble counts as `1` against the live budget (`m.chunks.length`). However, upon sealing, `explodeTurn` expands each batch into `N * 2` rows (a `tool_call` and `tool_result` for each parallel call). This means an in-flight turn with heavily parallelized tool execution will temporarily consume more memory than `chunkLimit` targets. + +**Direction:** Minor inconsistency. Once the turn seals, the correct DB row count will be respected. Acceptable trade-off for simplicity in the live rendering path. + +### 5. `trimLiveChunks` drops the prompt under extreme pressure (Nit) +**Location:** `packages/frontend/src/lib/tabs.svelte.ts:431` (`trimLiveChunks`) + +**Description:** +If a single streaming turn heavily exceeds `chunkLimit`, `trimLiveChunks` enforces a strict rolling window on `live`. Since it trims oldest-first indiscriminately, it will `shift()` away the user's prompt chunk from the UI entirely before it touches the assistant's streaming chunks. + +**Direction:** This technically aligns with the design ("never the chunk currently being streamed"), but means the user's initiating context disappears mid-stream under memory pressure. Expected behavior for a raw chunk limit. + +--- + +## Notable Correctness Risks Validated + +- **Seq Cursor Logic:** Perfect. `oldestLoadedSeq` rigidly derives from `ChunkRow.seq` via `minSeqOf(sealed)`. Live transient state is securely walled off from pagination logic. +- **Transaction Wrapper:** Sound. `db.transaction()` around the `appendChunks` batch is synchronously robust and correctly yields monotonic `seq` allocations per turn. +- **Scroll-up pagination:** Deduplication works flawlessly. `mergeChunksBySeq` efficiently handles overlap at the pagination boundary without duplicating message bubbles. diff --git a/gemini-chunk-log-review.md b/notes/gemini-chunk-log-review.md index e04f6c3..e04f6c3 100644 --- a/gemini-chunk-log-review.md +++ b/notes/gemini-chunk-log-review.md diff --git a/harness-comparison.md b/notes/harness-comparison.md index 1fd7aec..1fd7aec 100644 --- a/harness-comparison.md +++ b/notes/harness-comparison.md diff --git a/plan-bg-restore.md b/notes/plan-bg-restore.md index e622669..e622669 100644 --- a/plan-bg-restore.md +++ b/notes/plan-bg-restore.md diff --git a/notes/plan-chunk-eviction.md b/notes/plan-chunk-eviction.md new file mode 100644 index 0000000..74f850d --- /dev/null +++ b/notes/plan-chunk-eviction.md @@ -0,0 +1,251 @@ +# Plan: Chunk-Native Frontend (per-chunk eviction + pagination) + +Closes `eviction-limitation.md`. Kills the frontend `ChatMessage[]` "message" +structure entirely. The frontend's state becomes a **flat chunk log**; memory is +bounded by a **rolling per-chunk eviction**; history is **paginated as raw +chunks** from the backend. + +## Locked decisions (from the design discussion) + +1. **No stored "message" structure on the frontend.** The store holds a flat + chunk list. Any grouping into bubbles is a *render-time* projection only + (see "Rendering" — the one item still to confirm). +2. **Rolling per-chunk eviction.** When a chunk completes while streaming and the + in-memory chunk count is at `chunkLimit`, drop the oldest chunk. Pure + count-based window. The only chunk never evicted is the single one currently + receiving deltas (the open chunk). No turn pinning, no "keep last pair." +3. **Per-turn persistence stays** (backend unchanged in timing). It is the more + performant choice for a low-RAM/low-IO backend: peak RAM is dominated by the + agent's in-memory conversation context (required for the cache prefix) which + both strategies share, while per-seal would multiply fsyncs across the + streaming loop. One cheap win: wrap the end-of-turn write in a single SQLite + transaction. +4. **Mid-stream reload is deferred, not engineered away.** A still-streaming turn + isn't in the DB until it seals. If the user scrolls up into chunks that were + roll-evicted from the *current* turn, the refetch waits until the turn's write + lands (the running→idle signal). Accepted. + +## The dominating constraint (unchanged) + +**Cache cohesion is built 100% server-side and must not be touched.** The +Anthropic wire prefix comes from `toModelMessages` + `applyAnthropicCaching` in +`packages/core/src/agent/agent.ts`, built from the agent's in-memory history +(rebuilt from the chunk log) + the live turn — never from anything the frontend +holds or the WS sends. So **everything here is frontend + one additive raw API +endpoint**; it cannot regress the cache. Out of scope, do not edit: `agent.ts` +wire/folding/caching/interrupt code, `explodeTurn` row ordering, and the +backend's use of `groupRowsToMessages` for *agent history rebuild*. + +## Already done (reuse, don't rebuild) + +- `getChunksForTab(tabId,{limit,before})` (`db/chunks.ts`) already paginates the + flat log by `seq`. +- `explodeTurn` and `groupRowsToMessages` (`chunks/transform.ts`) are DB-free, + browser-importable, and orphan-tolerant (a turn split across a page boundary + groups correctly). `explodeTurn` emits a **role per draft** — we use that. +- `appendEventToChunks` (`chunks/append.ts`) folds a delta stream into render + `Chunk[]` correctly (text/thinking/metadata/tool-result/shell-output ordering). +- Chunk table schema already has `seq, turnId, step, role, type, data`. + **No DB migration, no chat-history reset needed.** + +--- + +## Frontend architecture (chunk-native) + +### Store state (per tab) — `tabs.svelte.ts` + +``` +chunks : ChunkRow[] // SEALED history, real per-tab seq, oldest→newest. + // The growing / evictable / paginated structure. +liveRender : Chunk[] // transient fold buffer for the CURRENT turn only, + // built by the EXISTING appendEventToChunks. +liveTurnId : string | null // turn_id of the in-flight turn (for stable keys). +liveAssistantId: string | null // == currentAssistantId today. +oldestLoadedSeq: number | null // min seq in `chunks` (pagination cursor). +totalChunks : number // backend total (drives "more to load?"). +// DELETED: messages, currentAssistantId(renamed→liveAssistantId), the turnId-merge state. +// Untouched: queuedMessages, cacheStats, agentStatus, model/agent fields, etc. +``` + +There is **no `messages` field.** The live turn is not a "message" — it's a +one-turn streaming buffer that is reconciled into `chunks` (as real-seq rows) the +moment the turn seals. + +### Unified flat view (derived, for render + eviction count) + +``` +liveFlat = liveTurnId ? explodeTurn(liveTurnId, liveRender) + .map((d,i) => ({...d, id:`${liveTurnId}:${i}`, seq:null, live:true})) + : [] +log = [...chunks, ...liveFlat] // one flat, chunk-native list +``` + +`explodeTurn` is the *same* transform the backend persists with, so the live flat +chunks are byte-for-byte what will land in the DB → reconciliation at seal is +seamless and symmetric. `log` is what we render and count for eviction. + +### Streaming (reuse, no new reducer) + +- Content events (`text-delta`/`reasoning-*`/`tool-call`/`tool-result`/ + `shell-output`/`error`/`notice`/`model-changed`/`config-reload`) fold into + `liveRender` via the existing `appendEventToChunks` + `$state.snapshot` clone + (the documented Svelte-5 proxy hazard — keep it). +- `turn-start` (new tiny WS event, see backend) sets `liveTurnId` / + `liveAssistantId` so live-flat ids match the sealed rows' turn → no remount at + seal. Without it, a one-frame remount at seal with identical content; harmless. +- Optimistic user message: on send, append a provisional user row to `chunks` + with `seq:null, live:true` (or carry it in a tiny `pendingUser` slot). It is + replaced by the real user row at reconcile. + +### Rendering — `ChatPanel.svelte` / `ChatMessage.svelte` + +``` +renderGroups = groupRowsToMessages(log) // ephemeral, render-time ONLY +{#each renderGroups as g (g.id)} <ChatMessage .../> {/each} +``` + +Grouping pairs `tool_call`+`tool_result` by `callId` and wraps a turn's +assistant chunks into a bubble — **purely a view projection of the flat log**, +never stored, never used for eviction/pagination. `ChatMessage.svelte` is largely +unchanged (it already renders a grouped turn's `Chunk[]`). + +> **DECISION TO CONFIRM (rendering):** default = keep the current bubble look via +> this render-time grouping (chunk-native state, same UX). Alternative = render +> each chunk as a standalone element (fully flat, no assistant-bubble wrapper). +> I'm proceeding with the default; say the word to go fully flat (smaller view +> change, different look). + +### Eviction (`evictChunks`, replaces `evictMessages`) + +``` +limit = appSettings.chunkLimit; if !finite or ≤0 → return +if scrolledUp and !force → return // unchanged suppression +total = chunks.length + liveFlat.length +while total > limit: + if chunks.length > 0 and chunks[0] is not the open chunk: + drop chunks[0]; total-- // evict sealed front first + else if liveRender has a non-open leading chunk: + drop oldest liveRender chunk; recompute liveFlat; total = chunks.length+liveFlat.length + else break // only the open chunk remains +oldestLoadedSeq = min seq remaining in chunks (or unchanged if chunks emptied) +``` + +This trims a huge **old** turn chunk-by-chunk (the fix) and even trims within a +huge **live** turn (bounding memory mid-stream), never evicting the open chunk. +Triggered on each content event (chunk completion) and after load/reconcile. + +### Pagination (`loadOlderChunks`, replaces `loadMoreMessages`) + +``` +if oldestLoadedSeq == null or ≤ 0 → nothing older +GET /tabs/:id/chunks?limit=50&before=oldestLoadedSeq +merge rows into `chunks`, DEDUPE BY seq, keep seq-sorted +oldestLoadedSeq = min seq held; totalChunks = resp.total +``` + +The old `turnId`-merge hack (`tabs.svelte.ts:440-462`) is **deleted** — raw rows +in one seq-sorted array make `groupRowsToMessages` rejoin a boundary-split turn +automatically. Overlap is fine; dedupe-by-seq is trivial. + +### Turn-completion reconcile (the one new flow) — on running→idle for the tab + +``` +refetch GET /tabs/:id/chunks?limit=chunkLimit // tail, real seqs (write already landed) +drop all live entries (liveRender=[], liveTurnId=null) and provisional user row +merge refetched rows into `chunks`, dedupe by seq, seq-sort +oldestLoadedSeq = min seq; evictChunks() +if scrolledUp: defer this refetch until the user returns to bottom +``` + +This single mechanism (a) gives the just-completed turn **real seqs** (so the +log never accumulates seq-less middles that break pagination), and (b) **recovers +any live chunks roll-evicted mid-stream** — they come back from the DB now that +the write landed. Exactly the "delay the reload until the write lands" model. +Keying render groups by `turnId` makes the swap from live→sealed flicker-free. + +--- + +## Backend changes (additive, cache-neutral, low-cost) + +1. **`routes/tabs.ts`** — add `GET /tabs/:id/chunks?limit&before` → + `{ chunks: ChunkRow[], total, oldestSeq }` (raw rows; no grouping). Leave + `/messages` as-is (unused by the new frontend; harmless). +2. **`db/chunks.ts`** — wrap the `appendChunks` insert loop in a single + `db.transaction(...)` (one fsync per turn instead of N). Pure perf; no behavior + change. +3. **`types/index.ts` + `agent-manager.ts`** — add a tiny + `{ type:"turn-start"; turnId; assistantId }` event, emitted at turn start + (`agent-manager.ts:1120-1122`). Optional but recommended (stable keys → no + seal remount). `appendEventToChunks` lists it in its no-op branch for + exhaustiveness. +4. **WS:** nothing — `index.ts:25-26` forwards all emitted events to the browser. + +No change to persistence *timing*, `explodeTurn`, the agent wire, or the caching +path. (Optional future optimization, not in this plan: a `chunk-rows` push at +seal to avoid the per-turn reconcile GET. The reconcile-refetch is simpler and +matches the agreed "reload" model, so we ship that.) + +--- + +## Test plan + +Rewrite frontend tests from `tab.messages` assertions to **chunk-log** assertions +(the store no longer has `messages`). Cover: + +- **Rolling eviction (headline):** stream a turn of `chunkLimit + N` chunks; + assert in-memory `log` length stays `≤ chunkLimit` (open chunk exempt) and the + oldest entries are dropped — including dropping the *leading chunks of a single + giant turn* (the exact case the limitation was about). +- **Pagination + dedupe:** `loadOlderChunks` prepends older rows, dedupes by seq, + updates `oldestLoadedSeq`; a turn split at the window boundary renders as one + group (no merge hack, no duplication). +- **Turn-completion reconcile:** stream a turn (live, seq=null) → running→idle → + assert live entries are replaced by real-seq rows from the refetch, `log` + content identical, render-group identity stable (no remount), and a mid-stream- + evicted chunk is recovered. +- **Deferred reload while scrolled up:** reconcile is deferred until return-to- + bottom; nothing yanks the viewport. +- **Streaming fidelity:** the existing delta→chunk behaviors (coalescing, + thinking metadata seal, tool-result fill, shell-output) still hold on + `liveRender` (these reuse `appendEventToChunks`, so port the assertions to read + the live flat view). +- **Cache untouched:** the 3-step byte-identical cache-stability test in + `packages/core/tests/agent/agent.test.ts` passes unchanged; `routes.test.ts` + gains `/chunks` coverage. + +## Phasing (each phase: `bun run check` + `bun run test` green) + +- **P1 — Backend additive:** `GET /chunks` raw endpoint; wrap `appendChunks` in a + transaction; add `turn-start`. Tests for endpoint + event. (Cache test must + still pass byte-for-byte.) +- **P2 — Frontend store core:** introduce `chunks`/`liveRender`/`liveTurnId`; + retarget every streaming handler from `messages` to `liveRender`; build the + derived `log` + render groups; delete the `messages` field. Port streaming + tests. +- **P3 — Eviction + pagination + reconcile:** `evictChunks`, `loadOlderChunks` + (delete turnId-merge), running→idle reconcile-refetch, load paths + (`hydrateFromBackend`/`openAgentTab`/`reloadTabMessagesFromApi`) switched to + `GET /chunks`. New tests. +- **P4 — Cleanup:** delete dead code (`oldestSeqOf` if unused, old eviction + comment, `loadMoreMessages` name), update `eviction-limitation.md` → + resolved, full `bun run test` + `bun run check`. + +## Risks / open items + +- **Rendering decision** (above) — the one thing to confirm; proceeding with + render-time grouping (preserves UX). +- **Interrupt turns at reconcile.** The live UI shows an interrupt as a split + (`message-consumed` handler); the persisted turn keeps the interrupt inside a + tool result (cache-safe, prior decision). So at reconcile the bubble re-renders + to the persisted shape. Cosmetic; documented; not fixed (fixing touches the + wire/cache). +- **shell-output live placement.** `appendEventToChunks` attaches live shell + output to the running tool call; `explodeTurn` places it on the `tool_result` + row. The live flat view and the post-reconcile rows therefore differ only in + where shellOutput hangs until seal — render reads it from the paired call/result + either way. Minor; assert in a test. +- **Scrolled-up during reconcile** — deferred (handled above); verify no viewport + jump using ChatPanel's existing scroll-anchor logic. +- **Renames ripple:** `totalMessages→totalChunks`, `currentAssistantId→ + liveAssistantId`, `evictMessages→evictChunks`, `loadMoreMessages→ + loadOlderChunks` touch a few call sites + tests. diff --git a/plan-chunk-log.md b/notes/plan-chunk-log.md index 3b39303..3b39303 100644 --- a/plan-chunk-log.md +++ b/notes/plan-chunk-log.md diff --git a/plan-chunk-refactor.md b/notes/plan-chunk-refactor.md index 0a60237..0a60237 100644 --- a/plan-chunk-refactor.md +++ b/notes/plan-chunk-refactor.md diff --git a/plan-v6-upgrade.md b/notes/plan-v6-upgrade.md index ad3d223..ad3d223 100644 --- a/plan-v6-upgrade.md +++ b/notes/plan-v6-upgrade.md diff --git a/problem.md b/notes/problem.md index a29f801..a29f801 100644 --- a/problem.md +++ b/notes/problem.md diff --git a/notes/queue-interrupt-reconcile-edge-cases.md b/notes/queue-interrupt-reconcile-edge-cases.md new file mode 100644 index 0000000..e564895 --- /dev/null +++ b/notes/queue-interrupt-reconcile-edge-cases.md @@ -0,0 +1,183 @@ +# The Queue / Interrupt / Reconcile Path Is an Edge-Case Magnet + +> Status: living document. Started after the chunk-native frontend rewrite, when +> three consecutive independent code-review passes each surfaced a *new* Blocker +> in the same ~40 lines of code. This note explains **why** that area is fragile, +> catalogs the bugs found so far, states the invariants that must hold, and +> records the recommended longer-term fix so the next person doesn't relearn it +> the hard way. + +## TL;DR + +The frontend turn-completion reconcile (`reconcileSealedTurn` → +`reloadChunksFromApi`) decides which transient `live` rows to **keep** vs **drop** +when a turn's durable chunks arrive. It makes that decision from a tangle of +loosely-coupled signals — `turnId` present/absent, the `queued-` id prefix, +`queuedMessages` membership, `liveTurnId`, scroll state — that are mutated by +**six** asynchronous events arriving in non-deterministic order, sometimes from +**other clients**. Every signal is a *proxy* for the real question ("is this row +already durable in `chunks`?"), and every bug so far has been a case where the +proxy disagreed with reality. The result: either a row is **lost** (dropped but +not yet sealed) or **duplicated/lingers** (kept but already sealed). + +**If you touch this code, re-read "Invariants" and "Why it keeps breaking" below, +and add a test for the exact interleaving you're changing.** + +--- + +## The moving parts + +### State (per tab, frontend store `tabs.svelte.ts`) +- `tab.chunks: ChunkRow[]` — SEALED, durable history (real DB `seq`). Source of truth. +- `tab.live: ChatMessage[]` — TRANSIENT buffer for the in-flight turn + optimistic + / queued user rows not yet folded into `chunks`. +- `tab.renderGroups` — DERIVED: `groupRowsToMessages(chunks) ++ live`. What the UI shows. +- `tab.queuedMessages: QueuedMessage[]` — messages the user sent while the agent was + busy, awaiting consumption. +- `tab.liveTurnId` / `tab.currentAssistantId` — the in-flight turn + its streaming row. +- per-row `turnId` — set once the row is bound to a concrete turn; `undefined` while + still optimistic/pending. +- per-row id convention — a still-pending queued row has id `queued-<queueId>`. + +### Events (arrive over WS, order NOT guaranteed relative to each other) +- `turn-start { turnId }` — fired **once per user-initiated `processMessage`** + (`agent-manager.ts`), which persists **exactly one** user chunk row + (`explodeUserText`). **A queued message NEVER gets its own `turn-start`.** +- `text-delta` / tool events — stream content into the live assistant row. +- `message-queued { messageId }` — a send was queued (this client or another). +- `message-consumed { messageIds }` — the agent drained queued message(s) **into the + running turn**: either injected as a `[USER INTERRUPT]` block inside a tool result, + or appended as trailing history (`agent.ts` — the two `dequeueMessages()` sites). +- `status { idle | running | error }` — note: `idle` fires **before** the DB write. +- `turn-sealed { turnId }` — fired **after** `flushAssistant` (the durable write). + This is the only safe trigger for reconcile. + +### The reconcile decision (`reloadChunksFromApi`) +On `turn-sealed` (deferred while the user is scrolled up, via `pendingReconcileTabs`), +refetch the chunk window and recompute `live` as `keptLive`: + +``` +keptLive = live.filter(m => + (preserveTurnId !== null && m.turnId === preserveTurnId) // a NEWER in-flight turn + || (m.turnId === undefined && m.role === "user") // optimistic/queued, not yet sealed +) +``` + +Everything else is assumed already-durable in `chunks` and dropped. + +--- + +## Why it keeps breaking + +The reconcile filter answers **"is this live row already in the sealed chunks?"** +but it has no direct way to know. It infers the answer from `turnId` and the +`queued-` prefix. That inference is correct **only if tagging is perfectly +exhaustive and perfectly conservative**: + +- **Exhaustive:** every row that *is* (or will be) sealed into a turn's chunks must + carry that `turnId` *before the turn seals*. Miss one → it stays "untagged" → kept + → **duplicate** (it's also in `chunks`), and it lingers forever because nothing + ever tags it later. +- **Conservative:** every row that is *not* part of any sealed turn (a pending queued + message, an optimistic initiator before `turn-start`) must stay untagged. Over-tag + one → it looks sealed → dropped → **the user's message vanishes**. + +Both failure modes hinge on the *same* `turnId`-presence bit, pulled in opposite +directions, mutated by events whose ordering we don't control. That is the whole +problem in one sentence. Add multi-client (events for turns this client never +initiated) and deferred reconcile (state mutates *while* a reconcile is pending), +and the interleaving space explodes. + +--- + +## Catalog of bugs found (one per review pass, same ~40 lines) + +| # | Pass | Failure mode | Root cause | Fix | +|---|------|--------------|-----------|-----| +| A | 1 | A newer, still-streaming turn got **wiped** when an earlier turn's deferred reconcile flushed. | reconcile blindly cleared the whole live tail. | `preserveTurnId`: keep rows whose `turnId === liveTurnId` when `liveTurnId !== sealedTurnId`. | +| B | 1 | Optimistic **queued** user bubble **dropped** on reconcile. | filter didn't keep untagged user rows. | keep `turnId === undefined && role === "user"`. | +| C | 2 | `turn-start` backfill **over-tagged** trailing `queued-` rows with the new turnId → **wiped** on seal. | backfill tagged *every* trailing untagged user row. | skip `queued-` rows; tag only the single most-recent non-queued initiator; then `break`. | +| D | 3 | **Consumed** interrupt bubble **lingered forever** + **duplicated** the `[USER INTERRUPT]` text in the sealed tool result. | `message-consumed` stripped the `queued-` prefix to a *plain untagged* row, which rule B then preserved on every reconcile. | bind the consumed row to `liveTurnId` so reconcile drops it (collapse to persisted shape). | + +Notice the chain reaction: the **rule B fix** ("keep untagged user rows") is the +hinge that **both C and D** then bent. C was about *creating* wrongly-tagged rows; +D was about *failing to tag* rows that should have been. Each fix narrowed the +definition of "untagged ⇒ keep" without ever making it airtight. + +--- + +## Invariants (the contract reconcile depends on) + +1. **No loss.** A live row that is not yet represented in `chunks` must survive + reconcile. (Pending queued messages, optimistic initiators pre-`turn-start`.) +2. **No duplicate / no linger.** A live row whose content has been sealed into + `chunks` (as a user row OR folded into a tool result) must be dropped on the + sealing turn's reconcile. +3. **Newer turn preserved.** A turn that started streaming *after* the one being + reconciled must not be touched; only the sealed turn folds into `chunks`. +4. **Idempotent + deferral-safe.** Reconcile may run late (after the user scrolls + back to the bottom) and may be preceded by arbitrary further events. Re-running it + must not violate 1–3. Key reconcile off `turn-sealed` (post-write), never `status:idle`. + +Corollaries that are easy to forget: +- A consumed/interrupt message is sealed **inside a tool result**, not as its own + user chunk row — yet its optimistic bubble must still be dropped (invariant 2). +- A freshly-created tab defaults to `agentStatus: "running"`, so the *first* user + send may be treated as queued unless the tab is known-idle. + +--- + +## Recommended longer-term fix (not yet done) + +Stop inferring "is this sealed?" from `turnId` presence. Make preservation key off +**explicit, positive membership** instead of the absence of a tag: + +> Keep a live row iff it belongs to a turn we are explicitly preserving +> (`turnId === preserveTurnId`) **or** it is still a *pending* queued message +> (its `queueId` is still in `tab.queuedMessages`). Drop everything else. + +This flips the dangerous default. Today "untagged ⇒ keep" means *any* tagging gap +causes a linger/dup (bugs C/D class). With membership-based keep, an untagged row +that is *not* a pending queued message is dropped — which is correct, because the +only rows that should outlive a reconcile are (a) the explicitly-preserved newer +turn and (b) genuinely-pending queue entries, both of which have positive signals. +It also makes the `turn-start` backfill purely cosmetic (stable render keys), so a +tagging miss there can no longer lose or duplicate data. + +Until that refactor lands, treat this path as **high-touch**: any change needs a +targeted interleaving test. + +--- + +## Testing guidance + +Property/interleaving coverage beats one-off scenario tests here. Suggested: +- A randomized sequence generator over `{turn-start, text-delta, message-queued, + message-consumed, status, turn-sealed, scrollUp/Down}` with multiple concurrent + turns and a simulated second client, asserting invariants 1–3 after every + `turn-sealed` + final state. +- Scenario tests that already exist in `packages/frontend/tests/chat-store.test.ts` + and should stay green (named for the bugs above): + - deferred reconcile preserves a concurrent newer turn (A) + - optimistic queued user message survives an earlier turn's reconcile (B) + - `turn-start` backfill skips a pending queued row, tags only the initiator (C) + - a consumed interrupt message collapses into the sealed turn — no lingering bubble (D) +- Always assert **both** directions: the kept row is still present (no loss) AND the + sealed rows are not duplicated (no linger). + +## Pointers + +- Frontend store: `packages/frontend/src/lib/tabs.svelte.ts` + (`reloadChunksFromApi`, `reconcileSealedTurn`, `handleEvent` cases + `turn-start` / `turn-sealed` / `message-queued` / `message-consumed` / `status`, + `setScrolledUp`, `pendingReconcileTabs`). +- Render keying: `packages/frontend/src/lib/components/ChatPanel.svelte` + (`${turnId}:${role}:${n}`). +- Backend event emission: `packages/api/src/agent-manager.ts` (`processMessage`, + `dequeueMessages`, `turn-start` / `turn-sealed`); `packages/core/src/agent/agent.ts` + (the two `dequeueMessages()` consumption sites). +- Review records: `notes/gemini-chunk-eviction-review.md` (Pass 1, bugs A+B), + `notes/gemini-chunk-eviction-review-2.md` (Pass 2, bug C), + `notes/gemini-chunk-eviction-review-3.md` (Pass 3, bug D). +</content> +</invoke> diff --git a/report.md b/notes/report.md index 3c6d6f7..3c6d6f7 100644 --- a/report.md +++ b/notes/report.md diff --git a/requirements.md b/notes/requirements.md index 96b2fb8..96b2fb8 100644 --- a/requirements.md +++ b/notes/requirements.md diff --git a/notes/tool-runner-duplication-incident.md b/notes/tool-runner-duplication-incident.md new file mode 100644 index 0000000..188110d --- /dev/null +++ b/notes/tool-runner-duplication-incident.md @@ -0,0 +1,147 @@ +# Incident Report: Duplicated Tool Output in Dispatch Harness + +**Date:** 2025-05-30 +**Component:** Dispatch AI harness — tool runner / tool-result delivery +**Severity:** Medium (no data corruption, but severe context pollution and wasted tokens/latency) +**Status:** Observed, not root-caused + +--- + +## Summary + +During the exploration phase of a code task (disabling the todo/tasks skill), +a single assistant turn that batched a large number of independent, +read-only tool calls came back with **massively duplicated tool results**. +The same `read_file`, `list_files`, and `run_shell` invocations were echoed +back dozens of times each, producing a multi-thousand-line "wall" of +near-identical output for what should have been a few dozen distinct results. + +The actual file edits performed later in the task landed correctly — this was +an **output/delivery problem**, not a state-corruption problem. But the +duplication wasted a large amount of context window, made the transcript very +hard to read, and could plausibly cause an agent to lose track of its plan or +hit context limits on a larger task. + +--- + +## What I observed + +I emitted **one** assistant message containing a large parallel batch of +independent tool calls (the correct pattern for independent reads). The batch +included things like: + +- `read_file packages/api/src/agent-manager.ts` (several distinct line ranges) +- `read_file package.json` +- `list_files packages/api/src` +- various `run_shell` commands (`grep`, `sed`, `awk`, plus debug probes like + `echo hello`, `echo hi`, `pwd`) + +Instead of one result per call, the results stream contained the **same +results repeated many times over**. Concrete examples from the transcript: + +| Tool call | Approx. times the identical result appeared | +|---|---| +| `read_file package.json` (lines 1-5) | ~150+ | +| `read_file agent-manager.ts` (lines 75-126 / 76-127 / 78-117) | ~40+ | +| `list_files packages/api/src` | ~20 | +| `run_shell sed -n '75,127p' ...` | ~13 | +| `run_shell awk 'NR>=75...'` | ~12 | +| `run_shell echo hello` | ~16 | +| `run_shell pwd` | ~20 | + +The duplicated blocks were **byte-for-byte identical** (same line ranges, same +content, same exit codes). They were not re-reads of changed files — the +content was stable across every repeat. + +### Notable characteristics + +1. **Read-only and idempotent calls were the ones duplicated.** The heavy + duplication clustered on `read_file` / `list_files` / trivial `run_shell` + probes. This is consistent with the harness retrying or re-emitting results + for calls it considered safe/idempotent. +2. **The duplication count was wildly uneven.** `package.json` (a tiny file) + was echoed ~150 times while a single `read_file_slice` appeared once. The + multiplier was not constant across calls. +3. **A genuinely valuable signal was nearly buried.** One real result hid in + the noise: `read_file agent-manager.ts` lines 75-126 truncated mid-line and + instructed me to use `read_file_slice` — easy to miss in a wall of + duplicates. +4. **No effect on writes.** The later `write_file`/`run_shell` mutation steps + each returned exactly once and produced correct results, and the final + `git diff --stat` matched expectations (6 insertions, 168 deletions across + 8 files). So the glitch appears confined to result *delivery/echo*, not + tool *execution*. + +--- + +## Impact + +- **Context pollution:** Thousands of redundant lines consumed the context + window. On a larger codebase or a longer task this could trigger truncation + or force premature summarization. +- **Readability:** The human-facing transcript became almost unreadable for + the exploration phase; it's hard to audit what the agent actually looked at. +- **Cost/latency:** Re-emitting and re-processing identical results wastes + tokens and time. +- **Reasoning risk:** Heavy duplication can bias or confuse a model + (repetition can be read as emphasis), and the near-miss on the truncation + hint shows real signals can get lost. + +--- + +## What I could NOT determine + +- Whether the duplication originated in: + - the **tool runner** (executing/emitting each call multiple times), + - the **result serializer/transport** (one execution, many echoes), or + - a **retry/timeout loop** (calls re-issued after a perceived stall). +- Whether the trigger was the **large fan-out batch size** specifically, or + something incidental. +- Whether other agents on the same harness see it (single-session + observation only). + +The byte-identical nature of the repeats (including stable `pwd` and `echo` +output) leans toward an **echo/transport duplication** rather than genuine +re-execution, but that is inference, not proof. + +--- + +## Reproduction hypothesis + +Not reliably reproduced. The strongest correlated factor was **a single turn +with a large number (dozens) of independent tool calls batched together**, +heavily weighted toward `read_file` on small files. Suggested repro attempt: + +1. In one assistant turn, batch ~30+ independent `read_file` calls, several + targeting the same small file at different line ranges. +2. Mix in a few cheap `run_shell` probes (`echo`, `pwd`). +3. Observe whether results are echoed >1× and whether the multiplier varies + per call. + +--- + +## Suggested follow-ups for the harness maintainers + +1. **Deduplicate / cap result emission per tool_call_id.** Each tool call has + a unique id; the result stream should emit exactly one result per id. + Assert this invariant and drop/merge duplicates. +2. **Log the executor path.** Add a counter for "executions per tool_call_id" + vs "results delivered per tool_call_id" to distinguish re-execution from + re-echo. +3. **Investigate the batch/fan-out path.** Check whether large parallel tool + batches trigger a retry or a fan-in bug that replays buffered results. +4. **Surface truncation hints robustly.** When a `read_file` result is + truncated, ensure that notice survives any dedup/compaction so the agent + doesn't miss it. +5. **Consider a per-turn result-size guardrail** that warns when duplicate + identical results are detected. + +--- + +## Notes + +- This document records a single observed session. Numbers above are + approximate counts read from the transcript, not exact instrumentation. +- The underlying code task completed successfully and was verified + independently (targeted test suites passed, typechecks clean); this report + concerns only the harness behavior, not the task outcome. diff --git a/wishlist.md b/notes/wishlist.md index 43e579b..43e579b 100644 --- a/wishlist.md +++ b/notes/wishlist.md diff --git a/packages/core/src/credentials/anthropic-betas.ts b/packages/core/src/credentials/anthropic-betas.ts index f2ca7a6..802ebbd 100644 --- a/packages/core/src/credentials/anthropic-betas.ts +++ b/packages/core/src/credentials/anthropic-betas.ts @@ -7,7 +7,7 @@ // // `prompt-caching-scope-2026-01-05` is load-bearing for cost: without it the // Anthropic API silently ignores every `cache_control` breakpoint we place, -// producing a 0% cache hit rate (see claude-report.md). `oauth-2025-04-20` +// producing a 0% cache hit rate (see notes/claude-report.md). `oauth-2025-04-20` // gates the Bearer/OAuth flow used by Claude Pro/Max subscriptions. const BASE_BETAS = [ diff --git a/packages/core/src/db/index.ts b/packages/core/src/db/index.ts index 18dd1b5..29448bc 100644 --- a/packages/core/src/db/index.ts +++ b/packages/core/src/db/index.ts @@ -99,7 +99,7 @@ export function getDatabase(): Database { // keyed by a per-tab monotonic `seq`. "Message" and "turn" are DERIVED // groupings (see db/chunks.ts), never stored containers. This is what // powers per-chunk frontend pagination AND the stable per-step wire - // format that fixes Anthropic prompt-cache churn (see plan-chunk-log.md). + // format that fixes Anthropic prompt-cache churn (see notes/plan-chunk-log.md). // // role : 'user' | 'assistant' | 'tool' | 'system' // type : 'text' | 'thinking' | 'tool_call' | 'tool_result' | 'error' | 'system' diff --git a/packages/core/src/llm/provider.ts b/packages/core/src/llm/provider.ts index 5978196..9e8475f 100644 --- a/packages/core/src/llm/provider.ts +++ b/packages/core/src/llm/provider.ts @@ -81,7 +81,7 @@ export function createProvider(config: ProviderConfig): ModelFactory { * (computer-use, structured-outputs, etc.) — it does NOT add the prompt-caching * or oauth betas on its own. Without `prompt-caching-scope-2026-01-05` the API * silently ignores every `cache_control` breakpoint we attach to messages, - * giving a 0% cache hit rate and a massive token burn (see claude-report.md). + * giving a 0% cache hit rate and a massive token burn (see notes/claude-report.md). * The SDK folds any `anthropic-beta` it finds on the provider's config headers * back into its own beta set (via `getBetasFromHeaders`), so the values here * are merged — not overwritten — with any tool-derived betas. diff --git a/packages/core/src/tools/truncate.ts b/packages/core/src/tools/truncate.ts index 2204a3f..8c62174 100644 --- a/packages/core/src/tools/truncate.ts +++ b/packages/core/src/tools/truncate.ts @@ -8,7 +8,7 @@ import { dirname, join } from "node:path"; // /tmp/dispatch/tool-results/<tabId>/<callId>.txt and the model receives // HEAD_CHARS from the start + TAIL_CHARS from the end with a notice // in between. These are deliberate hardcoded defaults — see the design -// discussion in plan.md for the rationale. +// discussion in notes/plan.md for the rationale. export const MAX_CHARS = 10_000; export const MAX_LINES = 500; diff --git a/packages/core/tests/agent/agent.test.ts b/packages/core/tests/agent/agent.test.ts index 6c2b452..a12ea07 100644 --- a/packages/core/tests/agent/agent.test.ts +++ b/packages/core/tests/agent/agent.test.ts @@ -1038,7 +1038,7 @@ describe("Agent", () => { expect(rc).toBeUndefined(); }); - // ─── Prompt-caching: tool-result grouping & breakpoints (claude-report.md) ── + // ─── Prompt-caching: tool-result grouping & breakpoints (notes/claude-report.md) ── it("groups a turn's tool results into a SINGLE role:'tool' message (Root Cause 2)", async () => { // The agent batches three distinct read_file calls in one step. The @@ -1158,7 +1158,7 @@ describe("Agent", () => { } }); - // ─── Tool-call dedup (tool-runner-duplication-incident.md) ───────────────── + // ─── Tool-call dedup (notes/tool-runner-duplication-incident.md) ───────────────── it("deduplicates byte-identical tool calls within a single batch", async () => { // Claude can degenerate and emit the same tool call (same name + args) @@ -1265,7 +1265,7 @@ describe("Agent", () => { // A 3-step tool turn. The messages for steps 0 and 1 must serialize // identically in the step-2 request and the step-3 request — that // byte-stability is what lets Anthropic's rolling prompt cache extend - // instead of re-writing the whole prefix every step (cache-miss-report.md). + // instead of re-writing the whole prefix every step (notes/cache-miss-report.md). // Uses the openai-compatible provider so no cacheControl markers (which // intentionally move each step) obscure the content comparison. let n = 0; diff --git a/packages/core/tests/llm/provider.test.ts b/packages/core/tests/llm/provider.test.ts index 9e6b2ad..12b6350 100644 --- a/packages/core/tests/llm/provider.test.ts +++ b/packages/core/tests/llm/provider.test.ts @@ -179,7 +179,7 @@ describe("createClaudeOAuthProvider", () => { } }); - it("sends the anthropic-beta header so prompt-caching is honored (claude-report.md Root Cause 1)", () => { + it("sends the anthropic-beta header so prompt-caching is honored (notes/claude-report.md Root Cause 1)", () => { // Without `anthropic-beta: ...,prompt-caching-scope-2026-01-05,...` the // Anthropic API silently ignores every `cache_control` marker we attach // to messages, producing a 0% cache hit rate. `@ai-sdk/anthropic` does |
