diff options
| author | Adam Malczewski <[email protected]> | 2026-06-11 14:11:13 +0900 |
|---|---|---|
| committer | Adam Malczewski <[email protected]> | 2026-06-11 14:11:13 +0900 |
| commit | 7ffb6b28f5b6bdbfc53ebed94fc68af557612189 (patch) | |
| tree | e66d9ea9d326ef771cc473d81ca5716ff78b08a8 /frontend-cache-warming-handoff.md | |
| parent | 763e5fb1c7fbfb4c7bbd43ffb935e42e5f5b5a42 (diff) | |
| download | dispatch-7ffb6b28f5b6bdbfc53ebed94fc68af557612189.tar.gz dispatch-7ffb6b28f5b6bdbfc53ebed94fc68af557612189.zip | |
fix(cache-warming): accurate cache rate + expectedCacheRate (retention) metric
The Claude cache % read 100% whenever anything was cached, because the metric's
denominator (inputTokens) excluded cached tokens on Anthropic. Fixed upstream in
../claude/provider-anthropic (inputTokens = total prompt); this commit adds the
companion retention metric and exposes it:
- transport-contract: WarmResponse += expectedCacheRate
- transport-http: POST /chat/warm returns expectedCacheRate = cacheRead/(cacheRead+cacheWrite)
- cache-warming: computeExpectedCacheRate + a per-conversation 'cache retention' surface stat
- handoff: documents the fix + cache-rate vs expected-cache (cross-turn) for the FE
Live-verified vs claude haiku: real turn cache rate 61% (was inflated 100%);
warm within TTL expectedCacheRate=100%, after expiry=0%.
Diffstat (limited to 'frontend-cache-warming-handoff.md')
| -rw-r--r-- | frontend-cache-warming-handoff.md | 59 |
1 files changed, 51 insertions, 8 deletions
diff --git a/frontend-cache-warming-handoff.md b/frontend-cache-warming-handoff.md index e5d50b3..64b94d6 100644 --- a/frontend-cache-warming-handoff.md +++ b/frontend-cache-warming-handoff.md @@ -69,7 +69,8 @@ scoped** (state differs per conversation, e.g. cache-warming). To support the la |---|---|---|---| | `toggle` | enabled on/off | warming on for this conversation | `cache-warming/toggle` | | `number` | refresh interval | **seconds** (`unit:"s"`, `min:1`, `step:1`, no `max` = free value) | `cache-warming/set-interval` | - | `stat` | last cache % | most recent warm's hit % (`"—"` when none yet) | — (read-only) | + | `stat` | last cache rate | most recent warm's `cachePct` (`"—"` when none yet) | — (read-only) | + | `stat` | cache retention | most recent warm's `expectedCacheRate` — the **health** signal (~100% = cache stayed warm; 0% = it expired) | — (read-only) | - **Invoke payloads:** - `cache-warming/toggle` → **flips** the current enabled state. Send `{ type: "invoke", surfaceId: "cache-warming", actionId: "cache-warming/toggle", conversationId }` (payload is ignored — it @@ -86,17 +87,21 @@ For an on-demand warm (e.g. a button) without waiting for the automatic timer: ``` POST /chat/warm body WarmRequest { conversationId: string; model?: string; cwd?: string } - 200 WarmResponse { inputTokens; outputTokens; cacheReadTokens; cacheWriteTokens; cachePct } + 200 WarmResponse { inputTokens; outputTokens; cacheReadTokens; cacheWriteTokens; + cachePct; expectedCacheRate } 409 { error } // the conversation is currently generating — try again when idle 400 { error } // missing/invalid conversationId ``` - Pass the **same `model`** (`<credentialName>/<model>`) the conversation chats with, so the warm request's prefix matches the real turn (that's what makes the cache hit). `cwd` only matters if the conversation uses cwd-scoped tools. -- `cachePct` = `round(clamp(cacheReadTokens / inputTokens, 0, 1) * 100)` — show it as the "last - warming" hit indicator. The warm is **never** persisted or streamed and is **never** folded into - the conversation's real usage/cache-rate (keep it visually distinct from the real cache rate in - §`frontend-cache-rate-handoff.md`). +- `cachePct` = `round(cacheReadTokens / inputTokens * 100)` — the cache RATE of the warm request. +- `expectedCacheRate` = `round(cacheReadTokens / (cacheReadTokens + cacheWriteTokens) * 100)` — the + **retention / health** signal: ~**100%** when the cache was still warm (read back, ~nothing + rewritten), **0%** when it had expired (rewrote everything). This is the one to headline for a + "is warming working?" indicator. +- The warm is **never** persisted or streamed and is **never** folded into the conversation's real + usage/cache-rate (keep it visually distinct from the real cache rate in §F / `frontend-cache-rate-handoff.md`). - Types live in `@dispatch/transport-contract` (`WarmRequest`, `WarmResponse`). ## E. Behavior model (for the UX) @@ -108,8 +113,46 @@ POST /chat/warm - Verified live against Claude (`claude/claude-haiku-4-5-...`): an idle conversation's warm reports ~100% cache read once its prefix exceeds the provider's min-cacheable size. +## F. Cache-rate metric — a correctness fix + the "expected cache" metric (READ THIS) +A backend bug made the cache-hit % read **100% on Claude whenever anything was cached** (it inflated). +Root cause: Anthropic's `input_tokens` is the *uncached remainder*, with cache read/creation reported +separately — but the wire `Usage.inputTokens` convention (which the flash/OpenAI-compat provider +already follows) is the **TOTAL prompt incl. cached**. Fixed in `../claude/provider-anthropic` +(`inputTokens = input + cacheRead + cacheWrite`). **No FE change needed** — your existing +`cacheRead/inputTokens` math (see `frontend-cache-rate-handoff.md`) now yields the *true* rate on +Claude. (Note: that older handoff's caveat "cacheWriteTokens is usually absent" is **not** true for +Claude — it reports both.) + +Two distinct cache numbers — show them as different things: +- **Cache rate** = `cacheReadTokens / inputTokens` — *what fraction of THIS turn's prompt came from + cache*. It legitimately **drops when a turn adds a lot of new content** (e.g. a turn that pastes a + big file reads back the old prefix but also writes the new file → rate < 100%). This is the + per-turn efficiency number, available on every `usage`/`done` event and in persisted metrics. +- **Expected cache (retention)** = *of the cache that existed going into this turn, how much did we + read back* — ideally **~100% every turn after the first** (you re-read the entire prefix you + cached). It is a **cross-turn** derivation: + ``` + expectedCacheRate(turn N) = cacheRead_N / (cacheRead_{N-1} + cacheWrite_{N-1}) // clamp [0,1] + ``` + (denominator = the prior turn's cached prefix = what it read + what it wrote). **<100% means the + cache busted/expired** between turns. The FE derives this from two consecutive turns' usage (which + you already have, live + persisted). For the WARM endpoint/surface this same idea is the single-shot + `expectedCacheRate` (§C/§D) the backend already computes. + +**Worked example (live, Claude haiku), one chat, two real turns:** +| turn | inputTokens (total) | cacheRead | cacheWrite | cache rate `cr/input` | expected cache (cross-turn) | +|---|---|---|---|---|---| +| 1 (fresh) | 5149 | 0 | 5146 | 0% | — | +| 2 (new msg) | 8462 | 5146 | 3313 | **61%** | `5146/(0+5146)` = **100%** | + +So on turn 2 the prompt was 61% cache (the rest was the new message), yet you successfully read back +**100%** of what turn 1 cached — two true, complementary signals. (Pre-fix, the rate wrongly showed +100% because the denominator excluded the 5146 cached tokens.) + ## Versions / type references - `@dispatch/ui-contract`: `NumberField` (new `SurfaceField` variant); `conversationId?` on `SubscribeMessage`/`UnsubscribeMessage`/`InvokeMessage`/`SurfaceMessage`/`SurfaceUpdate`. -- `@dispatch/transport-contract`: `WarmRequest`, `WarmResponse`. -- Cache-% math + the real (non-warming) cache rate: see `frontend-cache-rate-handoff.md` (unchanged). +- `@dispatch/transport-contract`: `WarmRequest`, `WarmResponse` (now incl. `expectedCacheRate`). +- Cache-% fix: `../claude/provider-anthropic` now reports `inputTokens` as the total prompt — the + real (non-warming) cache rate in `frontend-cache-rate-handoff.md` becomes accurate on Claude with + no FE change; ignore that doc's "cacheWriteTokens usually absent" caveat for Claude. |
