fix(cache-warming): accurate cache rate + expectedCacheRate (retention) metric

The Claude cache % read 100% whenever anything was cached, because the metric's denominator (inputTokens) excluded cached tokens on Anthropic. Fixed upstream in ../claude/provider-anthropic (inputTokens = total prompt); this commit adds the companion retention metric and exposes it: - transport-contract: WarmResponse += expectedCacheRate - transport-http: POST /chat/warm returns expectedCacheRate = cacheRead/(cacheRead+cacheWrite) - cache-warming: computeExpectedCacheRate + a per-conversation 'cache retention' surface stat - handoff: documents the fix + cache-rate vs expected-cache (cross-turn) for the FE Live-verified vs claude haiku: real turn cache rate 61% (was inflated 100%); warm within TTL expectedCacheRate=100%, after expiry=0%.
author: Adam Malczewski <[email protected]> 2026-06-11 14:11:13 +0900
committer: Adam Malczewski <[email protected]> 2026-06-11 14:11:13 +0900
commit: 7ffb6b28f5b6bdbfc53ebed94fc68af557612189 (patch)
tree: e66d9ea9d326ef771cc473d81ca5716ff78b08a8
parent: 763e5fb1c7fbfb4c7bbd43ffb935e42e5f5b5a42 (diff)
download: dispatch-7ffb6b28f5b6bdbfc53ebed94fc68af557612189.tar.gz
dispatch-7ffb6b28f5b6bdbfc53ebed94fc68af557612189.zip
13 files changed, 280 insertions, 19 deletions
diff --git a/frontend-cache-warming-handoff.md b/frontend-cache-warming-handoff.md
index e5d50b3..64b94d6 100644
--- a/frontend-cache-warming-handoff.md
+++ b/frontend-cache-warming-handoff.md
@@ -69,7 +69,8 @@ scoped** (state differs per conversation, e.g. cache-warming). To support the la
   |---|---|---|---|
   | `toggle` | enabled on/off | warming on for this conversation | `cache-warming/toggle` |
   | `number` | refresh interval | **seconds** (`unit:"s"`, `min:1`, `step:1`, no `max` = free value) | `cache-warming/set-interval` |
-  | `stat`   | last cache % | most recent warm's hit % (`"—"` when none yet) | — (read-only) |
+  | `stat`   | last cache rate | most recent warm's `cachePct` (`"—"` when none yet) | — (read-only) |
+  | `stat`   | cache retention | most recent warm's `expectedCacheRate` — the **health** signal (~100% = cache stayed warm; 0% = it expired) | — (read-only) |
 - **Invoke payloads:**
   - `cache-warming/toggle` → **flips** the current enabled state. Send `{ type: "invoke", surfaceId:
     "cache-warming", actionId: "cache-warming/toggle", conversationId }` (payload is ignored — it
@@ -86,17 +87,21 @@ For an on-demand warm (e.g. a button) without waiting for the automatic timer:
 ```
 POST /chat/warm
   body  WarmRequest  { conversationId: string; model?: string; cwd?: string }
-  200   WarmResponse { inputTokens; outputTokens; cacheReadTokens; cacheWriteTokens; cachePct }
+  200   WarmResponse { inputTokens; outputTokens; cacheReadTokens; cacheWriteTokens;
+                       cachePct; expectedCacheRate }
   409   { error }    // the conversation is currently generating — try again when idle
   400   { error }    // missing/invalid conversationId
 ```
 - Pass the **same `model`** (`<credentialName>/<model>`) the conversation chats with, so the warm
   request's prefix matches the real turn (that's what makes the cache hit). `cwd` only matters if the
   conversation uses cwd-scoped tools.
-- `cachePct` = `round(clamp(cacheReadTokens / inputTokens, 0, 1) * 100)` — show it as the "last
-  warming" hit indicator. The warm is **never** persisted or streamed and is **never** folded into
-  the conversation's real usage/cache-rate (keep it visually distinct from the real cache rate in
-  §`frontend-cache-rate-handoff.md`).
+- `cachePct` = `round(cacheReadTokens / inputTokens * 100)` — the cache RATE of the warm request.
+- `expectedCacheRate` = `round(cacheReadTokens / (cacheReadTokens + cacheWriteTokens) * 100)` — the
+  **retention / health** signal: ~**100%** when the cache was still warm (read back, ~nothing
+  rewritten), **0%** when it had expired (rewrote everything). This is the one to headline for a
+  "is warming working?" indicator.
+- The warm is **never** persisted or streamed and is **never** folded into the conversation's real
+  usage/cache-rate (keep it visually distinct from the real cache rate in §F / `frontend-cache-rate-handoff.md`).
 - Types live in `@dispatch/transport-contract` (`WarmRequest`, `WarmResponse`).
 
 ## E. Behavior model (for the UX)
@@ -108,8 +113,46 @@ POST /chat/warm
 - Verified live against Claude (`claude/claude-haiku-4-5-...`): an idle conversation's warm reports
   ~100% cache read once its prefix exceeds the provider's min-cacheable size.
 
+## F. Cache-rate metric — a correctness fix + the "expected cache" metric (READ THIS)
+A backend bug made the cache-hit % read **100% on Claude whenever anything was cached** (it inflated).
+Root cause: Anthropic's `input_tokens` is the *uncached remainder*, with cache read/creation reported
+separately — but the wire `Usage.inputTokens` convention (which the flash/OpenAI-compat provider
+already follows) is the **TOTAL prompt incl. cached**. Fixed in `../claude/provider-anthropic`
+(`inputTokens = input + cacheRead + cacheWrite`). **No FE change needed** — your existing
+`cacheRead/inputTokens` math (see `frontend-cache-rate-handoff.md`) now yields the *true* rate on
+Claude. (Note: that older handoff's caveat "cacheWriteTokens is usually absent" is **not** true for
+Claude — it reports both.)
+
+Two distinct cache numbers — show them as different things:
+- **Cache rate** = `cacheReadTokens / inputTokens` — *what fraction of THIS turn's prompt came from
+  cache*. It legitimately **drops when a turn adds a lot of new content** (e.g. a turn that pastes a
+  big file reads back the old prefix but also writes the new file → rate < 100%). This is the
+  per-turn efficiency number, available on every `usage`/`done` event and in persisted metrics.
+- **Expected cache (retention)** = *of the cache that existed going into this turn, how much did we
+  read back* — ideally **~100% every turn after the first** (you re-read the entire prefix you
+  cached). It is a **cross-turn** derivation:
+  ```
+  expectedCacheRate(turn N) = cacheRead_N / (cacheRead_{N-1} + cacheWrite_{N-1})   // clamp [0,1]
+  ```
+  (denominator = the prior turn's cached prefix = what it read + what it wrote). **<100% means the
+  cache busted/expired** between turns. The FE derives this from two consecutive turns' usage (which
+  you already have, live + persisted). For the WARM endpoint/surface this same idea is the single-shot
+  `expectedCacheRate` (§C/§D) the backend already computes.
+
+**Worked example (live, Claude haiku), one chat, two real turns:**
+| turn | inputTokens (total) | cacheRead | cacheWrite | cache rate `cr/input` | expected cache (cross-turn) |
+|---|---|---|---|---|---|
+| 1 (fresh) | 5149 | 0 | 5146 | 0% | — |
+| 2 (new msg) | 8462 | 5146 | 3313 | **61%** | `5146/(0+5146)` = **100%** |
+
+So on turn 2 the prompt was 61% cache (the rest was the new message), yet you successfully read back
+**100%** of what turn 1 cached — two true, complementary signals. (Pre-fix, the rate wrongly showed
+100% because the denominator excluded the 5146 cached tokens.)
+
 ## Versions / type references
 - `@dispatch/ui-contract`: `NumberField` (new `SurfaceField` variant); `conversationId?` on
   `SubscribeMessage`/`UnsubscribeMessage`/`InvokeMessage`/`SurfaceMessage`/`SurfaceUpdate`.
-- `@dispatch/transport-contract`: `WarmRequest`, `WarmResponse`.
-- Cache-% math + the real (non-warming) cache rate: see `frontend-cache-rate-handoff.md` (unchanged).
+- `@dispatch/transport-contract`: `WarmRequest`, `WarmResponse` (now incl. `expectedCacheRate`).
+- Cache-% fix: `../claude/provider-anthropic` now reports `inputTokens` as the total prompt — the
+  real (non-warming) cache rate in `frontend-cache-rate-handoff.md` becomes accurate on Claude with
+  no FE change; ignore that doc's "cacheWriteTokens usually absent" caveat for Claude.
diff --git a/packages/cache-warming/src/extension.ts b/packages/cache-warming/src/extension.ts
index 26d429b..802618a 100644
--- a/packages/cache-warming/src/extension.ts
+++ b/packages/cache-warming/src/extension.ts
@@ -77,7 +77,12 @@ export function activate(host: HostAPI): void {
 			return buildDefaultSpec();
 		}
 		const state = warmer.getState(convId);
-		return buildConversationSpec(state.enabled, state.intervalMs, state.lastPct);
+		return buildConversationSpec(
+			state.enabled,
+			state.intervalMs,
+			state.lastPct,
+			state.lastExpectedPct,
+		);
 	}
 
 	async function invoke(
diff --git a/packages/cache-warming/src/index.ts b/packages/cache-warming/src/index.ts
index d77f4ec..88cab3b 100644
--- a/packages/cache-warming/src/index.ts
+++ b/packages/cache-warming/src/index.ts
@@ -5,6 +5,7 @@ export {
 	type ConversationSettings,
 	type ConversationState,
 	computeCachePct,
+	computeExpectedCacheRate,
 	DEFAULT_INTERVAL_MS,
 	isTokenCurrent,
 	MIN_INTERVAL_MS,
diff --git a/packages/cache-warming/src/pure.test.ts b/packages/cache-warming/src/pure.test.ts
index 1c912f2..f5e2f1d 100644
--- a/packages/cache-warming/src/pure.test.ts
+++ b/packages/cache-warming/src/pure.test.ts
@@ -4,6 +4,7 @@ import {
 	buildConversationSpec,
 	buildDefaultSpec,
 	computeCachePct,
+	computeExpectedCacheRate,
 	isTokenCurrent,
 	MIN_INTERVAL_MS,
 	msToSeconds,
@@ -29,6 +30,20 @@ describe("computeCachePct", () => {
 	});
 });
 
+describe("computeExpectedCacheRate", () => {
+	it("cacheRead/(cacheRead+cacheWrite) rounded", () => {
+		expect(computeExpectedCacheRate(800, 200)).toBe(80);
+		expect(computeExpectedCacheRate(500, 500)).toBe(50);
+		expect(computeExpectedCacheRate(1000, 0)).toBe(100);
+		expect(computeExpectedCacheRate(0, 1000)).toBe(0);
+		expect(computeExpectedCacheRate(333, 667)).toBe(33);
+	});
+
+	it("0 when cacheRead+cacheWrite is 0", () => {
+		expect(computeExpectedCacheRate(0, 0)).toBe(0);
+	});
+});
+
 describe("shouldWarm", () => {
 	it("returns true when enabled, idle, and token matches", () => {
 		const state: ConversationState = {
@@ -36,6 +51,7 @@ describe("shouldWarm", () => {
 			intervalMs: 240_000,
 			active: false,
 			lastPct: null,
+			lastExpectedPct: null,
 			token: 5,
 		};
 		expect(shouldWarm(state, 5)).toBe(true);
@@ -47,6 +63,7 @@ describe("shouldWarm", () => {
 			intervalMs: 240_000,
 			active: false,
 			lastPct: null,
+			lastExpectedPct: null,
 			token: 5,
 		};
 		expect(shouldWarm(state, 5)).toBe(false);
@@ -58,6 +75,7 @@ describe("shouldWarm", () => {
 			intervalMs: 240_000,
 			active: true,
 			lastPct: null,
+			lastExpectedPct: null,
 			token: 5,
 		};
 		expect(shouldWarm(state, 5)).toBe(false);
@@ -69,6 +87,7 @@ describe("shouldWarm", () => {
 			intervalMs: 240_000,
 			active: false,
 			lastPct: null,
+			lastExpectedPct: null,
 			token: 5,
 		};
 		expect(shouldWarm(state, 6)).toBe(false);
@@ -162,12 +181,12 @@ describe("parseIntervalPayload", () => {
 });
 
 describe("buildConversationSpec", () => {
-	it("builds a per-conversation spec with toggle + number(interval) + last-% fields", () => {
-		const spec = buildConversationSpec(true, 240_000, 80);
+	it("builds a per-conversation spec with toggle + number(interval) + last-% + retention fields", () => {
+		const spec = buildConversationSpec(true, 240_000, 80, 95);
 		expect(spec.id).toBe("cache-warming");
 		expect(spec.region).toBe("side");
 		expect(spec.title).toBe("Cache Warming");
-		expect(spec.fields).toHaveLength(3);
+		expect(spec.fields).toHaveLength(4);
 
 		const toggle = spec.fields[0];
 		expect(toggle).toEqual({
@@ -194,20 +213,33 @@ describe("buildConversationSpec", () => {
 			label: "Last Cache %",
 			value: "80%",
 		});
+
+		const retention = spec.fields[3];
+		expect(retention).toEqual({
+			kind: "stat",
+			label: "Cache retention",
+			value: "95%",
+		});
 	});
 
-	it("shows — when lastPct is null", () => {
-		const spec = buildConversationSpec(true, 240_000, null);
+	it("shows — when lastPct and lastExpectedPct are null", () => {
+		const spec = buildConversationSpec(true, 240_000, null, null);
 		const stat = spec.fields[2];
 		expect(stat).toEqual({
 			kind: "stat",
 			label: "Last Cache %",
 			value: "—",
 		});
+		const retention = spec.fields[3];
+		expect(retention).toEqual({
+			kind: "stat",
+			label: "Cache retention",
+			value: "—",
+		});
 	});
 
 	it("reflects disabled state", () => {
-		const spec = buildConversationSpec(false, 120_000, 50);
+		const spec = buildConversationSpec(false, 120_000, 50, 75);
 		const toggle = spec.fields[0];
 		expect(toggle).toEqual({
 			kind: "toggle",
diff --git a/packages/cache-warming/src/pure.ts b/packages/cache-warming/src/pure.ts
index 7b91b11..ab6fc79 100644
--- a/packages/cache-warming/src/pure.ts
+++ b/packages/cache-warming/src/pure.ts
@@ -17,6 +17,7 @@ export interface ConversationSettings {
 export interface ConversationState extends ConversationSettings {
 	readonly active: boolean;
 	readonly lastPct: number | null;
+	readonly lastExpectedPct: number | null;
 	readonly token: number;
 }
 
@@ -43,6 +44,21 @@ export function computeCachePct(inputTokens: number, cacheReadTokens: number): n
 }
 
 /**
+ * Compute expected cache retention rate from token counts.
+ * Of the cacheable prefix the warm touched, how much was still warm (read back)
+ * vs. had to be (re)written.
+ * Returns an integer in [0, 100]. cacheRead + cacheWrite ≤ 0 → 0.
+ */
+export function computeExpectedCacheRate(
+	cacheReadTokens: number,
+	cacheWriteTokens: number,
+): number {
+	const total = cacheReadTokens + cacheWriteTokens;
+	if (total <= 0) return 0;
+	return Math.round((cacheReadTokens / total) * 100);
+}
+
+/**
  * Decide whether a conversation should be warmed right now.
  * Requires: enabled, idle (not active), and the token is current (not superseded).
  */
@@ -120,8 +136,10 @@ export function buildConversationSpec(
 	enabled: boolean,
 	intervalMs: number,
 	lastPct: number | null,
+	lastExpectedPct: number | null,
 ): SurfaceSpec {
 	const pctDisplay = lastPct === null ? "—" : `${lastPct}%`;
+	const retentionDisplay = lastExpectedPct === null ? "—" : `${lastExpectedPct}%`;
 	const toggle: ToggleField = {
 		kind: "toggle",
 		label: "Enabled",
@@ -142,11 +160,16 @@ export function buildConversationSpec(
 		label: "Last Cache %",
 		value: pctDisplay,
 	};
+	const retentionStat: StatField = {
+		kind: "stat",
+		label: "Cache retention",
+		value: retentionDisplay,
+	};
 	return {
 		id: "cache-warming",
 		region: "side",
 		title: "Cache Warming",
-		fields: [toggle, interval, stat],
+		fields: [toggle, interval, stat, retentionStat],
 	};
 }
 
diff --git a/packages/cache-warming/src/warmer.test.ts b/packages/cache-warming/src/warmer.test.ts
index 9865877..86908a2 100644
--- a/packages/cache-warming/src/warmer.test.ts
+++ b/packages/cache-warming/src/warmer.test.ts
@@ -182,6 +182,30 @@ describe("CacheWarmer", () => {
 		expect(state.lastPct).toBe(80);
 	});
 
+	it("a completed warm stores both lastPct (rate) and lastExpectedPct (retention)", async () => {
+		const timers = fakeTimers();
+		const warmer = createCacheWarmer({
+			warm: async () => ({
+				inputTokens: 1000,
+				outputTokens: 10,
+				cacheReadTokens: 700,
+				cacheWriteTokens: 300,
+			}),
+			storage: memStorage(),
+			logger: makeLogger(),
+			timers,
+			onSurfaceChange: () => {},
+		});
+
+		warmer.onTurnSettled("conv-1", {});
+		timers.flush();
+
+		await new Promise((r) => setTimeout(r, 10));
+		const state = warmer.getState("conv-1");
+		expect(state.lastPct).toBe(70);
+		expect(state.lastExpectedPct).toBe(70);
+	});
+
 	it("re-arms timer after warm completes", async () => {
 		const timers = fakeTimers();
 		let warmCount = 0;
@@ -316,4 +340,27 @@ describe("CacheWarmer", () => {
 		await warmer.setIntervalMs("conv-1", 30_000);
 		expect(changeCount).toBe(2);
 	});
+
+	it("the per-conversation spec includes a cache-retention stat", async () => {
+		const timers = fakeTimers();
+		const warmer = createCacheWarmer({
+			warm: async () => ({
+				inputTokens: 1000,
+				outputTokens: 10,
+				cacheReadTokens: 900,
+				cacheWriteTokens: 100,
+			}),
+			storage: memStorage(),
+			logger: makeLogger(),
+			timers,
+			onSurfaceChange: () => {},
+		});
+
+		warmer.onTurnSettled("conv-1", {});
+		timers.flush();
+		await new Promise((r) => setTimeout(r, 10));
+
+		const state = warmer.getState("conv-1");
+		expect(state.lastExpectedPct).toBe(90);
+	});
 });
diff --git a/packages/cache-warming/src/warmer.ts b/packages/cache-warming/src/warmer.ts
index 31dd41e..f50f346 100644
--- a/packages/cache-warming/src/warmer.ts
+++ b/packages/cache-warming/src/warmer.ts
@@ -5,6 +5,7 @@ import {
 	type ConversationSettings,
 	type ConversationState,
 	computeCachePct,
+	computeExpectedCacheRate,
 	DEFAULT_INTERVAL_MS,
 	isTokenCurrent,
 	MIN_INTERVAL_MS,
@@ -63,6 +64,7 @@ const DEFAULT_STATE: ConversationState = {
 	intervalMs: DEFAULT_INTERVAL_MS,
 	active: false,
 	lastPct: null,
+	lastExpectedPct: null,
 	token: 0,
 };
 
@@ -145,11 +147,13 @@ export function createCacheWarmer(deps: CacheWarmerDeps): CacheWarmer {
 			});
 		} else {
 			const pct = computeCachePct(result.inputTokens, result.cacheReadTokens);
-			setState(conversationId, { ...currentState, lastPct: pct });
+			const expectedPct = computeExpectedCacheRate(result.cacheReadTokens, result.cacheWriteTokens);
+			setState(conversationId, { ...currentState, lastPct: pct, lastExpectedPct: expectedPct });
 			deps.onSurfaceChange();
 			deps.logger.debug("cache-warming: warm complete", {
 				conversationId,
 				pct,
+				expectedPct,
 			});
 		}
 
diff --git a/packages/transport-contract/src/index.ts b/packages/transport-contract/src/index.ts
index fbb61fc..95111ae 100644
--- a/packages/transport-contract/src/index.ts
+++ b/packages/transport-contract/src/index.ts
@@ -192,10 +192,21 @@ export interface WarmResponse {
 	readonly cacheReadTokens: number;
 	readonly cacheWriteTokens: number;
 	/**
-	 * Cache-hit percent: `round(clamp(cacheReadTokens / inputTokens, 0, 1) * 100)`
-	 * (0 when `inputTokens <= 0`).
+	 * **Cache rate** — what fraction of THIS request's prompt was served from cache:
+	 * `round(cacheReadTokens / inputTokens * 100)` (0 when `inputTokens <= 0`).
+	 * (`inputTokens` is the TOTAL prompt incl. cached, so this is in [0,100].)
 	 */
 	readonly cachePct: number;
+	/**
+	 * **Expected cache (retention)** — of the cacheable prefix this warm touched, how
+	 * much was still warm and read back vs. had to be (re)written:
+	 * `round(cacheReadTokens / (cacheReadTokens + cacheWriteTokens) * 100)` (0 when the
+	 * sum is 0). For a healthy warm this is ~**100%** (the whole prefix was still
+	 * cached); it drops toward 0 as the cache expires/busts and the warm has to rewrite
+	 * it. This is the warming HEALTH signal — distinct from `cachePct` (which a warm's
+	 * tiny fresh probe makes ~equal, but which on a real turn reflects new content).
+	 */
+	readonly expectedCacheRate: number;
 }
 
 // ─── WebSocket chat ops ───────────────────────────────────────────────────────
diff --git a/packages/transport-http/src/app.test.ts b/packages/transport-http/src/app.test.ts
index 7352b5d..22b26fc 100644
--- a/packages/transport-http/src/app.test.ts
+++ b/packages/transport-http/src/app.test.ts
@@ -449,12 +449,64 @@ describe("POST /chat/warm", () => {
 			cacheReadTokens: number;
 			cacheWriteTokens: number;
 			cachePct: number;
+			expectedCacheRate: number;
 		};
 		expect(body.inputTokens).toBe(1000);
 		expect(body.outputTokens).toBe(200);
 		expect(body.cacheReadTokens).toBe(800);
 		expect(body.cacheWriteTokens).toBe(100);
 		expect(body.cachePct).toBe(80);
+		expect(body.expectedCacheRate).toBe(89);
+	});
+
+	it("POST /chat/warm returns expectedCacheRate = round(cacheRead/(cacheRead+cacheWrite)*100)", async () => {
+		const app = createApp({
+			conversationStore: createFakeConversationStore(),
+			orchestrator: createFakeOrchestrator([]),
+			credentialStore: createFakeCredentialStore([]),
+			warmService: createFakeWarmService({
+				inputTokens: 500,
+				outputTokens: 100,
+				cacheReadTokens: 400,
+				cacheWriteTokens: 100,
+			}),
+			logger: noopLogger,
+		});
+
+		const res = await app.request("/chat/warm", {
+			method: "POST",
+			headers: { "Content-Type": "application/json" },
+			body: JSON.stringify({ conversationId: "conv1" }),
+		});
+
+		expect(res.status).toBe(200);
+		const body = (await res.json()) as { expectedCacheRate: number };
+		expect(body.expectedCacheRate).toBe(80);
+	});
+
+	it("POST /chat/warm returns expectedCacheRate = 0 when cacheRead+cacheWrite is 0", async () => {
+		const app = createApp({
+			conversationStore: createFakeConversationStore(),
+			orchestrator: createFakeOrchestrator([]),
+			credentialStore: createFakeCredentialStore([]),
+			warmService: createFakeWarmService({
+				inputTokens: 100,
+				outputTokens: 50,
+				cacheReadTokens: 0,
+				cacheWriteTokens: 0,
+			}),
+			logger: noopLogger,
+		});
+
+		const res = await app.request("/chat/warm", {
+			method: "POST",
+			headers: { "Content-Type": "application/json" },
+			body: JSON.stringify({ conversationId: "conv1" }),
+		});
+
+		expect(res.status).toBe(200);
+		const body = (await res.json()) as { expectedCacheRate: number };
+		expect(body.expectedCacheRate).toBe(0);
 	});
 
 	it("POST /chat/warm returns 409 when the warm service reports the conversation is generating", async () => {
diff --git a/packages/transport-http/src/app.ts b/packages/transport-http/src/app.ts
index a8cef51..84c7d20 100644
--- a/packages/transport-http/src/app.ts
+++ b/packages/transport-http/src/app.ts
@@ -10,6 +10,7 @@ import { Hono } from "hono";
 import { cors } from "hono/cors";
 import {
 	computeCachePct,
+	computeExpectedCacheRate,
 	isParseError,
 	isSinceSeqError,
 	parseChatBody,
@@ -284,6 +285,7 @@ export function createApp(opts: CreateServerOptions): Hono {
 			cacheReadTokens: result.cacheReadTokens,
 			cacheWriteTokens: result.cacheWriteTokens,
 			cachePct: computeCachePct(result.inputTokens, result.cacheReadTokens),
+			expectedCacheRate: computeExpectedCacheRate(result.cacheReadTokens, result.cacheWriteTokens),
 		};
 		return c.json(response, 200);
 	});
diff --git a/packages/transport-http/src/logic.test.ts b/packages/transport-http/src/logic.test.ts
index 1e33f40..19b47ef 100644
--- a/packages/transport-http/src/logic.test.ts
+++ b/packages/transport-http/src/logic.test.ts
@@ -1,6 +1,7 @@
 import type { AgentEvent } from "@dispatch/kernel";
 import { describe, expect, it } from "vitest";
 import {
+	computeExpectedCacheRate,
 	isParseError,
 	isSinceSeqError,
 	parseChatBody,
@@ -197,3 +198,26 @@ describe("serializeEventLine", () => {
 		expect(parsed.reason).toBe("stop");
 	});
 });
+
+describe("computeExpectedCacheRate", () => {
+	it("returns round(cacheRead/(cacheRead+cacheWrite)*100)", () => {
+		expect(computeExpectedCacheRate(800, 200)).toBe(80);
+	});
+
+	it("returns 0 when cacheRead+cacheWrite is 0", () => {
+		expect(computeExpectedCacheRate(0, 0)).toBe(0);
+	});
+
+	it("returns 100 when all tokens are cacheRead", () => {
+		expect(computeExpectedCacheRate(500, 0)).toBe(100);
+	});
+
+	it("returns 0 when all tokens are cacheWrite", () => {
+		expect(computeExpectedCacheRate(0, 500)).toBe(0);
+	});
+
+	it("rounds to nearest integer", () => {
+		expect(computeExpectedCacheRate(1, 2)).toBe(33);
+		expect(computeExpectedCacheRate(2, 1)).toBe(67);
+	});
+});
diff --git a/packages/transport-http/src/logic.ts b/packages/transport-http/src/logic.ts
index bb827e2..bddedf0 100644
--- a/packages/transport-http/src/logic.ts
+++ b/packages/transport-http/src/logic.ts
@@ -113,3 +113,12 @@ export function computeCachePct(inputTokens: number, cacheReadTokens: number): n
 	if (inputTokens <= 0) return 0;
 	return Math.round(Math.max(0, Math.min(1, cacheReadTokens / inputTokens)) * 100);
 }
+
+export function computeExpectedCacheRate(
+	cacheReadTokens: number,
+	cacheWriteTokens: number,
+): number {
+	const denom = cacheReadTokens + cacheWriteTokens;
+	if (denom <= 0) return 0;
+	return Math.round((cacheReadTokens / denom) * 100);
+}
diff --git a/tasks.md b/tasks.md
index c94b156..6fd3676 100644
--- a/tasks.md
+++ b/tasks.md
@@ -162,6 +162,14 @@ arm-on-settle/cancel-on-start; `pct = round(clamp(cacheRead/input,0,1)*100)`).
 - **LIVE-VERIFIED against Claude haiku:** automatic timer warm → journal `warm complete pct:100`;
   manual `POST /chat/warm` → `cacheReadTokens:6799, cachePct:100` (100% hit), HTTP 200. The external
   `../claude` provider-anthropic is loaded via `bin/up` (`DISPATCH_EXTERNAL_EXTENSIONS`).
+- **Cache-metric fix + retention metric:** `provider-anthropic` (in `../claude`, commit `0e9d118`)
+  now reports `Usage.inputTokens` as the TOTAL prompt (was the uncached remainder → the cache rate
+  inflated/clamped to 100% on Claude). So `cacheRead/inputTokens` is now the true rate (live: a turn
+  adding new content reads 61%, not 100%). Added **`expectedCacheRate`** = `cacheRead/(cacheRead+
+  cacheWrite)` (retention/health, ~100% when warm, 0% when the cache expired) to `WarmResponse` +
+  `POST /chat/warm` + the cache-warming surface (a "cache retention" stat). Live-verified: warm
+  within TTL → 100%; warm after >5 min idle → 0% (cache expired). FE handoff updated with both
+  metrics + the cross-turn real-turn `expectedCache = cacheRead_N/(cacheRead_{N-1}+cacheWrite_{N-1})`.
 - **Surface framework extended (DONE):** added `NumberField` to `ui-contract` + per-conversation
   surface scoping (optional `conversationId` on subscribe/unsubscribe/invoke + surface/update; new
   `SurfaceContext` on `SurfaceProvider.getSpec/invoke`; transport-ws keys subscriptions by
author	Adam Malczewski <[email protected]>	2026-06-11 14:11:13 +0900
committer	Adam Malczewski <[email protected]>	2026-06-11 14:11:13 +0900
commit	7ffb6b28f5b6bdbfc53ebed94fc68af557612189 (patch)
tree	e66d9ea9d326ef771cc473d81ca5716ff78b08a8
parent	763e5fb1c7fbfb4c7bbd43ffb935e42e5f5b5a42 (diff)
download	dispatch-7ffb6b28f5b6bdbfc53ebed94fc68af557612189.tar.gz dispatch-7ffb6b28f5b6bdbfc53ebed94fc68af557612189.zip