frontend-cache-warming-handoff.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91

# FE handoff — cache warming: cache-rate fix + "expected cache" metric

> **Courier doc** (backend → `../dispatch-web`, via the user). Per ORCHESTRATOR §7 the backend does
> NOT write the FE repo. `lsp references` does not span the two repos.
> Backend commits: `7ffb6b2` (arch-rewrite), `0e9d118` (`../claude/provider-anthropic`).

## Status — most of the original handoff is DONE (removed)
Per the FE's `backend-handoff.md` (2026-06-11), the frontend has already consumed the bulk of the
earlier version of this doc — those sections are **removed**:
- ✅ `NumberField` (`kind:"number"`) renderer.
- ✅ Conversation-scoped surface subscriptions (focused `conversationId` on subscribe/invoke +
  staleness rule; re-scope on conversation switch).
- ✅ The "Cache Warming" sidebar view: enabled toggle, minutes+seconds interval (`cache-warming/
  set-interval`), `cache-warming/toggle`, manual **Warm now** (`POST /chat/warm`), live countdown,
  hit-% history.
- ✅ `warmNow()` posting `/chat/warm` with the conversation's model.

What remains below is the ONE piece the FE has not yet consumed: a cache-rate **correctness fix** and
a new **retention** metric.

## Cache-rate metric — a correctness fix + the "expected cache" metric (TO CONSUME)
A backend bug made the cache-hit % read **100% on Claude whenever anything was cached** (it inflated).
Root cause: Anthropic's `input_tokens` is the *uncached remainder*, with cache read/creation reported
separately — but the wire `Usage.inputTokens` convention (which the flash/OpenAI-compat provider
already follows) is the **TOTAL prompt incl. cached**. Fixed in `../claude/provider-anthropic`
(`inputTokens = input + cacheRead + cacheWrite`). **No FE change needed for the fix itself** — your
existing `cacheRead/inputTokens` math (in `frontend-cache-rate-handoff.md`) now yields the *true* rate
on Claude. (That older handoff's caveat "cacheWriteTokens is usually absent" is **not** true for
Claude — it reports both.)

Show two distinct cache numbers:
- **Cache rate** = `cacheReadTokens / inputTokens` — *what fraction of THIS turn's prompt came from
  cache*. Legitimately **drops when a turn adds a lot of new content** (e.g. pasting a big file: reads
  the old prefix back but also writes the new file → rate < 100%). Per-turn efficiency; on every
  `usage`/`done` event + persisted metrics.
- **Expected cache (retention)** = *of the cache that existed going into this turn, how much we read
  back* — ideally **~100% every turn after the first**. **<100% = the cache busted/expired.** It is a
  **cross-turn** derivation (FE-side, from two consecutive turns' usage you already have):
  ```
  expectedCache(turn N) = clamp01( cacheRead_N / (cacheRead_{N-1} + cacheWrite_{N-1}) )
  ```
  (denominator = the prior turn's cached prefix = what it read + what it wrote).

**Worked example (live, Claude haiku), one chat, two real turns:**
| turn | inputTokens (total) | cacheRead | cacheWrite | cache rate `cr/input` | expected cache (cross-turn) |
|---|---|---|---|---|---|
| 1 (fresh) | 5149 | 0 | 5146 | 0% | — |
| 2 (new msg) | 8462 | 5146 | 3313 | **61%** | `5146/(0+5146)` = **100%** |

So on turn 2 the prompt was 61% cache (the rest was the new message), yet you read back **100%** of
what turn 1 cached — two true, complementary signals. (Pre-fix, the rate wrongly showed 100% because
the denominator excluded the 5146 cached tokens.)

### Warming-specific (already on the wire — small additions)
For the warming feature, the backend now also reports a **single-shot** retention so you don't have to
track cross-turn state there:
- **`WarmResponse.expectedCacheRate`** (new field on `POST /chat/warm`) =
  `round(cacheReadTokens / (cacheReadTokens + cacheWriteTokens) * 100)` — ~**100%** when the warm
  found the cache still warm, **0%** when it had expired (rewrote everything). This is the **"is
  warming working?"** signal — headline this for the Warm-now result rather than `cachePct`.
- The conversation-scoped `cache-warming` surface gained a matching **`stat` "cache retention"** field
  (alongside the existing "last cache rate" stat). It's a generic `stat`, so your existing renderer
  already shows it — just relabel/position as desired.

Types: `@dispatch/transport-contract` `WarmResponse` now carries `expectedCacheRate` (additive).

## CR-3 — DONE (next-warm timestamps + manual-warm resets the timer)
Both asks from `backend-handoff-cache-warming-timer.md` are implemented (commit `bfbad3a`). No
contract bump (uses the `custom` escape hatch, as you suggested).

**Ask 1 — authoritative timestamps on the `cache-warming` surface.** The conversation-scoped spec now
includes a `custom` field:
```ts
{ kind: "custom", rendererId: "cache-warming-timer",
  payload: { nextWarmAt: number | null, lastWarmAt: number | null } }   // epoch-ms
```
- `nextWarmAt` = epoch-ms the next AUTOMATIC warm will fire, or `null` when not scheduled (disabled,
  or a turn is generating so the timer is cancelled). Drive your countdown off this directly.
- `lastWarmAt` = epoch-ms of the most recent completed warm, or `null` if none. Use its changes for
  the history. (The hit-% for that warm is the `last cache rate` / `cache retention` stats in the
  same spec.)
- Pushed via the normal surface `update` on every change (warm complete, toggle, interval, turn
  start/settle). You can drop the FE-side best-effort countdown anchor.

**Ask 2 — a manual `POST /chat/warm` now resets the cycle + refreshes the surface.** Implemented via
an inversion (no new endpoint, no change to the `/chat/warm` request/response): the backend's warm
service emits an internal event that the cache-warming extension consumes, so a manual warm now
re-arms the automatic timer (new `nextWarmAt`), updates `lastPct`/`lastWarmAt`, and **pushes a surface
`update`**. So after a "Warm now" click you'll get an authoritative surface `update` — you can drop the
workaround of reading the % from the HTTP response (though the HTTP `WarmResponse` is still returned and
fine to use for immediate feedback). Live-verified against Claude haiku.