frontend-cache-rate-handoff.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126

# FE handoff — cache hit/miss + percentage (calculation guide)

> **Courier doc** (backend → `../dispatch-web`, via the user). Per ORCHESTRATOR §7
> the backend does not write the FE repo. This describes ONLY how to compute cache
> hit/miss + percentages from data the backend ALREADY exposes — **no UI design here**
> (the look is specified separately) and **no backend change is required**.
> Contracts: `@dispatch/wire` + `@dispatch/transport-contract` `0.4.0`.

## TL;DR
The cache hit rate is `cacheReadTokens / inputTokens`. Everything you need is already
on the `usage` + `done` live events and in `GET /conversations/:id/metrics`. There is
**no separate cache endpoint or boolean** — it's derived from token counts, exactly as
the old `CacheRatePanel` did.

## The data shape (`Usage`, from `@dispatch/wire`)
```ts
interface Usage {
  inputTokens: number;       // TOTAL prompt tokens this step/turn, INCLUDING cached ones
  outputTokens: number;
  cacheReadTokens?: number;  // input tokens served FROM cache (the "hit" count). Optional.
  cacheWriteTokens?: number; // cache-creation count. Optional; usually ABSENT (see caveats).
}
```
Field semantics that matter for the math:
- `inputTokens` is the **whole** prompt, so `cacheReadTokens ≤ inputTokens` and the rate is in `[0,1]`.
- The cache fields are **optional** — treat `undefined` as `0` in all arithmetic.

## Formulas
```ts
const read  = u.cacheReadTokens  ?? 0;
const write = u.cacheWriteTokens  ?? 0;

const isHit   = read > 0;                               // hit vs miss
const hitRate = u.inputTokens > 0 ? read / u.inputTokens : 0;   // 0..1  (guard /0)
const hitPct  = Math.round(hitRate * 100);
const fresh   = Math.max(0, u.inputTokens - read - write);      // uncached input tokens
```
(These are byte-identical to the old `CacheRatePanel.svelte` formulas: hit rate =
`cacheReadTokens/inputTokens` clamped; uncached = `max(0, input − read − write)`.)

## Where to get `Usage` — three granularities, two channels

| Scope | LIVE (WS `chat.delta` / NDJSON) | REPLAY (`GET /conversations/:id/metrics`) |
|---|---|---|
| **Per step** | `usage` event (`type:"usage"`, carries `stepId`, `usage`) | `TurnMetrics.steps[].usage` (each has `stepId`) |
| **Per turn** (authoritative aggregate) | `done` event (`type:"done"`, carries `usage`, `durationMs`) | `TurnMetrics.usage` |
| **Cumulative** (conversation) | Σ of each turn's `done.usage` | Σ of `turns[].usage` |

Notes:
- The **per-turn aggregate IS the sum of its steps** (the runtime aggregates). So when
  summing a cumulative figure, pick ONE granularity — sum `done.usage`/`TurnMetrics.usage`
  per turn, **or** sum all steps — never both (double-count).
- `done.usage` is the authoritative per-turn total. (`turn-sealed` does NOT carry usage in
  this backend — it's just `{conversationId, turnId}`; the numbers ride the immediately
  preceding `done` event.)
- `step-complete` is timing only (ttft/decode) — no tokens; ignore it for cache.

## Live accumulation + reconcile (recommended pattern)
1. **In-progress turn (optional live counter):** as `usage` events stream, you may sum
   `read`/`input` across the turn's steps to show a live-updating hit % for the current turn.
2. **Turn finished:** take that turn's authoritative totals from its `done.usage`. Use it as
   the turn's final value (replace any live partial for that turn).
3. **Cumulative (session/conversation):** add each completed turn's `done.usage` to a running
   total. Compute the cumulative hit % from the running totals (`ΣcacheRead / Σinput`).
4. **"Last request" rate:** the most recent turn's `done.usage` (or most recent step's `usage`
   if you want per-round-trip granularity).

## Replay / reopening a conversation
On open, `GET /conversations/:id/metrics` → `ConversationMetricsResponse { turns: TurnMetrics[] }`.
Seed the cumulative totals from `Σ turns[].usage`, the "last request" from `turns.at(-1).usage`,
and you can render a per-turn (and per-step, via `steps[]`) breakdown — a superset of what the
old session-cumulative-only panel could show.

## Caveats (be honest in the UI)
- **`cacheWriteTokens` is usually absent.** The current provider is OpenAI-compatible
  (OpenCode Go): it reports a cache **read** count (`cached_tokens`) but **no cache-creation**
  count. So the old panel's separate "write" row will be 0/empty. Hit/miss and the read
  percentage are unaffected. It would populate only if an Anthropic-native (or
  `cache_write`-reporting) provider is added.
- **Optional fields:** any of the cache fields can be `undefined` (provider-dependent). Default
  to 0; never assume presence.
- **A legitimate 0% is not a bug.** OpenAI-style providers auto-cache (no `cache_control`
  breakpoints), and short prompts below the provider's cache threshold simply won't be cached —
  `cacheReadTokens: 0` is a real "miss", not missing data. Cache reads grow as a conversation's
  resent prefix gets large enough.
- **Provider doesn't report cache at all — distinguish from 0.** Some providers (e.g.
  **Umans**) never include `cache_read_tokens` / `cache_write_tokens` in their usage
  payload. In that case `cacheReadTokens` is `undefined` — the provider can't tell you
  whether cache was hit or missed. This is **different from `cacheReadTokens: 0`**,
  which means "cache was checked and there were 0 hits" (a real miss).

  The FE should distinguish these three states:

  | `cacheReadTokens` | Meaning | FE display |
  |---|---|---|
  | `undefined` | Provider doesn't report cache | Hide cache panel, or show "N/A" |
  | `0` | Provider reports cache; this request had 0 hits | Show "0%" (genuine miss) |
  | `> 0` | Cache hit | Show percentage |

  ```ts
  function cacheDisplay(u: Usage): { kind: "not-reported" } | { kind: "reported"; hitPct: number } {
    if (u.cacheReadTokens === undefined) return { kind: "not-reported" };
    const read = u.cacheReadTokens;
    const hitRate = u.inputTokens > 0 ? read / u.inputTokens : 0;
    return { kind: "reported", hitPct: Math.round(hitRate * 100) };
  }
  ```

  When `kind === "not-reported"`, do NOT show "0%" — that's misleading. Either hide the
  cache panel entirely or show "Cache: not reported". This also applies to `cacheWriteTokens`
  (if `undefined`, don't show a write row).

## Worked example (real numbers, captured live against OpenCode Go flash)
| Turn | inputTokens | cacheReadTokens | hit % |
|---|---|---|---|
| 1 | 2669 | 384 | 14% |
| 2 (history resent) | 2737 | 2560 | **93%** |

Cumulative: read `2944` / input `5406` → **54%**. These exact values appear both on the live
`done.usage` stream and in `GET /conversations/:id/metrics` (`turns[].usage`).

## Type references
- `@dispatch/wire`: `Usage`, `TurnUsageEvent` (`usage`), `TurnDoneEvent` (`done`),
  `TurnMetrics`, `StepMetrics`.
- `@dispatch/transport-contract`: `ConversationMetricsResponse`, and the WS `chat.delta`
  envelope carrying each `AgentEvent`.