frontend-metrics-handoff.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121

# Frontend handoff — live turn metrics (tokens + timing)

> From: arch-rewrite (backend) orchestrator · For: the dispatch-web FE team.
> Status: **LIVE on the stream now** (backend committed + live-verified). Consume via the pinned
> contracts `@dispatch/[email protected]` + `@dispatch/[email protected]` (reference snapshots
> regenerated in `dispatch-web/.dispatch/{wire,transport-contract}.reference.md`).

## 1. What you can now access
The backend's **authoritative** token + timing metrics are now on the live turn stream:

| Metric | Where | Field(s) |
|---|---|---|
| Per-step tokens | `usage` event | `usage` (`inputTokens`/`outputTokens`/`cacheReadTokens?`/`cacheWriteTokens?`) + new `stepId?` |
| Per-step **TTFT** | new `step-complete` event | `ttftMs?` |
| Per-step **decode** time | new `step-complete` event | `decodeMs?` |
| Per-step total generation | new `step-complete` event | `genTotalMs?` |
| **Tool execution** time | `tool-result` event | `durationMs?` |
| **Turn** wall-clock | `done` event | `durationMs?` |
| **Turn** total tokens | `done` event | `usage?` |
| **Tokens/sec** (TPS) | derive | `usage.outputTokens / (step-complete.decodeMs / 1000)` |
| Context-size proxy | `usage` event | `usage.inputTokens` (size the model counted; `cacheReadTokens` = cached portion) |

"Authoritative" = measured by the backend runtime, not client wall-clock. They differ from
anything you'd time in the browser (no network/buffering in them).

## 2. How they're delivered
**Inline, in the same chat stream you already consume** — WS `chat.delta` frames (and the
`POST /chat` NDJSON stream) carry the `AgentEvent` union; metrics are additional event types /
fields in that union. **No new endpoint, no subscription/negotiation.** You already `switch` on
`event.type`; route the metric events to a telemetry handler and ignore any you don't render
(zero cost). They do **not** appear in message content — keep your transcript rendering as-is.

These events are **low-frequency** (one `step-complete` per step, one `done` per turn, a
`durationMs` per tool result) — not per-token — so there's no stream-volume concern.

## 3. The new/changed events (shapes)
All new fields are **optional** — see §5. Every event still carries `conversationId` + `turnId`.

```ts
// NEW variant in AgentEvent — emitted once per step, AT STEP END (timing is final here)
interface TurnStepCompleteEvent {
  type: "step-complete";
  conversationId: string;
  turnId: string;
  stepId: StepId;        // join key to the step's `usage` event + tool events
  ttftMs?: number;       // time to first token (stream start → first text|reasoning delta)
  decodeMs?: number;     // first token → stream end  (== genTotalMs - ttftMs)
  genTotalMs?: number;   // whole-step generation (present even if no first token was seen)
}

// usage event — now labeled by step
interface TurnUsageEvent {
  type: "usage";
  conversationId: string; turnId: string;
  stepId?: StepId;       // NEW — attribute tokens to a step / join to step-complete
  usage: Usage;          // { inputTokens, outputTokens, cacheReadTokens?, cacheWriteTokens? }
}

// tool-result — now carries execution time
interface TurnToolResultEvent {
  type: "tool-result";
  conversationId: string; turnId: string;
  stepId: StepId; toolCallId: string; toolName: string;
  content: string; isError: boolean;
  durationMs?: number;   // NEW — tool execution time (dispatch → result)
}

// done — now carries turn totals
interface TurnDoneEvent {
  type: "done";
  conversationId: string; turnId: string;
  reason: string;
  durationMs?: number;   // NEW — whole-turn wall-clock
  usage?: Usage;         // NEW — aggregate turn tokens (so you needn't sum the usage events)
}
```

## 4. Correlation & derived metrics
Keys: `turnId` groups a turn; `stepId` groups a step within it; `toolCallId` pairs a tool call
with its result. A turn has **one `step-complete` (and usually one `usage`) per step**.

- **Per-step TPS** = `usage.outputTokens / (step-complete.decodeMs / 1000)` — join `usage` and
  `step-complete` by `stepId`. (Use `decodeMs`, not `genTotalMs`, for decode-rate TPS; it excludes
  first-token latency. See "which TPS" caveat below.)
- **Turn TPS** = `done.usage.outputTokens / (Σ step-complete.decodeMs / 1000)`.
- **Generation total per step** = `genTotalMs` (or `ttftMs + decodeMs`).
- **Turn-visible first-token latency** = the `ttftMs` of **step 0** (the first `step-complete`).
- **Total prefill overhead** = `Σ ttftMs` across steps; **pure generation** = `Σ decodeMs`.
- **Tool time** = `tool-result.durationMs` per call; sum per `stepId` for a batch.

"Which TPS": `decodeMs` is first-token → end, so TPS over it is the decode rate (first-token
latency removed). If you want end-to-end rate including the wait, use `ttftMs + decodeMs`.

## 5. Optionality — you MUST tolerate absence
- `step-complete` is always emitted per step, but its **timing fields are present only when the
  server runs with a clock** (it does in normal operation). `ttftMs`/`decodeMs` are additionally
  absent for a step that produced **no text/reasoning token** (e.g. a tool-call-only step) —
  `genTotalMs` is still present in that case.
- `usage.stepId`, `tool-result.durationMs`, `done.durationMs`, `done.usage` are all optional.
- Render gracefully when a value is missing (omit the figure; don't show `NaN`/`undefined`).

## 6. What is NOT available yet (deferred — Pass 2)
**Metrics are LIVE-ONLY.** They are **not persisted**, so:
- `GET /conversations/:id` (history) returns messages/chunks but **no tokens/timing**. Reopening a
  past conversation will show content without metrics.
- If you need historical metrics (e.g. show TPS on a reloaded conversation), that's the planned
  **Pass 2** (persist per-turn metrics + a read path) — see `tasks.md` "Pass 2 — DEFERRED". Tell
  us if you need it and we'll prioritize.
- TPS is not sent pre-computed (derive it, §4). No per-token timing (metrics are per-step/per-turn).

## 7. Integration checklist
1. Refresh deps: `bun run typecheck` in dispatch-web (picks up `[email protected]` / `[email protected]`).
2. Extend your `chat.delta` event handler: add a `case "step-complete"` and read the new optional
   fields on `usage`/`tool-result`/`done`. (No exhaustive-switch break — these are additive.)
3. Keep a per-turn (and per-step, keyed by `stepId`) telemetry accumulator alongside the transcript
   store; fold metric events into it; render where you want (e.g. a turn footer / per-step badges).
4. Treat every metric field as optional (§5).

## 8. Carrier facts (unchanged)
HTTP 24203 (`POST /chat` NDJSON, `GET /conversations/:id`, `GET /models`), WS 24205 (one socket,
`chat.delta` carries each `AgentEvent`), CORS `*`. Same events on both carriers.