diff options
| author | Adam Malczewski <[email protected]> | 2026-06-25 18:09:12 +0900 |
|---|---|---|
| committer | Adam Malczewski <[email protected]> | 2026-06-25 18:09:12 +0900 |
| commit | 4c42ec9c7df067e0e0e309610b61e25752d73f9f (patch) | |
| tree | b8ad7c766dd2b0756f5015e09d1d79055446d3b8 /notes | |
| parent | c1bc7bfaaca7bdf4d9b2973f5dc88605217a7866 (diff) | |
| download | dispatch-4c42ec9c7df067e0e0e309610b61e25752d73f9f.tar.gz dispatch-4c42ec9c7df067e0e0e309610b61e25752d73f9f.zip | |
feat(kernel): retry-with-backoff on retryable provider errors
When the upstream LLM API returns a retryable error (HTTP 429 / 5xx
"overloaded"), the kernel now retries provider.stream() with a stepped
backoff, visibly, until the 8h cumulative-sleep budget is exhausted — then
emits the final error and seals the turn. Retries fire only when no content
was emitted yet this step (safety invariant: never duplicate partial output).
- wire: new transient TurnProviderRetryEvent AgentEvent variant (emitted
before each sleep; not persisted to model history).
- kernel contracts: RetryStrategy (pure delayFor + injected sleep) + optional
retry? on RunTurnInput (omit = no retry, backward-compatible).
- kernel run-turn: retry loop in executeStep; providerRetryEvent constructor.
Kernel imports no timer (sleep injected).
- session-orchestrator: concrete schedule (5s..30m, repeat 30m, 8h budget) +
abortable setTimeout sleep, wired into RunTurnInput.retry.
tsc -b EXIT 0; biome clean; 1574 vitest pass (+16 new: 11 kernel retry tests
with injected fake sleep + pure delayFor, zero @dispatch/* mocks; 5 schedule
tests). Transports unchanged (transport-ws forwards AgentEvent verbatim in
chat.delta; transport-http is generic JSON.stringify).
Plan: notes/retry-with-backoff-plan.md. tasks.md updated with milestone +
optional CLI-renderer roadmap follow-up.
Diffstat (limited to 'notes')
| -rw-r--r-- | notes/retry-with-backoff-plan.md | 138 |
1 files changed, 138 insertions, 0 deletions
diff --git a/notes/retry-with-backoff-plan.md b/notes/retry-with-backoff-plan.md new file mode 100644 index 0000000..99f6f5d --- /dev/null +++ b/notes/retry-with-backoff-plan.md @@ -0,0 +1,138 @@ +# Plan — Retry-with-backoff on retryable provider errors (FINALIZED) + +**Goal:** When the upstream LLM API returns a retryable error (e.g. "server +overloaded"), retry the request with a stepped backoff, visibly, until the +budget is exhausted. + +## The error (from the prod DB) — detection is already done + +- **HTTP 429** (46×) and **HTTP 502** (1×), **no 503s**. +- Body: `{"error":{"type":"overloaded_error","message":"The service is temporarily overloaded. Please retry."}}` +- `packages/openai-stream/src/stream.ts:201` **already sets** + `retryable: response.status >= 500 || response.status === 429` on the error + event, and `ProviderErrorEvent` (`kernel/contracts/provider.ts:72`) **already + declares `retryable?: boolean`**. The kernel's `processEvent` just ignores it. +- The error is **emitted (not thrown) and before any content** → retrying + `provider.stream()` is safe (no partial chunks to roll back). + +## Decision 1 — the backoff schedule + +`5s, 10s, 30s, 60s, 5m, 10m, 15m, 30m`, then **repeat 30m** until **8h of +cumulative retry-wait** is reached, then give up (emit the final error + seal). + +Pure function of the attempt index (0 = first retry): +```ts +const SCHEDULE_MS = [5_000, 10_000, 30_000, 60_000, 300_000, 600_000, 900_000, 1_800_000]; +const TAIL_MS = 1_800_000; // 30m +const BUDGET_MS = 8 * 60 * 60 * 1000; // 8h + +// pure, deterministic, no I/O +function delayFor(attempt: number): number | undefined { + const delay = attempt < SCHEDULE_MS.length ? SCHEDULE_MS[attempt] : TAIL_MS; + if (cumulativeSleepMs(attempt) > BUDGET_MS) return undefined; // over budget → stop + return delay; +} +``` +- `cumulativeSleepMs(attempt)` = sum of delay[0..attempt]; head (8 steps) sums to + 3,705s, then +1,800s per extra step. 8h (28,800s) is reached at attempt ~21 + → ~21 retries, ~7h32m of sleeping, then give up. +- Budget = cumulative *scheduled sleep* (pure/testable). If you prefer wall-clock + since first error, it switches to using the injected `now` — easy change. + +## Decision 2 — visible (yellow system-message warning) + 5d3f handoff + +Add a new **transient** `AgentEvent` variant (emitted to the frontend, NOT +persisted into the model's message history — so it never pollutes the prompt): + +```ts +// @dispatch/wire (AgentEvent union gains this member) +export interface TurnProviderRetryEvent { + readonly type: "provider-retry"; + readonly conversationId: string; + readonly turnId: string; + /** 0-based: this is the Nth retry about to happen. */ + readonly attempt: number; + /** ms the client should expect to wait before the retry fires. */ + readonly delayMs: number; + /** The endpoint's error verbatim (e.g. "HTTP 429: {…overloaded_error…}"). */ + readonly message: string; + /** The HTTP code when known (e.g. "429"). */ + readonly code?: string; +} +``` +- Emitted once per scheduled retry, BEFORE the sleep, so the UI shows + "⚠ Server overloaded — retrying in 5s…" immediately. +- When retries are exhausted (8h), the existing `error` event is emitted (as + today) and the turn seals — so the final failure is still a persisted error. + +**Frontend handoff to 5d3f:** render `provider-retry` as a yellow warning +system-message bubble showing `message` (+ `code`), with the countdown. (I do +the backend; 5d3f does the renderer — handoff via dispatch CLI.) + +## Decision 3 — retry ANY retryable error + +Retry trigger (both paths), **only when no content has been emitted yet** +(the safety invariant — never duplicate partial output): + +- **Emitted** `error` ProviderEvent with `retryable === true` → retry. (429/502/5xx + network fetch errors — all pre-content.) +- **Thrown** error (mid-stream, caught in `executeStep`'s `catch`) → treated as **retryable-by-default when pre-content** (most mid-stream throws are transient network/SSE issues). A thrown error after content is emitted is NOT retried (can't safely). + +So "if it's retryable, retry it" = the `retryable` flag drives emitted errors; +thrown errors default to retryable when nothing was streamed yet. Non-retryable +emitted errors (`retryable: false`/absent) end the step as today. + +## Architecture — kernel provides the HOOK, shell provides POLICY + I/O + +(Constitution: kernel touches no I/O; effects injected; decision pure.) + +### Kernel contract (`kernel/src/contracts/runtime.ts`) — add to `RunTurnInput`: +```ts +export interface RetryStrategy { + /** Pure: attempt → delay ms, or undefined to stop (budget exhausted). */ + readonly delayFor: (attempt: number) => number | undefined; + /** Injected effect: actually sleep. Kernel imports no timer. Abortable. */ + readonly sleep: (ms: number, signal: AbortSignal) => Promise<void>; +} +export interface RunTurnInput { + // …existing… + /** Optional injected retry. Omit = no retry (backward-compatible). */ + readonly retry?: RetryStrategy; +} +``` + +### Kernel loop (`kernel/src/runtime/run-turn.ts`, `executeStep`): +Wrap stream consumption in a retry loop: +- track `hadContent` (any text/reasoning/tool-call/usage seen); +- on a retryable error (emitted `retryable:true` OR thrown) with `!hadContent`: + - `delay = retry.delayFor(attempt)`; if `undefined` → give up (emit the + suppressed error, end step); + - else emit `providerRetryEvent(attempt, delay, message, code)`, `await + retry.sleep(delay, signal)`, `attempt++`, re-call `provider.stream()`; +- on abort during sleep → reject, seal turn `aborted` (existing flow). + +### Shell wiring (`session-orchestrator/src/orchestrator.ts`): +- Provide the concrete `RetryStrategy`: `delayFor` = the schedule + 8h budget + above; `sleep` = abortable `setTimeout`-based promise. +- Pass `retry` into the `RunTurnInput` it builds (line 589). + +## Build breakdown by unit (execution) + +| Unit (owner) | Change | +|---|---| +| `@dispatch/wire` | add `TurnProviderRetryEvent` to `AgentEvent` union | +| `kernel` contracts | add `RetryStrategy` + `retry?` on `RunTurnInput` | +| `kernel` events.ts | `providerRetryEvent(...)` constructor | +| `kernel` run-turn.ts | retry loop in `executeStep` (the core logic) | +| `kernel` run-turn.test.ts | pure tests: fake `sleep` + pure `delayFor`; assert schedule, no-after-content retry, give-up emits error, abort-during-sleep | +| `session-orchestrator` | wire concrete schedule + real `setTimeout` sleep | +| `transport-ws` | if it has an exhaustive `switch(event.type)`, add the `provider-retry` case | +| `transport-http` (mine) | **no change** — `serializeEventLine` is generic `JSON.stringify` | +| frontend (5d3f) | render `provider-retry` as a yellow warning system message | + +## Open items +- **8h budget = cumulative scheduled sleep** (pure). Confirm OK vs wall-clock. +- **Thrown errors default retryable-when-pre-content.** Confirm (vs only the + flagged emitted path). +- **Execution mode:** this spans kernel + wire + orchestrator (outside my + transport-http unit). Build it directly across units, or dispatch each slice + to its unit owner via the dispatch CLI? |
