feat(kernel): retry-with-backoff on retryable provider errors

When the upstream LLM API returns a retryable error (HTTP 429 / 5xx "overloaded"), the kernel now retries provider.stream() with a stepped backoff, visibly, until the 8h cumulative-sleep budget is exhausted — then emits the final error and seals the turn. Retries fire only when no content was emitted yet this step (safety invariant: never duplicate partial output). - wire: new transient TurnProviderRetryEvent AgentEvent variant (emitted before each sleep; not persisted to model history). - kernel contracts: RetryStrategy (pure delayFor + injected sleep) + optional retry? on RunTurnInput (omit = no retry, backward-compatible). - kernel run-turn: retry loop in executeStep; providerRetryEvent constructor. Kernel imports no timer (sleep injected). - session-orchestrator: concrete schedule (5s..30m, repeat 30m, 8h budget) + abortable setTimeout sleep, wired into RunTurnInput.retry. tsc -b EXIT 0; biome clean; 1574 vitest pass (+16 new: 11 kernel retry tests with injected fake sleep + pure delayFor, zero @dispatch/* mocks; 5 schedule tests). Transports unchanged (transport-ws forwards AgentEvent verbatim in chat.delta; transport-http is generic JSON.stringify). Plan: notes/retry-with-backoff-plan.md. tasks.md updated with milestone + optional CLI-renderer roadmap follow-up.
author: Adam Malczewski <[email protected]> 2026-06-25 18:09:12 +0900
committer: Adam Malczewski <[email protected]> 2026-06-25 18:09:12 +0900
commit: 4c42ec9c7df067e0e0e309610b61e25752d73f9f (patch)
tree: b8ad7c766dd2b0756f5015e09d1d79055446d3b8 /notes
parent: c1bc7bfaaca7bdf4d9b2973f5dc88605217a7866 (diff)
download: dispatch-4c42ec9c7df067e0e0e309610b61e25752d73f9f.tar.gz
dispatch-4c42ec9c7df067e0e0e309610b61e25752d73f9f.zip
1 files changed, 138 insertions, 0 deletions
diff --git a/notes/retry-with-backoff-plan.md b/notes/retry-with-backoff-plan.md
new file mode 100644
index 0000000..99f6f5d
--- /dev/null
+++ b/notes/retry-with-backoff-plan.md
@@ -0,0 +1,138 @@
+# Plan — Retry-with-backoff on retryable provider errors (FINALIZED)
+
+**Goal:** When the upstream LLM API returns a retryable error (e.g. "server
+overloaded"), retry the request with a stepped backoff, visibly, until the
+budget is exhausted.
+
+## The error (from the prod DB) — detection is already done
+
+- **HTTP 429** (46×) and **HTTP 502** (1×), **no 503s**.
+- Body: `{"error":{"type":"overloaded_error","message":"The service is temporarily overloaded. Please retry."}}`
+- `packages/openai-stream/src/stream.ts:201` **already sets**
+  `retryable: response.status >= 500 || response.status === 429` on the error
+  event, and `ProviderErrorEvent` (`kernel/contracts/provider.ts:72`) **already
+  declares `retryable?: boolean`**. The kernel's `processEvent` just ignores it.
+- The error is **emitted (not thrown) and before any content** → retrying
+  `provider.stream()` is safe (no partial chunks to roll back).
+
+## Decision 1 — the backoff schedule
+
+`5s, 10s, 30s, 60s, 5m, 10m, 15m, 30m`, then **repeat 30m** until **8h of
+cumulative retry-wait** is reached, then give up (emit the final error + seal).
+
+Pure function of the attempt index (0 = first retry):
+```ts
+const SCHEDULE_MS = [5_000, 10_000, 30_000, 60_000, 300_000, 600_000, 900_000, 1_800_000];
+const TAIL_MS = 1_800_000;            // 30m
+const BUDGET_MS = 8 * 60 * 60 * 1000;  // 8h
+
+// pure, deterministic, no I/O
+function delayFor(attempt: number): number | undefined {
+  const delay = attempt < SCHEDULE_MS.length ? SCHEDULE_MS[attempt] : TAIL_MS;
+  if (cumulativeSleepMs(attempt) > BUDGET_MS) return undefined; // over budget → stop
+  return delay;
+}
+```
+- `cumulativeSleepMs(attempt)` = sum of delay[0..attempt]; head (8 steps) sums to
+  3,705s, then +1,800s per extra step. 8h (28,800s) is reached at attempt ~21
+  → ~21 retries, ~7h32m of sleeping, then give up.
+- Budget = cumulative *scheduled sleep* (pure/testable). If you prefer wall-clock
+  since first error, it switches to using the injected `now` — easy change.
+
+## Decision 2 — visible (yellow system-message warning) + 5d3f handoff
+
+Add a new **transient** `AgentEvent` variant (emitted to the frontend, NOT
+persisted into the model's message history — so it never pollutes the prompt):
+
+```ts
+// @dispatch/wire (AgentEvent union gains this member)
+export interface TurnProviderRetryEvent {
+  readonly type: "provider-retry";
+  readonly conversationId: string;
+  readonly turnId: string;
+  /** 0-based: this is the Nth retry about to happen. */
+  readonly attempt: number;
+  /** ms the client should expect to wait before the retry fires. */
+  readonly delayMs: number;
+  /** The endpoint's error verbatim (e.g. "HTTP 429: {…overloaded_error…}"). */
+  readonly message: string;
+  /** The HTTP code when known (e.g. "429"). */
+  readonly code?: string;
+}
+```
+- Emitted once per scheduled retry, BEFORE the sleep, so the UI shows
+  "⚠ Server overloaded — retrying in 5s…" immediately.
+- When retries are exhausted (8h), the existing `error` event is emitted (as
+  today) and the turn seals — so the final failure is still a persisted error.
+
+**Frontend handoff to 5d3f:** render `provider-retry` as a yellow warning
+system-message bubble showing `message` (+ `code`), with the countdown. (I do
+the backend; 5d3f does the renderer — handoff via dispatch CLI.)
+
+## Decision 3 — retry ANY retryable error
+
+Retry trigger (both paths), **only when no content has been emitted yet**
+(the safety invariant — never duplicate partial output):
+
+- **Emitted** `error` ProviderEvent with `retryable === true` → retry. (429/502/5xx + network fetch errors — all pre-content.)
+- **Thrown** error (mid-stream, caught in `executeStep`'s `catch`) → treated as **retryable-by-default when pre-content** (most mid-stream throws are transient network/SSE issues). A thrown error after content is emitted is NOT retried (can't safely).
+
+So "if it's retryable, retry it" = the `retryable` flag drives emitted errors;
+thrown errors default to retryable when nothing was streamed yet. Non-retryable
+emitted errors (`retryable: false`/absent) end the step as today.
+
+## Architecture — kernel provides the HOOK, shell provides POLICY + I/O
+
+(Constitution: kernel touches no I/O; effects injected; decision pure.)
+
+### Kernel contract (`kernel/src/contracts/runtime.ts`) — add to `RunTurnInput`:
+```ts
+export interface RetryStrategy {
+  /** Pure: attempt → delay ms, or undefined to stop (budget exhausted). */
+  readonly delayFor: (attempt: number) => number | undefined;
+  /** Injected effect: actually sleep. Kernel imports no timer. Abortable. */
+  readonly sleep: (ms: number, signal: AbortSignal) => Promise<void>;
+}
+export interface RunTurnInput {
+  // …existing…
+  /** Optional injected retry. Omit = no retry (backward-compatible). */
+  readonly retry?: RetryStrategy;
+}
+```
+
+### Kernel loop (`kernel/src/runtime/run-turn.ts`, `executeStep`):
+Wrap stream consumption in a retry loop:
+- track `hadContent` (any text/reasoning/tool-call/usage seen);
+- on a retryable error (emitted `retryable:true` OR thrown) with `!hadContent`:
+  - `delay = retry.delayFor(attempt)`; if `undefined` → give up (emit the
+    suppressed error, end step);
+  - else emit `providerRetryEvent(attempt, delay, message, code)`, `await
+    retry.sleep(delay, signal)`, `attempt++`, re-call `provider.stream()`;
+- on abort during sleep → reject, seal turn `aborted` (existing flow).
+
+### Shell wiring (`session-orchestrator/src/orchestrator.ts`):
+- Provide the concrete `RetryStrategy`: `delayFor` = the schedule + 8h budget
+  above; `sleep` = abortable `setTimeout`-based promise.
+- Pass `retry` into the `RunTurnInput` it builds (line 589).
+
+## Build breakdown by unit (execution)
+
+| Unit (owner) | Change |
+|---|---|
+| `@dispatch/wire` | add `TurnProviderRetryEvent` to `AgentEvent` union |
+| `kernel` contracts | add `RetryStrategy` + `retry?` on `RunTurnInput` |
+| `kernel` events.ts | `providerRetryEvent(...)` constructor |
+| `kernel` run-turn.ts | retry loop in `executeStep` (the core logic) |
+| `kernel` run-turn.test.ts | pure tests: fake `sleep` + pure `delayFor`; assert schedule, no-after-content retry, give-up emits error, abort-during-sleep |
+| `session-orchestrator` | wire concrete schedule + real `setTimeout` sleep |
+| `transport-ws` | if it has an exhaustive `switch(event.type)`, add the `provider-retry` case |
+| `transport-http` (mine) | **no change** — `serializeEventLine` is generic `JSON.stringify` |
+| frontend (5d3f) | render `provider-retry` as a yellow warning system message |
+
+## Open items
+- **8h budget = cumulative scheduled sleep** (pure). Confirm OK vs wall-clock.
+- **Thrown errors default retryable-when-pre-content.** Confirm (vs only the
+  flagged emitted path).
+- **Execution mode:** this spans kernel + wire + orchestrator (outside my
+  transport-http unit). Build it directly across units, or dispatch each slice
+  to its unit owner via the dispatch CLI?
author	Adam Malczewski <[email protected]>	2026-06-25 18:09:12 +0900
committer	Adam Malczewski <[email protected]>	2026-06-25 18:09:12 +0900
commit	4c42ec9c7df067e0e0e309610b61e25752d73f9f (patch)
tree	b8ad7c766dd2b0756f5015e09d1d79055446d3b8 /notes
parent	c1bc7bfaaca7bdf4d9b2973f5dc88605217a7866 (diff)
download	dispatch-4c42ec9c7df067e0e0e309610b61e25752d73f9f.tar.gz dispatch-4c42ec9c7df067e0e0e309610b61e25752d73f9f.zip