diff options
| author | Adam Malczewski <[email protected]> | 2026-06-25 18:36:08 +0900 |
|---|---|---|
| committer | Adam Malczewski <[email protected]> | 2026-06-25 18:36:08 +0900 |
| commit | de022cee7ac66c95d7ed6a35d4e00f8e2d92cbbc (patch) | |
| tree | 041dcb1017e544a405526443cb578baa974bec0e /notes | |
| parent | fc1c3a54c3075990ec0dd0f97901bd46fe142923 (diff) | |
| parent | 649fc4f66f40f7743683546f81d3320e7394e597 (diff) | |
| download | dispatch-de022cee7ac66c95d7ed6a35d4e00f8e2d92cbbc.tar.gz dispatch-de022cee7ac66c95d7ed6a35d4e00f8e2d92cbbc.zip | |
Merge branch 'dev' into feature/ssh-support
Brings dev's retry-with-backoff (the transient `provider-retry` AgentEvent the
web frontend consumes) + the LSP-dead-server per-edit-hang fix into the SSH
feature branch, alongside the SSH waves 0-5c.
All code files auto-merged cleanly (run-turn.ts, orchestrator.ts, runtime.ts,
wire/index.ts, tool-edit-file/extension.ts, run-turn.test.ts — both computerId
threading and retry-with-backoff coexist). Only tasks.md conflicted (status
section — orchestrator-resolved; both feature sections kept).
Verified post-merge: tsc -b EXIT 0, biome clean (391 files), 1730 vitest pass
+6 sshd-integration skipped (was 1690; +40 from dev's retry/LSP tests).
Wire dist rebuilt so the FE can re-sync the pinned @dispatch/wire dep and pick
up BOTH provider-retry AND the SSH Computer/defaultComputerId types.
No merge or push (into dev or otherwise).
Diffstat (limited to 'notes')
| -rw-r--r-- | notes/conv-list-by-worktree-research.md | 247 | ||||
| -rw-r--r-- | notes/retry-with-backoff-plan.md | 138 |
2 files changed, 385 insertions, 0 deletions
diff --git a/notes/conv-list-by-worktree-research.md b/notes/conv-list-by-worktree-research.md new file mode 100644 index 0000000..636f06e --- /dev/null +++ b/notes/conv-list-by-worktree-research.md @@ -0,0 +1,247 @@ +# Research — List conversations filtered by worktree / workspace + +> **Branch:** `feature/conv-list-by-worktree` (research notes only — no code changes). +> **Date:** 2026-06-25. **Repo:** `dispatch-backend` worktree +> `worktrees/conv-list-by-worktree/backend`. +> +> **Research question:** Does the Dispatch backend currently support getting a +> list of *open* conversations filtered by a specific worktree (or workspace)? + +## TL;DR + +**Yes — fully supported, via the `workspace` concept.** `GET /conversations` +accepts a composable `?workspaceId=` query filter, and every conversation +already carries a `workspaceId` (default `"default"`). Combined with the +`?status=` filter, "open conversations in workspace X" is: + +``` +GET /conversations?workspaceId=X&status=active,idle +``` + +("open" = not `closed`; the lifecycle statuses are `active | idle | closed`, +where `closed` is the archived state.) + +**Important terminology caveat:** **"worktree" is NOT a Dispatch domain term** — +it appears nowhere in `packages/` (zero matches; it only occurs in `ORCHESTRATOR.md` +and `notes/restructure-plan.md` as part of file *paths*). The two closest canonical +concepts are: + +| Canonical term (GLOSSARY) | Meaning | "Worktree" as directory? | +|---|---|---| +| **workspace** | A named, URL-driven *logical grouping* of conversations that owns a default cwd. Every conversation belongs to exactly one. (`@dispatch/wire` `Workspace`) | No — it's a slug, not a path | +| **working directory (cwd)** | The per-conversation *filesystem directory* tools/language-servers operate within. | **Yes** — a git worktree *is* a directory | + +The existing `GET /conversations` filter supports `?workspaceId=` but **not** +`?cwd=`. So if the intent is "conversations whose working directory is a *specific +git worktree path*", that is NOT supported today (see §3b). If the intent is the +logical grouping, it's fully supported (see §1–§2). + +--- + +## 1. Is there an existing endpoint? — YES + +`GET /conversations` lists all known conversations and supports three +**composable** query filters: + +| Filter | Values | Effect | +|---|---|---| +| `?workspaceId=<slug>` | workspace slug | Restrict to conversations in that workspace | +| `?status=<csv>` | `active`,`idle`,`closed` | Restrict to those lifecycle statuses | +| `?q=<prefix>` | id prefix | Short-id resolution; id-prefix match (applied in-memory) | + +**Open conversations in a workspace** = `?status=active,idle` (excludes `closed`) +combined with `?workspaceId=`. + +There is also a `GET /workspaces` endpoint that lists workspaces (each with a +`conversationCount`), and per-workspace CRUD — see §4. + +--- + +## 2. How it works (endpoint, parameters, response shape) + +### 2a. The route — `packages/transport-http/src/app.ts:775-814` + +```ts +app.get("/conversations", async (c) => { + // ?status= comma-separated (e.g. "active,idle"). Default: all. Invalid dropped. + const statusFilter = parseStatusFilter(c.req.query("status")); + // ?workspaceId= . Missing/empty/whitespace → ignored (all workspaces). + const workspaceId = rawWorkspaceId !== undefined && rawWorkspaceId.trim().length > 0 + ? rawWorkspaceId.trim() : undefined; + const filter = (statusFilter !== undefined || workspaceId !== undefined) + ? { ...(statusFilter ? { status: statusFilter } : {}), + ...(workspaceId ? { workspaceId } : {}) } + : undefined; + const all = await opts.conversationStore.listConversations(filter); + // ?q= prefix filter applied in-memory on the result. + const conversations = q.length > 0 ? all.filter((m) => m.id.startsWith(q)) : all; + return c.json({ conversations }, 200); // 500 on store error +}); +``` + +- `?status=` parsing: `parseStatusFilter` (`packages/transport-http/src/logic.ts:24-38`). + Valid values are `"active" | "idle" | "closed"` (`VALID_STATUSES`, logic.ts:16). + Invalid values are silently dropped; if *all* values are invalid → no filter + (returns all). +- `?workspaceId=` is whitespace-trimmed; empty/whitespace-only is ignored. +- The `filter` object passed to the store is `{ status?, workspaceId? }` — either + or both optional. + +### 2b. Response shape + +`200` → `ConversationListResponse` (`packages/transport-contract/src/index.ts:710-713`, +re-exported from `@dispatch/wire`): + +```ts +interface ConversationListResponse { + readonly conversations: readonly ConversationMeta[]; +} + +interface ConversationMeta { // @dispatch/wire, wire/src/index.ts:518-536 + readonly id: string; + readonly createdAt: number; // epoch-ms, set on first write + readonly lastActivityAt: number; // epoch-ms, updated on every append + readonly title: string; // first user message (truncated 80) or PUT /title + readonly status: ConversationStatus; // "active" | "idle" | "closed" + readonly workspaceId: string; // always present; "default" fallback + readonly compactedFrom?: string; // present iff post-compaction +} +``` + +- Sorted by `lastActivityAt` **descending** (most recent first); stable sort keeps + first-seen (index) order for ties. +- `workspaceId` is always present on the response — conversations never assigned + read as `"default"` (see `toMeta`, `store.ts:329-337`). +- Errors: store failure → `500 { error: "Failed to list conversations" }`. + +### 2c. The store filter — `packages/conversation-store/src/store.ts` + +`ConversationStore.listConversations` filter type (`store.ts:91-94`): + +```ts +readonly listConversations: (filter?: { + readonly status?: readonly ConversationStatus[]; + readonly workspaceId?: string; +}) => Promise<readonly ConversationMeta[]>; +``` + +Implementation (`store.ts:695-733`): reads the conversation index, dedups +(first-seen order), reads each conversation's meta row, applies both filters, sorts +desc by `lastActivityAt`. Specifically: + +- `statusFilter` → membership test against `row.status`. +- `workspaceFilter` → equality against `row.workspaceId ?? DEFAULT_WORKSPACE_ID` + (so legacy rows with no stored workspaceId match `"default"`). + +`DEFAULT_WORKSPACE_ID` is `"default"`. + +--- + +## 3. "What would it take to add it?" + +### 3a. If "worktree" means the logical **workspace** → already done. Nothing to add. + +The capability, data model, contract types, and tests all exist. Pin +`@dispatch/transport-contract` (currently `0.22.0`) and `@dispatch/wire` +(`0.12.0`) and call `GET /conversations?workspaceId=<slug>&status=active,idle`. + +### 3b. If "worktree" literally means a **filesystem directory / git worktree path** → +NOT supported today; small, well-contained change. + +The directory concept maps to **working directory (cwd)**, which is per-conversation +(`conversation-store` `getCwd`/`setCwd`, keyed per conversation; `GET/PUT +/conversations/:id/cwd`). The list endpoint does NOT support a `?cwd=` filter, and +`ConversationMeta` does NOT carry a `cwd` field (confirmed: `wire/src/index.ts:518-536` +has no `cwd`; filter type `store.ts:91-94` has no `cwd`). + +To add a `?cwd=` filter (filter conversations by their working directory), the change +touches three layers, all additive: + +1. **Contract (`@dispatch/wire`)** — (optional) add `readonly cwd?: string | null` to + `ConversationMeta` so the caller can see each conversation's directory. Additive + type bump (e.g. `0.12.0 → 0.13.0`). Not strictly required if filtering is + server-side-only, but useful for the FE to render. +2. **Store (`conversation-store/src/store.ts`)**: + - Widen the `listConversations` filter to `{ status?, workspaceId?, cwd? }`. + - In the scan loop (`store.ts:718-728`), for each candidate call `getCwd(id)` + (or the effective cwd via `getEffectiveCwd`) and compare. **Cost note:** this is + an extra storage read per conversation (cwd is stored in its own key, not in the + meta row) — fine for typical counts; if scale matters, add a `cwd` field to + `ConversationMetaRow` (populated on `setCwd`) so the filter is a row comparison + with no extra reads. The latter is the better design and mirrors how + `workspaceId` was added to the meta row. + - Decide exact-match (path equality) vs. prefix/normalized match (e.g. a worktree + and its subdirs). Equality is simplest and probably sufficient. +3. **Transport (`transport-http/src/app.ts` + `logic.ts`)** — parse `?cwd=` (trim; + empty → ignore, mirroring `?workspaceId=`) and pass `cwd` into the store filter. + Update `ConversationListResponse` doc. + +The worktree-as-directory case is the only one that requires new code; the +workspace case requires none. + +--- + +## 4. Related workspace capabilities (for context) + +The workspace model is fully built out (courier doc `frontend-workspaces-handoff.md` +→ implemented in `@dispatch/[email protected]` / `@dispatch/transport-contract`): + +- `GET /workspaces` → `WorkspaceListResponse` (workspaces sorted by `lastActivityAt` + desc, each with `conversationCount`). The `"default"` workspace is always + synthesized/present. +- `PUT /workspaces/:id` (create-on-miss, idempotent), `GET /workspaces/:id` (pure + read, 404 if missing), `PUT /workspaces/:id/title`, `PUT /workspaces/:id/default-cwd`, + `DELETE /workspaces/:id` (closes all its conversations → `status="closed"`, + reassigns them to `"default"`, returns `closedCount`; 409 for `"default"`). +- Workspace slug regex: `^[a-z0-9](?:[a-z0-9-]{0,38}[a-z0-9])?$` (1–40, lowercase, + digits, internal hyphens). Validated by `isValidWorkspaceSlug`. +- `workspaceId` is auto-created on turn start if missing (`title = id`, + `defaultCwd = null`). Auto-assigned on `/chat` and `POST /conversations/:id/queue`. +- `DELETE /conversations/:id/cwd` clears the explicit per-conversation cwd + (falls back to the workspace `defaultCwd`, then the server default). +- cwd resolution: explicit per-conversation cwd → workspace `defaultCwd` → server + default (`process.cwd()`). `GET /conversations/:id/cwd` returns the explicit cwd + only; `GET /conversations/:id/lsp` roots at the *effective* cwd. + +`Workspace` type (`wire/src/index.ts:566-577`): `{ id, title, defaultCwd: string|null, +createdAt, lastActivityAt }`. `WorkspaceEntry extends Workspace { conversationCount }`. + +--- + +## 5. Test coverage (verified green in-repo) + +- `conversation-store/src/store-workspace.test.ts:369` — "listConversations filtered + by workspaceId" (asserts work-a returns `[a1,a2]`, work-b returns `[b1]`). +- `conversation-store/src/store.test.ts:1463` — "listConversations filters by status", + incl. `{ status: ["active","idle"] }` (= "open") returns `[conv1, conv3]` + (excludes the `closed` conv2). Also `store.test.ts:1485` — status persists across a + fresh store instance. +- `transport-http/src/app.test.ts:3696` — "GET /conversations?workspaceId= filters" + (asserts the store receives `{ workspaceId: "proj" }` and responds `200`). +- `transport-http/src/app.test.ts` (q-filter block ~3133-3156) — `?q=` prefix + + empty/whitespace handling; 500 on store throw. + +The two filters are independently testable and composable at the store level +(same `filter` object carries both); HTTP-level composition of `?status=…&workspaceId=…` +is not asserted by an explicit combined test, but the handler builds one merged +`filter` object so composition is structural. + +--- + +## 6. Verdict & recommendation + +- **For the workspace interpretation (the canonical reading of "grouping of + conversations"): the feature already exists.** Use + `GET /conversations?workspaceId=<slug>&status=active,idle`. No code changes, no + data-model changes, no contract bumps needed. +- **For the literal "git worktree as a directory" interpretation: not supported** as a + list filter; the closest concept is per-conversation **cwd**. Adding a `?cwd=` filter + is a small additive change across `@dispatch/wire` (optional `cwd` on + `ConversationMeta`), `conversation-store` (`listConversations` filter + ideally a + `cwd` column on the meta row for cheap filtering), and `transport-http` (parse + `?cwd=`). See §3b. +- **Terminology:** recommend the user confirm which concept they mean. If they mean + the logical grouping, "workspace" is already the canonical GLOSSARY term — no new + vocabulary. If they truly need directory-based grouping, that is a distinct feature + from workspaces and should be scoped as such (it overlaps with, but is not, a + workspace; a single directory could be shared by multiple workspaces or none). diff --git a/notes/retry-with-backoff-plan.md b/notes/retry-with-backoff-plan.md new file mode 100644 index 0000000..99f6f5d --- /dev/null +++ b/notes/retry-with-backoff-plan.md @@ -0,0 +1,138 @@ +# Plan — Retry-with-backoff on retryable provider errors (FINALIZED) + +**Goal:** When the upstream LLM API returns a retryable error (e.g. "server +overloaded"), retry the request with a stepped backoff, visibly, until the +budget is exhausted. + +## The error (from the prod DB) — detection is already done + +- **HTTP 429** (46×) and **HTTP 502** (1×), **no 503s**. +- Body: `{"error":{"type":"overloaded_error","message":"The service is temporarily overloaded. Please retry."}}` +- `packages/openai-stream/src/stream.ts:201` **already sets** + `retryable: response.status >= 500 || response.status === 429` on the error + event, and `ProviderErrorEvent` (`kernel/contracts/provider.ts:72`) **already + declares `retryable?: boolean`**. The kernel's `processEvent` just ignores it. +- The error is **emitted (not thrown) and before any content** → retrying + `provider.stream()` is safe (no partial chunks to roll back). + +## Decision 1 — the backoff schedule + +`5s, 10s, 30s, 60s, 5m, 10m, 15m, 30m`, then **repeat 30m** until **8h of +cumulative retry-wait** is reached, then give up (emit the final error + seal). + +Pure function of the attempt index (0 = first retry): +```ts +const SCHEDULE_MS = [5_000, 10_000, 30_000, 60_000, 300_000, 600_000, 900_000, 1_800_000]; +const TAIL_MS = 1_800_000; // 30m +const BUDGET_MS = 8 * 60 * 60 * 1000; // 8h + +// pure, deterministic, no I/O +function delayFor(attempt: number): number | undefined { + const delay = attempt < SCHEDULE_MS.length ? SCHEDULE_MS[attempt] : TAIL_MS; + if (cumulativeSleepMs(attempt) > BUDGET_MS) return undefined; // over budget → stop + return delay; +} +``` +- `cumulativeSleepMs(attempt)` = sum of delay[0..attempt]; head (8 steps) sums to + 3,705s, then +1,800s per extra step. 8h (28,800s) is reached at attempt ~21 + → ~21 retries, ~7h32m of sleeping, then give up. +- Budget = cumulative *scheduled sleep* (pure/testable). If you prefer wall-clock + since first error, it switches to using the injected `now` — easy change. + +## Decision 2 — visible (yellow system-message warning) + 5d3f handoff + +Add a new **transient** `AgentEvent` variant (emitted to the frontend, NOT +persisted into the model's message history — so it never pollutes the prompt): + +```ts +// @dispatch/wire (AgentEvent union gains this member) +export interface TurnProviderRetryEvent { + readonly type: "provider-retry"; + readonly conversationId: string; + readonly turnId: string; + /** 0-based: this is the Nth retry about to happen. */ + readonly attempt: number; + /** ms the client should expect to wait before the retry fires. */ + readonly delayMs: number; + /** The endpoint's error verbatim (e.g. "HTTP 429: {…overloaded_error…}"). */ + readonly message: string; + /** The HTTP code when known (e.g. "429"). */ + readonly code?: string; +} +``` +- Emitted once per scheduled retry, BEFORE the sleep, so the UI shows + "⚠ Server overloaded — retrying in 5s…" immediately. +- When retries are exhausted (8h), the existing `error` event is emitted (as + today) and the turn seals — so the final failure is still a persisted error. + +**Frontend handoff to 5d3f:** render `provider-retry` as a yellow warning +system-message bubble showing `message` (+ `code`), with the countdown. (I do +the backend; 5d3f does the renderer — handoff via dispatch CLI.) + +## Decision 3 — retry ANY retryable error + +Retry trigger (both paths), **only when no content has been emitted yet** +(the safety invariant — never duplicate partial output): + +- **Emitted** `error` ProviderEvent with `retryable === true` → retry. (429/502/5xx + network fetch errors — all pre-content.) +- **Thrown** error (mid-stream, caught in `executeStep`'s `catch`) → treated as **retryable-by-default when pre-content** (most mid-stream throws are transient network/SSE issues). A thrown error after content is emitted is NOT retried (can't safely). + +So "if it's retryable, retry it" = the `retryable` flag drives emitted errors; +thrown errors default to retryable when nothing was streamed yet. Non-retryable +emitted errors (`retryable: false`/absent) end the step as today. + +## Architecture — kernel provides the HOOK, shell provides POLICY + I/O + +(Constitution: kernel touches no I/O; effects injected; decision pure.) + +### Kernel contract (`kernel/src/contracts/runtime.ts`) — add to `RunTurnInput`: +```ts +export interface RetryStrategy { + /** Pure: attempt → delay ms, or undefined to stop (budget exhausted). */ + readonly delayFor: (attempt: number) => number | undefined; + /** Injected effect: actually sleep. Kernel imports no timer. Abortable. */ + readonly sleep: (ms: number, signal: AbortSignal) => Promise<void>; +} +export interface RunTurnInput { + // …existing… + /** Optional injected retry. Omit = no retry (backward-compatible). */ + readonly retry?: RetryStrategy; +} +``` + +### Kernel loop (`kernel/src/runtime/run-turn.ts`, `executeStep`): +Wrap stream consumption in a retry loop: +- track `hadContent` (any text/reasoning/tool-call/usage seen); +- on a retryable error (emitted `retryable:true` OR thrown) with `!hadContent`: + - `delay = retry.delayFor(attempt)`; if `undefined` → give up (emit the + suppressed error, end step); + - else emit `providerRetryEvent(attempt, delay, message, code)`, `await + retry.sleep(delay, signal)`, `attempt++`, re-call `provider.stream()`; +- on abort during sleep → reject, seal turn `aborted` (existing flow). + +### Shell wiring (`session-orchestrator/src/orchestrator.ts`): +- Provide the concrete `RetryStrategy`: `delayFor` = the schedule + 8h budget + above; `sleep` = abortable `setTimeout`-based promise. +- Pass `retry` into the `RunTurnInput` it builds (line 589). + +## Build breakdown by unit (execution) + +| Unit (owner) | Change | +|---|---| +| `@dispatch/wire` | add `TurnProviderRetryEvent` to `AgentEvent` union | +| `kernel` contracts | add `RetryStrategy` + `retry?` on `RunTurnInput` | +| `kernel` events.ts | `providerRetryEvent(...)` constructor | +| `kernel` run-turn.ts | retry loop in `executeStep` (the core logic) | +| `kernel` run-turn.test.ts | pure tests: fake `sleep` + pure `delayFor`; assert schedule, no-after-content retry, give-up emits error, abort-during-sleep | +| `session-orchestrator` | wire concrete schedule + real `setTimeout` sleep | +| `transport-ws` | if it has an exhaustive `switch(event.type)`, add the `provider-retry` case | +| `transport-http` (mine) | **no change** — `serializeEventLine` is generic `JSON.stringify` | +| frontend (5d3f) | render `provider-retry` as a yellow warning system message | + +## Open items +- **8h budget = cumulative scheduled sleep** (pure). Confirm OK vs wall-clock. +- **Thrown errors default retryable-when-pre-content.** Confirm (vs only the + flagged emitted path). +- **Execution mode:** this spans kernel + wire + orchestrator (outside my + transport-http unit). Build it directly across units, or dispatch each slice + to its unit owner via the dispatch CLI? |
