summaryrefslogtreecommitdiffhomepage
path: root/notes
diff options
context:
space:
mode:
Diffstat (limited to 'notes')
-rw-r--r--notes/ntfy-notifications-handoff.md311
-rw-r--r--notes/wake-schedule-handoff.md442
2 files changed, 753 insertions, 0 deletions
diff --git a/notes/ntfy-notifications-handoff.md b/notes/ntfy-notifications-handoff.md
new file mode 100644
index 0000000..fde84c8
--- /dev/null
+++ b/notes/ntfy-notifications-handoff.md
@@ -0,0 +1,311 @@
+# Handoff — n2/ntfy-notifications
+
+## Summary
+
+Adds **ntfy.sh push notifications** to Dispatch: a configurable per-event
+notification dispatcher that POSTs to a user-supplied ntfy topic URL when
+notable events happen in the running agent process.
+
+The architecture is intentionally layered so a future transport (email,
+Slack webhook, custom backend) plugs in without touching call sites:
+
+```
+AgentManager.onEvent ─┐ ┌─→ sendNtfy (fetch)
+ ├─→ NotificationDispatcher.notify(event)
+PermissionMgr ────────┘ (filter / dedupe) └─→ (other transports later)
+.onPromptAdded
+```
+
+### Event taxonomy
+
+The user toggles each one independently in Settings:
+
+| event | trigger | default | priority | tags |
+|-----------------------|--------------------------------------------------------------------------|---------|----------|------------------|
+| `turn-completed` | assistant `done` event (one per cleanly-finished turn) | on | 3 | white_check_mark |
+| `turn-error` | assistant `error` event (final, after all fallback retries) | on | 4 | rotating_light |
+| `permission-required` | `PermissionManager` newly admits a prompt to its pending list | on | 4 | lock |
+| `agent-spawned` | `tab-created` for a **top-level user agent** (parent=null, slug present) | off | 2 | sparkles |
+
+Each notification carries a short tab tag (`tab-<first8>`) so multi-tab users
+can tell which conversation pinged them.
+
+**Subagent gating**: `turn-completed` and `turn-error` from subagent tabs
+(any tab with a `parentTabId`) are suppressed by default — a parent
+agent that spawns 8 subagents would otherwise push 9 "Turn complete"
+notifications per round. Toggle on "Include subagent tabs" in Settings
+to opt in. `permission-required` is deliberately NOT gated: a
+subagent's permission prompt still needs a human tap to proceed, so
+suppressing it would silently hang the subagent. `agent-spawned` is
+already top-level-only by construction.
+
+### Design notes
+
+- **Non-blocking**: `dispatcher.notify` does `void Promise.resolve(send(...)).catch(warn)`.
+ A slow or unreachable ntfy server never stalls a turn. Worst case is a
+ 10s per-request abort timeout in the transport.
+- **Dedupe**: 5 s in-memory window keyed by `dedupeKey`. Used for
+ `permission-required` because the permission system rebroadcasts the
+ whole pending list on every change (we'd otherwise re-fire on every
+ unrelated mutation).
+- **Master switch + per-event toggle**: both must allow before a send.
+ Disabled config is a fast no-op (no fetch, no `loadConfig` work past
+ the early return).
+- **Single global config**: matches the rest of the codebase's settings
+ table (`perm_*`, `title_model_*` are also global). Stored as one JSON
+ blob under `settings.key = 'ntfy_config'`.
+- **Auth token round-trip**: `GET /notifications` redacts the token but
+ surfaces `hasAuthToken: boolean`. `PUT /notifications` semantics:
+ `authToken === undefined` keeps the stored value, `""` clears it, any
+ other string replaces it. The frontend's "Clear stored token" button
+ uses the explicit-`""` path **and awaits the server response** before
+ flipping local UI state (post-review fix — see Review section).
+- **Auth header scheme**: tokens that already start with a scheme
+ (`Bearer foo`, `Basic dXNlcjpwYXNz`) pass through verbatim; bare tokens
+ get a `Bearer ` prefix automatically. Lets users of private ntfy
+ servers use any HTTP auth scheme without code changes.
+- **Topic-URL validation**: tightened to ntfy's actual constraints —
+ exactly one path segment, 1–64 chars, `[A-Za-z0-9_-]` only. Catches
+ topics that would silently 404 at publish time.
+- **Header injection guard**: CR/LF and control chars are stripped from
+ every header value the transport writes (`Title`, `Tags`, `Click`,
+ `Authorization`).
+- **Permission "added" detection**: `PermissionManager.broadcastPending`
+ now diffs the current pending-id set against an `announcedPromptIds`
+ set and fires `onPromptAdded` only for genuinely new ids. Resolved ids
+ are pruned. This keeps the contract "one notification per prompt".
+
+## Files changed / added
+
+```
+packages/core/src/notifications/types.ts (new)
+packages/core/src/notifications/ntfy.ts (new)
+packages/core/src/notifications/config.ts (new)
+packages/core/src/notifications/dispatcher.ts (new)
+packages/core/src/notifications/index.ts (new)
+packages/core/src/index.ts (barrel re-export)
+packages/core/tests/notifications/ntfy.test.ts (new, 22 tests)
+packages/core/tests/notifications/config.test.ts (new, 10 tests)
+packages/core/tests/notifications/dispatcher.test.ts (new, 13 tests)
+
+packages/api/src/permission-manager.ts (+ onPromptAdded contract)
+packages/api/src/routes/notifications.ts (new — GET/PUT/POST routes)
+packages/api/src/app.ts (wire dispatcher + mount routes)
+packages/api/tests/permission-manager.test.ts (new, 4 tests)
+packages/api/tests/routes.test.ts (add mocks for new core exports)
+
+packages/frontend/src/lib/components/SettingsPanel.svelte (new ntfy section)
+```
+
+Seven commits on `n2/ntfy-notifications`:
+
+```
+<new> docs: update HANDOFF.md with notifySubagents
+9c93086 feat(notifications): add notifySubagents toggle to suppress subagent turn pings
+4185789 docs: update HANDOFF.md with Gemini review triage + post-fix state
+1870d0b fix(notifications): address Gemini review — tighten validation, sanitize Click, support Basic auth, non-optimistic UI clear
+29bdd00 docs: add HANDOFF.md for ntfy notifications feature
+786bc43 feat(frontend): ntfy.sh settings block in SettingsPanel
+21cdb11 feat(api): wire notification dispatcher into app + /notifications routes
+5e72191 feat(core): ntfy.sh notification dispatcher module
+```
+
+## Public surface added
+
+### New config (persisted in `settings` table)
+
+- `settings.key = "ntfy_config"` → JSON-serialized `NtfyConfig`:
+ ```ts
+ {
+ enabled: boolean,
+ topicUrl: string,
+ authToken: string,
+ events: {
+ "turn-completed": boolean,
+ "turn-error": boolean,
+ "permission-required": boolean,
+ "agent-spawned": boolean,
+ },
+ notifySubagents: boolean, // default false; gates turn-* from subagent tabs
+ }
+ ```
+
+### New API routes
+
+- `GET /notifications` →
+ `{ config: NtfyConfig & { hasAuthToken: boolean }, eventTypes: string[], defaults: NtfyConfig }`
+ (authToken is always returned as `""`; `hasAuthToken` reflects what's stored)
+- `PUT /notifications` → accepts partial `NtfyConfig`. Validates topic URL
+ when `enabled === true`. Returns the saved (redacted) config or `400`.
+- `POST /notifications/test` → sends a `turn-completed`-typed test
+ notification using the saved config. Returns `{ ok, status?, error? }`,
+ or `400` if disabled / invalid topic / event-type disabled, or `502` on
+ ntfy server failure.
+
+### New core exports (via `@dispatch/core` barrel)
+
+Types: `NotificationEvent`, `NotificationEventType`, `NtfyConfig`,
+`NtfyPriority`, `NtfySendResult`, `FetchLike`, `DispatcherOptions`,
+`AgentEventSource`, `PermissionPromptSource`, `TabTitleLookup`.
+
+Values: `NotificationDispatcher`, `sendNtfy`, `validateTopicUrl`,
+`loadNtfyConfig`, `saveNtfyConfig`, `clearNtfyConfig`,
+`normalizeNtfyConfig`, `defaultNtfyConfig`, `redactNtfyConfig`,
+`NTFY_EVENT_TYPES`, `NTFY_DEFAULT_EVENTS`, `NTFY_DEFAULT_PRIORITIES`,
+`NTFY_DEFAULT_TAGS`, `NTFY_CONFIG_KEY`.
+
+### New API surface on `PermissionManager`
+
+- `onPromptAdded(listener) => unsubscribe` — fires exactly once per
+ genuinely-new pending prompt id (with `{ id, permission, description, metadata }`).
+
+### New exported singleton in `packages/api/src/app.ts`
+
+- `notificationDispatcher: NotificationDispatcher` — already wired to
+ the module-level `agentManager` and `permissionManager`. Exposed so
+ tests / future callers can `dispose()` or `notify(...)` directly.
+
+### Frontend
+
+No new exported props — the change is entirely inside `SettingsPanel.svelte`
+and uses its existing `{ keys, apiBase }` props.
+
+## Verification status
+
+### `bun run check`
+
+```
+$ biome check .
+Checked 150 files in 158ms. No fixes applied.
+```
+
+✅ Pass (0 errors, 0 warnings).
+
+### `bun run test`
+
+```
+Test Files 28 passed (28)
+ Tests 451 passed (451)
+ Duration 2.85s
+```
+
+✅ Pass. Baseline was 393 tests in 24 files; this branch adds 58 tests
+across 4 new files (`notifications/ntfy.test.ts` ×22,
+`notifications/config.test.ts` ×13, `notifications/dispatcher.test.ts` ×19,
+`permission-manager.test.ts` ×4) and modifies 0 existing tests.
+
+### Per-package strict typecheck
+
+```
+@dispatch/core tsc --noEmit — 0 errors
+@dispatch/api tsc --noEmit — 0 errors
+@dispatch/frontend svelte-check — 0 errors, 0 warnings
+```
+
+### Manual smoke test
+
+Verified end-to-end against the real `ntfy.sh` server twice (initial
+build + post-review):
+
+```
+$ bun -e 'import { sendNtfy } from "./packages/core/src/notifications/ntfy.js"; ...'
+validate: null
+{"ok":true,"status":200}
+```
+
+Topics were throwaway and only used for these smoke tests. Full UI flow
+(Settings → topic URL → Save → Send test → push lands in ntfy app)
+was not executed because that requires a live `bun run dev:api` plus
+`dev:frontend` plus a phone with the ntfy app — but the same code path
+that the "Send test" button exercises is what the smoke test above hit,
+and the route logic on top of it is covered by unit tests.
+
+## Second-opinion review (Gemini)
+
+After the initial implementation I ran a broad, open-ended code review
+via `gemini-3-flash-preview` in YOLO mode (read-only, instructed to find
+bugs/flaws/edge-cases). Findings were triaged as follows:
+
+| # | Severity | Finding | Action |
+|---|----------|---------|--------|
+| 1 | Medium | "Duplicate error+done notifications on LLM fallback retries" | **Rejected.** Verified against `agent-manager.ts:1525–1532` and `:1611`: inner per-attempt errors set `attemptError` and `break` out of the stream loop; only the final terminal error is `this.emit({type:"error"})`-ed. The dispatcher cannot see intermediate retry errors. |
+| 2 | Medium | "Loose topic URL validation" | **Fixed.** `validateTopicUrl` now enforces one segment + `[A-Za-z0-9_-]{1,64}`. |
+| 3 | Low | "`Click` header not sanitized" | **Fixed.** Passed through `sanitizeHeader`. |
+| 4 | Low | "Use JSON publishing for UTF-8 safety in headers" | **Deferred.** Per ntfy docs UTF-8 in `Title` is supported and works in practice; the only realistic risk is non-ASCII tab titles being mangled by an exotic intermediate proxy. Switching to JSON-body publishing would re-architect the transport (and invalidate all header-shape tests) for a hypothetical edge case. Worth a follow-up if anyone reports mangled titles. |
+| 5 | Low | "Optimistic UI clear of auth token" | **Fixed.** `clearNtfyAuthToken` now awaits the response and only mutates local state on success; failures surface via the existing error banner. Added a `ntfyClearingToken` loading flag so the button disables + spins during the request. |
+| 6 | Nit | "DB read on every notify()" | **Skipped.** SQLite reads are sub-millisecond, events are human-scale infrequent (one per turn at most, dominated by an LLM round-trip taking seconds). Cache adds invalidation complexity for no measurable win. |
+| OQ | — | "Support Basic auth, not just Bearer" | **Fixed.** Tokens with a scheme prefix already (`Bearer xyz`, `Basic dXNlcjpwYXNz`) now pass through verbatim; bare tokens still get the `Bearer ` prefix. |
+
+The other open questions: click-URL deep link is still a known gap
+below; subagent noise was follow-up-fixed in commit 9c93086 (the
+`notifySubagents` flag, off by default).
+
+## Assumptions / known gaps
+
+Decisions made without product input (the spec said "ask if ambiguous";
+each of these felt unambiguous in the context of Dispatch's current
+single-user, single-process design):
+
+1. **Single global config, not per-user.** The existing `settings` table
+ is global (e.g. `title_model_*`, `perm_*` — all single-tenant). When
+ Dispatch grows real multi-tenancy this'll need a `user_id` column and
+ a load-by-user helper, but that's a much bigger refactor than this
+ feature.
+
+2. **Auth token persisted in plain text.** Same as the existing
+ `credentials` / `api_keys` tables in this DB; SQLite at-rest
+ encryption isn't a thing in this codebase. Token never leaves the
+ DB on the read path (`GET /notifications` redacts).
+
+3. **No rate-limiting or burst grouping** beyond the 5 s permission
+ dedupe. Notification-worthy events are human-scale infrequent (one
+ per turn, one per permission prompt). If someone hammers `summon` and
+ ships 50 user agents in 10 seconds, they'll get 50 pushes — that
+ matches "agent-spawned is off by default" being the right call.
+
+4. **No click-URL deep-link to the originating tab.** The frontend
+ doesn't currently route tabs by URL (`router.svelte.ts` just toggles
+ between `dashboard` and `agent-builder`), so I left `clickUrl`
+ plumbing in the transport layer (now sanitized as of the Gemini-fix
+ commit) but didn't synthesize one in the dispatcher. A future "open
+ this tab" router change would make this a 4-line addition in
+ `buildTurnCompleted` / etc.
+
+5. **Event taxonomy is intentionally small.** I considered `model-changed`
+ and `queue-overflow`/`auto-wake-budget-exhausted` notices but they
+ felt like "annoying push" rather than "useful push"; easy to add
+ later by extending `NotificationEventType`, `NTFY_DEFAULT_EVENTS`,
+ and adding a builder + dispatch hook.
+
+6. **No topic validation client-side.** The Settings field accepts any
+ non-empty string as the topic name and the transport posts to
+ `https://ntfy.sh/<topic>` (URL-encoded). Earlier revisions enforced
+ ntfy.sh's documented `^[A-Za-z0-9_-]{1,64}$` rule, but the project
+ relaxes those rules over time (issue #1451) and a regex here just
+ locks users out of valid configurations. The server is the final
+ authority; any rejection surfaces through the "Send test" button or
+ the first real notification.
+
+7. **Header-based publishing (vs. JSON-body publishing).** Per the
+ Gemini-review triage above, the transport sends `Title`/`Tags`/`Click`
+ as HTTP headers. UTF-8 titles work against ntfy.sh itself; non-ASCII
+ tab titles through an exotic intermediate proxy could theoretically
+ be mangled. Switching to ntfy's JSON publish mode (`Content-Type:
+ application/json`, body `{topic, message, title, ...}`) would side-
+ step that entirely — leaving as a follow-up if anyone hits it.
+
+Working tree is clean; seven commits on `n2/ntfy-notifications`; nothing
+merged.
+
+## Update — topic-only input (post-merge of this branch's seventh commit)
+
+The `topicUrl` field was replaced with `topic`. The user now enters just
+the ntfy topic name (e.g. `my-secret-topic`); the transport always posts
+to `https://ntfy.sh/<topic>`. `validateTopicUrl` is gone — only an empty
+check remains (server-side, and only when `enabled === true`). This
+eliminates the "string does not match the expected pattern" error users
+hit when entering a bare topic. Tests, the `/notifications` PUT route,
+the persisted JSON shape, and the SettingsPanel UI were updated together.
+Also fixed a small pre-existing bug: the `/notifications` PUT handler now
+honours `notifySubagents` on save (previously it was silently dropped
+because the field wasn't passed to `normalizeNtfyConfig`).
diff --git a/notes/wake-schedule-handoff.md b/notes/wake-schedule-handoff.md
new file mode 100644
index 0000000..3ea711a
--- /dev/null
+++ b/notes/wake-schedule-handoff.md
@@ -0,0 +1,442 @@
+# Claude Reset / Wake Schedule — Fix Handoff
+
+**Branch:** `r1/claude-reset-fix` (off `dev`)
+**Worktree:** `/home/tradam/projects/dispatch/r1-claude-reset-fix`
+**Commits:** 4 atomic commits (see `git log r1/claude-reset-fix ^dev --oneline`).
+
+---
+
+## Summary
+
+The "Claude Wake Schedule" panel (`ClaudeReset.svelte` + `/models/wake-schedule*`
+routes + the in-memory backend scheduler) had several real bugs that would
+silently lose wakes, drift over time, or behave erratically in the UI. It
+also only probed once per marked hour, which is unreliable at rate-window
+edges.
+
+This branch fixes the bugs *and* upgrades probing to 4× per hour
+(`:00 / :15 / :30 / :45`), with same-tick coalescing so the upstream still
+sees a single call. Schema changed destructively (per direction); migration
+code drops the old `wake_schedule` table.
+
+### What was broken
+
+#### Backend (`packages/api/src/routes/models.ts`)
+1. **Missed wakes silently lost.** `loadScheduleFromDB` saw any past
+ `next_wake_at`, rewrote it to the next occurrence using server-local TZ,
+ and never fired the missed wake. So if the API was down when a wake was
+ due (overnight container restart), the user lost it entirely.
+2. **Server-TZ drift.** `nextOccurrenceAt15(hour)` used `new Date().setHours()`
+ — *server* local time. The client sends absolute Unix ms (user's local
+ wall-clock intent). On a UTC Docker host running for a PST user, each
+ reschedule re-anchored to the wrong TZ and slowly migrated the fire time.
+3. **Retry storm.** Every failed wake pushed a new entry into a
+ `pendingRetries[]` array, all converging at the same `+5min` instant.
+4. **Retry/fire race within a tick.** A freshly fired wake AND a due retry
+ could both hit `wakeAllClaudeAccounts()` back-to-back.
+5. **No status surface.** Nothing told the user whether scheduled wakes
+ actually succeeded.
+6. **Only one probe per hour.** A single fire at `:15` can land 14 min off
+ the actual rate-window reset moment.
+
+#### Frontend (`packages/frontend/src/lib/components/ClaudeReset.svelte`)
+7. **`fadedHours` returned a function, not a Set.** The `$derived` had shape
+ `(): Set<number> => {...}` — `blockClass` then called `fadedHours()`
+ once per of the 24 buttons, rebuilding the Set 24× per render.
+8. **`currentHour` was frozen.** `const currentHour = $derived(new Date().getHours())`
+ — `new Date()` is not a reactive read; the value never updated. After
+ midnight (or any hour boundary) the "now" highlight stayed on the wrong
+ block until reload.
+9. **Out-of-order toggles.** Rapid double-clicks fired multiple requests;
+ the *last response* won, not the *last click* — so a slow add followed
+ by a fast remove could land in the wrong order.
+10. **No success/failure feedback.** No surface for whether the most-recent
+ wake actually worked.
+
+### What I changed
+
+| # | Bug / Feature | Fix |
+|---|---|---|
+| 1 | Missed wake silently lost | New `recoverScheduleEntry()` helper: if missed by ≤ 2h fire on next tick; either way roll forward by 24h-multiple steps. |
+| 2 | Server-TZ drift | Removed server-local `nextOccurrenceAt15`; rescheduling now uses `nextDailyAfter(previous, now)` — adds 24h × N from the *client-supplied* original ms. |
+| 3 | Retry storm | Replaced `pendingRetries: []` with a single shared `pendingRetry: PendingRetry \| null` whose budget resets on subsequent failures. |
+| 4 | Retry/fire race | Retry processing skipped on any tick where a fresh wake fired. |
+| 5 | No status surface | `GET /wake-schedule` now returns `{ schedule, resetOffsetHours, probeSlotMinutes, lastWake, pendingRetry }`. |
+| 6 | One probe/hour | A marked hour expands to 4 slots (`:00 :15 :30 :45`), each its own row. Multiple due slots in the same tick coalesce into one upstream wake. |
+| 7 | `fadedHours` was a fn | Now `$derived.by(() => Set)`; passed as a value to `blockClass`. Window length is `resetOffsetHours - 1` (no longer hardcoded 4). |
+| 8 | Frozen `currentHour` | Backed by `nowMs = $state(Date.now())`, bumped every 30s via `setInterval`, cleaned up in `onDestroy`. |
+| 9 | Out-of-order toggles | Per-hour sequence counter (`inFlightSeq`) + `pendingHours: Set<number>` that disables in-flight buttons; stale responses dropped. |
+| 10 | No feedback | New status row: "✓ Last wake N min ago" or "✗ Last wake N min ago — <error>"; pending retry row shows retries-left + next-attempt countdown. |
+
+Also extracted `CLAUDE_RESET_OFFSET_HOURS = 5` and `PROBE_SLOT_MINUTES = [0,15,30,45]`
+to a single source of truth in `packages/api/src/wake-scheduler.ts`; the
+frontend learns both from the server snapshot.
+
+---
+
+## Files changed
+
+- **New:** `packages/api/src/wake-scheduler.ts` — pure helpers
+ (`nextDailyAfter`, `recoverScheduleEntry`, `resetHourFor`,
+ `isProbeSlotMinute`, `CLAUDE_RESET_OFFSET_HOURS`,
+ `MISSED_WAKE_GRACE_MS`, `DAILY_INTERVAL_MS`, `PROBE_SLOT_MINUTES`,
+ `ProbeSlotMinute`).
+- **New:** `packages/api/tests/wake-scheduler.test.ts` — 12 unit tests for
+ the pure helpers (grace boundaries, multi-day skip, custom grace,
+ midnight wraparound).
+- **Modified:** `packages/api/src/routes/models.ts` — full rewrite of the
+ wake-scheduler section (~280 LoC). Routes preserved
+ (`POST /models/wake`, `POST /models/wake-schedule/toggle`,
+ `GET /models/wake-schedule`) but request/response payloads expanded.
+- **Modified:** `packages/api/tests/routes.test.ts` — +12 HTTP tests for
+ the wake-schedule routes (was +9 in the prior commit; rewritten for
+ the 4-slot payload).
+- **Modified:** `packages/core/src/db/index.ts` — `wake_schedule` schema
+ changed; destructive migration drops the old table if the
+ `slot_minute` column is missing. Nothing else touched.
+- **Modified:** `packages/frontend/src/lib/components/ClaudeReset.svelte`
+ — full rewrite of the script section; markup updated for the marked-hour
+ summary + the status footer.
+
+---
+
+## Public surface changes
+
+### Database schema
+
+```sql
+-- BEFORE
+CREATE TABLE wake_schedule (
+ hour INTEGER PRIMARY KEY CHECK (hour BETWEEN 0 AND 23),
+ next_wake_at INTEGER NOT NULL
+)
+
+-- AFTER
+CREATE TABLE wake_schedule (
+ hour INTEGER NOT NULL CHECK (hour BETWEEN 0 AND 23),
+ slot_minute INTEGER NOT NULL CHECK (slot_minute IN (0, 15, 30, 45)),
+ next_wake_at INTEGER NOT NULL,
+ PRIMARY KEY (hour, slot_minute)
+)
+```
+
+Migration on boot: if `PRAGMA table_info(wake_schedule)` lacks a
+`slot_minute` column, `DROP TABLE IF EXISTS wake_schedule` then `CREATE`
+the new shape. **No other table is touched** (credentials, api_keys,
+usage_cache, tabs, chunks, settings preserved).
+
+### API: `GET /models/wake-schedule`
+
+```json
+{
+ "schedule": {
+ "9": { "0": 1700001500000, "15": 1700002400000, "30": 1700003300000, "45": 1700004200000 }
+ },
+ "resetOffsetHours": 5,
+ "probeSlotMinutes": [0, 15, 30, 45],
+ "lastWake": {
+ "firedAt": 1700000000000,
+ "ok": true,
+ "results": [{ "label": "personal", "ok": true }]
+ } | null,
+ "pendingRetry": {
+ "retriesLeft": 5,
+ "nextRetryAt": 1700000300000,
+ "reason": "scheduled probe(s) 9:15"
+ } | null
+}
+```
+
+### API: `POST /models/wake-schedule/toggle`
+
+**Add** (when hour is not yet marked):
+
+```json
+{
+ "hour": 9,
+ "timestamps": { "0": 1700001500000, "15": 1700002400000, "30": 1700003300000, "45": 1700004200000 }
+}
+```
+
+All four slot keys are required; each value must be a future Unix ms.
+Returns the same expanded snapshot as `GET`.
+
+**Remove** (when hour *is* marked):
+
+```json
+{ "hour": 9 }
+```
+
+Same shape as before. Deletes all 4 slots for that hour atomically.
+
+Validation: hour must be an integer 0–23. Non-integer / out-of-range
+hours → 400. Missing or non-object `timestamps` on add → 400. Missing,
+non-finite, or past timestamp in any slot → 400.
+
+### Component props
+
+`ClaudeReset.svelte` props unchanged: still `{ apiBase?: string }`.
+
+### New exported helpers
+
+`packages/api/src/wake-scheduler.ts` exports `nextDailyAfter`,
+`recoverScheduleEntry`, `resetHourFor`, `isProbeSlotMinute`,
+`CLAUDE_RESET_OFFSET_HOURS`, `DAILY_INTERVAL_MS`,
+`MISSED_WAKE_GRACE_MS`, `PROBE_SLOT_MINUTES`, `ProbeSlotMinute`,
+`RecoveredEntry`. Not re-exported from `@dispatch/core` — they live in
+`@dispatch/api` and aren't intended cross-package surface.
+
+---
+
+## End-to-end traces
+
+### Happy path: mark 9 AM
+1. User clicks the "9" AM block.
+2. Frontend computes 4 timestamps in **user's local TZ** for the next
+ occurrence of `9:00`, `9:15`, `9:30`, `9:45`.
+3. `POST /models/wake-schedule/toggle { hour: 9, timestamps: { "0":…, "15":…, "30":…, "45":… } }`.
+4. Backend writes 4 rows to `wake_schedule`, returns snapshot. Button
+ turns primary, +4 trailing blocks fade.
+5. Tick loop runs every 30s. At `9:00` the `0`-minute slot becomes due;
+ the tick advances its `next_wake_at` to tomorrow `9:00`, persists,
+ and fires *one* coalesced wake. Same dance at 9:15, 9:30, 9:45 —
+ each is a separate upstream call (different 15-min windows).
+6. If two slots happen to come due in the *same* 30s tick (e.g. the
+ scheduler was paused), they coalesce into ONE upstream wake.
+
+### Recovery: API was down when 9:15 fired
+1. Server boots at 11:00. Reads 4 rows for hour 9. The `9:00`, `9:15`,
+ `9:30`, `9:45` slots all have `next_wake_at` ≤ now, all overdue by
+ ≤ 2h → all `shouldFireNow: true`.
+2. Each slot's `next_wake_at` is advanced to tomorrow's equivalent
+ wall-clock via `nextDailyAfter`. The boot-fire flag is set.
+3. First tick runs immediately, sees `needsBootFire`, fires ONE coalesced
+ wake. `lastWake` shows ✓ on the panel.
+
+### Recovery: API was down for two days
+1. Server boots Wed at 14:00. Slots for Mon 9:00/15/30/45 are overdue by
+ > 48h → `shouldFireNow: false`.
+2. Each `nextWakeAt` jumps forward by `nextDailyAfter` (ceil-div, single
+ step — not a 48-iteration loop) to Thu 9:00/15/30/45.
+3. Schedule preserved; no spurious wake; entry resumes normally.
+
+### Rapid double-click
+1. User clicks "9" → request A (add, seq=1) in flight; button disabled.
+2. User clicks "9" again → request B (remove, seq=2) in flight.
+3. Response B arrives → `inFlightSeq[9] === 2` → applied.
+4. Response A arrives later → `inFlightSeq[9] !== 1` → dropped.
+
+---
+
+## Verification
+
+### `bun run check`
+```
+$ biome check .
+Checked 142 files in 155ms. No fixes applied.
+```
+
+### `bun run test`
+```
+Test Files 25 passed (25)
+ Tests 417 passed (417)
+ Start at 09:52:16
+ Duration 2.80s
+```
+
+(Was 393 tests at branch base; +24 net = 12 helper unit tests + 12 HTTP
+route tests for the 4-slot wake schedule.)
+
+### TypeScript strict checks
+- `bun --bun tsc -p packages/api/tsconfig.json --noEmit` → exit 0
+- `bun --bun tsc -p packages/core/tsconfig.json --noEmit` → exit 0
+- `bun run --cwd packages/frontend typecheck` → svelte-check 0 errors, 0 warnings
+
+---
+
+## Assumptions / known gaps
+
+1. **TZ behavior:** absolute `timestamp` from the toggle request is the
+ source of truth for *first* fire. On reschedule the slot advances by
+ exactly 24h × N from its previous `next_wake_at`. Preserves the user's
+ local wall-clock intent regardless of server TZ. **DST transition days
+ can drift the fire by ±1h**; self-corrects when the user next toggles
+ the hour. A more thorough fix would store hour + IANA TZ and recompute
+ each cycle — punted; requires UI for TZ selection.
+
+2. **Missed-wake grace = 2h.** Picked because Claude's typical session
+ window is ~5h. Tunable in `wake-scheduler.ts:MISSED_WAKE_GRACE_MS`;
+ `recoverScheduleEntry` accepts a custom value (exercised in tests).
+
+3. **Same-tick coalescing.** Hitting `:00 :15 :30 :45` produces 4 wakes
+ per hour at steady state. Two slots due in the *same* 30s tick
+ coalesce into one upstream call — there's no value in 2 simultaneous
+ probes. The advancement-then-fire ordering means a slow upstream
+ call can't cause re-firing on the next tick.
+
+4. **"Reset" semantics:** I interpreted the "Reset by HH:00" label as a
+ display hint (wake + 5h ≈ when Claude's session window resets), not
+ a separate event. The scheduler only fires *wakes*. The `+5h`
+ constant lives in `CLAUDE_RESET_OFFSET_HOURS` if it ever needs
+ changing.
+
+5. **Recurring daily.** Matches prior behavior; no UI for one-shot
+ wakes.
+
+6. **`nowMs` ticker = 30s on the frontend.** Current-hour ring updates
+ within at most 30s of the hour boundary. Status "X min ago" labels
+ refresh at the same cadence.
+
+7. **Retry budget = 6 × 5min = 30min.** Unchanged from before, just
+ consolidated to a single shared slot.
+
+8. **Snapshot polling:** frontend refreshes the snapshot on mount and
+ after toggles. `lastWake` / `pendingRetry` rows are therefore stale
+ between user actions; the displayed *relative* timestamps DO refresh
+ live (driven by the same `nowMs` ticker). Adding a 30s poll would be
+ a one-liner if desired — left off to avoid quiet background traffic
+ for a panel that's typically only opened intentionally.
+
+9. **Destructive migration.** Per direction: no back-compat. Any
+ existing rows in `wake_schedule` from before this branch are dropped
+ on first boot. Users will need to re-mark their hours. No other
+ tables are touched.
+
+10. **No backend test for `loadScheduleFromDB` recovery branch.** The
+ module-level scheduler state is initialized at import time; covering
+ the boot-path recovery from a Vitest module requires either DI for
+ the DB or spinning up real SQLite. I covered the pure logic via
+ `recoverScheduleEntry` unit tests and the route surface end-to-end
+ via HTTP tests. A follow-up could refactor `loadScheduleFromDB` to
+ take a `db` parameter and write a fixture-backed integration test.
+
+---
+
+## Review followup (Gemini review pass — `notes/claude-reset-review.md`)
+
+A Gemini code-review pass after the initial 4-slot work surfaced 3 real
+bugs (2 High, 1 Medium) and 2 nits. All are now fixed on this branch.
+
+| # | Sev | Where | Symptom | Fix |
+|---|---|---|---|---|
+| 1 | High | `models.ts` POST toggle | `raw <= now` rejected legitimate toggles whenever client clock skew or request latency made an imminent slot land in the past → 400 → UI toggle silently fails | Dropped the `<= now` check; kept `Number.isFinite`. The scheduler's `recoverScheduleEntry` already fires within `MISSED_WAKE_GRACE_MS` and rolls forward. |
+| 2 | High | `ClaudeReset.svelte` | Per-hour `inFlightSeq` couldn't stop an older snapshot from clobbering a newer one when the two requests covered *different* hours (or initial-load racing a click). `applySnapshot` replaces the whole `schedule` → newest click vanishes. | Replaced per-hour counter with a single global `SnapshotSequencer` (`src/lib/snapshot-sequencer.ts`) used by `loadFromServer` AND `postToggle`. Older `accept(seq)` calls return false and are dropped. |
+| 3 | Med | `models.ts` `persistSchedule` | `DELETE` + N `INSERT`s with no transaction; an insert failure left the table empty (DELETE already committed) → schedule silently wiped on next boot, error swallowed. | Wrapped both in `db.transaction(...)`. On failure the DELETE rolls back and the previously persisted snapshot stays intact. |
+| 4 | Nit | `ClaudeReset.svelte` | `inFlightSeq` was effectively dead code for user clicks (the `pendingHours.has(hour)` early-return blocks them) but still mattered for initial-load vs first-click. | Subsumed by the new global `SnapshotSequencer` (cleaner than two parallel mechanisms). |
+| 5 | Nit | `models.ts` `schedulerTick` | Boot-recovery `reason` was masked whenever boot recovery + due slots coincided in the same tick. | Capture `bootFireRequested` before clearing the flag and append `" (boot recovery)"` to the reason. |
+
+### Files added/changed in the followup
+
+- **New:** `packages/frontend/src/lib/snapshot-sequencer.ts` — 47 LoC, the
+ reusable "most-recent request wins" race guard. Pure class, no Svelte
+ deps; usable from any component that fans out snapshot-style HTTP calls.
+- **New:** `packages/frontend/tests/snapshot-sequencer.test.ts` — 8 unit
+ tests covering the core race, the initial-load-vs-click race, monotonic
+ ordering, equal-seq idempotency, and the watermark inspector.
+- **New:** `notes/claude-reset-review.md` — the original review (kept for
+ audit trail).
+- **Modified:** `packages/api/src/routes/models.ts` — fixes #1, #3, #5
+ (~30 LoC delta).
+- **Modified:** `packages/frontend/src/lib/components/ClaudeReset.svelte`
+ — fix #2 / nit #4 (~20 LoC delta).
+- **Modified:** `packages/api/tests/routes.test.ts` — replaced the
+ "POST toggle rejects past timestamp" test with two new tests:
+ - "POST toggle ACCEPTS a slightly-past timestamp (clock skew / latency)"
+ regression-guards finding #1.
+ - "POST toggle rejects NaN / Infinity / non-number slot values" guards
+ that we still reject *malformed* inputs.
+ - Plus "snapshot remains consistent across toggle round-trips" guards
+ finding #3 (the transactional persist path).
+
+### Verification (after followup)
+
+```
+$ bun run check
+Checked 144 files in 167ms. No fixes applied.
+
+$ bun run test
+Test Files 26 passed (26)
+ Tests 427 passed (427)
+
+$ bun run --cwd packages/frontend typecheck
+svelte-check found 0 errors and 0 warnings
+```
+
+### Still deferred (not addressed in followup)
+
+These were noted in the review as design pushback rather than bugs and
+remain as documented in §"Assumptions / known gaps" above:
+
+- **Snapshot polling.** UI may show stale "Retrying…" forever if a retry
+ eventually succeeds in the background without user interaction. Adding
+ a slow 60s poll is still a one-liner; left off to avoid quiet background
+ traffic. (Documented in gap #8.)
+- **DST drift.** Adding `24h` to an absolute Unix ts ignores DST
+ transitions; documented in gap #1.
+
+---
+
+## Review followup — Round 2 (Gemini review pass — `notes/claude-reset-review-2.md`)
+
+After the round-1 fixes shipped, a second Gemini review pass surfaced one
+**Critical** and one **High** finding which together exposed a real desync
+hazard: the round-1 SnapshotSequencer only protected against RESPONSE
+reordering, but the toggle endpoint was vulnerable to REQUEST reordering,
+and the toggle endpoint itself ignored client intent — combining into
+"clicks feel inverted" when the UI got desynced.
+
+Both are now fixed on this branch.
+
+| # | Sev | Where | Symptom | Fix |
+|---|---|---|---|---|
+| R2-1 | Critical | `ClaudeReset.svelte` + `snapshot-sequencer.ts` | If two concurrent toggle POSTs reorder on the WIRE (B reaches server first), the server's truer post-A snapshot carries the OLDER client seq → SnapshotSequencer discards it as stale → UI permanently desyncs from server. Round-1's per-hour-counter fix was replaced with a global sequencer but BOTH had this blind spot. | Replaced per-hour `pendingHours: Set<number>` with a single global `pendingHour: number \| null` mutation lock. While any POST is in flight, ALL toggle buttons are disabled — mutations are now serialized on the client, so the server never sees two concurrent toggle requests. Sequencer retained for the GET-on-mount vs first-click race (which the global lock doesn't cover). |
+| R2-2 | High | `models.ts` POST `/wake-schedule/toggle` | Server decided add-vs-remove from its own in-memory state instead of an explicit request field. Any UI desync (from R2-1 or any future cause) → user clicks to turn on an hour the UI shows off, server sees it on, deletes it → "click inverted" UX, recoverable only by reload. | Toggle endpoint now requires explicit `action: 'on' \| 'off'`. Idempotent: `'off'` on already-off is a no-op success; `'on'` on already-on REPLACES timestamps (so a recovering UI can re-assert wall-clock intent without a delete-then-add round trip). Missing/invalid action → 400. |
+| R2-3 | Low (deferred) | `models.ts` `wakeAllClaudeAccounts` / `processPendingRetry` | If 1 of N accounts fails a probe, the 6 × 5min retry loop re-probes ALL accounts (including the ones that already succeeded). Wastes bandwidth, but the probe payload is tiny (~16 tok) and the constant 30-min budget caps the blast radius. | **Not fixed** — explicit deliberate trade-off; per-account success tracking inside `PendingRetry` would meaningfully complicate the retry path for marginal savings. Noted in §"Assumptions / known gaps". |
+
+### Files changed in round 2
+
+- **Modified:** `packages/api/src/routes/models.ts` — toggle endpoint
+ rewritten to require explicit `action` (`+27 / -6` LoC, idempotency rules
+ documented inline).
+- **Modified:** `packages/api/tests/routes.test.ts` — `toggle()` helper
+ auto-derives `action` from `timestamps` presence so the existing 12 tests
+ stayed terse; one test (`POST toggle rejects missing timestamps on add`)
+ was renamed to `rejects action='on' with missing timestamps` and now
+ passes `action` explicitly. **+4 new contract tests** (29 / 29 routes
+ tests pass):
+ - `POST toggle requires explicit action: 'on' | 'off'` (rejects missing
+ action, rejects non-`'on'/'off'` strings/numbers/`null`).
+ - `POST toggle action='off' is idempotent on an already-off hour`.
+ - `POST toggle action='on' on an already-on hour REPLACES timestamps`
+ (the recovery-from-desync scenario).
+ - `POST toggle action='off' ignores timestamps payload`.
+- **Modified:** `packages/frontend/src/lib/components/ClaudeReset.svelte`
+ — `pendingHours: Set<number>` → `pendingHour: number | null`; all 4 row
+ buttons gated by the global lock; `toggleHour` derives `action` from
+ local state; `postToggle` sends it on the wire. Per-hour
+ `cursor-wait` class is preserved for the in-flight hour as a UX cue.
+- **New:** `notes/claude-reset-review-2.md` — the round-2 review (kept
+ for audit trail, parallel to `notes/claude-reset-review.md`).
+
+### Verification (after round-2 followup)
+
+```
+$ bun run check
+Checked 144 files in 161ms. No fixes applied.
+
+$ bun run test
+Test Files 26 passed (26)
+ Tests 431 passed (431)
+
+$ bun run --cwd packages/frontend typecheck
+svelte-check found 0 errors and 0 warnings
+```
+
+(`+4` tests vs round 1: the four explicit-action contract tests.)
+
+### What's NOT addressed in round 2
+
+- **DST drift** — unchanged design trade-off (gap #1).
+- **No snapshot polling** — unchanged design trade-off (gap #8).
+- **R2-3 (retry storm re-probes succeeded accounts)** — deliberate trade-
+ off, see table above.