summaryrefslogtreecommitdiffhomepage
path: root/notes/observability-design.md
blob: 1b099736e5845a2a7eb44ac0f1edf12e05e33486 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
# Observability & Debugging — Design Scratch

> **Status:** IDEATION / scratch. NOT the plan, NOT decided-to-build. Captures a
> live design discussion so decisions don't live only in chat (per ORCHESTRATOR
> "write up before pivoting"). Promote settled parts into
> `notes/restructure-plan.md` + `GLOSSARY.md` when we commit to building.
> **Scope:** backend only. **Driver:** old Dispatch had bugs that were near-
> impossible to diagnose because an agent had *no real data to read*. This fixes
> that at the root.
>
> **Read order (fresh agent picking this up):** `ORCHESTRATOR.md` →
> `notes/restructure-plan.md` (the §-refs below point into it) → `GLOSSARY.md` →
> this file. **Mode = IDEATION with the user** (discuss/design, do NOT build yet);
> the user owns the boundary + vocabulary calls (§5.2 / §5.6).
> **Where we left off:** **Logger ABI = A, LOCKED** (evolve in place, §8). **Now
> PLANNING Phase A** (logging substrate, §10): A1 kernel Logger/Span ABI → A2
> `journal-sink` package (in-process, host-bin-injected) → A3 runtime span
> instrumentation. Phase A captures the **"BEFORE"** (verbatim pre-mutation prompt on
> the step span, via a new `LogRecord.body` blob); the **"AFTER"** (provider.request
> verbatim, D5) is the **next step** → completes full round-trip rebuild + the
> before↔after diff. **2 processes total, no more:** process 1 = main app *incl. the
> in-process journal sink* + the journal file; process 2 = the **collector** (Phase B).
> A package ≠ a process — the sink runs in-process (cheap non-blocking append); the
> file-seam + out-of-process collector is what delivers the crash-safety.
> **Redaction = each ext self-redacts IN ISOLATION** (no shared helper; P1/P7).
> **D10** = logs are one-way. **§9** extension-logging rule → `.dispatch/rules/`
> (placement decided; authored with the substrate). Crash-safety: journal file IS the
> durable queue (§6/D3). **Mode: PLANNING Phase A (not building yet).** Settled:
> D1–D10 (§2); open per-phase: §6.

---

## 0. Goal in one paragraph

First-class, **agent-first** debugging. When something goes wrong with a chat or
feature, you grab an **ID** and hand it to a **debugging agent**; it pinpoints the
*source* of the problem from a **complete, queryable trace** of what actually
happened, then hands its findings up to the orchestrator, which dispatches fixers.
The debugging agent is trained ONLY to find root cause — it does not fix.

---

## 1. Principles specific to this subsystem

- **Agent-first.** The primary trace *consumer is an LLM*, not a human dashboard.
  This single fact killed the OpenTelemetry path (see D1): its entire payoff is
  human visualization, which is useless to the agent. Every field we capture is
  justified by *"what query must the debugging agent run?"*
- **"Complete" = completely *captured & queryable*, never completely *loaded into
  context*.** Noise reduction is done by deterministic **query/filter tools**
  (plain code), so the agent always reads a small, targeted slice (a timeline
  skeleton → drill into one span; a diff; a validator verdict). **Store fat, serve
  thin.** This is the harness's progressive-disclosure idea (P5) applied to
  runtime data, and it's what protects context + usage.
- **Subordinate & fail-safe.** Observability must NEVER break or slow a turn. A
  tracing error, full disk, or downed collector → drop/sample + warn, never stall.
  (The §3.7 fault-containment principle: the turn is sovereign.)
- **The debugger must survive the crash it's debugging** (see D3). Robustness of
  the collector is paramount — it's the explicit reason it gets an exception to
  the in-process architecture.

---

## 2. Decisions settled so far (with the named problem each solves — P4)

### D1 — Agent-first; OpenTelemetry dropped
No OTLP exporter, no Jaeger/Tempo/Grafana, no OTel SDK, no "stay compatible"
constraint. Its value is human dashboards + cross-service fleet tracing — we are
one process whose reader is an agent. **Kept** (not as "OTel" but because it's the
right abstraction for *"what caused X"*): the **causal-trace model** — spans with
**parent + links + attributes**. Designed for agent queryability, not rendering.
- *Rejected by a named tradeoff, not NIH:* the OTel SDK's ambient "current span"
  (`AsyncLocalStorage`) is exactly the hidden global P3 bans; we pass correlation
  explicitly. Dropping it removes that tax + a heavy dep tree.

### D2 — Boundary: kernel owns types+mechanism; one `observability` extension owns the rest
(User's granularity call, §5.2.)
- **Kernel (ABI):** the structured-log + span **types**, the scoped logger/trace
  **handle** shape, and the **`onAny`** firehose (reserved in §5.4 / ORCHESTRATOR
  §6, **not yet built**). Kernel still touches no I/O and names no feature.
- **`observability` extension:** capture (onAny tap + AgentEvent tap + the scoped
  handles), the store, the query/filter tools, redaction, and the debugging-agent
  harness (P7).

### D3 — The collector runs OUT OF PROCESS (the sanctioned exception)
Single-process app (§3.7): a buggy extension can crash/`exit`/OOM everything. An
in-process collector buffering in memory loses its evidence **at the exact moment
you need it** (the crash). So the **sink** behind the logging handle is a separate
OS process. Three agent-first wins only this enables:
1. **Query a chat that killed the main process** — the collector stays alive, so
   the agent can debug a dead app. (Decisive.)
2. **Attribute the crash** — collector sees pipe EOF / heartbeat loss → stamps a
   synthetic terminal span "process exited unexpectedly after span X." In-process
   that final fact dies with the recorder.
3. **Isolate the query workload** from the live turn.

**Robust shape (write-ahead logging for telemetry):**
```
main process ──(cheap, non-blocking, append-only line)──▶ journal file (durable buffer)
                                                              │ tails
                                              standalone collector process
                                                normalize → SQLite trace store → query API ◀── debugging agent
```
- Hot path only **appends a line to an OS-buffered fd** — no SQLite, no blocking;
  buffer full → drop/sample (fail-safe).
- The **journal file is the queue**: if both processes die, flushed bytes are on
  disk; collector resumes tailing on restart. Worst case loses the last partial
  line (the §3.4 "last in-flight" bound).
- Collector is **deliberately dead-simple** (read→normalize→store→serve): the
  less it does, the less it can crash.
- **Exception scoped tightly:** extensions still emit through a normal
  `host.logger`/trace handle — *the contract is unchanged*; only the sink's
  *implementation* is out-of-process. This is the first real use of §3.7's
  pre-designed "move an extension to a worker without a rewrite" (serializable
  payloads). It fits the architecture rather than fighting it.
- *Cost (honest):* a second process host-bin spawns first / drains last + a tiny
  line protocol. Justified by the "must be robust" priority.

### D4 — Correlation model (reuses existing keys — P8, no invented id)
- **turn = one trace** (`turnId` ≈ trace id).
- **conversationId = an indexed attribute** grouping a thread's turn-traces.
- **span = one occurrence** of work (its own span id), with `parentSpanId`.
- **extension.id = an auto-stamped attribute** — the host hands each extension a
  handle pre-tagged with its `manifest.id`; nobody can spoof or invent an id
  scheme. (This is the clean version of "each extension broadcasts its id.")
- **Cross-feature causality = span links**, recorded **at the handoff moment**
  (enqueue→dequeue, summon→subagent-turn) — *causal edges, not co-occurrence.*
- Every span carries the correlation keys of what it acts on (e.g. a cross-chat
  op records both `source.conversationId` and `target.conversationId`).

### D5 — Verbatim provider I/O capture (the highest-value capture)
The cure for "the agent had no real data." Rule:
- **Capture the request verbatim at the fetch edge — AFTER all transforms** (OAuth
  body rewrite, tool-name prefixing, normalization). Corrupted-history bugs are
  often *introduced by* a transform; capture before the bytes hit the wire = ground
  truth (URL, model, params, full messages array, tool schemas, cache markers).
- **Capture the raw response/error too** (status + error JSON — it usually *names*
  the defect, e.g. `MissingToolResultsError`).
- **Redact only secrets** (auth headers, vault-injected fields) — never the body.
  **Technique = partial masking, NOT removal** (user-decided): reveal a few first +
  last real chars with `…redacted…` between (e.g. `sk-…redacted…f9a`), **graduated by
  length** so short secrets reveal less (tiers in §6). The field stays present and
  **diffable** (tells you if a secret changed/rotated — serves the §3.1 cache-bust
  diff) without storing the live value.
  **Mechanism (user-decided): each extension self-redacts IN ISOLATION — no shared
  helper.** Logs are self-generated, so they are self-redacted: the extension that
  owns the data — it alone knows what is secret and *how* to censor it — masks the
  value in **its own code, at the log call-site**, before the record is built. There
  is **no shared `redact()` utility and nothing in the kernel** for this: a shared
  helper would couple every secret-handling extension to one algorithm (violates
  **P1** feature-as-a-library + kernel-minimalism) and contradicts "the code decides
  *how* to censor." Consistency comes from the **harness, not shared code** (**P7**):
  the §9 rule makes self-redaction mandatory and documents the default tiers; **each
  extension reimplements them locally**. Duplicating a tiny mask across the few
  secret-handling exts (providers, auth) is **intentional — isolation over DRY**.
  Result: raw secrets never leave the producing extension's code — they reach neither
  the sink nor the journal.
  *(Supersedes the earlier "shared helper / mask at mint" sketches. Refines D2: the
  observability ext owns redaction POLICY/config + store/query; the redaction ACT is
  each source extension's own, self-contained.)*
  **Caveat (honest):** no scheme can mask a secret nobody marked — so the real
  enforcement is the §9 harness rule; an optional cheap sink-side pattern-scan
  (`sk-…`) may backstop obvious leaks but is not relied upon.
- **Retain successful requests in-window** (not error-only) so "diff failing vs a
  known-good chat" works. Cost: bodies are large → retention/rotation + compression.
- **Volume control (cache warming):** stamp the cheap `prefix.fingerprint` +
  cache-token counts on *every* request, but persist the **full body only when the
  fingerprint changed** (or on error) — keeps the bust findable without storing
  every warming send verbatim.
- **Store fat, serve thin:** the blob lives in the store; the agent gets a **diff**
  or a **validator verdict**, never the raw 200KB.
- *Future affordance (free from the data):* replay a stored request to reproduce.

### D6 — Per-extension observability: free by default, rich by opt-in (P4)
- **Default = free:** the auto-scoped structured handle (D4) makes *every*
  extension observable for nothing — no per-extension logging contract to write.
- **Opt-in = a typed debug surface** only where an extension has domain-specific
  diagnostics worth standardizing (queue's enqueue/dequeue/deliver lifecycle).
  Lives in that extension's own **P7 harness**, loaded by the debugging agent only
  when investigating that extension. Mandating a hand-written contract per
  extension would be the boilerplate P4 warns against.
- **Worked boundary (who computes an attribute):** the `prefix.fingerprint` (§3.1)
  can't be computed by the generic collector — only the cache-warming/provider
  extension knows where the cacheable prefix ends. So that extension stamps the
  attribute; the collector just stores it. The template for every "who computes
  this attribute?" question.

### D7 — Performance posture
Negligible. An agent turn is **LLM-network-bound (seconds)**; span/attribute work
is microseconds — in the noise. The only real cost is **write I/O + volume**,
solved by: non-blocking append (D3) + out-of-process normalize + **don't trace
token deltas** (high-frequency *and* redundant — the chunk log already has final
text) + levels/sampling + fail-safe drop.

### D8 — The "easy view": one compact projection, served to agent AND human
A single-line-per-event **transcript skeleton** of how a chat was built — prompt
assembly, thinking, tool calls, timings, sizes — each line collapsing to a summary
(`ai thought 1.8s`, `tool read_file ran 0.5s → ok (2.0k)`), expandable to the
verbatim span. Key realizations:
- **It's a *projection*, not new infra** — a pure `spans → text` formatter in the
  query layer (P2: zero I/O, exhaustively unit-testable). NOT a frontend, NOT a
  graphical timeline (that's out-of-scope frontend later, which can consume the
  same projection).
- **Same artifact for both consumers** — this compact skeleton IS the
  "store-fat-serve-thin" *thin overview* the agent reads first before drilling
  into a span. The human "easy view" is that same text rendered to a terminal/
  markdown. So it *reinforces* agent-first; it does NOT reopen the human-viz
  question (D1) — no heavy infra, no design deformation.
- **One genuinely new capture it forces: prompt-assembly provenance** (see §4) —
  the context-filter chain (§3.2) records each contribution as a *segment* so we
  can show `persona(1180) · tool:read_file(300) · skill:web(400) · user(412)`.
  Cheap: *segmentation metadata over the verbatim body we already store* (D5),
  not a copy.
- **Interlocks with cache debugging (§3.1):** a prefix-fingerprint bust attributes
  to the exact contributor whose segment changed ("skill:web 400→420 between warm
  and real send").
- **Doubles as a completeness test:** if the easy view reconstructs how the chat
  was built end-to-end, the §4 taxonomy captured enough — every skeleton line must
  be backed by a real span/segment.

### D9 — Optimization analytics = derived aggregates, NOT a separate metrics pillar
A third consumption pattern over the SAME spans (after incident-debugging and the
D8 easy view): longitudinal roll-ups to tune Dispatch itself — token size per turn,
model generation time, tool-call durations, tool/skill usage frequency, error
rates. **P4 call: these are `GROUP BY` queries over the span store, not a separate
counters/histograms pipeline.** Everything needed is already a span attribute (turn
tokens, step/tool durations, tool-call counts, `isError`, D8 segment sizes) — no
new capture. (The "metrics pillar" machinery earns nothing here: no dashboards, no
alerting — agent-first.)
- **Cost/benefit interlock (the high-value one):** cross D8 `prompt.assembly`
  segment sizes — the *standing cost* a tool/skill/persona imposes on EVERY turn
  (context budget + cache pressure) — against usage frequency + success — the
  *realized benefit*. Ranks contributors by cost/benefit → directly answers
  "should we keep this tool in the definitions?" A rarely-used tool with a fat
  definition is pure overhead and a cache-bust risk.
- **Objective vs. interpreted:** counts/durations/error-rates are objective SQL.
  "Is the tool's description good enough for the model to know when to use it?" is
  a *hypothesis* an analysis agent forms from those signals + samples (defined but
  never called; called then result ignored; high input-schema error rate) — not a
  pure metric. Agent-interpreted, consistent with agent-first.
- **Retention asymmetry (the one real new requirement):** fat verbatim data (D5)
  rotates out fast; cheap aggregates are rolled up and kept long for trend signal.
  A periodic rollup `scheduledJob` (§2.3 scheduler) writes compact daily/weekly
  summaries that survive raw-trace pruning.
- **Three consumers, one capture:** incident-debug (one trace) · D8 easy view (one
  chat) · D9 analytics (many turns) — all the same spans. The capture is the asset;
  every consumer is a cheap projection/aggregation. Validates the D2 boundary.

### D10 — Logs/spans are a one-way emission, not a feature channel (clarifies D2/D3/§5.4)
Answering "do logs pass through other extensions?": **no.** Two distinct flows that
never mix:
- **Regular (feature) data** — extension→extension via typed contracts / hooks /
  services (§3.5, §5.4): in-band where it matters (filters awaited, services return
  values, the provider stream consumed by the kernel), type-anchored, **affects the
  turn outcome**.
- **Logged (observability) data** — ANY extension → its auto-scoped `host.logger`/
  span → kernel mints the record (stamps `extensionId`/ids/timing; secrets are
  already self-redacted by the source, D5) → `sink.emit` → **journal** →
  out-of-process **collector** → store →
  debugging agent. **One-way, fire-and-forget** (D3/D7): never awaited, never changes
  a turn, swallowed on error.

**Each extension authors its OWN logs** (handle auto-stamped with its id, D4). A
sibling feature extension **never receives** another's logs — that would be feature
coupling *through logs* (an anti-pattern; cross-feature reaction is what hooks are
for). The **sole reader** is the `observability` extension, off the (out-of-process)
collector/store + the in-process **`onAny`** firehose + the AgentEvent tap — the
§5.4 "observability only, never feature code" exception. **Many independent one-way
producers, exactly one privileged consumer** — so logging can never become a
backdoor feature bus.

---

## 3. Validation — the bug catalog (P4: earn it against real failures)

Each row is a real (old-Dispatch) failure → the query/diff that finds it. This is
the acid test that the design pays for itself.

| Failure | How the trace finds it |
|---|---|
| **Chat crashed** | Spans written incrementally → trace ends at the last span before death; collector's EOF stamp shows "process exited after span X". Read the tail. |
| **API rejected corrupted history** | `provider.request` span holds the exact post-transform body + raw rejection; an incomplete tool call is literally a tool-call span with no matching tool-result span; a `reconcile.repair` span shows any auto-fix. |
| **Queued message shown twice** | Two `deliver` spans for one message id (a count query). |
| **Queued message in wrong chat** | The enqueue span's `target.conversationId` ≠ where it delivered; filter by message id across conversations → the misroute is explicit. |
| **Prompt cache returns 0% hit** *(new)* | See §3.1 — diff consecutive verbatim requests' cacheable prefixes; a `prefix.fingerprint` attribute flags the bust; span timestamps reveal a cache-warming gap > TTL. |

### 3.1 Worked example — the prompt-cache 0%-hit bug (cache warming)
**The system (cache warming):** provider prompt caches expire ~5 min. To keep a
chat's cache warm without a user message, periodically **resend the (rewound)
conversation** to refresh the cached prefix — staying warm until the real next
message lands a hit.
**The bug:** occasionally the API reported **0% cache read** with no way to debug
why. Cache hits depend on **byte-exact prefix identity**, so a miss has many silent
causes, all invisible before now:
- the cacheable prefix changed (tool schemas reserialized in different key order; a
  volatile value crept into the system prompt — timestamp/date/nonce; a
  skill/agent injection changed; attachment re-encoded);
- the `cache_control` breakpoint moved or wasn't set;
- the **cache-warming request's prefix wasn't byte-identical** to the prefix the
  next real message extends → you warmed the *wrong* cache;
- provider-side: the warming send fired late (gap > TTL), eviction, model change.

**Why this design nails it:**
- Verbatim capture (D5) of **every** provider request — *including cache-warming
  sends, flagged `warm` vs `real`* — plus **cache-token counts from the response**
  (`cache_read`, `cache_creation`).
- A **`prefix.fingerprint`** attribute (hash of the cacheable prefix up to the
  `cache_control` breakpoint): a cache bust = the fingerprint changed unexpectedly
  between a warming send and the next real send. One grouped query flags it; then
  diff the two bodies to see *which bytes* diverged.
- **Bonus insight this surfaces:** it makes **non-deterministic serialization** (a
  Map's key order, unstable tool-schema generation) — an otherwise invisible
  cache-killer — show up as a visible diff. Capture turns "0% and no idea why" into
  "these 14 bytes of the prefix changed, introduced by transform Y."

---

## 4. Capture taxonomy (draft — the §-next thread)

Span kinds, driven by the bug list (attributes are illustrative, not final):
- **turn** (root) — `conversationId`, `turnId`, model, status, token usage.
- **step** — one LLM round-trip within a turn.
- **provider.request** *(the star, D5)* — verbatim body (post-transform), headers
  (redacted), raw response/error, `cache_control` presence, `prefix.fingerprint`,
  `warm|real`, cache-read/creation tokens, latency.
- **prompt.assembly** *(D8)* — ordered composition segments of the request:
  `{contributor extension.id, kind (persona|tool-def|skill|agent-profile|history|
  user-msg), role, length, contentRef→verbatim body}`. Powers the easy-view
  prompt-assembly render + per-contributor cache-bust attribution.
- **tool-call** — `toolCallId`, name, input, result, isError, duration.
- **reconcile.repair** *(§3.4)* — what the load-time repair changed.
- **queue.enqueue / dequeue / deliver** — message id, source/target conversation,
  links to the turn it caused. *(queue/session-features ext not built yet.)*
- **process.lifecycle** — boot / shutdown / **crash** (collector-synthesized).
- **structured log** — leveled, attributed, correlated line (the evolved Logger).

---

## 5. Vocabulary (user-approved; promote to GLOSSARY.md when this lands)

Approved this session: **trace**, **span**, **attribute**, **structured log**,
**observability**, **redaction**, **debugging agent**. Standard/training-baked —
they cost zero glossary justification (P6).
- **trace** — the full correlated record of one operation (≠ *history*/chunk log).
- **span** — one timed unit of work in a trace (parent + links + attributes).
- **attribute** — typed key/value on a span/log. *(aliases to avoid: tag, field, metadata)*
- **structured log** — a leveled, attributed, correlated record. *(avoid: debug message)*
- **span link** — a causal edge to another span/trace (cross-feature).
- **collector** — the out-of-process sink that normalizes + stores + serves traces.
- **journal** — the append-only durable buffer file between app and collector.
- **cache warming** — periodically resending a (rewound) conversation to keep its
  provider prompt cache warm within the ~5-min TTL. *(aliases to avoid: reheat,
  cache reheating)* — **user-decided.** Distinct from **wake** (the Claude
  wake-probe scheduler in old code) — keep the two concepts separate.
- **redaction** — replacing a secret's middle with `…redacted…`, keeping the first
  3 + last 3 real chars (partial masking — keeps the field present + diffable; never
  dropped). *(aliases to avoid: censoring, scrubbing, masking-as-removal)* —
  technique **user-decided.**

---

## 6. Open threads (not yet decided)

- **Span/attribute vocabulary** — finalize §4 (next up).
- **Structured-`Logger` ABI change** — today it's unstructured `(message,
  ...args)`; evolve to leveled + attributed + auto-scoped + correlated. Kernel
  owns `Logger` → this is an ABI change with `lsp references` fan-out. Decide:
  evolve in place vs. add a parallel structured channel. **Proposed shape: §8**
  (awaiting the A/B call).
- **Retention/rotation sizing** — verbatim bodies are large; TTL + size cap +
  compression; keep successes long enough for diffing.
- **Journal / IPC line protocol** — NDJSON to a pipe vs. unix socket vs. append
  file; framing; backpressure (drop-oldest/sample). **The journal file IS the durable
  queue** — there is **no in-memory log queue** in either process (that's the
  crash-safety win): a normal app crash/OOM does NOT lose already-`write()`-en lines
  (bytes live in the OS buffer, not the process heap); the collector resumes tailing
  from its last offset on restart. *Open:* **fsync cadence** — per-line (durable,
  slow) vs. periodic vs. none (fast); only a kernel-panic/power-loss risks the last
  unflushed line (§3.4 bound).
- **Redaction policy** — *DECIDED:* (b) **short-secret guard** = graduated tiers by
  length: **≥13 → reveal 3** each side · **11–12 → 2** · **8–10 → 1** · **≤7 → full
  mask** (the `10` overlap resolved conservatively to reveal-1; `…redacted…` token is
  fixed-width so it never leaks the hidden length). (c) **who redacts** = **each
  extension self-redacts in isolation — NO shared helper, nothing in the kernel**
  (D5/§9): the producing ext masks in its own code at the call-site. *Still open:*
  (a) the exact secret-field/header list — declared **per-extension** (it knows its
  own; e.g. openai-compat: `authorization` header + any vault-injected body field).
- **Cache-token fields** — *RESOLVED by inspection:* `Usage` already carries
  `cacheReadTokens?` / `cacheWriteTokens?` (`packages/kernel/src/contracts/provider.ts`)
  — **no provider CR needed**; only confirm the openai-compat provider *populates*
  them (instrumentation detail, not a contract gap).
- **Collector supervision** — host-bin spawns first / drains last; restart +
  resume-tail; what if the collector dies.
- **Levels & default capture set** — what's on by default (deltas off).
- **Easy-view rendering (D8)** — the projection format + delivery surface (CLI
  command vs. transport route returning markdown; frontend later), and char-counts
  (free) vs token-counts (needs a tokenizer) in the skeleton.
- **Analytics roll-ups (D9)** — rollup table shape + retention asymmetry (raw
  traces short, aggregates long); `GROUP BY` indexes (tool_name, model, kind, time).
  *(The periodic-job mechanism already exists: `host.scheduler.register` —
  `packages/kernel/src/contracts/extension.ts`; only the table shape + retention +
  indexes remain to design.)*
- **`onAny` firehose (kernel)** — reserved (§5.4) but unbuilt: confirm it's the
  capture tap and define its shape (one listener; payload = hook id + payload +
  correlation).
- **Debugging-agent delivery + the "grab an ID" entry point** — how an id is handed
  to the agent; rides on the (unbuilt) `agents` extension + its P7 harness.
- **Sequencing / dependencies** — build the substrate now (Logger ABI, onAny,
  collector, store, core span kinds); instrument the rest as their features land
  (queue spans ← session-features; `prompt.assembly` ← context-filter chain;
  debugging agent ← agents ext).
- **Extension-logging rule placement** — *DECIDED:* `.dispatch/rules/extension-logging.md`
  (user's call); wired into the ORCHESTRATOR.md §3 scoping map for extension agents.
  Content drafted (§9, tribal-knowledge only); authored when the substrate is built.

## 7. Deferred / out of scope
- Replay-to-reproduce (the data supports it; build later).
- Adversary/multi-tenant isolation (we defend faults, not adversaries — §3.7).
- Human dashboards / metrics viz (agent-first; revisit only if a human need
  appears).

---

## 8. Logger ABI — proposed shape (resolving open-question #1; awaiting A/B)

> **STATUS: proposal, NOT yet decided.** Forced by: structured records ·
> auto-scoped to extension (D4/D6) · explicit correlation / no ambient (P3) ·
> spans (D8/D9) · fire-and-forget non-blocking emit (D3) · sink-injected (purity).

### Proposed types (kernel ABI)
```ts
type Level = "debug" | "info" | "warn" | "error";                              // keep 4 (P6)
type Attributes = Readonly<Record<string, string | number | boolean | null>>;  // flat: serializable (D3) + queryable (D9)

interface LogContext {
  readonly extensionId: string;      // auto-stamped by host (D6) — not caller-supplied
  readonly conversationId?: string;
  readonly turnId?: string;
  readonly spanId?: string;
}

interface Logger {
  readonly debug: (msg: string, attrs?: Attributes) => void;
  readonly info:  (msg: string, attrs?: Attributes) => void;
  readonly warn:  (msg: string, attrs?: Attributes) => void;
  readonly error: (msg: string, attrs?: Attributes & { err?: unknown }) => void;
  readonly child: (ctx: Partial<LogContext> & { attrs?: Attributes }) => Logger; // explicit value, passed down (P3)
  readonly span:  (name: string, attrs?: Attributes) => Span;
}

interface Span {
  readonly id: string;
  readonly log: Logger;                                                          // pre-bound to this span
  readonly setAttributes: (attrs: Attributes) => void;
  readonly addLink: (target: { spanId: string; turnId?: string }, reason?: string) => void; // D4 causal edges
  readonly child: (name: string, attrs?: Attributes) => Span;                    // step → tool-call nesting
  readonly end: (outcome?: { err?: unknown; attrs?: Attributes }) => void;       // records duration + status
}
```

### Why this is P3-safe (the OTel contrast)
`child()` / `span()` return **explicit values you pass down** (orchestrator →
`runTurn` → `ctx.log` into tools). No hidden "current span"; correlation travels
as an argument — exactly why we could drop OTel's `AsyncLocalStorage`.

### Purity story (a "kernel logger" that touches no I/O)
`Logger`/`Span` are **pure record-builders over an injected `LogSink {
emit(record): void }`**. The kernel mints records (auto-stamps `extensionId`,
ids/timing) and calls `sink.emit`; the **sink is a host-bin bootstrap dependency**
(like the storage factory — available before any extension activates, per the
"Logger always available" contract). The sink writes the **journal** (D3); the
observability **collector owns the other end**. Kernel testable with a fake sink,
zero mocks.

### The fork — DECIDE THIS (open-question #1)
- **A — evolve `Logger` in place (recommended).** Replace the string logger; the
  simple call `info("msg")` still compiles (attrs optional) but flows through the
  same correlated logger → **A subsumes B's ergonomics**. Con: ABI fan-out — but
  tiny now (a few `logger.error("...", err)` sites → `{ err }`); only grows later.
- **B — parallel structured channel, keep the string logger.** Zero churn, but two
  paths → drift, and casual `logger.info` stays uncorrelated / invisible to the
  store — defeats "everything queryable" and grates against P8.

### Sub-decisions baked into the sketch
- Inject `{ sink, now, newId }` — deterministic ids/timing in tests (§3.6).
- Flat scalar attributes (nested → stringify) — keeps D9 `GROUP BY`/indexes clean.
- `error(msg, { err })` normalizes today's positional `Error` arg.
- **Downstream contract change:** tools get **`ctx.log`** (span-bound) so they log
  correlated without a global (P3) — a `ToolContract` ctx addition (§3.3 ctx
  already carries `signal`/`onOutput`).

---

## 9. Harness artifact — the extension-logging rule (tribal-knowledge ONLY, P6/P7)

> **Status:** planned deliverable (user-requested). Extension-scoped (P7): the
> knowledge lives in the **harness**, and **each extension implements logging +
> redaction in its OWN isolated code** — there is NO shared logging/redaction helper
> to couple them (P1; isolation over DRY). Loaded by every extension-author agent.
> **Placement DECIDED** (user): `.dispatch/rules/extension-logging.md`, wired into the
> ORCHESTRATOR.md §3 scoping map for extension agents. **Write it when we build the
> substrate, not before.** P6 governs hard: state ONLY what a frontier model would
> get wrong about THIS system; omit anything inferable.

**MUST state (non-inferable, project-specific):**
- **Self-redact your own secrets before logging — in your own code.** There is no
  shared `redact()` and nothing in the kernel does it for you. YOU decide what is
  secret and how to censor it. Default censoring = the §6 tiered partial mask
  (reimplement it locally; deviate if your secret type warrants).
- Use the injected `host.logger` / the span / `ctx.log` you're handed — never
  `console.*`, never a hand-rolled logger or your own correlation ids.
- **Don't set `extensionId` / invent an id scheme** — it is auto-stamped (D4).
- **Attributes are flat scalars** (nested → stringify) — D9 GROUP BY + journal
  serialization need it.
- **Don't log token deltas / per-chunk streaming** (D7) — redundant + noisy.
- **Logs are one-way (D10)** — never read another extension's logs; cross-feature
  reaction is a hook, not a log.
- Edge I/O (providers): capture the request **verbatim, post-transform, at the fetch
  edge** (D5), self-redacting secret headers/fields in your own code.

**Must NOT state (inferable — omit, P6):**
- What logging/levels are for; info/warn/error semantics; "log your errors."
- How to call `logger.info("msg", {attrs})` (obvious from the type).
- Generic "don't log sensitive data" platitudes (replaced by the concrete
  self-redaction rule) or any SQL/GROUP BY explanation.

---

## 10. Build plan — Phase A (logging substrate)

> **Status:** PLANNED, prompts drafted (`prompts/phase-a-{kernel-logging,journal-sink}.md`),
> awaiting final user review before summon. **Goal:** every extension + every turn
> emits structured records durably into the **journal file**. Phase A = **1 process**
> (main app + in-process sink); the collector is process 2 (Phase B). Logger ABI = **A**.

### Record contract — frozen FIRST (both units depend on it)
`packages/kernel/src/contracts/logging.ts` (kernel owns it). Agent finalizes the
exact shape; load-bearing constraints:
- **`LogRecord`** = flat, JSON-serializable **discriminated union**: `log` line |
  `span-open` | `span-close`. Spans emitted **incrementally** (open at `span()`,
  close at `end()`) so a crashed turn is reconstructable from the journal (D3).
  Every variant carries correlation (`extensionId` auto-stamped + optional
  `conversationId`/`turnId`/`spanId`/`parentSpanId`), `timestamp`, flat `attributes`
  (queryable scalars), **and an optional `body` blob** (string) for large verbatim
  payloads — the pre-mutation prompt now, the verbatim provider request later
  (store-fat-serve-thin: query tools serve diffs/slices, not the raw blob).
- **`LogSink { emit(record): void }`** — fire-and-forget; the kernel never lets a
  sink error escape into a turn (D7).
- **`Logger` / `Span`** — the §8 shapes (leveled/attributed/auto-scoped/correlated;
  `child()`/`span()` return explicit values — P3, no ambient).

### Unit 1 — `kernel-logging` (ONE coordinated kernel owner; single-writer over kernel)
Interlocked ABI change (cf. the `tabId→conversationId` rename). Stage commits
(**contract checkpoint → host → runtime**) so the contract freezes early for Unit 2.
Owns: `contracts/logging.ts` (new), `contracts/extension.ts` (evolve `Logger`),
`contracts/tool.ts` (+ `ctx.log`), `contracts/index.ts` (re-export), `host/host.ts`
(+ `logSink` on `HostDeps`, mirroring `storageFactory` @ host.ts:34/60; build each
extension's auto-scoped logger that stamps `manifest.id`) + `host.test.ts`,
`runtime/{run-turn,dispatch}.ts` (open/close turn/step/tool-call spans; thread
`ctx.log`) + tests. **The `step` span records the verbatim pre-mutation prompt**
(messages + tools + opts) in its `body` — the **"BEFORE"** capture. Pure
record-builder injected with `{ now, newId }`.

### Unit 2 — `journal-sink` (bootstrap library, NOT an extension)
`packages/journal-sink/`. Implements kernel `LogSink`: pure `record → one NDJSON
line` core + thin fs append edge + rotation/backpressure (drop-oldest + warn — D3
fail-safe). Imports the frozen `LogRecord`/`LogSink` TYPES; never redefines them.

### host-bin wiring (orchestrator CR)
Construct the sink, add `logSink` to `deps: HostDeps` (main.ts:77) so `host.logger`
works before any extension activates. Root tsconfig ref + host-bin dep on
`@dispatch/journal-sink` + `bun install` + biome import-sort (same wiring as
tool-read-file, Step 2).

### Order & parallelism
Sequential for safety (record type is a hard dep): **Unit 1 (freeze contract →
finish) → Unit 2 → host-bin wiring**. Optional overlap: start Unit 2 once Unit 1's
`logging.ts` checkpoint is committed.

### Open sub-decisions (lock while finalizing prompts)
- Package name `journal-sink` (vs `log-journal`). · fsync cadence (default periodic).
- Journal file path + rotation policy (size cap; on-disk location).
**Verify:** `typecheck`/`test`/`check` clean; live boot → `host.logger` lines land in
the journal file.

### Phase A.2 — the "AFTER" capture (build plan)
`provider.request` verbatim post-transform (D5) inside `provider-openai-compat`:
exact serialized request + response status/cache-tokens/raw-error, auth self-redacted.
Completes full round-trip rebuild + the **before↔after diff**.
- **Contract (DONE, orchestrator):** `ProviderStreamOptions.logger?: Logger`
  (`contracts/provider.ts`) — threads the step's correlated logger into `stream()` so
  the `provider.request` span is a child of the step span (before↔after share
  turnId/parentSpanId). Optional = non-breaking.
- **Unit K — kernel run-turn** (owner: kernel): pass the step span's logger into
  `provider.stream(msgs, tools, { ...opts, logger })`. One file (`runtime/run-turn.ts`).
- **Unit P — provider-openai-compat** (owner): at the fetch edge, if `opts.logger`,
  open a `provider.request` child span; capture the verbatim post-transform request
  (URL, model, params, serialized body) + `cache_control` presence; on response,
  status + cache-read/creation tokens (Usage) + (on error) raw error; **self-redact
  the auth header in its own code** (graduated tiers, §6). First hermetic provider
  HTTP test (`stream.test.ts`, mock `fetch` + real-capture fixtures).
- **Order:** contract frozen (done) → Unit K ∥ Unit P (disjoint: kernel vs provider).

**body-channel ABI — RESOLVED (Phase A.3):** added optional `body?` to
`Logger.span` / `Span.child` / `Span.end` → `LogRecord.body`; and moved `createLogger`
out of `contracts/` so `contracts/logging.ts` is pure types again. The before
(`prompt` span) and after (`provider.request` span) now carry their verbatim payloads
in `body`, not stringified attributes — attributes stay thin/queryable (D9).

*(Full per-extension prompt-segment provenance — D8 — comes later, with the
context-filter chain.)*

---

## 11. Phase B preview — collector flush into SQLite (NOT built; model only)

> Captured so it's not lost; the exact knobs are open Phase B sub-decisions.

The flush into the SQLite **store** is the **collector's** job (process 2), async +
continuous, **decoupled from turns** — the app only appends to the journal and never
waits. Per tick the collector:
1. **Tails** the journal — reads new complete lines since its last committed
   byte-offset (short poll loop or fs-watch).
2. **Batches** them into **one SQLite transaction**, commits.
3. **Advances a persisted consume-offset.**

- **Cadence:** per batch/tick, bounded by *interval OR batch-size, whichever first*
  (e.g. ~250 ms or N records) — never one-txn-per-record (avoids per-record fsync),
  never per-turn. Sub-second / near-real-time.
- **Crash-safety = at-least-once + idempotent:** resume from the persisted offset on
  restart → may re-read a few lines → writes are idempotent (`INSERT OR IGNORE` on a
  record/span id). No loss; harmless reprocess. This is what lets you query a chat
  **after the app that produced it crashed** (D3).
- **Queryability lag** ≈ one batch interval; post-mortem still works (journal is
  durable; the collector consumes whenever it's up).
- **Open (Phase B):** poll vs fs-watch; interval/batch-size; offset storage (store
  metadata vs sidecar); dedup key; + the store schema/indexes (§6).

---

## 12. Phase C — body de-dup + retention (DECIDED; building)

> Resolves the §6 "Retention/rotation sizing" + "dedup key" open threads + the
> tasks.md "dedup / storage growth" item. **User-gated decisions (this session):**
> extend the existing pipeline (no new extension); scope = **de-dup + retention/
> rotation** (D9 analytics roll-ups DEFERRED); dedup = **content-addressed bodies**.

**The problem.** D5 verbatim provider-I/O capture stores large request/response
bodies; cache-warming resends near-identical bodies every few minutes → the trace
store grows without bound.

**Decision — content-addressed bodies (supersedes the §3.1/D5 "fingerprint-gated
persistence" sketch).** Dedup keys on the **body content hash**, NOT on
`prefix.fingerprint`:
- The store hashes each verbatim `body`, stores it **once** in a content-addressed
  bodies table keyed by hash, and references it by hash from the span/record row.
  Identical bodies (the cache-warming case, and any other repeat) collapse to one
  stored copy. Robust against ALL duplicate bodies, not just prefix matches; and it
  is **stateless** (no "prior fingerprint per conversation" bookkeeping).
- This **decouples `prefix.fingerprint` from storage.** Fingerprint + `warm|real`
  revert to their original job: queryable **cache-bust debugging** attributes (§3.1).
  They are NOT needed for dedup/retention and are **deferred** to a later cache-bust-
  debugging milestone — also because **cache-warming is not built yet**, so a request
  cannot honestly be flagged `warm` vs `real` today (declare-reality, §extension brief).

**Retention/rotation.** The store exposes `prune(policy)` enforcing a
`RetentionPolicy` ({ maxAgeMs?, maxTotalBodyBytes? }): delete records/bodies past
`maxAgeMs`; evict **oldest** bodies (drop-oldest) until under `maxTotalBodyBytes`;
garbage-collect orphaned bodies (no remaining referencing row). Bodies above a size
threshold are **compressed** at rest (gzip), transparently decompressed on read.
Exports a `DEFAULT_RETENTION` constant.

**Who triggers prune.** The **collector** (process 2) calls `store.prune(policy)` on
a cadence in its existing tick loop — NOT a `scheduledJob` (the scheduler is for
extensions; trace-store/collector are supporting packages). Retention policy values
default to `DEFAULT_RETENTION`; host-bin env-override wiring is a deferred follow-up.

**Units / waves.**
- **Wave 1 — `trace-store`:** content-addressed body storage + compression +
  `prune`/`RetentionPolicy`/`DEFAULT_RETENTION` on the `TraceStore` surface, read
  paths transparent. (`bun:sqlite` → `bun test`.)
- **Wave 2 — `observability-collector`:** call the new `prune` on a cadence;
  confirm body inserts flow through the content-addressed path. Depends on Wave 1's
  surface.

**Deferred (still open):** D9 analytics roll-ups (§2 D9 / §6) — rollup table shape,
`GROUP BY` indexes, retention asymmetry, the periodic rollup job. And the
`prefix.fingerprint` / `warm|real` cache-bust attributes (above).