Skip to content

Embeddings & guidelines

Evaluation findings are extracted as facts, embedded, and clustered to surface recurring patterns per repo. The clusters drive two surfaces: the Curation tab in the dashboard (so an admin can promote the lessons learned), and the SessionStart guideline injection (so Claude picks up curated guidance automatically at the start of every session).

This page covers the admin configuration: which embedding provider runs, what happens when you switch, and how the curation workflow gates what gets injected.

/admin/settings → Embeddings section, plus per-repo controls under Curation on each repo’s page.

ProviderBlockNotes
VoyageEmbedding:VoyageDefault. Higher-quality code embeddings; Voyage offers a generous free tier.
OpenAIEmbedding:OpenAIIf you already standardise on OpenAI. Dimensions configurable on text-embedding-3-large.

Pick exactly one via Embedding:Provider. Per-vendor blocks:

KeyRequiredDescription
Embedding:Voyage:ApiKeyyes (if provider is Voyage)Voyage API key.
Embedding:Voyage:ModelyesVoyage model, e.g. voyage-code-3.
Embedding:OpenAI:ApiKeyyes (if provider is OpenAI)OpenAI API key.
Embedding:OpenAI:ModelyesOpenAI embedding model, e.g. text-embedding-3-large.
Embedding:OpenAI:DimensionsoptionalDimension count (only meaningful for text-embedding-3-*).

Switching the provider or the model triggers a one-time vector clear at the next app startup. Stored vectors are dropped; the backfill service re-embeds retained facts under the new configuration on the side.

Practical consequences:

  • Curation and guideline injection are unavailable for the few minutes the backfill runs.
  • The Curation tab will show no clusters until the backfill completes.
  • Promoted decisions (see below) are not lost — they’re stored independently of the vector index.

You can monitor the backfill from /admin/settings; the embeddings panel shows last-rebuild timestamp and counts.

Two layers of review: facts, then clusters

Section titled “Two layers of review: facts, then clusters”

Findings travel through two distinct review surfaces before they reach guideline injection. Both are visible per repo from the dashboard.

Layer 1: the Facts tab (fact-level review)

Section titled “Layer 1: the Facts tab (fact-level review)”

The Facts tab lists every individual judge finding, filterable by category. Anyone with repo visibility can act on it. Two surface toggles control what’s shown — Show muted and Show deleted — and four actions are available per fact:

  • Mute / Unmute — set aside noise. Muted facts are excluded from clustering. Reversible, no reason required. Anyone with repo access can mute or unmute.
  • Delete — soft-delete a wrong finding. Requires a one-line reason. The fact is removed from the dashboard, the cluster pipeline, and guideline injection. Gated to the fact’s retainer (the user whose session produced it) or any user owning a session in the repo.
  • Restore — undo a delete. The deleter sees their own deletions in Show deleted and can restore them.

The endpoints behind this are POST /api/repositories/{hash}/judge-facts/{category}/{factHash}/mute|unmute and …/delete|undelete. The Facts tab is the user-facing wrapper; the API is the same one used by future automation.

Fact-level review is where the noise gets filtered out. By the time something reaches the Curation tab, it has survived this pass.

Layer 2: the Curation tab (cluster-level promotion)

Section titled “Layer 2: the Curation tab (cluster-level promotion)”

The Curation tab is admin-only and operates on clusters of related facts, not individual findings. There’s a lot on the page; the section below walks the actual UI top to bottom.

The first control on the tab is a radio for SessionStart guideline injection:

  • Promoted-only (default) — only clusters an admin has explicitly promoted with injection get injected. Safe default for established repos.
  • Promoted + top-weighted uncurated — the per-repo auto_inject_uncurated flag flipped on. Top-weighted clusters inject without explicit promotion. Use for solo or brand-new repos where waiting on curation is friction.

Both modes still respect Dismissed — a dismissed cluster never injects, regardless of mode. The label below the radio shows when the setting was last changed.

Below the toggle, a chip set narrows the queue by decision state:

  • Pending — clusters waiting for a decision. The default view.
  • Promoted — clusters you’ve already promoted. Each card shows which target kinds were picked.
  • Dismissed — clusters you’ve dismissed. Each card shows the reason (if any) and when.
  • Regressions (warning-coloured) — clusters that have grown noticeably since you promoted them. The card shows the delta (e.g. +4 since promotion), so you can spot patterns that have got worse even after writing guidance for them. The two cases to investigate are the injected guidance isn’t landing or something in the repo has changed.

A second chip set restricts to a category: All / Safety / Plan adherence / Quality / Efficiency. Combines with the status filter — e.g. “Pending Safety” shows only undecided safety clusters.

Each cluster renders as a card:

┌─ "Agent ran destructive bash without confirmation" ────────────────┐
│ │
│ Weight 12 · 7 phrasings · Safety · last seen 2026-05-22 │
│ │
│ [ Promote ] [ Dismiss ] │
└────────────────────────────────────────────────────────────────────┘

Anatomy:

  • Cluster text — the curated phrasing if you’ve already promoted (your PromotedText), otherwise the best representative judge finding for the cluster. The text you read when deciding whether to promote.
  • Weight — the cluster’s raw score, roughly the number of evidence-weighted facts in it. Higher means “this pattern shows up more often” or “more recent sessions hit it”.
  • Effective weight — appears in parentheses for stale clusters (Weight 8 (effective 4.5)) where age decays the signal. Use this to spot clusters that are mostly history.
  • Regression delta(+4 since promotion) on Regression-tab cards, showing how much the cluster has grown since you originally promoted it.
  • N phrasings — how many distinct judge findings cluster together. A cluster with 1 phrasing is a single fact; 7 phrasings means the judges flagged the same pattern seven ways.
  • Category — Safety / Plan adherence / Quality / Efficiency.
  • Last seen — date of the most recent fact in the cluster. Stale clusters with no recent activity often dismiss-without-reading.
  • Actions — depend on status:
    • PendingPromote (opens a dialog asking which target kinds: CLAUDE.md / Memory / Injection — pick any subset). Dismiss (asks for an optional reason).
    • Promoted — shows a green Promoted chip plus outline chips for each target kind you picked. Revoke button to undo the promotion (cluster returns to Pending).
    • Dismissed — shows when you dismissed it and the reason (if any). Revoke button to bring it back to Pending.

Promote is multi-target: a single cluster can be marked for CLAUDE.md and Injection — the targets are independent.

The two layers (Facts and Curation) are operationally independent — you can leave fact review to anyone in the repo and only personally curate clusters, or shape clusters upstream by aggressively muting individual facts. Most teams settle into “any session owner can mute, admins curate clusters” without explicit coordination.

For automation, the same two layers are exposed as HTTP endpoints under your tenant:

  • GET /api/repositories/{hash}/judge-facts — list facts for a repo (with optional ?category=).
  • POST /api/repositories/{hash}/judge-facts/{category}/{factHash}/{mute|unmute|delete|undelete} — fact-level actions.
  • GET /api/repositories/{hash}/curation — list clusters (status + category filters via query string).
  • POST /api/repositories/{hash}/curation/{clusterId}/{promote|dismiss|revoke} — cluster-level actions.
  • GET / PUT /api/repositories/{hash}/guideline-settings — read and update the per-repo auto_inject_uncurated flag.

The dashboard UI is a thin wrapper over these endpoints.

Individual users can disable SessionStart guideline injection for their own sessions:

Terminal window
kapacitor config set disable_session_guidelines true

Useful for users who want to A/B-test agent behaviour with and without injection. The opt-out is per-profile, so users with multiple tenants can disable on one and keep it on for others.

  • Evaluations — the user-facing scoring feature and the Facts tab where fact-level review lives.
  • Eval runner — the judge that produces the evidence the embedder ingests.