Eval runner

The server-side eval runner lets users dispatch evaluations from the dashboard without running a local daemon. The runner uses LLM credentials the admin configures here; users don’t need their own. This page is admin-only.

If no server runner is configured, dashboard evaluations fall back to the daemon runner, which uses the user’s own local Claude/Codex install. Either path produces the same evaluation output.

Where it lives

/admin/settings → Eval runner section.

Vendors

The runner supports two LLM vendors, configured independently:

Vendor	Block	Use when
Anthropic	`Evals:ServerRunner:Anthropic`	You want Claude judges (sonnet or opus). Recommended default.
OpenAI	`Evals:ServerRunner:OpenAI`	You want GPT judges, or your team standardises on OpenAI for billing/governance reasons.

Configure either or both. Each block has its own credentials, so an admin can keep them on separate billing accounts.

Settings

Per vendor:

Key	Required	Description
`ApiKey`	yes	The vendor API key. Stored encrypted at rest. The block is treated as enabled only when an API key is set.
`Model`	yes	The judge model. For Anthropic, e.g. `claude-sonnet-4-6` or `claude-opus-4-7`. For OpenAI, e.g. `gpt-4o`.

And one runner-wide setting:

Key	Default	Description
`RunTimeoutSeconds`	300	Per-evaluation timeout. Long sessions with `--threshold` raised may need a higher value.

How users pick a vendor

In the dashboard’s eval dialog:

If exactly one vendor is configured, evals dispatch to it implicitly.
If both are configured, the dialog shows a vendor selector and the user picks. The EvalRunRequest.Vendor field is required in this case.
If neither is configured, the Run evaluation button only routes to the daemon runner; if the user has no daemon, the button is hidden.

Why two runners exist

There are two ways an evaluation can run, and either is fine:

Daemon runner — the user’s daemon connects to the tenant over SignalR and runs the judges using whichever Claude/Codex install the daemon has. No tenant-side credentials. Free if the user has Claude Pro or a Codex subscription.
Server runner — the tenant calls Anthropic/OpenAI directly using the credentials configured here. Centralised billing, no daemon needed, works for users without a local Claude/Codex install (read-only reviewers, mobile, etc.).

Mixing both is common: most evals run on daemons; the server runner exists for reviewers and for teams who want the consistency of a single billable account.

Evaluations — the user-facing feature.
Embeddings & guidelines — the embedding service that ingests eval evidence into judge-fact clusters.