Eval runner
The server-side eval runner lets users dispatch evaluations from the dashboard without running a local daemon. The runner uses LLM credentials the admin configures here; users don’t need their own. This page is admin-only.
If no server runner is configured, dashboard evaluations fall back to the daemon runner, which uses the user’s own local Claude/Codex install. Either path produces the same evaluation output.
Where it lives
Section titled “Where it lives”/admin/settings → Eval runner section.
Vendors
Section titled “Vendors”The runner supports two LLM vendors, configured independently:
| Vendor | Block | Use when |
|---|---|---|
| Anthropic | Evals:ServerRunner:Anthropic | You want Claude judges (sonnet or opus). Recommended default. |
| OpenAI | Evals:ServerRunner:OpenAI | You want GPT judges, or your team standardises on OpenAI for billing/governance reasons. |
Configure either or both. Each block has its own credentials, so an admin can keep them on separate billing accounts.
Settings
Section titled “Settings”Per vendor:
| Key | Required | Description |
|---|---|---|
ApiKey | yes | The vendor API key. Stored encrypted at rest. The block is treated as enabled only when an API key is set. |
Model | yes | The judge model. For Anthropic, e.g. claude-sonnet-4-6 or claude-opus-4-7. For OpenAI, e.g. gpt-4o. |
And one runner-wide setting:
| Key | Default | Description |
|---|---|---|
RunTimeoutSeconds | 300 | Per-evaluation timeout. Long sessions with --threshold raised may need a higher value. |
How users pick a vendor
Section titled “How users pick a vendor”In the dashboard’s eval dialog:
- If exactly one vendor is configured, evals dispatch to it implicitly.
- If both are configured, the dialog shows a vendor selector and the user picks. The
EvalRunRequest.Vendorfield is required in this case. - If neither is configured, the Run evaluation button only routes to the daemon runner; if the user has no daemon, the button is hidden.
Why two runners exist
Section titled “Why two runners exist”There are two ways an evaluation can run, and either is fine:
- Daemon runner — the user’s daemon connects to the tenant over SignalR and runs the judges using whichever Claude/Codex install the daemon has. No tenant-side credentials. Free if the user has Claude Pro or a Codex subscription.
- Server runner — the tenant calls Anthropic/OpenAI directly using the credentials configured here. Centralised billing, no daemon needed, works for users without a local Claude/Codex install (read-only reviewers, mobile, etc.).
Mixing both is common: most evals run on daemons; the server runner exists for reviewers and for teams who want the consistency of a single billable account.
Related
Section titled “Related”- Evaluations — the user-facing feature.
- Embeddings & guidelines — the embedding service that ingests eval evidence into judge-fact clusters.