Why systematic evaluation matters
AI systems fail in non-deterministic ways. The same prompt can produce different results. Edge cases emerge unpredictably. As capabilities grow from simple single-turn interactions to complex multi-agent systems, manual testing becomes impossible to scale. Systematic evaluation solves this by:- Establishing baselines: Measure current performance before making changes
- Preventing regressions: Catch quality degradation before it reaches production
- Enabling experimentation: Compare different models, prompts, or architectures
- Building confidence: Deploy changes knowing they improve aggregate performance
Evaluation approaches
Axiom supports two complementary approaches:- Offline evaluations test your capability against a curated collection of inputs with expected outputs (ground truth). Run them before deploying to catch regressions.
- Online evaluations score live production traffic with reference-free scorers. Run them after deploying to monitor quality continuously.
Scorer API. The scorers you write for one context work in the other.
Which evaluation approach to use
Use offline evaluations when you need to test against known-good answers before shipping. Use online evaluations when you want to continuously monitor production quality. You can use both approaches together to get the best of both worlds.| Offline evaluations | Online evaluations | |
|---|---|---|
| When | Development, before deploy | Production, on live traffic |
| Expected values | Requires expected output per case | No ground truth needed |
| Scorers | Can compare output to expected | Reference-free |
| Execution | CLI runner with vitest | Fire-and-forget inside your app |
| Sampling | Runs every case | Per-scorer sampling rate |
| Telemetry | OTel spans in eval dataset | OTel spans linked to production traces |
Offline evaluation workflow
Offline evaluations test your capability against a curated dataset before you deploy. Axiom’s evaluation framework follows a simple pattern:Create a collection
Build a set of test cases with inputs and expected outputs (ground truth). Start small with 10-20 examples and grow over time.
Define scorers
Write functions that compare your capability’s output against the expected result. Use custom logic or prebuilt scorers from libraries like
autoevals.Run evaluations
Execute your capability against the collection and score the results. Track metrics like accuracy, pass rate, and cost.
Online evaluation workflow
Online evaluations score live production traffic continuously after you deploy. They use the sameScorer API as offline evaluations, but without expected values.
Write reference-free scorers
Create scorers that assess output quality using only the input and output without ground truth required. Use heuristic checks for format and structure, or LLM-as-judge patterns for semantic quality.
Attach scorers to your capability
Call
onlineEval inside your capability code to run scorers as fire-and-forget operations that don’t affect your response latency.Control sampling
Set per-scorer sampling rates to balance coverage and cost. Run cheap heuristic scorers on every request and expensive LLM judges on a fraction of traffic.
What’s next?
Shared:- To set up your environment and authenticate, see Quickstart.
- To learn how to write scoring functions that work in both offline and online evaluations, see Scorers.
- To learn how to write evaluation functions, see Write offline evaluations.
- To understand flags and experiments, see Flags and experiments.
- To view results in the Console, see Analyze results.
- To learn how to write and run online evaluation functions, see Write and run online evaluations.
- To view results in the Console, see Analyze online evaluation results.