Skip to main content
Evaluation is the systematic process of measuring how well your AI performs.

Why systematic evaluation matters

AI systems fail in non-deterministic ways. The same prompt can produce different results. Edge cases emerge unpredictably. As capabilities grow from simple single-turn interactions to complex multi-agent systems, manual testing becomes impossible to scale. Systematic evaluation solves this by:
  • Establishing baselines: Measure current performance before making changes
  • Preventing regressions: Catch quality degradation before it reaches production
  • Enabling experimentation: Compare different models, prompts, or architectures
  • Building confidence: Deploy changes knowing they improve aggregate performance

Evaluation approaches

Axiom supports two complementary approaches:
  • Offline evaluations test your capability against a curated collection of inputs with expected outputs (ground truth). Run them before deploying to catch regressions.
  • Online evaluations score live production traffic with reference-free scorers. Run them after deploying to monitor quality continuously.
Both approaches use the same Scorer API. The scorers you write for one context work in the other.

Which evaluation approach to use

Use offline evaluations when you need to test against known-good answers before shipping. Use online evaluations when you want to continuously monitor production quality. You can use both approaches together to get the best of both worlds.
Offline evaluationsOnline evaluations
WhenDevelopment, before deployProduction, on live traffic
Expected valuesRequires expected output per caseNo ground truth needed
ScorersCan compare output to expectedReference-free
ExecutionCLI runner with vitestFire-and-forget inside your app
SamplingRuns every casePer-scorer sampling rate
TelemetryOTel spans in eval datasetOTel spans linked to production traces

Offline evaluation workflow

Offline evaluations test your capability against a curated dataset before you deploy. Axiom’s evaluation framework follows a simple pattern:
1

Create a collection

Build a set of test cases with inputs and expected outputs (ground truth). Start small with 10-20 examples and grow over time.
2

Define scorers

Write functions that compare your capability’s output against the expected result. Use custom logic or prebuilt scorers from libraries like autoevals.
3

Run evaluations

Execute your capability against the collection and score the results. Track metrics like accuracy, pass rate, and cost.
4

Compare and iterate

Review results in the Axiom Console. Compare against baselines. Identify failures. Make improvements and re-evaluate.

Online evaluation workflow

Online evaluations score live production traffic continuously after you deploy. They use the same Scorer API as offline evaluations, but without expected values.
1

Write reference-free scorers

Create scorers that assess output quality using only the input and output without ground truth required. Use heuristic checks for format and structure, or LLM-as-judge patterns for semantic quality.
2

Attach scorers to your capability

Call onlineEval inside your capability code to run scorers as fire-and-forget operations that don’t affect your response latency.
3

Control sampling

Set per-scorer sampling rates to balance coverage and cost. Run cheap heuristic scorers on every request and expensive LLM judges on a fraction of traffic.
4

Monitor and iterate

Review online evaluation scores in the Axiom Console alongside your production traces. Use the insights to add targeted offline test cases and refine your capability.

What’s next?

Shared:
  • To set up your environment and authenticate, see Quickstart.
  • To learn how to write scoring functions that work in both offline and online evaluations, see Scorers.
Offline evaluations: Online evaluations: