CarysBench and Evals

CarysBench is our evaluation platform for measuring how well Carys performs, tracking quality over time and guiding ongoing improvements.

Evals are structured quality checks. Carys is asked a consistent set of business questions, then the results are scored against clear standards. This gives us repeatable evidence of quality instead of relying on one-off demos.

Each run is saved and compared with previous runs, so we can see whether quality is improving, staying steady, or slipping. If a score drops, we investigate the cause and fix it.

The goal of CarysBench is simple: quality should be measured continuously, not assumed.

What Evals Help Us Do

Catch Regressions Early

We rerun known scenarios and compare scores between releases. If performance drops, we can catch it early before it impacts more users.

Track Quality Over Time

CarysBench keeps run history so we can track trend lines over time, rather than relying on subjective impressions.

Target Fixes Precisely

Low-scoring runs show us where quality is weakest. Teams can prioritize fixes, rerun evals and confirm whether changes actually improved outcomes.

Maintain Trust at Scale

As Carys evolves, eval discipline helps keep quality standards stable across releases.

How We Score Quality

We score Carys using high-level quality factors that reflect what matters in real decision workflows:

Numerical Correctness

Are the numbers right? Totals, percentages and derived figures should be mathematically correct.

Evidence-Based Claims

Are claims supported by data? Key statements should be grounded in clear evidence, not vague language.

Internal Consistency

Does the report stay consistent from start to finish? Numbers and conclusions should not contradict each other.

Coverage and Completeness

Does the analysis answer the full question? The output should cover all major parts of what the user asked.

Clarity and Usefulness

Is the result understandable and actionable? Reports should be clear, direct and useful for making decisions.

Stability Over Time

Is performance reliable across repeated runs and releases? Quality should hold steady, not fluctuate unpredictably.

How Scores Are Tracked Over Time

Each quality factor is scored, then rolled up into overall run results. CarysBench compares those results over time so we can track trend direction and verify whether changes improved quality.

How This Drives Improvement

CarysBench supports a continuous loop: run evals, inspect weak spots, implement fixes and rerun. This is how we maintain and improve Carys performance over time with evidence rather than guesswork.