Evals · Innotalent

A useful eval set is small, curated and boring on purpose: a few dozen to a few hundred real inputs with known expected behaviour, run automatically on every change. Some checks are deterministic (does the JSON parse, does the answer contain the required field), some are graded by another model, and some need a human in the loop for the genuinely judgemental cases. The shape matters less than the discipline of running them.

Evals sit one layer above prompt engineering and one layer below MLOps. They are how you catch a regression when a fine-tuned model behaves differently on edge cases, or when a new prompt accidentally raises the hallucination rate on a known-hard slice of inputs.

The honest take: evals are the most under-invested part of most AI projects, and the most diagnostic. Teams will spend weeks tuning a prompt by vibe before they spend a day building the test set that would have told them whether any of it worked. Shipping LLM features without evals is shipping blind — the rule of thumb is to build the eval set on day one, even if it only has twenty examples, and grow it every time something breaks.

Need a team that ships on your clock?