eva // weekly legal tech digest
← Topics·2 sources

Evals & AI Quality in Legal

Measuring whether legal AI tools actually work — beyond demos and vendor claims.

Current understanding

Evals & Benchmarking Legal AI with Anna Guo is the central piece: most procurement decisions for legal AI are made on demos and vendor relationships, with vague accuracy claims and almost no rigorous measurement. Anna Guo's argument is that the field needs evals that match the actual work, not generic LLM benchmarks. The corpus broadly agrees but offers little on what those evals would look like in production. The View from the Interface with Kevin Cohn supplies an unexpected data source: Brightflag's invoice data is effectively a panoramic eval of how legal work flows, what's billed, and where AI is (or isn't) replacing hours. The dataset reveals AI-assisted billing patterns that look "not the most above board" — suspiciously robotic six-minute increments. This is one of the few empirical anchors in the corpus. The corpus has not yet produced a benchmark, eval framework, or shared dataset that the legal AI community can rally around. The gap is named but not filled.

Tensions

Mino relevance

Mino has to provide eval scaffolding for its agents to be credible — and this is an opportunity, not a cost. Publishing per-agent evals (what the agent is good at, where it fails, against what benchmark) is a differentiator the rest of the market refuses to do. Strategic implication: build evals in from day one, publish them, and reference Anna Guo's framing in the public messaging. Long-term, Mino could host a shared eval registry for specialist legal agents — community good and a moat at the same time.

Sources

2

Related