← Topics·2 sources·Last updated May 19, 2026

Evals & AI Quality in Legal

Measuring whether legal AI tools actually work — beyond demos and vendor claims.

Current understanding

Evals & Benchmarking Legal AI with Anna Guo is the central piece: most procurement decisions for legal AI are made on demos and vendor relationships, with vague accuracy claims and almost no rigorous measurement. Anna Guo's argument is that the field needs evals that match the actual work, not generic LLM benchmarks. The corpus broadly agrees but offers little on what those evals would look like in production. The View from the Interface with Kevin Cohn supplies an unexpected data source: Brightflag's invoice data is effectively a panoramic eval of how legal work flows, what's billed, and where AI is (or isn't) replacing hours. The dataset reveals AI-assisted billing patterns that look "not the most above board" — suspiciously robotic six-minute increments. This is one of the few empirical anchors in the corpus. The corpus has not yet produced a benchmark, eval framework, or shared dataset that the legal AI community can rally around. The gap is named but not filled.

Tensions

Generic legal benchmarks (bar exam questions, contract clause classification) don't predict performance on the work lawyers actually do. But task-specific evals don't generalize. The market needs both and has neither.
Vendors won't publish evals that make them look bad. Buyers won't pay for evals before purchase. The information asymmetry persists.
The Brightflag data is one firm's view. There's no industry-wide measurement infrastructure for legal AI value capture.

Mino relevance

Mino has to provide eval scaffolding for its agents to be credible — and this is an opportunity, not a cost. Publishing per-agent evals (what the agent is good at, where it fails, against what benchmark) is a differentiator the rest of the market refuses to do. Strategic implication: build evals in from day one, publish them, and reference Anna Guo's framing in the public messaging. Long-term, Mino could host a shared eval registry for specialist legal agents — community good and a moat at the same time.

Sources

ai-trust-and-output-verification vibe-coding-and-self-built-tools law-firm-business-model

Evals & AI Quality in Legal

Current understanding

Tensions

Mino relevance

Sources

Related