eva // weekly legal tech digest
← Digest·substack·Law What's Next

Evals & Benchmarking Legal AI with Anna Guo

Most conversations about legal AI tools end the same way: impressive demos, vague claims about accuracy, and procurement decisions made on vibes and vendor relationships.

January 21, 2026569 wordsoriginal ↗

Topics

Article

# Evals & Benchmarking Legal AI with Anna Guo > Most conversations about legal AI tools end the same way: impressive demos, vague claims about accuracy, and procurement decisions made on vibes and vendor relationships. [Read on Substack](https://lawwhatsnext.substack.com/p/evals-and-benchmarking-legal-ai-with) · 2026-01-21 · Law What's Next --- Happy New Year Friends 👋 A quick reminder: Law://WhatsNext is our vehicle to explore through dialogue (or occasional reflection) how leading lawyers, educators and technologists are using emerging tech to evolve how we practice and administer legal services. No hype - just practical conversations. 🎙️This week we were lucky enough to spend some time with Anna Guo — a Singapore-based lawyer, startup advisor, and founder of LegalBenchmarks.ai — who has quietly built one of the most rigorous practitioner-driven evaluation frameworks for legal AI tools in the industry. Her community now spans close to 900 legal and AI professionals and her research has produced findings that challenge industry assumptions, including: that legal-specific AI tools don’t always outperform general-purpose models that accuracy isn’t actually the top driver of lawyer adoption, and that in some drafting tasks, AI is already matching or exceeding human reliability. Listen Now Available here or on Spotify, Apple Podcasts, or wherever you enjoy your podcasts. This is a watch-don’t-only-listen episode. Anna shares her screen throughout — running us through a live, double-blind benchmarking exercise where we rank outputs from legal AI, general-purpose AI, and human lawyers without knowing which is which. She also demonstrates how prompt injection attacks can bypass AI guardrails using techniques as simple as low-resource languages (Vietnamese or ASCII code?), surfacing security risks that become particularly acute as we move closer toward widespread agentic AI adoption. What You’ll Learn: The Three Dimensions of Tool Evaluation — Why measuring accuracy alone misses the point, and how Anna assesses output reliability, output usefulness, and platform workflow support as distinct layers What Actually Drives Adoption — Survey data revealing that lawyers prioritise context management and verification over raw accuracy when choosing AI tools Where Humans Still Win — High-judgment, context-sparse tasks requiring commercial reasoning remain firmly in human territory; routine, context-complete work is where AI excels Prompt Injection in Practice — Live demonstrations of how attackers can trick AI models into revealing harmful information using low-resource languages and clever framing Key References from Our Conversation LegalBenchmarks.ai Anna’s practitioner-driven platform for evaluating AI tool performance on real legal work. The community has grown to nearly 900 legal and AI professionals and has published two major benchmarking reports. Benchmarking Humans & AI in Contract Drafting The September 2025 report Anna references throughout the conversation. Key finding: the top-performing AI tool (Gemini 2.5 Pro) achieved 73.3% reliability, marginally outperforming the best human lawyer at 70%. GDPval (OpenAI) OpenAI’s benchmark for evaluating AI on real-world economically valuable tasks across 44 occupations, including legal work. Tom references this as an example of how the industry is moving toward measuring deliverables rather than just text outputs. Prompt Injection Ranked #1 in OWASP’s 2025 Top 10 for LLM Applications. Anna demonstrates how attackers can use low-resource languages and clever framing to bypass AI guardrails — a risk that becomes particularly acute as AI systems gain more autonomy and access. The Art of Modern Legal Warfare Anna’s fictional story series (developed with security collaborators including Brock and George Zeller) illustrating how AI can be exploited for harmful purposes. Each story is backed by academic research on the relevant vulnerabilities. Connect with Anna On LinkedIn OR through the LegalBenchmarks.ai Community. We hope you enjoy this conversation as much as we did! And do share this with others you feel would benefit from listening and would enjoy Anna’s insight and cutting edge work 🤗. Tom & Alex