eva // weekly legal tech digest
← Digest·reddit·r/legaltech·AcanthisittaHorror86

We designed an AI rulebook for contract review instead of just prompting GPT. Here's what we learned after a year of iteration.

May 7, 20263321💬original ↗
Implementation Story

Post

Bit of a long one but hopefully useful for anyone building in this space or evaluating tools. Background: I've been working on AI assisted contract due diligence for about a year now. Not as a lawyer, more on the engineering and product side, working closely with in house legal teams. What I'm sharing here isn't theory, it's stuff we got wrong first and fixed later. **Why the obvious approach breaks down** The instinct when you first start is to give the model a checklist. Flag unlimited liability. Flag warranty terms over 3 years. Flag missing governing law. Prompt engineer your way to a solution. Works okay on demo contracts. Falls apart on real ones. The problem is the model doesn't know why something is a risk. So when it hits an edge case, a clause that's technically fine in isolation but problematic given the rest of the contract, it either misses it or flags it without useful context. A lawyer reading flagged: warranty clause can't do anything with that. They need to know whether this specific clause in this specific deal is actually a problem for their business. Generic AI treats all contracts the same. Real contracts are not the same. **The shift that actually helped: teaching the WHY** We restructured how we encoded legal knowledge. Instead of a flat list of rules, every rule now has three components. Definition which is the precise linguistic pattern that triggers a flag. Not long warranty but warranty duration stated or implied to exceed 36 months. Specific enough that the model can pattern match reliably. Rationale which is the business logic behind the rule. Why does warranty duration matter past 36 months? Because it creates open ended exposure for latent defects that surface after the normal product lifecycle. This isn't for the lawyer, it's for the model. When the agent understands why a rule exists, its output rationale actually tracks back to real business risk instead of generic legal commentary. Examples which are 1 to 2 actual sentence excerpts from real contracts that represent the pattern. This was the biggest unlock for us. Abstract rule definitions are hard for models to apply consistently. Concrete linguistic examples are much easier to match against. When you structure it this way the model stops pattern matching and starts reasoning. Small distinction in theory, massive difference in output quality. **Ditch binary classification** Good clause bad clause is not how legal risk actually works. We moved to three tiers. Green which favors your business and is low risk so you move on. Orange which is acceptable under certain conditions and needs a human decision. Red which is non negotiable so you push back, redline, or walk. The middle tier is where most of the interesting work lives. A 2x liability cap might be fine on a $500K contract and completely unacceptable on a $10M one. The rule has to encode that conditionality or the model can't make a meaningful call. It'll just flag everything orange and create more work than it saves. **The multi agent thing nobody talks about** Single model reviewing an entire contract hits a ceiling pretty fast. The context window problem is obvious but the less obvious problem is specialization. One model can't be an expert in warranty law, IP indemnification, data privacy obligations, force majeure, and governing law simultaneously, at least not with the precision you need for real due diligence. It knows a bit about everything. You need something that knows a lot about one thing. We ended up splitting into domain specific agents. Each one only analyzes clauses in its category. Warranty agent looks at warranty clauses. IP agent looks at IP clauses. And so on. Then you need an orchestration layer. Because a single paragraph in a contract can trigger three different agents at once and you can't flag everything at the same severity. The orchestrator compares what each agent found, looks at their confidence levels, and decides what to surface. That conflict resolution step is unglamorous and took longer to get right than anything else in the system. **What's still genuinely hard** Confidence calibration. Getting an agent to say I'm 70% confident this is high risk in a way that's consistent and meaningful across different contract types is still not solved cleanly. A model's internal uncertainty and real world legal risk don't map to each other neatly. Rulebook maintenance. As business conditions change, as case law shifts, the knowledge base needs updating. This is a human process. Anyone claiming their system keeps itself current automatically is exaggerating. Getting lawyers to trust it. This is maybe the hardest one. The output has to be explainable, not just this clause is risky but here's why, here's the specific language, here's what we'd suggest instead. Without that adoption stalls regardless of how accurate the model is. Curious whether anyone here has experimented with graph based knowledge representations instead of flat vector search for legal reasoning. Been thinking about whether relationship modeling between clauses changes anything meaningfully.

Top comments · 7