We designed an AI rulebook for contract review instead of just prompting GPT. Here's what we learned after a year of iteration.
Implementation Story
Post
Bit of a long one but hopefully useful for anyone building in this space or evaluating tools.
Background: I've been working on AI assisted contract due diligence for about a year now. Not as a lawyer, more on the engineering and product side, working closely with in house legal teams. What I'm sharing here isn't theory, it's stuff we got wrong first and fixed later.
**Why the obvious approach breaks down**
The instinct when you first start is to give the model a checklist. Flag unlimited liability. Flag warranty terms over 3 years. Flag missing governing law. Prompt engineer your way to a solution.
Works okay on demo contracts. Falls apart on real ones.
The problem is the model doesn't know why something is a risk. So when it hits an edge case, a clause that's technically fine in isolation but problematic given the rest of the contract, it either misses it or flags it without useful context. A lawyer reading flagged: warranty clause can't do anything with that. They need to know whether this specific clause in this specific deal is actually a problem for their business.
Generic AI treats all contracts the same. Real contracts are not the same.
**The shift that actually helped: teaching the WHY**
We restructured how we encoded legal knowledge. Instead of a flat list of rules, every rule now has three components.
Definition which is the precise linguistic pattern that triggers a flag. Not long warranty but warranty duration stated or implied to exceed 36 months. Specific enough that the model can pattern match reliably.
Rationale which is the business logic behind the rule. Why does warranty duration matter past 36 months? Because it creates open ended exposure for latent defects that surface after the normal product lifecycle. This isn't for the lawyer, it's for the model. When the agent understands why a rule exists, its output rationale actually tracks back to real business risk instead of generic legal commentary.
Examples which are 1 to 2 actual sentence excerpts from real contracts that represent the pattern. This was the biggest unlock for us. Abstract rule definitions are hard for models to apply consistently. Concrete linguistic examples are much easier to match against.
When you structure it this way the model stops pattern matching and starts reasoning. Small distinction in theory, massive difference in output quality.
**Ditch binary classification**
Good clause bad clause is not how legal risk actually works.
We moved to three tiers. Green which favors your business and is low risk so you move on. Orange which is acceptable under certain conditions and needs a human decision. Red which is non negotiable so you push back, redline, or walk.
The middle tier is where most of the interesting work lives. A 2x liability cap might be fine on a $500K contract and completely unacceptable on a $10M one. The rule has to encode that conditionality or the model can't make a meaningful call. It'll just flag everything orange and create more work than it saves.
**The multi agent thing nobody talks about**
Single model reviewing an entire contract hits a ceiling pretty fast. The context window problem is obvious but the less obvious problem is specialization.
One model can't be an expert in warranty law, IP indemnification, data privacy obligations, force majeure, and governing law simultaneously, at least not with the precision you need for real due diligence. It knows a bit about everything. You need something that knows a lot about one thing.
We ended up splitting into domain specific agents. Each one only analyzes clauses in its category. Warranty agent looks at warranty clauses. IP agent looks at IP clauses. And so on.
Then you need an orchestration layer. Because a single paragraph in a contract can trigger three different agents at once and you can't flag everything at the same severity. The orchestrator compares what each agent found, looks at their confidence levels, and decides what to surface. That conflict resolution step is unglamorous and took longer to get right than anything else in the system.
**What's still genuinely hard**
Confidence calibration. Getting an agent to say I'm 70% confident this is high risk in a way that's consistent and meaningful across different contract types is still not solved cleanly. A model's internal uncertainty and real world legal risk don't map to each other neatly.
Rulebook maintenance. As business conditions change, as case law shifts, the knowledge base needs updating. This is a human process. Anyone claiming their system keeps itself current automatically is exaggerating.
Getting lawyers to trust it. This is maybe the hardest one. The output has to be explainable, not just this clause is risky but here's why, here's the specific language, here's what we'd suggest instead. Without that adoption stalls regardless of how accurate the model is.
Curious whether anyone here has experimented with graph based knowledge representations instead of flat vector search for legal reasoning. Been thinking about whether relationship modeling between clauses changes anything meaningfully.
Top comments · 7
- 5↑u/RexDaneThanks for the helpful insights. I’m a corporate lawyer using AI for a lot of tasks. Based on your findings, what practical advice would you recommend for making the best use of things like Harvey and Legora?
- 2↑u/marryhaguire100Totally agree with what you're saying regarding relationship modelling between clauses as that is the one of the major things which any tool at the moment lacks to understand fully on a broad base level as the tools need to be fed vast amounts of context for them to link it together but if we teach the tool enough of the basics of an agreement (for example) in the sense that how inter-clause relationship works on a basic contractual level first and give 2-3 different scenarios to start with it would start understanding the variability based on each deal by itself and in sometime you just need to be providing it deal-specific info only and it can provide you contextual answers. Wdyt?
- 1↑u/dreamlegal_legaltechThis is one of the better explanations of where “AI contract review” actually becomes useful instead of just impressive in demos. A few things here feel especially important: * The “why” layer matters more than the rule itself Without rationale, models flag patterns. With rationale, they start connecting clauses to actual business exposure * The orange tier is the real work Most legal review is conditional judgment, not obvious red/green classification * Specialization makes sense A focused warranty/IP/privacy agent is probably more reliable than one general reviewer trying to reason about everything simultaneously * Explainability is underrated Lawyers trust systems that show language, reasoning, and consequences not just labels On the graph point, relationship modeling probably matters most where clauses interact indirectly. Limitation of liability, indemnity, insurance, and termination clauses can completely change each other’s practical meaning even if each clause individually looks acceptable. That feels like one of the bigger unsolved gaps in current contract AI systems.
- 1↑u/ArmOfRickAllenEven with the setup as you describe, I would have to imagine that there is enough flagged in any given contract, combined with low enough confidence levels, that I would just want to review the contract myself. I think that is a hurdle all contract review products will have to overcome for any self-respecting attorney (and any attorney attempting to comply with rule of professional responsibility)
- 1↑u/respeckKnucklesYou've rediscovered rule based expert systems. AI from the 1990s is back baby!
- 1↑u/tommytmoparTeaching the why makes so much more sense than just feeding it a checklist. Generic AI flags stuff that looks wrong on paper but misses the actual business risk. Your approach sounds like it actually helps lawyers think instead of just giving them more garbage to read. The graph idea for clause relationships would be a real next step.
- 1↑u/room9guyI’m an in house commercial lawyer at a tech company and working with our engineering teams. We are early in the same journey you’ve taken as described in your post. Your write up is thought provoking and insightful. The system of domain specific agents and the orchestration layer that you designed seems a better way to scale in a high volume enterprise environment and across multiple deal types. Do you think it will hold up over time as model intelligence expands? Specifically, I’m curious if the more complex scaffolding that you’ve designed might be limiting to LLMs that are smart enough to be an expert in multiple domains and also weigh conflicts/priorities. Some might argue that today’s best models can already do that.