5 tiers of AI system design for lawyers and small businesses, sorted by privacy tolerance: LegalBench leaders and tradeoffs
Research / Academic
Post
*As a legal AI startup, we keep seeing confusion when lawyers, owners, or professionals of all sorts try to figure out the right way to think about and pick AI tools for their legal tasks. So we put together a solutions overview framed around privacy - a simple framework for evaluating the options.*
LegalBench scores below are from vals.ai (April 2026 update). I list top 3 models in each tier where a comparable benchmark applies.
# Tier 1: Agentic AI "co-workers"
**What it is:** Tools that take action - read your screen, navigate the browser, click through documents, draft inside Word. They run as a desktop app or browser extension and have access to your local files, online accounts, and live web.
**Examples:** Claude in Chrome / Computer Use, Perplexity Comet, ChatGPT Atlas / Operator, Cursor (for desk research)
**Models behind them:** Whatever the vendor wires in - typically Claude Sonnet 4.6, GPT-5.x, Gemini 3 Pro
**Setup:** Easy. Install extension or app, sign in, grant permissions. \~5 minutes.
**Cost:** $20–$200/user/mo
**Privacy:** **Lowest.** Agents screenshot, read local files, and stream them to the vendor's cloud. Some offer enterprise tiers with no-training guarantees, but you're trusting a third party with raw work product. Verify your firm's policies before letting one of these touch a client folder.
**Productivity:** **Highest.** Actual work gets done - not just text suggestions.
**Support:** Easy. Vendor handles it.
**Best fit:** Solo practitioners, in-house teams with permissive data policies, anything pre-discovery or non-confidential.
# Tier 2: General-purpose proprietary chat
**What it is:** Direct chat interfaces: ChatGPT, Claude, Gemini app, Grok. You paste, you ask, you copy back.
**Top 3 by LegalBench:**
* **Gemini 3.1 Pro Preview** \- 87.40% ($2 / $12 per 1M tokens)
* **Gemini 3 Pro** \- 87.04% ($2 / $12)
* **Gemini 3 Flash** \- 86.86% ($0.50 / $3) ← best price/performance
For reference: GPT 5.5 ranks 4th (86.52%, $5/$30), Claude Opus 4.6 (Thinking) ranks 8th (85.30%, $5/$25).
**Setup:** Easy. Sign up, log in.
**Cost:** $20–30/mo on consumer plans; $25–200/user/mo on enterprise tiers
**Privacy:** **Low–medium.** Consumer tiers often train on your inputs unless you opt out. Enterprise/Team tiers contractually exclude training and offer DPAs (sometimes BAAs). None of these will sign a no-sub-processor commitment - you're transitively trusting OpenAI/Anthropic/Google's vendor stack.
**Productivity:** High. Frontier-grade models, broad capability, no legal-specific tuning.
**Support:** Easy. Vendor handles it.
**Best fit:** Non-confidential research, public-data analysis, drafting boilerplate, learning. Not appropriate for client work without enterprise contracts and a documented policy review.
# Tier 3: Privacy-improved or legal-specific platforms
**What it is:** Vendors that wrap proprietary or open models with stricter data handling - DPAs by default, no-training defaults, sometimes EU-only hosting, sometimes legal-specific tuning (clause libraries, redlining, citation grounding).
**Examples:**
* *Legal-specific:* Harvey, Thomson Reuters CoCounsel, Spellbook, Justee AI
* *General privacy-first:* Lumo (Proton), Brave Leo
**Models behind them:** Often a mix. Some vendors fine-tune open-weight models on legal corpora; others route different tasks to different models - frontier models for drafting, cheap models for classification, specialized models for citation grounding - picking the optimal model per product layer. This flexibility is one reason a well-built Tier 3 platform can outperform Tier 2 on legal tasks despite drawing from the same underlying base models.
**What's different from Tier 2:** the data layer (what's logged, retained, trained on) and the application layer (legal-specific UX, evals, domain logic).
**Setup:** Easy–medium. Sign up, sometimes SSO/onboarding. 5–60 minutes.
**Cost:** Wide range: $19 to $600, and more. Free tiers exist (Justee has a free tier with paid plans from $19/user/mo - one of the most affordable solutions for SMB on the market; Lumo and Brave Leo are free for individuals). Paid plans run from \~$19/user/mo at the consumer end up to $500+/user/mo for full legal-specific enterprise tools (Harvey, CoCounsel).
**Privacy:** **Medium–high.** Real DPAs, no training on inputs, often regional hosting, published sub-processor lists. Still cloud - your data leaves your network - but with contractual guardrails and (for the better vendors) audit trails.
**Productivity:** High when the platform is genuinely tuned for legal workflows; only marginally better than Tier 2 if it's a thin wrapper.
**Support:** Easy. Vendor handles it.
**Best fit:** Firms and in-house teams that need cloud convenience but require contracts and policies that consumer chat can't satisfy.
# Tier 4: Self-hosted in your own cloud
**What it is:** You run the models in your own AWS, Google Cloud, or Azure account - via AWS Bedrock, Google Vertex AI, Azure OpenAI / Azure ML - or by deploying open-weight models on your own VMs.
**Top 3 open-weight by LegalBench:**
* **Qwen 3.5 Plus** \- 85.10% ($0.40 / $2.40 via API; deployable)
* **Kimi K2.6** \- 84.74% ($0.95 / $4 via API; deployable)
* **GLM 5.1** \- 84.39% ($1 / $3.20 via API; deployable)
Honest caveat: these are 100B+ parameter MoE models. "Self-hosting" them realistically means a managed service (AWS Bedrock, Google Vertex AI Model Garden, Azure ML, Together AI) inside your cloud account - not literally on-prem unless you have datacenter GPUs.
**Setup:** **Hard.** Cloud account, model deployment, API wrapper, application layer, evals. Days to weeks.
**Cost:** Pay per token + infrastructure. Typically $0.10–$5 per 1M tokens at scale, plus engineering time.
**Privacy:** **High.** Data stays in your cloud account. Sub-processors are limited to your cloud vendor (AWS / Azure / GCP) - typically already covered by your existing vendor approvals.
**Productivity:** Depends entirely on the application layer you build or buy. The model is there; the workflow isn't.
**Support:** Hard. You + cloud vendor + (optionally) the model provider's enterprise tier.
**Best fit:** Firms with engineering capacity and high data-sensitivity requirements, or those with strict GDPR / data-residency constraints.
# Tier 5: Local AI
**What it is:** Models running on your own hardware. Nothing leaves the workstation.
**Tools:** Ollama, LM Studio, llama.cpp, vLLM - desktop apps that load and run models locally.
**Models that actually fit consumer/prosumer hardware:** smaller Llama, Qwen, Mistral, Gemma variants. The frontier models on the LegalBench leaderboard mostly don't fit on a laptop. Realistic options for a 32–64 GB workstation are Llama-class 70B quantized or Qwen 32B-class - these aren't in the top 20 of LegalBench. **Expect a 10–15 percentage-point drop from frontier accuracy.**
**Setup:** **Hardest.** Hardware procurement, software install, model download, prompt engineering, your own UI. Hours to days minimum.
**Cost:** Hardware ($2K–$10K for a capable workstation; more for multi-GPU) + electricity. No per-token cost.
**Privacy:** **Highest.** Nothing leaves your machine.
**Productivity:** Lower than frontier - model quality is meaningfully worse, and you're building the workflow on top yourself.
**Support:** Hardest. You + open-source community.
**Best fit:** Highly sensitive matters, classified/government work, jurisdictions with strict data residency, or anyone unwilling to extend third-party trust at all.
# Aside: Wearable AI
Limitless, Plaud, Friend, Rabbit, Bee. Niche for legal work - most are meeting-capture devices, not document workflow. Privacy varies wildly (some local-only, most pipe to vendor cloud). Useful for client-meeting note synthesis if your jurisdiction's recording rules allow it. Not a substitute for any tier above.
# Quick comparison
|Tier|Privacy|Productivity|Setup|Cost|Support|
|:-|:-|:-|:-|:-|:-|
|1. Agentic co-workers|Low|Highest|Easy|$$|Easy|
|2. General chat|Low–Med|High|Easy|$|Easy|
|3. Privacy / legal-specific|Med–High|High|Easy–Med|$$–$$$|Easy|
|4. Own-cloud|High|Depends|Hard|$ at scale|Hard|
|5. Local|Highest|Lower|Hardest|$$$ upfront|Hardest|
# A few honest takes
* **Everyone wants Tier 1 productivity at Tier 5 privacy.** That product doesn't exist. Pick a tradeoff and document why.
* **"No training" is necessary but not sufficient.** Read sub-processor lists. Most "private" tools still send data to AWS / Anthropic / OpenAI / Google - they just don't train on it. The data is still leaving your network.
* **Local AI is overhyped for serious legal work.** The quality gap vs. frontier is real. It's a fit for narrow tasks (PII redaction, classification, summarization), not full contract review or research.
* **The frontier moves fast.** This leaderboard will look different in three months. Pick a tier (architecture), not a specific model - models are swappable, architectures aren't.
Happy to go deeper on any of these in the following posts.
Top comments · 8
- 6↑u/DepoGeniusI’d like to caution those that try #5 that it will make you want to beat your head into a wall. We run a ton of data through 100% local fine tuned models and the software setup is daunting if you don’t have experience. Buying a large card and throwing a model on it is pretty dang easy. Once you get to that point and type your first questions in and see the responses fly in, it is exciting. Then you realize nothing after that is easy.
- 5↑u/pontymythonIT teams reading this and breaking out into a cold sweat at the idea that a user can just sign up to an AI tool without going through 3 approval boards and a mandatory security and usability testing phase, roll out plan and procurement process.
- 2↑u/tempfootDecent rundown. 👍
- 2↑u/dreamlegal_legaltechThis is a really clean way to frame it, especially using privacy as the anchor instead of just “which tool is best.” One thing that stands out is how the real decision isn’t about the model anymore, it’s about where your data sits and who touches it. Most people still evaluate tools on output quality, but in practice the bigger constraint is internal policy and risk tolerance. Also feels like Tier 3 is where a lot of firms will land for now. It gives enough control to pass internal checks, but without the overhead of building everything in-house. Tier 4 sounds great in theory, but most teams underestimate the effort needed to actually make it usable day to day. The point about everyone wanting Tier 1 productivity with Tier 5 privacy is probably the most accurate takeaway here. That tension is basically shaping the entire market right now. Curious how often you’re seeing teams move backwards between tiers after trying something more advanced.
- 2↑u/iLiveForTruthThe privacy framing is honestly more useful than another generic model ranking list. Most small firms I’ve seen want Tier 1 convenience while acting like they’re operating a classified government network. Then IT gets dragged in halfway through after someone already pasted client docs into three random AI tools
- 2↑u/Deep_Ad1959my read on tier 4 after watching this play out: the spot it breaks isn't deployment, it's the eval. spinning up qwen 3.5 plus or kimi on bedrock is a week of yaml. what kills the project six months in is nobody built a tagged failure-mode rubric (citation mismatch, jurisdiction wrong, hallucinated cite, hedged-when-a-confident-answer-existed, refused-when-shouldn't), so when policy review asks 'is this at least as accurate as the tier 2 baseline we already trust', there is no answer. without it the firm quietly drifts back to enterprise chatgpt because at least someone else owns the liability. legalbench doesn't reflect the firm's actual matter mix, so the in-house rubric is the work that compounds across model swaps; the productivity gap between tier 4 and tier 2 isn't the model, it's whether you can rerun a bakeoff in a week without redoing six months of human labels. written with ai
- 2↑u/taibojamesLawyers and judges can’t think straight about AI privacy. They’re terrified of feeding data to Claude or ChatGPT, then gleefully pour everything into half a dozen SaaS platforms that can keep it forever. Even if one is doing the minimum - signing up for pro/max and opt, out of training at Anthropic or OpenAI, etc.- and you get a max 30-day retention window, encryption (allegedly), and relatively narrow 3rd party disclosure rules. It is better terms than most of the tech already used. Sigh.
- 2↑u/TriVincibleEsqThe tier framework is genuinely useful, and the tradeoffs are real and the “no training” point is important. Data still leaves your network in most tools. I built an app myself and for myself over the past few weeks using AI coding tools (mainly Claude Code / Cursor + Codex-style assistance). It’s a private custom workspace with matter management, secure document upload + parsing + pgvector RAG, a reusable playbooks/clauses library, cited generation, and strong privacy controls (RLS, private Supabase buckets scoped to the user). Using the framework above, it sits as a custom tier 3 / lightweight tier 4 hybrid — structured legal workflows and full control over the app layer, without being generic chat or agentic desktop software. You're right about tiers, but who wants to pay thousands per year for something a motivated non-coder can now build themselves if they’re willing to put in some time? The economics have changed. It’s not as polished as mature SaaS yet, and I’m still responsible for maintenance, but it already does exactly what I need at a fraction of the cost with full ownership.