Eva · legaltech-brain

I've been using all the major LLMs since each has been public, and Claude's models have \*always\* been the worst with this. ChatGPT started off badly too, but it really cleaned things up with the last couple iterations of the 5.X models (5.5 is especially strong at grounding in actual, pin-cited language from real-world cases). Google's Pro models are also \*very\* strong, and they have been for a lot longer than any of the major LLM providers. It just seems like this is something that Anthropic simply can't get right. You'd expect this wouldn't be as much of an issue nowadays, especially with Opus 4.8 at "Max" thinking, with all customizable parameters customized in ways that'd set the model up for success—e.g., plugins and connectors that route to libraries of case decisions, custom-created "Skills" that allow models to pull relevant subsets of statutes and regulations, 'fan-out-fan-in' for achieving stochastic consensus and debate consensus. I've built in every conceivable safeguard that I can, and Claude's model just \*consistently\* underwhelm.

Why are Claude models so unusually prone to hallucinating case citations?

Post