Legal AI is often evaluated by scale. Bigger models. More data. Longer lists of capabilities. Demos emphasize volume: how many questions a system can answer, how many issues it can spot, how fast it can respond.
That framing misses the real constraint.
The problem with most legal AI tools is not that they are insufficiently powerful. It is that they are insufficiently grounded in realistic scenarios. More AI does not compensate for shallow context.
This became clear during a series of empirical classroom pilots run through Product Law Hub using an AI-based legal coach called Frankie. The pilots were designed to observe how users engage with AI while learning judgment-based legal skills. The findings draw on quantitative engagement data and qualitative interviews conducted during and after the course.
The signal was consistent. Fewer, richer scenarios produced deeper engagement, stronger reasoning, and higher trust than high-volume question sets ever did.
Volume Looks Impressive. Scenarios Do The Work.
In demos, volume is persuasive. A system that can answer dozens of questions in seconds feels powerful. Buyers infer competence from speed and breadth.
In the classroom, that illusion collapsed quickly.
When students were presented with large numbers of short, repetitive prompts, engagement dropped. Sessions shortened. Follow-up questions declined. Interviews revealed a common reaction: the interactions felt mechanical, even when the content was correct.
By contrast, when students were given fewer scenarios with richer context, they stayed longer and worked harder. They revisited assumptions, asked clarifying questions, and refined their analysis. The difference was not sophistication of the model. It was quality of the situation.
Ambiguity Invites Judgment
The most effective scenarios shared a common feature. They were ambiguous.
Exercises that included stakeholder disagreement, incomplete information, or competing incentives consistently outperformed cleaner hypotheticals. Students leaned in when they had to decide what mattered, not when they were asked to identify what applied.
Quantitative data showed higher completion rates and longer session times for these scenarios. Qualitative interviews confirmed that students found them more credible and more useful. They felt closer to real work.
Legal judgment does not emerge from clean facts. It emerges from tension. AI that avoids ambiguity to simplify interactions undermines the very skill it claims to support.
Repetition Erodes Trust Faster Than Difficulty
One of the more counterintuitive findings was how users responded to difficulty versus repetition. Hard problems did not drive disengagement. Repetitive ones did.
When scenarios reused the same structure or language, users quickly lost trust. Even minor variations felt shallow. The system appeared inattentive, as though it were pattern-matching rather than reasoning.
In contrast, users tolerated complexity and uncertainty when the scenario felt authentic. They did not expect the AI to make the problem easier. They expected it to take the problem seriously.
This distinction matters for buyers evaluating tools. A demo that showcases dozens of similar questions may signal capability, but it does not predict sustained use.
Realism Is Not About Polish
It is tempting to equate realism with polish. Better UX. Cleaner flows. More reassuring language. The pilot suggests the opposite.
Realism came from friction. Stakeholders who disagreed. Constraints that could not be optimized away. Tradeoffs that had no clean resolution. When the AI engaged with those elements instead of smoothing them over, users trusted it more.
This mirrors real legal work. Lawyers trust colleagues who acknowledge uncertainty and wrestle with it. They distrust those who offer tidy answers to messy problems.
AI that prioritizes smoothness over substance feels less credible, not more.
Scenario Quality Shapes Learning And Trust
The classroom setting made visible something that is harder to detect in practice. Scenario quality shapes not just learning outcomes, but trust in the system itself.
When scenarios felt generic, users disengaged cognitively. When scenarios felt grounded, users attributed more intelligence to the system, even when its responses were constrained.
Trust followed attention. Systems that appeared to understand the situation earned credibility. Systems that recycled patterns lost it.
This has implications beyond education. In firms, scenario quality influences whether lawyers treat AI as a serious tool or a novelty. High-volume outputs cannot compensate for shallow context.
Why Buyers Should Rethink Evaluation Criteria
Legal tech buyers often ask how many use cases a tool supports. A better question is how well it handles one difficult case.
The Product Law Hub pilot suggests that depth beats breadth when it comes to judgment-based work. Tools that invest in realistic, high-fidelity scenarios deliver more value than tools that chase coverage.
That may require different procurement thinking. Scenario design is harder to evaluate than feature lists. It does not demo well in five minutes. But it predicts long-term usefulness far better than model size.
The Quiet Cost Of Shallow Scenarios
The cost of shallow scenarios is not just wasted time. It is missed development.
Junior lawyers do not build judgment by answering dozens of simplified questions. They build it by grappling with realistic situations that force prioritization and explanation. AI that substitutes volume for realism accelerates output without accelerating growth.
The classroom data made this visible early. In practice, the cost shows up later as stalled development and diminished confidence.
The Takeaway Vendors Do Not Want To Hear
The uncomfortable takeaway from the pilot is that scenario design matters more than AI sophistication. Bigger models will not fix shallow context. Faster answers will not build judgment.
Legal AI that succeeds will not be defined by how much it can do, but by how well it can inhabit realistic situations and resist the urge to oversimplify them.
More AI is easy to sell. Better scenarios are harder to build. The data suggests they are worth the effort.
Olga V. Mack is the CEO of TermScout, where she builds legal systems that make contracts faster to understand, easier to operate, and more trustworthy in real business conditions. Her work focuses on how legal rules allocate power, manage risk, and shape decisions under uncertainty. A serial CEO and former General Counsel, Olga previously led a legal technology company through acquisition by LexisNexis. She teaches at Berkeley Law and is a Fellow at CodeX, the Stanford Center for Legal Informatics. She has authored several books on legal innovation and technology, delivered six TEDx talks, and her insights regularly appear in Forbes, Bloomberg Law, VentureBeat, TechCrunch, and Above the Law. Her work treats law as essential infrastructure, designed for how organizations actually operate.
The post Why Realistic Scenarios Matter More Than More AI appeared first on Above the Law.