The 20% problem

April 28, 2026 - Ryan Erickson

Most AI tools today are built the same way. Take a general capability, wrap a product around it, and ship it. And to be fair, it works. You can summarize documents, draft memos, redline contracts, and analyze data. The outputs are fast, coherent, and often impressive. But they all hit the same ceiling.

Generalized tools get you about 70-80% of the way there. That last 20% is where things break down. Not because the model is failing, but because the definition of "good" is too vague. The output looks right at a glance, but when you actually try to use it in a real workflow, it doesn't hold up.

We ran into this while building an NDA redlining agent with a client. Out of the box, the model could redline. It understood the structure, flagged risks, and made edits. But the feedback wasn't "this is wrong." It was that the edits were too heavy, not surgical enough, rewrote things that didn't need to be rewritten, and didn't match how the team actually works. That's a different problem entirely. You're no longer asking whether something is correct. You're asking whether it feels like your work.

At that point, switching models doesn't help. Better prompting doesn't help. What matters is building a way to measure what "right" actually looks like for that team. We took real NDAs that had already been redlined and used them as a benchmark. That became the standard. Then we built an evaluation loop around it, measuring not just correctness but how closely the output matched human edits, how much unnecessary change it introduced, and how precise those edits were. Then we iterated.

This is the part most teams skip because it takes time. It's faster to rely on a general tool and accept "pretty good." But that approach stalls out. You get outputs that are helpful but not usable. Drafts that still require cleanup. Work that doesn't quite match how your team operates.

When you invest in the evaluation loop, the system starts to converge on how the work is actually done. Not just logically correct but stylistically aligned. Not just useful but trusted. That's the difference between something that looks good in a demo and something people actually use every day.

The tradeoff is straightforward. You can move fast and ship something that looks impressive, or you can spend more time upfront defining the standard, building the feedback loop, and iterating until the gap closes. The second path is slower. It's also the only one that produces results people rely on.

Cookie Settings