Once a week someone forwards us a call recording of "an AI agent we tried, it was terrible." We listen and grade it on the same five-point rubric we use internally. It's almost always the same five things. They're not subtle — every one of them is audible inside ten seconds.
1. The opening is too long
The bad ones say: "Hi! Thanks for calling Smith's Plumbing. This is the AI assistant. I can help with appointments, billing, technical questions, or transfer you to a real person. What can I help you with today?"
That's 7 seconds of preamble. The caller already lost interest. The good version says: "Hi, Smith's Plumbing, how can I help?" — same as a good receptionist. The AI should sound like a person picking up the phone, not a phone tree.
The fix
Cap the opening at 8 words. If you can't say it in 8 words, you're overdesigning.
2. Confirmations are too literal
Bad agent: "You said Tuesday at 2pm. Is that correct?"
This sounds like a Turing test for the caller. A human receptionist confirms by integrating the information into the next sentence: "Great, Tuesday at 2 — I just need your address." Same confirmation, no robot energy.
The fix
Confirm by use, not by question. If the next thing you say uses the information correctly, you've confirmed it.
3. They don't say "I don't know"
Bad agents will hallucinate before they will admit a gap. Ask about a service the business doesn't offer and you'll get a confident, made-up answer. The good ones say "I'm not certain — let me get someone who can confirm" and trigger a handoff.
Confidence calibration is a real product feature. Build it. Show it in the dashboard. Reward the model (and the persona designer) for refusing to answer when uncertain.
The instinct to always have an answer is the single biggest reason AI voice agents lose trust.
4. They repeat themselves under pressure
When the caller interrupts, gets emotional, or asks something off-script, bad agents loop. "I understand you're frustrated. I understand you're frustrated. To help you better, can you tell me..."
This is the worst one because it's the moment the caller most needs to feel heard. Loop-detection is non-negotiable. If you've said something equivalent in the last two turns, you cannot say it again. You must either advance or escalate.
The fix
Track turn-level similarity at runtime. If similarity to a recent turn exceeds threshold, force a new path — usually a handoff.
5. The handoff is a transfer
I wrote a separate post on this so I'll keep it short: a handoff that requires the customer to repeat themselves isn't a handoff. It's a transfer. The receiving human needs to know everything the AI knew, on screen, before the call connects. If they don't, you don't have a handoff — you have a failover, and you've made things worse than not having an AI at all.
The rubric
When we grade an AI voice agent (ours or a competitor's), we score 1–5 on each of these. Anything below 4 on any single dimension is a fail for that agent — not "fix it later," a real fail. The customer will hear it.
If you're evaluating a vendor: run the rubric. They won't push back, and you'll save yourself a quarter.
