September 25, 20244 min read

Voice AI for clinic intake is a harder problem than it looks from the outside. A chat-based AI intake can pause while the model thinks. A voice agent cannot - the pause becomes a dead line, and dead lines make callers hang up. This is the account of our first voice AI deployment for a multi-branch clinic, what the technical constraints actually looked like, and how we evaluate whether the system is working.

The Problem It Was Solving

The client was a clinic group with four branches. Their front desk staff were spending a significant portion of each day handling inbound calls for appointment booking, rescheduling, basic FAQs, and referrals to the right branch for specific services. Staffing the phone line across branch hours was expensive and inconsistent - after-hours calls were missed, and busy periods created hold times that callers did not stay for.

The brief was a voice agent that could handle first-contact calls: collect caller information, identify the purpose of the call, book appointments directly into the clinic's scheduling system for straightforward cases, and escalate to a human for anything complex or urgent.

Tagalog, English, and the Real Caller Population

Our first challenge was language. The caller population for a Philippine clinic is not monolingual. Some callers speak English. Many speak Filipino or Tagalog. Most mix languages mid-sentence in ways that do not follow a clean code-switching pattern.

We tested multiple speech recognition approaches against real call recordings the clinic provided. Recognition accuracy on Tagalog-dominant speech and mixed-language speech was significantly lower than on English-dominant speech in our initial tests. The gap was large enough that a single-model approach produced an unacceptable error rate on the mixed-language inputs.

Our solution was a pre-classification step that detected the likely primary language of the first few seconds of speech and routed accordingly, combined with a fallback that prompted the caller to repeat if recognition confidence fell below a threshold. This added latency to the opening of the call but produced better overall accuracy.

The fallback prompt was tuned carefully. "Sorry, could you repeat that?" repeated twice in a row on an intake call kills the experience. We limited to one re-prompt per exchange and had the agent reframe rather than repeat verbatim.

Latency Is the User Experience

In a voice interaction, the response time between a caller finishing speaking and the agent responding is everything. If the gap is too long, callers assume the call has dropped or that they are talking to a broken system. The tolerance is narrow - measured in seconds, not seconds plus a few more.

We set an internal target of under two seconds from caller end-of-speech to agent start-of-speech for the majority of exchanges. Hitting that target required work on both the infrastructure side and the prompt side.

On the infrastructure side: we minimized round-trip latency between the call platform, the speech-to-text layer, the language model, and the text-to-speech synthesis. Each of those steps adds time. They cannot run in parallel because each step depends on the previous output, so the total time is additive.

On the prompt side: shorter responses help. A voice agent that generates a 200-word response takes longer to synthesize and delivers worse caller experience than one that generates 40 words. We rewrote every agent response pattern to be as concise as possible without losing the information the caller needed.

Evaluating Quality

We use three metrics to evaluate call performance, reviewed on a weekly basis with the clinic.

Containment rate: the percentage of calls the agent handled without escalation. A high containment rate is good only if the escalations that do happen are the right ones - genuinely complex cases, not cases where the agent failed.

Escalation appropriateness: we sample escalated calls and categorize why they escalated. This is manual review, not automated. We look for patterns: is the agent escalating things it should be able to handle? Are callers forcing escalation by expressing frustration?

Caller sentiment signals: we flag calls where the caller used specific language patterns that correlate with frustration - repeated "hello," repeated questions, explicit requests for a human. These are not definitive measures but they surface calls worth listening to.

The first month of live deployment produced a containment rate in the range we had targeted. The escalation patterns showed one category of call the agent was not handling well - insurance and HMO verification questions that required looking up information the agent did not have access to. We expanded the integration scope to include that lookup in the second iteration.

Voice AI for clinic intake is viable. It requires more careful engineering than chat-based AI features and more active monitoring post-launch. If your operation has a phone intake problem worth solving, the conversation is worth having.

Start a project →

Need this built for your business?

Let's scope it together.

Start a project

Building Voice AI for Clinic Intake

The Problem It Was Solving

Tagalog, English, and the Real Caller Population

Latency Is the User Experience

Evaluating Quality

Let's scope it together.

Read next