A Guide to Evaluating Voice AI Agents

A comprehensive guide to production and development evaluation of Voice AI applications based on a conversation with Brooke Hopkins from Coval.

Marc Klingen

As Voice AI applications continue to advance, developers are met with complex challenges in testing, evaluating, and monitoring their voice agents.

In this blog post, we’ll explore how you can create more robust and reliable voice applications. We’ll draw insights from setups we have seen from our users at Langfuse and the recent discussion (full video) between me and Brooke (Co-Founder of Coval) to provide a comprehensive guide on Voice AI evaluation.

The Evolution of Voice AI Testing

Voice AI applications present complexities that extend beyond traditional LLM implementations. In addition to challenges caused by the non-deterministic nature of language models, developers must also handle:

Audio Quality and Metrics
User Interruptions
Speech-to-Text (STT) Accuracy
Text-to-Speech (TTS) Output Quality
Real-Time Streaming Interactions

As voice applications mature, the need for both high-level integration testing and detailed component evaluation becomes critical.

Understanding the Voice AI Testing Pyramid

Evolving Testing Needs

Developing effective voice applications requires a dual approach to evaluation strategies:

Online Evaluation: Focuses on real-time production monitoring, performance tracking, and analyzing user interactions.
Offline Evaluation: Involves development testing, ranging from end-to-end agent testing to granular unit tests and validating conversation flows.

Using this testing pyramid is essential for effective Voice AI testing, ensuring your voice agents perform optimally in live environments.

Evaluations of Single Messages vs. Conversation Level

Evals: Single Messages vs. Conversation Level

The second duality we see in voice agent evaluation is between observing and evaluating single messages and evals on the whole conversation.

Single turn evaluations:

Trace the step-by-step execution of a single message
Monitor the tool calls and other application logic used by the voice agent
Stream-based interaction analysis

Multi turn evaluations:

Performing end-to-end simulation testing on the whole conversation
Testing for regressions caused by different prompt versions or model changes
Classifying and detecting anomalies in the conversation flow

Integration Best Practices and Development Workflow

When to apply which strategy

Usually, there are two phases in the voice agent development workflow:

Early development stages:

Quick integration tests and online evaluations
Trace and debug individual components of the conversation

Application running in production:

Implement specific unit tests for cases spotted in development
Detailed performance monitoring and conversation level evaluations
Ongoing regression testing

The type of evaluation also depends on the type of the voice application. Some applications might require a closer monitoring of model costs whereas other might focus on the conversation flow and the accuracy of tool calls:

Transactional Voice Applications (e.g., Appointment Scheduling):

Trace individual function calls and apply evaluations to single messages.
Perform end-to-end testing of complete user journeys.

Complex Applications (e.g., Virtual Assistants):

Focus on conversation-level testing and monitor conversation arcs.
Monitor tool calls and application logic.

📞

We are excited that Coval will natively integrate with Langfuse. With this integration, Langfuse users can use Coval to perform end-to-end simulation testing on the whole conversation of their voice agents. Reach out if you are interested to try it.

Resources

Watch the full discussion with Brooke Hopkins and Marc Klingen here.
Learn more about Langfuse:
- Tracing LLM applications
- LLM Observability Challenges
- Tracking model Usage and Cost
- Performing LLM-as-a-Judge Evaluations
Check out the Coval docs.

From Zero to Scale: Langfuse's Infrastructure Evolution Langfuse raises $4M

Was this page useful?

Questions? We're here to help

GitHub Q&AEmail Talk to sales