Skip to content

GenAI QA Evaluator

AI Travel Assistant Quality Analytics: From Manual Reviews to Real-Time Intelligence

Case Study Summary

Client: Leading AI Travel Assistant Platform
Industry: Travel Technology / Conversational AI

Impact Metrics:

  • Conversation Analysis Speed: Hours -> Seconds (99.9% reduction)
  • QA Coverage: ~5% manual sampling -> 100% automated evaluation
  • Insight Generation: Weekly reports -> Real-time dashboards
  • Engineer Productivity: 80% reduction in manual review time
  • Pattern Detection: 0 -> 11 dimensional analysis per conversation
  • Scalability: 10s of conversations -> 1000s per hour

By implementing a sophisticated dual-mode evaluation system with strategic and tactical scorers, we transformed how AI engineers and QA teams understand and improve conversational AI quality at scale.

Executive Summary

Developed a comprehensive multi-model LLM evaluation platform for a leading AI travel assistant, enabling both real-time agent self-reflection and offline analytics. The system employs 11 specialized scorers across strategic business metrics and tactical interaction analysis, processing thousands of conversations automatically to surface insights that would take human teams weeks to identify manually. This freed AI and QA engineers from tedious manual review while providing unprecedented visibility into conversation quality patterns.

Strategic KPI Dashboard

The Challenge

The AI travel assistant platform faced a critical blind spot: with thousands of daily user conversations, teams relied on manual spot-checks covering less than 5% of interactions. When users unsubscribed or provided negative feedback, engineers spent hours manually reviewing conversation logs trying to identify root causes. This reactive approach meant:

  • Invisible Failures: 95% of problematic conversations went unanalysed.
  • Delayed Insights: Pattern recognition took weeks of manual analysis.
  • Resource Drain: Engineers spent 60% of their time on manual QA reviews.
  • Limited Metrics: Basic technical metrics missed user experience quality.
  • No Real-Time Adaptation: The assistant couldn't self-correct during conversations.

As the sole architect and engineer on this project, I designed and implemented a solution that would transform their quality assurance from reactive manual processes to proactive automated intelligence.

Why It Matters

In the competitive travel AI market, conversation quality directly impacts user retention and business growth. Every failed interaction, whether from misunderstood constraints, unmet preferences, or poor emotional handling, risks losing customers to competitors. The platform needed to:

  • Understand User Satisfaction: Beyond simple task completion metrics.
  • Identify Failure Patterns: Across thousands of conversations simultaneously.
  • Enable Proactive Improvement: Before users complain or unsubscribe.
  • Empower Engineering Teams: With actionable insights, not raw data.
  • Support Real-Time Adaptation: During active conversations.

Without automated evaluation, the platform was flying blind, unable to systematically improve or even understand where conversations were failing.

Technical Approach

I designed a dual-mode evaluation architecture that serves both real-time and analytical needs:

Dual-Purpose Architecture

  1. Offline Mode (Analytics): Comprehensive batch processing for pattern detection, deep analysis, and insight generation across large conversation datasets.
  2. Online Mode (Real-Time): Evaluation during active conversations for agent self-reflection.

Cognitive Evaluation Framework

The system employs 11 specialized scorers divided into two complementary streams:

Strategic Scorers (Business Metrics):

  • Net Promoter Score (NPS): How likely users are to recommend the assistant, measuring overall satisfaction.
  • User Effort Score (UES): Measures how easy it was for users to achieve their goals.
  • Goal Achievement Score (GAS): Assesses whether user objectives were met during the conversation.
  • Emotional Journey Progression (EJP): Tracks user emotional states throughout the interaction.
  • Booking Intent Development (BID): Evaluates the evolution of user intent to book travel services.

Tactical Scorers (Interaction Analysis):

  • User Journey Phase Mapping: Travel planning and booking funnel.
  • User Preferences & Fulfillment: AI assistant's ability to meet user travel preferences.
  • User Requirements & Violations: AI assistant's adherence to user trip constraints.
  • Interaction Type Classification: User actions across user journey stages.
  • Cognitive State Analysis: AI assistant's ability to manage user's information processing, conversational engagement, and progress satisfaction.
  • Emotional State Detection: Detection of user emotional states along the conversation.

Conversation Summary Dashboard

Multi-Model LLM Integration

Built provider-agnostic architecture supporting OpenAI, Anthropic, Google Gemini, AWS Bedrock, and local Ollama deployments, enabling cost optimization and redundancy.

Key Implementation Decisions

1. Chain of Responsibility with Parallel Processing

Rather than sequential evaluation, I implemented a parallel processing architecture where strategic and tactical scorers execute concurrently. This reduced evaluation time from potential 20+ seconds to under 5 seconds for online mode while maintaining comprehensive analysis depth.

2. Structured Output Validation

Each scorer uses Pydantic models with XML-formatted prompts stored in Langfuse. This ensures consistent, parsable outputs across different LLM providers while enabling prompt version control and A/B testing without code changes.

3. VAD Emotional Quantification

Implemented Valence-Arousal-Dominance (VAD) scoring across emotional and cognitive states, providing dimensional analysis beyond simple categorical labels. This enables nuanced understanding of user emotional trajectories throughout conversations.

Results & Impact

Operational Transformation

  • QA Coverage: From ~5% manual sampling to 100% automated evaluation.
  • Analysis Speed: What took engineers hours now completes in seconds.
  • Pattern Detection: Systematic issues surface immediately vs. weeks later.
  • Engineer Liberation: 80% reduction in manual conversation review time.
  • Scalability: From analysing dozens to thousands of conversations per hour.

Intelligence Capabilities

The platform now enables:

Systematic Failure Analysis: QA teams identify patterns like "73% of booking modification requests show high user effort" within minutes instead of weeks.

Proactive Improvement: Engineering teams can observe in real-time dashboards when specific KPIs (e.g. NPS) start to consistently fail, enabling targeted fixes before user complaints.

Executive Visibility: Strategic dashboards provide C-suite and Product Managers insights into customer satisfaction trends and business impact metrics.

Real-Time Agent Adaptation: Assistants can now detect and respond to user frustration or confusion during conversations, adjusting their approach based on live evaluation feedback.

Technical Deep-Dive

Evaluation Pipeline Architecture

The system implements a sophisticated workflow orchestration pattern:

```txt
Event Stream → Task Context → Parallel Evaluation → Result Aggregation → Multi-Channel Output
                                    ↓
                        Strategic Stream | Tactical Stream
                                    ↓
                        Business Metrics | Interaction Quality
```

FastAPI Service Layer

The REST API handles both synchronous (online) and asynchronous (offline) evaluation requests. Key features include:

Request Routing: Intelligent routing between real-time and batch processing paths based on urgency flags.

Timeout Management: Configurable timeouts ensure online evaluations never block assistant responses.

Result Caching: Redis-backed caching prevents redundant evaluations of identical conversations.

Celery Worker Architecture

For offline batch processing, Celery workers provide:

Horizontal Scalability: Additional workers dynamically handle increased load.

Priority Queuing: Urgent analyses jump ahead of routine batch jobs.

Fault Tolerance: Failed evaluations retry with exponential backoff.

Streamlit Analytics Dashboard

The dashboard transforms raw evaluation data into actionable insights through:

Multi-Dimensional Filtering: Combine strategic and tactical criteria to isolate specific conversation types.

Time Series Analysis: Track metric evolution to measure improvement impact.

Conversation Deep-Dives: Examine individual interactions with full scorer breakdowns.

Export Capabilities: Generate reports for stakeholder presentations.

LLM-Powered Analysis Modules

Three specialized engines provide automated insights:

  1. Misalignment Root Cause Analyser: Identifies systematic failures in meeting user needs.
  2. Product Insights Engine: Discovers feature usage patterns and improvement opportunities.
  3. Customer Behavior Analyser: Generates user insights based on preferences, actions, and cognitive states.

Lessons Learned

The Power of Dual-Mode Architecture

Separating real-time and analytical evaluation unlocked unexpected benefits:

  1. Performance Optimization: Online mode could prioritize speed over completeness.
  2. Depth vs. Speed Trade-offs: Offline mode enables expensive deep analysis.
  3. Different Stakeholder Needs: Real-time for agents, analytics for humans.
  4. Resource Allocation: Separate infrastructure scaling for each mode.

Structured Prompts as Configuration

Storing prompts in Langfuse rather than code provided crucial flexibility:

  1. Rapid Iteration: Prompt improvements deploy without service restarts.
  2. Domain Expert Access: Non-engineers can refine evaluation criteria.
  3. Version Control: Roll back problematic changes instantly.
  4. Provider Portability: Same prompts work across different LLMs.

Parallel Processing Complexity

Implementing concurrent evaluation required careful consideration:

  1. Context Isolation: Each scorer maintains independent state.
  2. Error Boundaries: Individual scorer failures don't cascade.
  3. Result Aggregation: Careful handling of partial results.
  4. Resource Limits: Preventing LLM API rate limit exhaustion.
  5. Debugging Challenges: Comprehensive logging for concurrent flows.

Next Steps

Ready to transform your conversational AI quality assurance from reactive manual processes to proactive automated intelligence? Let's discuss how multi-dimensional evaluation can unlock insights hiding in your conversation data.


This case study demonstrates how sophisticated evaluation architectures can liberate engineering teams from manual drudgery while providing unprecedented visibility into conversational AI quality. Every design decision focused on creating actionable intelligence from raw conversation data.

  • Let's have a virtual coffee together!


    Want to explore how automated conversation analysis can transform your AI quality assurance? Schedule a free 30-minute strategy session to discuss your evaluation challenges.

    Book Free Intro Call