Sage Group Conversational RAG: From POC to 90% Accuracy Through Agentic Architecture

Case Study Summary

Client: Sage Group
Website: Sage
Industry: Software Development

Impact Metrics:

Answer Accuracy: <20% → 90%
Response Time: 18s → <7s (61% reduction)
Query Volume: 10,000/week across 6 products
Infrastructure Savings: 95% reduction ($60k → $2k/month)
Customer Satisfaction: 89% positive feedback
Human Escalations: 80% → <10% reduction

Sage Group's ambitious vision to revolutionize customer support through AI required transforming a failing proof-of-concept into an enterprise-grade conversational platform. By implementing an innovative "reasoning on rails" architecture, we delivered immediate production value while establishing the foundation for their multi-product AI support ecosystem.

Executive Summary

Transformed Sage Group's proof-of-concept RAG implementation into a production-grade, multi-tenant conversational AI platform serving 40+ accounting software products. By evolving from vanilla RAG to an innovative "reasoning on rails" agentic architecture, we achieved 90% answer accuracy, reduced response times by 61%, and eliminated 80% of manual customer support escalations while meeting enterprise security requirements and reducing infrastructure costs by 95%.

The Challenge

When I joined Sage Group in April 2024 as Principal Generative AI Consultant, their AWS consultant-led semantic search POC faced significant limitations. The system, a basic RAG pipeline using Excel Q&A files, had not passed security reviews and achieved less than 20% answer accuracy. With 80% of customer queries requiring escalation to human agents, Sage needed a production-ready solution to deliver on their AI-powered customer support vision across their 40+ product portfolio.

The technical gaps were substantial: no document ingestion pipeline, no security guardrails, no relevance filtering, and each product requiring its own $1,500/month OpenSearch instance. As the primary AI engineer on the project, I implemented over 90% of the non-DevOps engineering work to address these challenges.

Why It Matters

Sage Group serves millions of accounting professionals globally through products like Sage Intacct and Sage 50. Their customers need instant, accurate answers about complex accounting software, from tax form specifications to troubleshooting integration issues. Every unsuccessful query meant frustrated customers, overwhelmed support teams, and lost productivity during critical business periods.

The POC limitations represented more than technical shortcomings. They prevented Sage from meeting market expectations in an industry rapidly adopting AI-powered support. With enterprise clients demanding sub-2-second response times and stringent security compliance, delivering a production-ready solution was essential.

Technical Approach

I iteratively redesigned the architecture, fracturing the monolithic RAG pattern into specialized, composable cognitive nodes, each handling atomic tasks within a larger agentic workflow.

Progressive Architecture Evolution

Foundation Building: Implemented proper document ingestion, parsing, and chunking pipelines to replace error-prone Excel files
Security-First Design: Added query preprocessing with semantic validation and injection attack prevention
Hybrid Retrieval: Combined lexical and semantic search with metadata filtering for multi-tenancy
Intelligent Generation: Created intent extraction and guardrail reasoning workflows
Relevance Optimization: Implemented LLM-based relevance classification and re-ranking
Personalized Responses: Added query type classification for context-appropriate answers

Key Implementation Decisions

1. "Reasoning on Rails" Architecture

Instead of letting LLMs freely determine execution flow, I created a controlled agentic pattern where models make decisions through structured outputs, but the workflow orchestration remains deterministic. Each decision includes mandatory justification fields, creating complete auditability while maintaining security guarantees.

2. Multi-Tenant Cost Optimization

By implementing metadata-based filtering, we consolidated from 40+ individual OpenSearch instances to a single multi-tenant cluster. Product-specific queries filter documents at the search level, maintaining isolation while reducing monthly costs from $60,000+ to under $2,000.

3. Performance Through Concurrency

The complex workflow initially took 18 seconds per query. By identifying independent cognitive tasks and implementing parallel execution paths, we achieved sub-7-second response times without sacrificing accuracy, meeting enterprise SLAs while maintaining quality.

Results & Impact

Key Metrics

Answer Accuracy: <20% → 90%
Response Time: 18s → <7s (61% reduction)
Query Volume: 10,000/week across 6 products
Infrastructure Savings: 95% reduction ($60k → $2k/month)
Customer Satisfaction: 89% positive feedback
Human Escalations: 80% → <10% reduction

The platform now serves production traffic for 6 Sage products with additional products in the onboarding pipeline. 89% of users report the AI assistant significantly improves their experience compared to traditional support channels.

Technical Deep-Dive

Reasoning on Rails: Structured Control Flow

The architecture treats LLM outputs as structured decisions within a deterministic program flow. Each cognitive node in the workflow operates independently, processing specific aspects of the query through carefully designed prompts that enforce structured XML or JSON outputs.

Cognitive Architecture Diagram

The workflow orchestrator manages execution order while allowing parallel processing where dependencies permit. Critical security and validation nodes execute serially before any response generation, while independent classification tasks run concurrently.

This pattern delivers several technical advantages:

Deterministic Security: Guardrail checks always execute in the correct order, preventing bypass through prompt injection or unexpected model behavior.

Granular Observability: Each atomic decision logs to DynamoDB and Langfuse with full context, enabling debugging of complex multi-step reasoning chains.

Composable Architecture: New cognitive capabilities integrate as additional nodes without modifying existing workflow logic.

Fault Isolation: Individual node failures trigger specific error handling rather than cascading system failures.

Concurrency-Driven Agentic Workflow

The performance optimization from 18 seconds to under 7 seconds required analysing the directed acyclic graph (DAG) of cognitive dependencies and maximizing parallel execution.

Key optimizations included:

Persistent Connection Pools: ECS Fargate maintains long-lived connections to DynamoDB, OpenSearch, and Bedrock, eliminating per-request connection overhead.

Intelligent Caching: Frequently asked questions bypass several workflow stages through response caching, while embedding caches prevent redundant vector generation.

Early Termination: Inappropriate or out-of-domain queries exit immediately after preprocessing, avoiding unnecessary downstream processing.

Resource Allocation: Fargate tasks configured with 4GB memory and 2 vCPUs handle concurrent LLM calls without memory pressure or CPU throttling.

The concurrent architecture required careful handling of shared state and error propagation. Each parallel branch maintains isolated context, with results aggregated only at synchronization points. This prevents race conditions while maximizing throughput.

Lessons Learned

From POC to Production: The Reality Gap

Proof-of-concepts lie by omission. The vanilla RAG that worked in demos failed in production because:

Security isn't optional: Enterprise deployments require comprehensive guardrails from day one.
Specialized domains break assumptions: Generic embeddings and re-rankers fail on dense, narrow, technical content.
Users are unpredictable: Real queries include typos, commands, inappropriate requests, and edge cases.
Performance compounds: Each workflow step adds latency; only careful orchestration keeps it manageable.
Observability is critical: Without detailed tracing, debugging complex workflows becomes impossible.

Security-First Patterns for Enterprise AI

Building AI systems for accounting software taught crucial security lessons:

Input Validation is Paramount: Every user input must pass semantic coherence and security checks before processing.
Structured Outputs Enforce Control: Free-form LLM responses are security risks; XML/JSON outputs with validation schemas provide safety.
Audit Everything: Store every decision, classification, and transformation for compliance and debugging.
Fail Closed, Not Open: When uncertain, refuse to answer rather than risk inappropriate responses.
Defense in Depth: Multiple guardrail layers catch what individual checks miss.

These patterns enabled meeting security requirements that the original implementation had not addressed.

Next Steps

Ready to transform your AI proof-of-concept into a production-grade system that delivers real business value? Let's discuss how architectural innovation and engineering rigor can solve your most challenging AI problems.

This case study demonstrates how the right expertise can transform early-stage AI initiatives into enterprise-grade solutions. From security-first design to innovative architectural patterns, every decision focused on creating sustainable, scalable value.

Let's have a virtual coffee together!

Want to see if we're a match? Let's have a chat and find out. Schedule a free 30-minute strategy session to discuss your AI challenges and explore how we can work together.

Book Free Intro Call