Building a RAG System for Customer Support - L1

October 2025
Building a RAG System for
Customer Support - L1

Agentic AI
Repetitive Query Handling
Support teams face many repeated customer queries, which slow
down response times and increase workload.
Inconsistent Responses
Manual answering leads to inconsistent information, which can
negatively impact customer satisfaction and trust in support teams.
Need for Intelligent Systems
Accurate, intelligent information retrieval systems are needed to
streamline support and ensure reliable, consistent communication.
Problem Statement
Customer Support System
(Level 1 Coverage - L1)
AI Agent

Agentic AI 3
Training From Scratch
Building an LLM from the ground up is rare due to vast data needs and intensive
tuning, making it resource-heavy.
Domain Fine-Tuning
Fine-tuning adapts existing models with domain-specific data, lowering costs but
relying on high-quality examples.
Retrieval-Augmented Generation
RAG combines existing models with external knowledge sources to enhance
responses and flexibility.
Out-of-the-Box Solutions
Pre-built models offer rapid deployment and ease of use but are limited in
customization.
LLM Design Hierarchy

Agentic AI
Retrieval-Augmented Generation (RAG) combines:
o Ingestion: Load and preprocess documents
into the vector databse for indexing
o Retrieval: Fetches relevant documents from a
vector database.
o Generation: Uses LLMs to create accurate,
context-aware answers.
Ensures answers are grounded in trusted internal
data.
What is RAG?

Agentic AI
Combining Retrieval and Generation
RAG merges document retrieval and text generation, allowing
language models (LLMs) to access and use relevant information on
demand.
Context from Vector Databases
RAG fetches relevant documents from a vector database, providing
valuable context for generating precise and informed answers.
Grounded and Trustworthy Output
By grounding answers in trusted internal data, RAG increases the
reliability and trustworthiness of language model outputs.
What is RAG?

Agentic AI 6
Data Collection and Indexing
Content is extracted from local .md and .txt CMS sources, then
processed by a custom Node.js indexer using LlamaIndex.
Vector Database and Embeddings
Processed data is stored in Qdrant, a vector database, using
embeddings generated by nomic-embed-text technology.
API Orchestration and User Interface
The backend is handled by a Node.js REST API, while Gradio
powers the interactive frontend for users.
Architecture Overview

Agentic AI 7
LLMs models
• Data Source: Local .md & .txt files (CMS extracts)
• Indexing: Custom Node.js indexer + LlamaIndex
• Vector Database: Qdrant
Ollama is a local runtime for large language models (LLMs) 🧠
that makes it easy to download, run, and manage open-source
LLMs directly on your machine — without needing cloud APIs.
• 🔒 Privacy & control — since everything runs locally, no data
leaves your system.
• Models (Ollama):
• Embeddings: nomic-embed-text (274MB)
• Default LLM: llama3.2
• Supported:
▪ llama3.1 (8b), llama3.2 (3b),
▪ gpt-oss (20b),
▪ gemma3 (4b, 12b),
• Backend: Node.js server (REST API + orchestration)
• Frontend/UI: Gradio

Agentic AI 8
All Models Run Locally
Running models locally removes API dependencies, providing users
with greater privacy and control over data processing.
Flexible Model Support
Ollama supports multiple LLM families, such as Llama, Gemma,
GPT-OSS, enhancing model flexibility.
GPT-5 variants used via the OpenAI platform.
Enhanced Privacy and Performance
Local execution delivers improved privacy, cost efficiency, and faster
response times compared to cloud-based solutions.
Local Model Execution with Ollama

Agentic AI 9
Chroma: Developer-Friendly Simplicity
Chroma is a lightweight, developer-focused vector database designed for rapid
prototyping and smaller-scale applications, with a simple API and quick setup.
Pinecone: Managed Scalability
Pinecone is a managed vector database offering high scalability and performance, but it
comes with a higher operational cost.
Weaviate: Feature-Rich and Modular
Weaviate provides a flexible and modular platform with rich features suited for various
vector search needs.
Milvus: Enterprise Clustering
Milvus supports enterprise-grade clustering, making it suitable for large-scale
deployments and robust performance requirements.
Qdrant: Efficient Open-Source
Qdrant is an open-source, efficient vector database ideal for self-hosted environments
and the preferred option for our needs.
Vector Database Choices

Agentic AI 10
Open Source and Cost-Effective
Qdrant provides an accessible solution for vector similarity search
with no licensing costs, making it budget-friendly for all users.
High Performance Search
It enables rapid and accurate searches even on massive datasets,
ensuring excellent performance for demanding applications.
Easy Integration and Scalability
Qdrant integrates seamlessly with popular languages like Node.js
and Python, and scales flexibly to support growing data and user
needs.
Why Choose Qdrant?

Agentic AI 11
Backend Architecture
Core of the system: Modular document search with Elasticsearch + Qdrant
Indexes .md & .txt documents for search and RAG
Provides a RESTful API to support search and RAG workflows
Key Features:
• Exact & fuzzy search via Elasticsearch
• Vector similarity search & RAG responses via Qdrant
• REST API endpoints for search, update, and response generation
• Optional Gradio UI for interactive access

Agentic AI
Customizable Input Options
Users can input questions and choose
context sources, enabling tailored and
flexible interactions for various needs.
LLM Model Selection
A dropdown lets users select different
language models, clearly displaying which
model generated each response.
Transparent Source Listings
The UI shows source information with
relevance scores, helping users understand
and trust the provided answers.
Frontend & UI

Agentic AI 13
• Improved Accuracy:
Responses are grounded in .md and .txt files extracted from the CMS, ensuring that answers are consistent and based on verified information.
• Faster Response Times:
By retrieving relevant context and running models locally, the system provides instant answers, reducing wait times for both support agents and end-users.
• Operational Efficiency:
Automates repetitive queries and reduces manual effort, allowing support teams to focus on more complex issues instead of answering FAQs.
• Privacy and Cost Control:
Running models locally with Ollama means no data leaves the environment and external API costs are avoided. This setup ensures compliance with data
policies and predictable costs.
• Scalability and Flexibility:
The modular design supports adding new data sources, expanding the knowledge base, and experimenting with multiple LLMs, making it future-proof and
adaptable.
Key Solution Benefits

Agentic AI
Content / CMS Automated Content Synchronization
Daily sync ensures .md or .txt files are always up-to-date,
enhancing content reliability and workflow efficiency.
Multi-Language Accessibility
Supporting multiple languages increases accessibility,
allowing users worldwide to enjoy an improved experience.
Tailored Insights and Interactivity
Specific models deliver precise insights; chatbot and analytics
dashboard support interactive, data-driven decision-making.
• Chatbot Integration: Enables end-users or support
agents to interact with the knowledge base in real time,
asking follow-up questions and drilling deeper into topics.
• Analytics Dashboard: Provides visibility into query
trends, frequent issues, and system performance. This
makes the support workflow data-driven, helping teams
continuously improve processes and anticipate customer
needs.
Unlocking Future
Potential

Bogdan Mustata
Thank you for your presence.
15
Software Architect | Cloud & AI
bmustata@yahoo.com

Thank you
GitHub Repo
https://github.com/bmustata/rag_exploration

Building a RAG System for Customer Support - L1

More Related Content

Similar to Building a RAG System for Customer Support - L1

Recently uploaded

Building a RAG System for Customer Support - L1

Editor's Notes