Presented by:
Calvin Hendryx-Parker
CTO and Co-Founder
sixfeetup.com calvin@sixfeetup.com Fishers, Indiana
TODAY'S
AGENDA
1 How to Pick an LLM
2 Making the Model Aware of your Data
3 Picking an Application Framework
4 Choosing a Deployment Strategy
5 Scaling your Application
6 Key Take-Aways
7 Q&A
Service or API?
Using an LLM service is simpler and quicker, but offers less
custom control. An API allows deeper tuning, at the cost of
more setup and maintenance. Both options come with terms of
service that can limit usage, data handling, or intellectual
property rights, so review them carefully before choosing.
HOW TO
PICK
AN LLM
Evaluating Trustworthiness
Trustworthy Leaderboard
Decoding Trust Overview
The Open Source Question
Open source LLMs vary in how much they share. “Open Weights”
means you can use the pretrained model files but not fully
reproduce it. Truly open means you also have the training code
and dataset, allowing you to replicate and retrain the model from
scratch.
Evaluating Performances
Open LLM Leaderboard
Chatbot Arena LLM Leaderboard
Artificial Analysis LLM Performance Leaderboard
Choose your embedding model carefully because different
models have different training sets and specialties—some focus
on plain English, others on multilingual content, and some even
handle code or addresses. Using a default model often yields
mediocre outcomes, so be sure to align your data type with the
right embedding.
MAKING THE
MODEL AWARE
OF YOUR DATA
Pick an appropriate trans
f
o
r
m
e
r
m
o
d
e
l
Evaluate the shape of
y
o
u
r
d
a
t
a
Review your d
a
t
a
EMBEDDING MODEL
LEADERBOARD
https://huggingface.co/spaces/mteb/leaderboard
EMBEDDING MODEL SELECTION
Use Case
General Purpose RAG
OpenAI text-embedding-3-small
+ BAAI reranker
Cost-effective with 70%+
accuracy via re-ranking
Financial Analysis Voyage-finance-2
22% higher precision on SEC
filings than general models
Multilingual Search
Vectara Boomerang/Cohere-
embed-v3
Superior cross-lingual NDCG
scores
Self-hosted Solutions Nomic-embed-text/BGE-large
Zero API fees; 71-75% accuracy
on custom corpora
Rationale
Recommended Models
DATA HUNGRY
Fine-tuning demands large amounts of data because the
model must see enough varied examples to learn new
patterns without forgetting its previous knowledge.
Insufficient or low-quality data can lead to overfitting or
underperformance, emphasizing the need for abundant,
relevant training material.
FINE-TUNING MAY OR MAY NOT
BE THE BEST CHOICE
BEST FOR CLASSIFICATION
Fine-tuned models are tailored to the specific domain or
task, capturing relevant patterns with higher precision. This
specialization boosts accuracy while reducing computational
overhead during inference. By focusing on the data that
matters, fine-tuning can yield more efficient processing and
potentially lower inference costs.
COSTLY
Fine-tuning is resource-intensive, requiring specialized
hardware and extensive compute time. As model sizes
grow, so do energy and infrastructure costs. Gathering
and cleaning enough data adds extra expenses, and
multiple training runs further drive up the final bill.
TIME CONSUMING
Fine-tuning takes time because it requires multiple
training and validation cycles with extensive data. Each
iteration involves careful hyperparameter adjustments,
extending the overall process.
FIGHTING
HALLUCINATIONS
AND HALF-TRUTHS
WITH RAG
RAG uses vector-based
retrieval to fetch relevant
context from a knowledge
base. By relying on
semantic similarities, it
helps reduce hallucinations.
Vectors
FIGHTING
HALLUCINATIONS
AND HALF-TRUTHS
WITH RAG
Vectors
RAG uses vector-based
retrieval to fetch relevant
context from a knowledge
base. By relying on
semantic similarities, it
helps reduce hallucinations.
Data Storage
RAG’s data storage can
use both open source
solutions or commercial
platforms. Each option
comes with trade-offs in
cost, scalability, and
control.
Open Source Commercial
ChromaDB Pinecone
pg_vector Snowflake
Elastic Weaviate
Mongo Milvus
RAG DATA STORAGE OPTIONS
FIGHTING
HALLUCINATIONS
AND HALF-TRUTHS
WITH RAG
Vectors
RAG uses vector-based
retrieval to fetch relevant
context from a knowledge
base. By relying on
semantic similarities, it
helps reduce
hallucinations.
Data Storage
RAG’s data storage can
use both open source
solutions or commercial
platforms. Each option
comes with trade-offs in
cost, scalability, and
control.
Retrieval
RAG relies on converting the user
query into an embedding, then
searching a knowledge base for
matching vectors. It combines the
retrieved context with a context
window and may factor in recent
chat history to ensure continuity
and relevance.
RAG Demo
RAG DEMO
FIGHTING
HALLUCINATIONS
AND HALF-TRUTHS
WITH RAG
Vectors
RAG uses vector-based
retrieval to fetch relevant
context from a knowledge
base. By relying on
semantic similarities, it
helps reduce
hallucinations.
Data Storage
RAG’s data storage can
use both open source
solutions or commercial
platforms. Each option
comes with trade-offs in
cost, scalability, and
control.
Retrieval
RAG relies on converting the user
query into an embedding, then
searching a knowledge base for
matching vectors. It combines the
retrieved context with a context
window and may factor in recent
chat history to ensure continuity
and relevance.
Feedback Loop
RAG combats
hallucinations by rating
responses and refining
them for accuracy. A/B
testing model settings
uncovers half-truths,
leading to more reliable
outputs.
USING GUARDRAILS WITH
PRE-GENERATED ANSWERS
Pre-Generated Answers
A method where common
responses are produced in
advance and served quickly.
Less Creativity
PGA is less creative because it relies on
fixed responses, offering limited
adaptability. It can’t spontaneously craft
new content or adjust to unique queries
in real time.
More Control
PGA provides vetted, consistent
responses, reducing the risk of
unverified or unsuitable outputs. This
tighter control is crucial for regulated
markets like healthcare, where accuracy
and compliance are paramount.
SEE BLOG POST
Agent Apps
LangChain, Haystack, and LlamaIndex are agent apps designed
to orchestrate LLM operations. They simplify tasks like
retrieval, data management, and chaining prompts, enabling
more advanced AI-driven workflows.
HOW TO PICK AN
APPLICATION
FRAMEWORK
Framework Strengths
LangChain General-Purpose LLM Orchestration
Flexible workflow
design
Steeper learning
curve
Customizable multi-step agent
workflows
LlamaIndex Data Framework RAG & Indexing
Optimized for
efficient data
retrieval
Limited to
indexing/retrieval
tasks
Domain-specific RAG pipelines
Haystack Search-Oriented NLP & Semantic Search
Built-in document
preprocessing
Less flexible for
non-search tasks
Enterprise search applications
LangGraph Stateful Agents
Complex Decision-
Making
Handles
loops/human-in-
the-loop
Requires LangChain
integration
Dynamic customer
support/approval systems
CrewAI Multi-Agents
Collaborative Task
Execution
Role-based agent
collaboration
Early-stage tooling Research/data analysis teams
Type Focus Weaknesses Best for
Agent Apps
LangChain, Haystack, and LlamaIndex are agent apps designed
to orchestrate LLM operations. They simplify tasks like
retrieval, data management, and chaining prompts, enabling
more advanced AI-driven workflows.
HOW TO PICK AN
APPLICATION
FRAMEWORK
Conversational AI Frameworks
Simple REPL and WebUI wrappings (like ChainLit) are
frameworks for quickly testing, iterating, and deploying
conversational AI. They provide user-friendly interfaces and
straightforward setups, making it easier to refine prompts and
manage dialogue flows.
Getting to First Principles
Do you even need a framework? Are you solving a unique
problem or just following a trend? Sometimes a simpler,
framework-less approach can be more flexible and transparent.
By stripping down to first principles and building only what you
really need, you gain control and avoid unnecessary overhead.
HOW TO CHOOSE
A DEPLOYMENT
STRATEGY
Service Proxy
A service proxy approach (like LiteLLM) acts as a layer between
your app and the LLM provider, granting granular control over
data handling and API usage. It helps protect sensitive data and
manage costs through features like encryption, caching, and
rate limiting.
On-Premises
On-premises deployments give you full control of your data but
demand GPUs and skilled setup. Popular inference engines like
Ollama, GPT4All, vLLM, exo, and Provide offer private hosting, but
the hardware and maintenance costs can be significant.
You will need more RAM and VRAM than you think
Additional networking is needed, not just storage and compute.
Cloud Options
Self-Hosted
Managed Hosting
Self-Hosted
Endpoints
as a Service
EKS Azure hosted ChatGPT Bedrock
AKS Fireworks HuggingFace Spaces
GKE HuggingFace Endpoint Azure ML
Etc. Together.ai Google Cloud AI
Replicate Etc.
OpenRouter
CLOUD
OPTIONS
Hosted Models
What terms have you signed up for?
What shadow terms has your staff
signed you up for?
YOUR DATA
IN THE CLOUD
Enterprises need an AI Acceptable Use
Policy (AUP) to clarify permissible AI
activities, ensure regulatory compliance, and
reduce risks. It helps manage data handling,
ethical considerations, and user rights,
preventing legal or reputational pitfalls.
Set up an AI AUP
Conducting a threat analysis reveals any
hidden or risky terms your team may have
unknowingly accepted. By reviewing
agreements and commitments, you can
prevent unforeseen compliance or security
problems.
Perform a Threat Analysis
FIRST: GET REAL METRICS
Gather key performance indicators (KPIs)
through load testing and platform
instrumentation to spot bottlenecks and cost
drivers.
SCALING YOUR AI APPLICATION:
SECOND: ADD HARDWARE
Next, add more hardware to reduce latency and
handle load while you diagnose deeper issues.
This temporary fix buys time but won’t solve
the underlying problems.
THIRD: DITCH ABSTRATIONS
Removing frameworks and abstrations cuts
overhead, exposes inefficiencies, and enables
targeted optimizations for better control and
performance.
INVEST IN MLOPS AND LLMOPS
Deploying AI apps is still software deployment
at its core. Continuous deployment and
monitoring remain essential for maintaining
quality and performance.
HIGH LATENCY - HIGH COSTS
KEY TAKE-AWAYS
Choosing the Defaults Will Give You Very Average or Even Poor Results
1
Build Observabilty into your Solutions
2
Deploying AI is Deploying Software
3
Create an AI User Policy and Ethics Guide
4
Establish an AI Guild
5
Challenge your Teams to Use AI
6
THANK YOU
Presented by Calvin Hendryx-Parker
Come see me to talk further
sixfeetup.com calvin@sixfeetup.com Fishers, Indiana

Blending AI in Enterprise Architecture.pdf

  • 1.
    Presented by: Calvin Hendryx-Parker CTOand Co-Founder sixfeetup.com calvin@sixfeetup.com Fishers, Indiana
  • 2.
    TODAY'S AGENDA 1 How toPick an LLM 2 Making the Model Aware of your Data 3 Picking an Application Framework 4 Choosing a Deployment Strategy 5 Scaling your Application 6 Key Take-Aways 7 Q&A
  • 3.
    Service or API? Usingan LLM service is simpler and quicker, but offers less custom control. An API allows deeper tuning, at the cost of more setup and maintenance. Both options come with terms of service that can limit usage, data handling, or intellectual property rights, so review them carefully before choosing. HOW TO PICK AN LLM Evaluating Trustworthiness Trustworthy Leaderboard Decoding Trust Overview The Open Source Question Open source LLMs vary in how much they share. “Open Weights” means you can use the pretrained model files but not fully reproduce it. Truly open means you also have the training code and dataset, allowing you to replicate and retrain the model from scratch. Evaluating Performances Open LLM Leaderboard Chatbot Arena LLM Leaderboard Artificial Analysis LLM Performance Leaderboard
  • 4.
    Choose your embeddingmodel carefully because different models have different training sets and specialties—some focus on plain English, others on multilingual content, and some even handle code or addresses. Using a default model often yields mediocre outcomes, so be sure to align your data type with the right embedding. MAKING THE MODEL AWARE OF YOUR DATA Pick an appropriate trans f o r m e r m o d e l Evaluate the shape of y o u r d a t a Review your d a t a
  • 5.
  • 6.
    EMBEDDING MODEL SELECTION UseCase General Purpose RAG OpenAI text-embedding-3-small + BAAI reranker Cost-effective with 70%+ accuracy via re-ranking Financial Analysis Voyage-finance-2 22% higher precision on SEC filings than general models Multilingual Search Vectara Boomerang/Cohere- embed-v3 Superior cross-lingual NDCG scores Self-hosted Solutions Nomic-embed-text/BGE-large Zero API fees; 71-75% accuracy on custom corpora Rationale Recommended Models
  • 7.
    DATA HUNGRY Fine-tuning demandslarge amounts of data because the model must see enough varied examples to learn new patterns without forgetting its previous knowledge. Insufficient or low-quality data can lead to overfitting or underperformance, emphasizing the need for abundant, relevant training material. FINE-TUNING MAY OR MAY NOT BE THE BEST CHOICE BEST FOR CLASSIFICATION Fine-tuned models are tailored to the specific domain or task, capturing relevant patterns with higher precision. This specialization boosts accuracy while reducing computational overhead during inference. By focusing on the data that matters, fine-tuning can yield more efficient processing and potentially lower inference costs. COSTLY Fine-tuning is resource-intensive, requiring specialized hardware and extensive compute time. As model sizes grow, so do energy and infrastructure costs. Gathering and cleaning enough data adds extra expenses, and multiple training runs further drive up the final bill. TIME CONSUMING Fine-tuning takes time because it requires multiple training and validation cycles with extensive data. Each iteration involves careful hyperparameter adjustments, extending the overall process.
  • 8.
    FIGHTING HALLUCINATIONS AND HALF-TRUTHS WITH RAG RAGuses vector-based retrieval to fetch relevant context from a knowledge base. By relying on semantic similarities, it helps reduce hallucinations. Vectors
  • 9.
    FIGHTING HALLUCINATIONS AND HALF-TRUTHS WITH RAG Vectors RAGuses vector-based retrieval to fetch relevant context from a knowledge base. By relying on semantic similarities, it helps reduce hallucinations. Data Storage RAG’s data storage can use both open source solutions or commercial platforms. Each option comes with trade-offs in cost, scalability, and control.
  • 10.
    Open Source Commercial ChromaDBPinecone pg_vector Snowflake Elastic Weaviate Mongo Milvus RAG DATA STORAGE OPTIONS
  • 11.
    FIGHTING HALLUCINATIONS AND HALF-TRUTHS WITH RAG Vectors RAGuses vector-based retrieval to fetch relevant context from a knowledge base. By relying on semantic similarities, it helps reduce hallucinations. Data Storage RAG’s data storage can use both open source solutions or commercial platforms. Each option comes with trade-offs in cost, scalability, and control. Retrieval RAG relies on converting the user query into an embedding, then searching a knowledge base for matching vectors. It combines the retrieved context with a context window and may factor in recent chat history to ensure continuity and relevance. RAG Demo
  • 12.
  • 13.
    FIGHTING HALLUCINATIONS AND HALF-TRUTHS WITH RAG Vectors RAGuses vector-based retrieval to fetch relevant context from a knowledge base. By relying on semantic similarities, it helps reduce hallucinations. Data Storage RAG’s data storage can use both open source solutions or commercial platforms. Each option comes with trade-offs in cost, scalability, and control. Retrieval RAG relies on converting the user query into an embedding, then searching a knowledge base for matching vectors. It combines the retrieved context with a context window and may factor in recent chat history to ensure continuity and relevance. Feedback Loop RAG combats hallucinations by rating responses and refining them for accuracy. A/B testing model settings uncovers half-truths, leading to more reliable outputs.
  • 14.
    USING GUARDRAILS WITH PRE-GENERATEDANSWERS Pre-Generated Answers A method where common responses are produced in advance and served quickly. Less Creativity PGA is less creative because it relies on fixed responses, offering limited adaptability. It can’t spontaneously craft new content or adjust to unique queries in real time. More Control PGA provides vetted, consistent responses, reducing the risk of unverified or unsuitable outputs. This tighter control is crucial for regulated markets like healthcare, where accuracy and compliance are paramount. SEE BLOG POST
  • 15.
    Agent Apps LangChain, Haystack,and LlamaIndex are agent apps designed to orchestrate LLM operations. They simplify tasks like retrieval, data management, and chaining prompts, enabling more advanced AI-driven workflows. HOW TO PICK AN APPLICATION FRAMEWORK
  • 16.
    Framework Strengths LangChain General-PurposeLLM Orchestration Flexible workflow design Steeper learning curve Customizable multi-step agent workflows LlamaIndex Data Framework RAG & Indexing Optimized for efficient data retrieval Limited to indexing/retrieval tasks Domain-specific RAG pipelines Haystack Search-Oriented NLP & Semantic Search Built-in document preprocessing Less flexible for non-search tasks Enterprise search applications LangGraph Stateful Agents Complex Decision- Making Handles loops/human-in- the-loop Requires LangChain integration Dynamic customer support/approval systems CrewAI Multi-Agents Collaborative Task Execution Role-based agent collaboration Early-stage tooling Research/data analysis teams Type Focus Weaknesses Best for
  • 17.
    Agent Apps LangChain, Haystack,and LlamaIndex are agent apps designed to orchestrate LLM operations. They simplify tasks like retrieval, data management, and chaining prompts, enabling more advanced AI-driven workflows. HOW TO PICK AN APPLICATION FRAMEWORK Conversational AI Frameworks Simple REPL and WebUI wrappings (like ChainLit) are frameworks for quickly testing, iterating, and deploying conversational AI. They provide user-friendly interfaces and straightforward setups, making it easier to refine prompts and manage dialogue flows. Getting to First Principles Do you even need a framework? Are you solving a unique problem or just following a trend? Sometimes a simpler, framework-less approach can be more flexible and transparent. By stripping down to first principles and building only what you really need, you gain control and avoid unnecessary overhead.
  • 18.
    HOW TO CHOOSE ADEPLOYMENT STRATEGY Service Proxy A service proxy approach (like LiteLLM) acts as a layer between your app and the LLM provider, granting granular control over data handling and API usage. It helps protect sensitive data and manage costs through features like encryption, caching, and rate limiting. On-Premises On-premises deployments give you full control of your data but demand GPUs and skilled setup. Popular inference engines like Ollama, GPT4All, vLLM, exo, and Provide offer private hosting, but the hardware and maintenance costs can be significant. You will need more RAM and VRAM than you think Additional networking is needed, not just storage and compute. Cloud Options Self-Hosted Managed Hosting
  • 19.
    Self-Hosted Endpoints as a Service EKSAzure hosted ChatGPT Bedrock AKS Fireworks HuggingFace Spaces GKE HuggingFace Endpoint Azure ML Etc. Together.ai Google Cloud AI Replicate Etc. OpenRouter CLOUD OPTIONS Hosted Models
  • 20.
    What terms haveyou signed up for? What shadow terms has your staff signed you up for? YOUR DATA IN THE CLOUD Enterprises need an AI Acceptable Use Policy (AUP) to clarify permissible AI activities, ensure regulatory compliance, and reduce risks. It helps manage data handling, ethical considerations, and user rights, preventing legal or reputational pitfalls. Set up an AI AUP Conducting a threat analysis reveals any hidden or risky terms your team may have unknowingly accepted. By reviewing agreements and commitments, you can prevent unforeseen compliance or security problems. Perform a Threat Analysis
  • 21.
    FIRST: GET REALMETRICS Gather key performance indicators (KPIs) through load testing and platform instrumentation to spot bottlenecks and cost drivers. SCALING YOUR AI APPLICATION: SECOND: ADD HARDWARE Next, add more hardware to reduce latency and handle load while you diagnose deeper issues. This temporary fix buys time but won’t solve the underlying problems. THIRD: DITCH ABSTRATIONS Removing frameworks and abstrations cuts overhead, exposes inefficiencies, and enables targeted optimizations for better control and performance. INVEST IN MLOPS AND LLMOPS Deploying AI apps is still software deployment at its core. Continuous deployment and monitoring remain essential for maintaining quality and performance. HIGH LATENCY - HIGH COSTS
  • 22.
    KEY TAKE-AWAYS Choosing theDefaults Will Give You Very Average or Even Poor Results 1 Build Observabilty into your Solutions 2 Deploying AI is Deploying Software 3 Create an AI User Policy and Ethics Guide 4 Establish an AI Guild 5 Challenge your Teams to Use AI 6
  • 23.
    THANK YOU Presented byCalvin Hendryx-Parker Come see me to talk further sixfeetup.com calvin@sixfeetup.com Fishers, Indiana