RAG : Stop Hallucinating, Start Retrieving!

Chetan_Tiwary_ · ‎06-15-2025

The very design of LLMs inherently leads to a degree of unpredictability in their output. Furthermore, the knowledge contained within LLMs is limited by their static training data, meaning there's a specific cut-off date for the information they possess and hence may produce outdated or inaccurate information.

This is HALLUCINATION !

Common issues encountered with LLMs include:

*Providing incorrect information when they lack a direct answer.
*Delivering outdated or overly general responses when a user requires something specific and current.
*Generating content based on unreliable or non-authoritative sources.
*Producing inaccurate replies stemming from confused terminology, where different training materials might use the same terms for distinct concepts.

Hence RAG ( Retrieval-Augmented Generation ).

RAG is is a AI/ML architecture that enhances a model's output by leveraging authoritative, external data to boost its accuracy, relevance, and overall usefulness.

It has two key components : 1. Retriever 2. Generator and the overall process can be divided into 4 main steps :

|
v
[ RETRIEVAL ] --> Search relevant chunks using user query.

|
v
[ AUGMENTATION ] --> Combine query + retrieved chunks into a rich prompt.

|
v
[ GENERATION ] --> LLM generates smart, grounded, fact-friendly output.

In short this is a simple RAG pipeline :

1. Indexing Phase :

Document ingestion: Load PDFs, docs, logs. 

Chunking: Divide content into paragraphs or token windows.

Embedding: Convert text to vectors using models like Sentence-BERT/OpenAI.

Store in vector DB: Tools like Pinecone, Weaviate, Milvus etc ensure fast similarity searches. 


2. Inference Phase

Embed the query using the same embedding model.

Retrieve top-k chunks via ANN or cosine similarity search. 

Compose prompt:Inject relevant chunks and original question.

Generate answer with the LLM.

Post-process: Filter content, rerank, format answers with citations.

Some of the use cases could be :

1. ChatBots trained on a company's internal docs, SOPs, reports etc.

2. Technical Assistant bot - you can feed the historic SNOW or JIRA data to automate ticket acknowedgement or for RCA analysis. Also, why would not it help assign those tickets to a suitable agent based on his/her availability, expertise and rating /performance ?

3. Techincal Chatbots to query historical RHEL or Openshift KBs to generate basic and general torubleshoot queries like how to create a network bond or how to restart a container etc.

4. Legal policy or compliance chatbot - for complex data like GDPR , copyright or patent laws.

RAG is useful because it directs the LLM to retrieve specific, real-time information from your chosen source (or sources) of truth. RAG can save money by providing a custom experience without the expense of model training and fine-tuning. It can also save resources by sending only the most relevant information (rather than lengthy documents) when querying an LLM.

You thought this was all ? Last but not the least , RAG got updated and improved :

Reranking: This involves refining the relevance of retrieved information by re-scoring the top data chunks before they are sent to the Large Language Model (LLM).
Dynamic RAG (DRAGIN): This pattern allows the system to fetch more context during the generation process itself, which is particularly useful for tasks involving multiple turns or steps.
Secure RAG: This focuses on protecting sensitive data by encrypting vectors and managing access controls to the information.
GraphRAG: This approach integrates knowledge graphs with text chunks to provide a richer, more contextual understanding of the information being used.

WARNING - Proceed at your own risk !!!

https://developers.redhat.com/articles/2024/12/04/level-your-generative-ai-llms-and-rag#how_to_creat...

https://developers.redhat.com/articles/2024/11/20/rhel-ai-action-streamline-ai-workflows-rag-lab-and...

RAG : Stop Hallucinating, Start Retrieving!

AI

LLM

OpenShift AI

RAG

RHEL AI