The very design of LLMs inherently leads to a degree of unpredictability in their output. Furthermore, the knowledge contained within LLMs is limited by their static training data, meaning there's a specific cut-off date for the information they possess and hence may produce outdated or inaccurate information.
This is HALLUCINATION !
Common issues encountered with LLMs include:
*Providing incorrect information when they lack a direct answer.
*Delivering outdated or overly general responses when a user requires something specific and current.
*Generating content based on unreliable or non-authoritative sources.
*Producing inaccurate replies stemming from confused terminology, where different training materials might use the same terms for distinct concepts.
Hence RAG ( Retrieval-Augmented Generation ).
RAG is is a AI/ML architecture that enhances a model's output by leveraging authoritative, external data to boost its accuracy, relevance, and overall usefulness.
It has two key components : 1. Retriever 2. Generator and the overall process can be divided into 4 main steps :
+-------------------+ +---------------------+
| Company Docs| | Public Knowledge|
| Internal PDFs, | | (e.g., Wikipedia) |
| APIs, Wikis, etc.| +----------+----------+
+---------+---------+ |
| |
v v
[ INGESTION ] --> Embed into vector DB (Pinecone, FAISS, etc.).
|
v
[ RETRIEVAL ] --> Search relevant chunks using user query.
|
v
[ AUGMENTATION ] --> Combine query + retrieved chunks into a rich prompt.
|
v
[ GENERATION ] --> LLM generates smart, grounded, fact-friendly output.
In short this is a simple RAG pipeline :
1. Indexing Phase :
Document ingestion: Load PDFs, docs, logs.
Chunking: Divide content into paragraphs or token windows.
Embedding: Convert text to vectors using models like Sentence-BERT/OpenAI.
Store in vector DB: Tools like Pinecone, Weaviate, Milvus etc ensure fast similarity searches.
2. Inference Phase
Embed the query using the same embedding model.
Retrieve top-k chunks via ANN or cosine similarity search.
Compose prompt:Inject relevant chunks and original question.
Generate answer with the LLM.
Post-process: Filter content, rerank, format answers with citations.
Some of the use cases could be :
1. ChatBots trained on a company's internal docs, SOPs, reports etc.
2. Technical Assistant bot - you can feed the historic SNOW or JIRA data to automate ticket acknowedgement or for RCA analysis. Also, why would not it help assign those tickets to a suitable agent based on his/her availability, expertise and rating /performance ?
3. Techincal Chatbots to query historical RHEL or Openshift KBs to generate basic and general torubleshoot queries like how to create a network bond or how to restart a container etc.
4. Legal policy or compliance chatbot - for complex data like GDPR , copyright or patent laws.
RAG is useful because it directs the LLM to retrieve specific, real-time information from your chosen source (or sources) of truth. RAG can save money by providing a custom experience without the expense of model training and fine-tuning. It can also save resources by sending only the most relevant information (rather than lengthy documents) when querying an LLM.
You thought this was all ? Last but not the least , RAG got updated and improved :
Reranking: This involves refining the relevance of retrieved information by re-scoring the top data chunks before they are sent to the Large Language Model (LLM).
Dynamic RAG (DRAGIN): This pattern allows the system to fetch more context during the generation process itself, which is particularly useful for tasks involving multiple turns or steps.
Secure RAG: This focuses on protecting sensitive data by encrypting vectors and managing access controls to the information.
GraphRAG: This approach integrates knowledge graphs with text chunks to provide a richer, more contextual understanding of the information being used.
WARNING - Proceed at your own risk !!!
Red Hat
Learning Community
A collaborative learning environment, enabling open source skill development.