Retrieval-augmented generation

Large Language Models, or LLMs, have revolutionized how we interact with information, but they often struggle with factual accuracy and staying current. Retrieval-Augmented Generation, or RAG, offers an ingenious solution by enabling these models to consult external, up-to-date sources before answering. It's like giving an incredibly smart but sometimes forgetful student access to a vast, perfectly organized library, allowing them to provide precise, verifiable answers every time. RAG empowers LLMs to access and integrate real-time or domain-specific information, significantly reducing factual inaccuracies and 'hallucinations'. By retrieving relevant documents, RAG eliminates the constant need to retrain LLMs, saving substantial computational and financial resources. The RAG process involves indexing external data, retrieving relevant pieces, augmenting the LLM's prompt, and finally generating a more informed response.

Source: Wikipedia

AI Summary

RAG empowers LLMs to access and integrate real-time or domain-specific information, significantly reducing factual inaccuracies and 'hallucinations'.
By retrieving relevant documents, RAG eliminates the constant need to retrain LLMs, saving substantial computational and financial resources.
The RAG process involves indexing external data, retrieving relevant pieces, augmenting the LLM's prompt, and finally generating a more informed response.

Introducing Retrieval-Augmented Generation

Imagine a super-intelligent assistant who can answer almost any question, but sometimes makes up facts or struggles with very recent events. That's a bit like a Large Language Model, or LLM. Retrieval-Augmented Generation—RAG—is the brilliant technique designed to fix these shortcomings.

At its core, RAG enables an LLM to look up information from a specified set of documents or databases before formulating a response. Instead of relying solely on its pre-existing training data, which can be vast but quickly outdated, the LLM consults external, authoritative sources.

Think of it as giving the LLM a 'brain' and a 'research assistant'. This allows LLMs to use current, domain-specific information—whether it's internal company policies, the latest scientific discoveries, or real-time news—that simply wasn't available during their initial training.

This 'lookup' capability drastically improves LLM performance, particularly in terms of factual accuracy. It's crucial for minimizing those notorious 'AI hallucinations' where chatbots confidently present non-existent facts or make up legal cases.

Beyond accuracy, RAG brings significant efficiency gains. Instead of costly, continuous retraining of massive models with new data, you simply update the external knowledge base. It also enhances transparency, as many RAG-powered systems can cite their sources, allowing users to verify information.

The Genesis of RAG

The concept of Retrieval-Augmented Generation wasn't always part of the LLM landscape. The term RAG was formally introduced in a groundbreaking 2020 research paper by the team at Meta, setting the stage for a new era of more reliable AI.

Why RAG is Essential: LLM Limitations Unveiled

Even the most advanced LLMs can stumble. A famous example occurred when Google's Bard LLM provided incorrect information about the James Webb Space Telescope during its initial demonstration. This single error reportedly contributed to a significant drop in Google's parent company's stock value.

While RAG is a powerful preventative measure, it's not a complete panacea for all LLM errors. Models can still misinterpret context, even when provided with factually correct sources. For instance, an LLM might mistakenly state, 'The United States has had one Muslim president, Barack Hussein Obama.'

This particular error stemmed from the LLM pulling a phrase from an academic book rhetorically titled Barack Hussein Obama: America’s First Muslim President? The model retrieved the title but failed to grasp its nuanced, questioning context, leading to a false assertion.

RAG works by what's sometimes called 'prompt stuffing'—a programming technique where additional, relevant context is dynamically added to a user's original query. This instructs the LLM to prioritize the supplied data over its pre-existing training knowledge when generating a response.

The RAG Process: A Four-Step Dance

Retrieval-Augmented Generation is a sophisticated mechanism, working through a clear four-step process. When you ask a question, RAG springs into action to ensure the LLM delivers the most accurate and relevant answer possible.

1. Indexing

Before any retrieval can happen, external data—like your company's documents or a knowledge base—must be prepared. This usually involves converting the information into numerical representations called 'embeddings'.

These embeddings capture the semantic meaning of the text and are stored in specialized 'vector databases'. This setup allows for incredibly fast and efficient searching of relevant content later on, regardless of whether the original data was unstructured text, semi-structured, or a highly organized knowledge graph.

2. Retrieval

When a user submits a query, the first thing RAG does is activate its 'document retriever'. This component searches the vector database using the user's query to find the most semantically similar and relevant documents or chunks of information.

3. Augmentation

Once the relevant information is retrieved, it's not simply handed to the LLM as a raw search result. Instead, this context is carefully integrated into the user's original query using 'prompt engineering'. This augmented prompt now gives the LLM the specific knowledge it needs to craft a precise answer.

4. Generation

Finally, with the enriched prompt in hand, the LLM synthesizes its output. It combines its inherent linguistic capabilities with the freshly retrieved information to generate an engaging, accurate, and contextually grounded answer tailored to the user's request.

Evolving RAG: Continuous Improvements

The foundational RAG process has spawned a wealth of innovative improvements, each designed to make the system smarter, faster, and more accurate at various stages of the flow.

Encoder Enhancements

Many improvements focus on how text is 'encoded' into those numerical vectors. We have 'dense' vectors, which are compact and capture meaning, and 'sparse' vectors, which are longer and encode specific word identities.

Optimizing how vector similarities are calculated—using techniques like 'dot products' or efficient 'approximate nearest neighbor' (ANN) searches—significantly boosts retrieval performance. Hybrid approaches combine the best of both dense and sparse vectors for balanced efficiency and accuracy.

The dot product, for example, is a fundamental operation used to measure the similarity between two vectors. A higher dot product usually indicates greater similarity in meaning or context between the query and the retrieved document chunk.

\mathbf{A} \cdot \mathbf{B} = \sum_{i=1}^{n} A_i B_i

Retriever-Centric Methods

Other advancements zero in on making the document retriever itself more intelligent. This includes 'pre-training' the retriever to learn optimal retrieval patterns or using 'progressive data augmentation' to sample challenging negative examples during training, making the retriever more robust.

Techniques like 'supervised retriever optimization' fine-tune the retriever to align its selections with what the language model actually finds most useful for generation. 'Reranking' also plays a vital role, ensuring that the most relevant documents among the retrieved set are prioritized.

Language Model Redesign

Some advanced approaches even involve redesigning the language model itself with the retriever in mind. Projects like 'Retro' and 'Retro++' aim to make smaller LLMs achieve comparable performance to much larger ones by integrating domain knowledge during initial training, shifting the focus of the LLM's internal weights towards language semantics rather than brute-force memorization of facts.

Smart Chunking Strategies

How you 'chunk' or break down documents into smaller, searchable units is critical. Simple methods use 'fixed length with overlap' to maintain context. More sophisticated strategies consider natural linguistic breaks or even the document's file format.

These varied chunking methods ensure that the retrieved pieces of information are coherent and meaningful, leading to better augmentation and generation.

Fixed length with overlap: Fast and easy, helps maintain semantic context. Syntax-based chunks: Breaks documents into sentences, paragraphs, or other logical units using NLP libraries. File format-based chunking: Respects the inherent structure of files, chunking code by functions, HTML by elements, or PDFs by sections.

Knowledge Graphs as Sources

Instead of raw documents, RAG can leverage 'Knowledge Graphs'. These structured representations convert text into a network of entities and relationships, providing a more semantically rich source for retrieval. This 'GraphRAG' approach often leads to more relevant and precise factual retrieval.

Hybrid Search Approaches

Sometimes, pure vector searches can miss niche but critical facts. 'Hybrid search' combines the semantic power of vector searches with traditional keyword-based text searches. This ensures a broader net is cast, capturing both conceptual relevance and exact keyword matches before augmentation.

Late-interaction Search

For even higher accuracy, 'late-interaction search' can be employed. This involves creating embeddings at a more granular, token-level, and comparing them later in the process. While computationally more intensive, it can yield significantly more precise results by allowing finer-grained comparisons.

The Road Ahead: Challenges for RAG

Despite its remarkable capabilities, RAG is not a magic bullet. It significantly mitigates hallucinations, but it doesn't eliminate them entirely. The LLM can still 'hallucinate around the source material,' embellishing or misinterpreting the retrieved facts.

One key challenge is the LLM's 'self-awareness'. Without specific training, models may struggle to recognize when they lack sufficient information to provide a reliable response. They might generate an answer even when they should express uncertainty or explicitly state, 'I don't have enough information.'

Article

Retrieval-augmented generation

Retrieval-augmented generation (RAG) is a technique that enables large language models (LLMs) to retrieve and incorporate new information from external data sources. With RAG, LLMs first refer to a specified set of documents, then respond to user queries. These documents supplement information from the LLM's pre-existing training data. This allows LLMs to use domain-specific and/or updated information that is not available in the training data. For example, this helps LLM-based chatbots access internal company data or generate responses based on authoritative sources.

RAG improves large language models (LLMs) by incorporating information retrieval before generating responses. Unlike LLMs that rely on static training data, RAG pulls relevant text from databases, uploaded documents, or web sources. According to Ars Technica, "RAG is a way of improving LLM performance, in essence by blending the LLM process with a web search or other document look-up process to help LLMs stick to the facts." This method helps reduce AI hallucinations, which have caused chatbots to describe policies that don't exist, or recommend nonexistent legal cases to lawyers that are looking for citations to support their arguments.

RAG also reduces the need to retrain LLMs with new data, saving on computational and financial costs. Beyond efficiency gains, RAG also allows LLMs to include sources in their responses, so users can verify the cited sources. This provides greater transparency, as users can cross-check retrieved content to ensure accuracy and relevance.

The term RAG was first introduced in a 2020 research paper.

RAG and LLM limitations

Retrieval-augmented generation

LLMs can provide incorrect information. For example, when Google first demonstrated its LLM tool "Google Bard" (later re-branded to Gemini), the LLM provided incorrect information about the James Webb Space Telescope. This error contributed to a $100 billion decline in Google's’s stock value. RAG is used to prevent these errors, but it does not solve all the problems. For example, LLMs can generate misinformation even when pulling from factually correct sources if they misinterpret the context. MIT Technology Review gives the example of an AI-generated response stating, "The United States has had one Muslim president, Barack Hussein Obama." The model retrieved this from an academic book rhetorically titled Barack Hussein Obama: America’s First Muslim President? The LLM did not "know" or "understand" the context of the title, generating a false statement.

LLMs with RAG are programmed to prioritize new information. This technique has been called "prompt stuffing." Without prompt stuffing, the LLM's input is generated by a user; with prompt stuffing, additional relevant context is added to this input to guide the model’s response. This approach provides the LLM with key information early in the prompt, encouraging it to prioritize the supplied data over pre-existing training knowledge.

Process

Retrieval-augmented generation

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating an information-retrieval mechanism that allows models to access and utilize additional data beyond their original training set. Ars Technica notes that "when new information becomes available, rather than having to retrain the model, all that’s needed is to augment the model’s external knowledge base with the updated information" ("augmentation"). IBM states that "in the generative phase, the LLM draws from the augmented prompt and its internal representation of its training data to synthesize" an answer.

RAG key stages

Overview of RAG process, combining external documents and user input into an LLM prompt to get tailored output

Typically, the data to be referenced is converted into LLM embeddings, numerical representations in the form of a large vector space. RAG can be used on unstructured (usually text), semi-structured, or structured data (for example knowledge graphs). These embeddings are then stored in a vector database to allow for document retrieval.

Given a user query, a document retriever is first called to select the most relevant documents that will be used to augment the query. This comparison can be done using a variety of methods, which depend in part on the type of indexing used.

The model feeds this relevant retrieved information into the LLM via prompt engineering of the user's original query. Newer implementations (as of 2023) can also incorporate specific augmentation modules with abilities such as expanding queries into multiple domains and using memory and self-improvement to learn from previous retrievals.

Finally, the LLM can generate output based on both the query and the retrieved documents. Some models incorporate extra steps to improve output, such as the re-ranking of retrieved information, context selection, and fine-tuning.

Improvements

Retrieval-augmented generation

Improvements to the basic process above can be applied at different stages in the RAG flow.

Encoder

These methods focus on the encoding of text as either dense or sparse vectors. Sparse vectors, which encode the identity of a word, are typically dictionary-length and contain mostly zeros. Dense vectors, which encode meaning, are more compact and contain fewer zeros. Various enhancements can improve the way similarities are calculated in the vector stores (databases).

• Performance improves by optimizing how vector similarities are calculated. Dot products enhance similarity scoring, while approximate nearest neighbor (ANN) searches improve retrieval efficiency over K-nearest neighbors (KNN) searches. • Accuracy may be improved with Late Interactions, which allow the system to compare words more precisely after retrieval. This helps refine document ranking and improve search relevance. • Hybrid vector approaches may be used to combine dense vector representations with sparse one-hot vectors, taking advantage of the computational efficiency of sparse dot products over dense vector operations. • Other retrieval techniques focus on improving accuracy by refining how documents are selected. Some retrieval methods combine sparse representations, such as SPLADE, with query expansion strategies to improve search accuracy and recall.

Retriever-centric methods

These methods aim to enhance the quality of document retrieval in vector databases:

• Pre-training the retriever using the Inverse Cloze Task (ICT), a technique that helps the model learn retrieval patterns by predicting masked text within documents. • Supervised retriever optimization aligns retrieval probabilities with the generator model’s likelihood distribution. This involves retrieving the top-k vectors for a given prompt, scoring the generated response’s perplexity, and minimizing KL divergence between the retriever’s selections and the model’s likelihoods to refine retrieval. • Reranking techniques can refine retriever performance by prioritizing the most relevant retrieved documents during training.

Language model

By redesigning the language model with the retriever in mind, a 25-time smaller network can get comparable perplexity as its much larger counterparts. Because it is trained from scratch, this method (Retro) incurs the high cost of training runs that the original RAG scheme avoided. The hypothesis is that by giving domain knowledge during training, Retro needs less focus on the domain and can devote its smaller weight resources only to language semantics. The redesigned language model is shown here.

It has been reported that Retro is not reproducible, so modifications were made to make it so. The more reproducible version is called Retro++ and includes in-context RAG.

Chunking

Chunking involves various strategies for breaking up the data into vectors so the retriever can find details in it.

• Fixed length with overlap. This is fast and easy. Overlapping consecutive chunks helps to maintain semantic context across chunks. • Syntax-based chunks can break the document up into sentences. Libraries such as spaCy or NLTK can also help. • File format-based chunking. Certain file types have natural chunks built in, and it's best to respect them. For example, code files are best chunked and vectorized as whole functions or classes. HTML files should leave <table> or base64 encoded <img> elements intact. Similar considerations should be taken for pdf files. Libraries such as Unstructured or Langchain can assist with this method.

Hybrid search

Sometimes vector database searches can miss key facts needed to answer a user's question. One way to mitigate this is to do a traditional text search, add those results to the text chunks linked to the retrieved vectors from the vector search, and feed the combined hybrid text into the language model for generation.

The adoption of RAG in consumer-facing web search products has given rise to new content optimization disciplines, as practitioners have noted that content retrievability in RAG systems depends on factors like semantic structure, passage-level authority signals, and entity clarity rather than traditional search ranking signals such as backlinks.

Evaluation and benchmarks

RAG systems are commonly evaluated using benchmarks designed to test retrievability, retrieval accuracy and generative quality. Popular datasets include BEIR, a suite of information retrieval tasks across diverse domains, and Natural Questions or Google QA for open-domain QA.

Challenges

Retrieval-augmented generation

RAG does not prevent hallucinations in LLMs. According to Ars Technica, "It is not a direct solution because the LLM can still hallucinate around the source material in its response."

While RAG improves the accuracy of large language models (LLMs), it does not eliminate all challenges. One limitation is that while RAG reduces the need for frequent model retraining, it does not remove it entirely. Additionally, LLMs may struggle to recognize when they lack sufficient information to provide a reliable response. Without specific training, models may generate answers even when they should indicate uncertainty. According to IBM, this issue can arise when the model lacks the ability to assess its own knowledge limitations.

RAG poisoning

RAG systems may retrieve factually correct but misleading sources, leading to errors in interpretation. In some cases, an LLM may extract statements from a source without considering its context, resulting in an incorrect conclusion. Additionally, when faced with conflicting information, RAG models may struggle to determine which source is accurate. The worst case outcome of this limitation is that the model may combine details from multiple sources producing responses that merge outdated and updated information in a misleading manner. According to the MIT Technology Review, these issues occur because RAG systems may misinterpret the data they retrieve.