llm-design-patterns

Retrieval-Augmented Generation (RAG)

Intent

Improve the performance of LLMs on knowledge-intensive tasks by combining them with a knowledge retriever.

Motivation

Knowledge Intensive Task

Knowledge-intensive tasks are those that require a deep understanding of facts and contextual information to provide accurate and meaningful responses. Retrieval-Augmented Generation (RAG) pattern can be particularly useful for such tasks as they combine pre-trained parametric models (like LLMs) with non-parametric memory (like dense vector indexes of large-scale knowledge sources) to enhance the model’s ability to access factual knowledge.

Traditional LLM Limitations on Knowledge Intensive Tasks

When it comes to knowledge-intensive tasks, traditional LLMs still exhibit certain limitations that hinder their overall effectiveness.

  1. Incomplete and Static Knowledge: LLMs have limited and outdated knowledge due to their pre-training process.

  2. Inaccurate Knowledge Retrieval: LLMs may generate text that is contextually incorrect or not fully aligned with the given facts, particularly when addressing questions or generating content that requires deep understanding.

  3. Lack of Provenance: Traditional LLMs do not provide clear sources for the factual knowledge they utilize, which makes the generated content lose some credibility.

  4. Difficulty in Updating Knowledge: Updating knowledge stored in LLMs is computationally expensive and time-consuming as it mostly requires retraining.

Given these limitations, traditional LLMs tend to under perform on knowledge-intensive tasks compared to task-specific architectures, which are specifically designed to access and manipulate external knowledge sources.

Structure

Indexing documents into KB diagram

Embedding Generator

The main role of the embedding generator is to convert input text (e.g., either an input document or end user query) into a continuous dense vector representation. This is typically achieved by using a pre-trained neural encoder, which has been trained to understand the semantic relationships between different pieces of text.

Knowledge Base

The Knowledge Base is a fundamental component of the Retrieval-Augmented Generation (RAG) design pattern, serving as the external, non-parametric memory that stores a vast amount of factual information. The primary role of the Knowledge Base is to supply the RAG model with a wealth of accurate, up-to-date information that can be used to augment its responses in knowledge-intensive tasks

RAG Inference Query diagram

Query Engine

The main role of the query engine is to convert the end user query into the same vector space as the documents that were indexed into the knowledge base.

LLM Generator

The LLM generator is responsible for generating the final response to the end user query. It takes the retrieved documents, and the original query as input and generates the response.

Implementation and Relevant Tools

Sample LLM Prompt Template

PROMPT_TMPL = (
    "Context information is below. \n"
    "---------------------\n"
    "{context_str}"
    "\n---------------------\n"
    "Assess if the context information is relevant and accordingly answer the question: {query_str}\n"
)

PROMPT_TMPL_TO_LIMIT_PRIOR_KNOWLEDGE = (
    "Context information is below. \n"
    "---------------------\n"
    "{context_str}"
    "\n---------------------\n"
    "Given only the context information and not any prior knowledge, "
    "answer the question factually: {query_str}\n"
)

PROMPT_TMPL_TO_SUMMARIZE = (
    "Context information is below. \n"
    "---------------------\n"
    "{context_str}"
    "\n---------------------\n"
    "Summarize the above context to refine knowledge about this query: {query_str}\n"
)

Sample Code

from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader

# Reading documents from the data directory. 
# The example assumes that the data directory contains the html from http://paulgraham.com/worked.html
documents = SimpleDirectoryReader('data').load_data()
# Ingesting documents into an in memory knowledge base. 
# GPTVectorStoreIndex has a default openai embedding generator that generates dense 
# vectors for the documents before indexing them into a simple dictionary index.
index = GPTVectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
# The query engine internally first uses the same embedding generator to generate a dense vector for the input query
# and then queries the index for the most similar documents to the query vector.
# The query engine then embeds the retrieved documents and the query into a template prompt and feeds it to the LLM.
# The LLM then generates a response to the query which is returned by the query engine.
response = query_engine.query("What did the author do growing up?")
print(response)

Consequences

Results

  1. Improved performance on knowledge-intensive tasks

  2. More control on specific vs diverse vs factual language generation

  3. Provenance for generated content

Trade-offs

  1. Latency : Considerably higher latency than simple LLM calls as it involves query embedding generation, document retrieval from KB and final LLM inference.
  2. Costs : Higher costs for the additional components like vectorDB, embedding-model-calls involved in the RAG pattern.
  3. Complexity : Higher management/dev complexity in terms of the number of components involved, and the number of calls between them. Also makes the evaluation of the system more complex.
  4. Dependency on external knowledge sources : The performance of the RAG pattern is highly dependent on the quality of the knowledge base used.

Known Uses

  1. Fact-based Text Generation: Generating text that requires incorporating accurate factual information, like writing a summary of an event, creating an informative article, or producing a detailed description of a specific topic.

  2. Conversational AI: Building chatbots or virtual assistants that can provide detailed and accurate responses ( likely from a KnowledgeBase) in natural language conversations, demonstrating an understanding of context and factual knowledge.

  3. Open-domain Question Answering (QA): Answering questions that span a wide range of topics and require access to a vast knowledge base, such as answering questions about history, science, literature, or general trivia.

  4. Knowledge Base Population: Automatically populating a knowledge base with new entities and relationships by extracting and synthesizing information from multiple sources, such as web pages, news articles, or scientific papers.