![](https://dedirock.com/wp-content/uploads/2024/10/54864fd6-b831-4f1e-af4d-a80bbc2603c5-2-848x440.webp)
Introduction to Retrieval Augmented Generation (RAG) for Language Models
In this article, you will discover how to develop a Retrieval-Augmented Generation (RAG) application that can seamlessly interact with your PDFs or other data sources. This application is particularly effective for managing substantial amounts of textual data, such as books and lecture notes, enabling the creation of chatbots that can answer queries relevant to the provided data. The best part is that we will leverage an open-source model, thus eliminating the need for costly API access.
RAG has become immensely popular and is one of the top AI frameworks for building tailored chatbots. Additionally, it serves as a robust tool for developing knowledge-centric AI applications.
Consider RAG as an AI assistant well-versed in user data and human language. When posed with a question, it taps into a reservoir of information to deliver accurate and detailed responses. It represents a powerful blend of an information retrieval system coupled with a strong large language model (LLM).
Retrieval-Augmented Generation (RAG) significantly enhances accuracy by fetching pertinent information from various external knowledge sources, thereby improving the context and accuracy of the generated responses. By utilizing factual data during the retrieval phase, RAG can mitigate hallucinations—a frequent issue faced by large language models. Furthermore, RAG enhances contextual learning by accessing specific, up-to-date data, making it exceptionally suited for applications like Q&A, document summarization, and interactive workflows.
For instance, a research paper on YOLO v9 was used as the data source, with Llama-2-7b-chat serving as the LLM. We posed several questions regarding the research paper, which yielded relevant responses as detailed below.
response=query_engine.query("What is YOLOv9")
YOLOv9 is a novel object detection algorithm that refines existing methods in various ways. Firstly, it employs a new architecture that minimizes the number of parameters and computations while retaining accuracy. Secondly, it utilizes a new technique called PGI to effectively retain and extract necessary information for mapping the data to the target. Finally, it has demonstrated significantly improved performance compared to existing methods across all dimensions.
response_2=query_engine.query("what is GELAN architecture?")
GELAN architecture is a deep neural network crafted for object detection tasks. It incorporates multiple prediction branches linked to a deep supervision branch. The deep supervision branch provides supplementary information to enhance the main branch's accuracy for object detection. Additionally, the GELAN architecture features an integration network that combines gradients returned from various prediction heads, enabling the main branch to gather comprehensive information about all target objects, addressing the broken information issue in deep supervision, thereby elevating overall object detection performance.
response_3=query_engine.query("What is new with YOLO v9?")
YOLO v9 unveils several new features, including:
1. Conventional convolution: YOLO v9 uses conventional convolution rather than depth-wise convolution, enhancing parameter utilization.
2. PGI: This technique, Progressive Gating and Integration, is employed in YOLO v9 to accurately retain and extract essential information needed for proper data mapping.
3. Large models: YOLO v9 shows considerable advantages with large models, requiring only 66% of the parameters while maintaining accuracy comparable to RT DETR-X.
4. Comparison with state-of-the-art methods: YOLO v9 showcases significant improvements when matched against other real-time object detectors trained from scratch across all dimensions.
Please let me know if you have any further questions or if there's anything else I can assist you with.
We even tested the application with some personal data, resulting in the following output.
response=query_engine.query("Who is Shaoni")
Shaoni Mukherjee is an experienced Technical Writer and AI Specialist with an enthusiasm for Generative AI and its transformative abilities. With more than four years in data science and a solid foundation in AI/ML technologies, she excels in crafting in-depth technical content that elucidates complex concepts. Currently contributing to DigitalOcean, Shaoni focuses on topics like GPU acceleration, deep learning, and large language models (LLMs), ensuring that developers and businesses can harness the latest technology. Her forte lies in distilling technical innovations into understandable, actionable insights, establishing her as a reputable figure in the AI domain.
Prerequisites
- Machine Learning Fundamentals: Understanding of concepts including embeddings, retrieval systems, and transformers.
- DigitalOcean Account: Create an account with DigitalOcean to access GPU Droplets.
- DigitalOcean GPU Droplets: Set up and configure GPU Droplets optimized for ML workloads.
- Transformers Library: Utilize the
transformers
library from Hugging Face for loading pre-trained models and adapting them for RAG. - Code Editor/IDE: Prepare an IDE like VS Code or Jupyter Notebook for development tasks.
How Does Retrieval-Augmented Generation (RAG) Work?
We are aware that large language models (LLMs) excel in generating responses, but they often struggle with financial queries, frequently providing inaccurate information. This challenge arises from LLMs lacking access to our personnel and updated data. By embedding retrieval-augmented generation (RAG) functionalities into foundational models, we can enrich the LLM with our personnel and updated data. Consequently, we can pose any financial questions to the LLM application, which will then provide responses based on the accurate data we supply. Integrating retrieval-augmented features into a large language model (LLM) transforms the way the model finds answers. Now, instead of relying solely on pre-existing knowledge, the LLM can access more accurate information.
Here’s a breakdown of the process:
- User Input: A user submits a question.
- Retrieval Step: The LLM begins by consulting the data store for relevant information concerning the user’s question.
- Response Generation: Following the retrieval of information, the LLM blends it with its knowledge to produce a more precise and well-informed response.
This methodology enables the model to enhance its responses by weaving in additional information rather than relying solely on existing knowledge. RAG allows for avoiding the necessity of retraining the model with new data. If new insights or data emerge, we can simply incorporate this new information into existing resources. Therefore, when a user asks a question, the model can tap into this updated content without undergoing the entire training process again. This guarantees that the model can consistently provide the most current and relevant responses based on the latest data.
The implementation of this approach greatly decreases the chance of the model generating incorrect information. It also allows the model to recognize when it lacks an answer, should it fail to find a satisfactory response within the data store. However, if the retriever fails to furnish high-quality information, the model might miss answering a question that it could have otherwise resolved.
1. User Input (Query)
A user asks a question or provides input for an augmented prompt, which could be a statement, query, or task.
2. Query Encoding
The user’s input is transformed into a machine-readable format using an embedding model. Embeddings represent the query’s meaning numerically, facilitating the alignment of user preferences with relevant information, thus storing this numerical representation in a vector database.
3. Retriever
- Search for Relevant Data: The encoded query is sent to a retrieval system that scans the vector database. The retriever seeks relevant chunks of text, documents, or data that align closely with the query.
- Sources of data can include knowledge bases, articles, or company-specific documentation.
- Prompts in RAG help meld retrieval systems with generative models, ensuring the model produces accurate and relevant answers.
- Return Results: The retriever presents the top-ranked documents or information corresponding to the user’s query, often referred to as “documents” or “passages.”
4. Combination of Retrieval and Model Knowledge
- The retrieved data is conveyed to a generative language model (such as GPT or another LLM). This model merges the retrieved information with its pre-existing knowledge to formulate a response.
- Grounding the Response: Unlike relying solely on internal knowledge (acquired during training), the model utilizes fresh, retrieved data to provide a more informed and accurate answer.
Code Demo and Explanation
It’s advisable to follow the tutorial for setting up the GPU Droplet and executing the code. A link to the references section will lead you through the process of creating and configuring a GPU Droplet using VSCode. Initially, ensure you have a PDF, Markdown, or any documentation file ready. Create a dedicated folder to store the PDFs.
Start by installing all necessary packages.
!pip install pypdf
!pip install -U bitsandbytes
!pip install langchain
!pip install -U langchain-community
!pip install sentence_transformers
!pip install llama_index
!pip install llama-index-llms-huggingface
!pip install llama-index-llms-huggingface-api
!pip install llama-index-embeddings-langchain
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext, PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core.prompts.prompts import SimpleInputPrompt
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Settings
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
import torch
documents=SimpleDirectoryReader("your/pdf/location/data").load_data()
system_prompt="""
You are a Q&A assistant. Your goal is to answer questions as
accurately as possible based on the instructions and context provided.
"""
query_wrapper_prompt=SimpleInputPrompt("{query_str}")
!huggingface-cli login
llm = HuggingFaceLLM(
context_window=4096,
max_new_tokens=256,
generate_kwargs={"temperature": 0.0, "do_sample": False},
system_prompt=system_prompt,
query_wrapper_prompt=query_wrapper_prompt,
tokenizer_name="meta-llama/Llama-2-7b-chat-hf",
model_name="meta-llama/Llama-2-7b-chat-hf",
device_map="auto",
model_kwargs={"torch_dtype": torch.float16, "load_in_8bit": True}
)
embed_model = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-mpnet-base-v2"
)
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)
Settings.num_output = 512
Settings.context_window = 3900
index = VectorStoreIndex.from_documents(
documents, embed_model=embed_model)
query_engine = index.as_query_engine(llm=llm)
response=query_engine.query("what is GELAN architecture?")
print(response)
Once we store the data, it should be divided into chunks. The code below facilitates loading the data and segmenting it.
documents=SimpleDirectoryReader("//your repo path/data").load_data()
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)
A document may encompass substantial content or text along with metadata. Since a document can be lengthy, it is essential to divide each document into smaller sections. This is part of the preprocessing step to prepare the data for RAG. These more focused pieces of information assist the system in accurately retrieving relevant context and details. By clearly segmenting documents, locating domain-specific information—such as passages or facts—is simplified, elevating the RAG application’s performance. In our scenario, we use “SentenceSplitter” from “llama_index.core.node_parser.”
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=300,
chunk_overlap=100,
length_function=len,
add_start_index=True,
)
For further information on RecursiveCharacterTextSplitter, ensure to refer to the link in the reference section.
Let’s explore embeddings!
Embeddings are numerical representations of text that encapsulate the underlying meaning of the data. They transform data into vectors, making it more comprehensible for machine learning models.
With text embeddings (like word or sentence embeddings), vectors are structured so that words or phrases with similar meanings cluster together in vector space. For instance, “king” and “queen” would have vectors closer to each other, whereas “king” and “apple” would be positioned farther apart. The distance between these vectors can be quantified using cosine similarity or Euclidean distance.
For this example, we will employ “sentence-transformers/all-mpnet-base-v2” from HuggingFaceEmbeddings.
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
embed_model = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-mpnet-base-v2"
)
This step involves selecting a pre-trained model, in this case ‘sentence-transformers/all-mpnet-base-v2’, to create embeddings due to its compact size and robust performance. We can choose a model from the Sentence Transformers library which maps sentences and paragraphs to a 768-dimensional dense vector space, suitable for tasks like clustering or semantic search in search engines.
index = VectorStoreIndex.from_documents(
documents, embed_model=embed_model
)
The same embedding model will then be used to generate embeddings for the documents during the index construction and for any queries concerning the query engine.
query_engine = index.as_query_engine(llm=llm)
response=query_engine.query("Who is Shaoni")
print(response)
Now, let’s discuss our LLM; we are using Llama 2, with a 7B fine-tuned model in our example. Meta has developed and released the Llama 2 family of large language models (LLMs), which encompass pre-trained and fine-tuned generative text models ranging from 7 billion to 70 billion parameters. These models consistently outperform many open-source chat models and are on par with popular closed-source models like ChatGPT and PaLM.
Key Details
- Model Developers: Meta
- Variations: Llama 2 comes in sizes 7B, 13B, and 70B, with both pre-trained and fine-tuned options available.
- Input/Output: The models accept text input and yield text output.
- Architecture: Llama 2 employs an auto-regressive transformer architecture, utilizing supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to better align with human preferences for helpfulness and safety. However, feel free to select any other model that suits your needs. Many open-source models from Hugging Face require an introductory statement before each prompt, referred to as a system_prompt. Additionally, queries might necessitate a wrapper around the query_str.
system_prompt="""
You are a Q&A assistant. Your goal is to answer questions as
accurately as possible based on the instructions and context provided.
"""
query_wrapper_prompt=SimpleInputPrompt("{query_str}")
Now, with our LLM, embedding model, and documents in place, we can ask questions about them using the provided code.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents=SimpleDirectoryReader("//your repo path/data").load_data()
index = VectorStoreIndex.from_documents(
documents, embed_model=embed_model
)
query_engine = index.as_query_engine(llm=llm)
response=query_engine.query("what are the drawbacks discussed in YOLO v9?")
print(response)
YOLOv9 has several drawbacks discussed in the paper, including:
1. Computational complexity: While YOLOv9 is Pareto optimal in terms of accuracy and computational complexity among all models of various scales, it still has a relatively high computational complexity compared to other state-of-the-art methods.
2. Parameter utilization: YOLOv9 employing conventional convolution has lower parameter utilization compared to YOLO MS which uses depth-wise convolution, and moreover, large models of YOLOv9 demonstrate lower parameter utilization than RT DETR using the ImageNet pretrained model.
3. Training time: YOLOv9 demands longer training times than other state-of-the-art methods, proving to be a limitation for real-time object detection applications.
Please let me know if you have any further questions or if there's anything else I can assist you with.
Why use GPU Droplet to Build Next-Gen AI-Powered Applications?
While this tutorial does not necessitate a high-end GPU, standard CPUs may not efficiently handle the computations required. Thus, complex operations—such as generating vector embeddings or leveraging large language models—could be significantly slower and may lead to performance challenges. To achieve optimal performance and faster outcomes, using a capable GPU is advisable, especially when confronted with vast document repositories or datasets, or when utilizing advanced LLMs like Falcon 180b.
Employing DigitalOcean’s GPU Droplets to build a Retrieval-Augmented Generation (RAG) application provides several advantages:
- Speed: GPU Droplets are tailored for swift handling of complex calculations, critical for processing large data volumes, markedly reducing the time needed to generate embeddings for substantial datasets.
- Efficiency with Large Models: RAG applications, as demonstrated in our tutorial, utilize large language models (LLMs) to generate responses fueled by retrieved information. The H100 GPUs are well-suited to efficiently run these models, facilitating tasks such as context comprehension and text generation.
- Better Performance: The H100’s advanced architecture guarantees superior performance in handling vector embeddings and LLMs, leading to more relevant and contextually precise responses in your RAG application.
- Scalability: As applications expand and require the ability to manage increasing users or data, scalability of H100 GPU Droplets addresses these demands effectively, alleviating concerns about performance bottlenecks as the application gains popularity.
Concluding Thoughts
In summary, Retrieval-Augmented Generation (RAG) is a pivotal AI framework that greatly amplifies the capabilities of large language models (LLMs) for application development. By cohesively merging the strengths of information retrieval with the power of LLMs, RAG systems can furnish accurate, contextually pertinent, and informative responses. This synergy enhances the quality of engagements across numerous domains—such as customer support, content generation, and tailored recommendations—enabling organizations to exploit vast data effectively. As demand escalates for intelligent, responsive applications, RAG will emerge as a formidable framework to assist developers in creating more capable systems that cater to users’ needs. Its versatility and efficacy position it as a critical player in the realm of AI-driven solutions.
Additional References
- Setting Up the GPU Droplet Environment for AI/ML Coding – Jupyter Labs
- Recursively Split by Character
- Embeddings
- HuggingFace LLM – StableLM
Thanks for learning with the DigitalOcean Community. Explore our offerings for compute, storage, networking, and managed databases.
Learn more about our products.
Welcome to DediRock, your trusted partner in high-performance hosting solutions. At DediRock, we specialize in providing dedicated servers, VPS hosting, and cloud services tailored to meet the unique needs of businesses and individuals alike. Our mission is to deliver reliable, scalable, and secure hosting solutions that empower our clients to achieve their digital goals. With a commitment to exceptional customer support, cutting-edge technology, and robust infrastructure, DediRock stands out as a leader in the hosting industry. Join us and experience the difference that dedicated service and unwavering reliability can make for your online presence. Launch our website.