12 July 2024
This post delves deep into the RAG techniques.
This is the second article in a series on How to Build & Serve Private LLMs. You can read the first post here↗.
In our previous post, we covered three strategies for building private LLMs: Retrieval-Augmented Generation (RAG), fine-tuning, and quantization and attention operation optimization. In this post, we will focus on RAG.
RAG is a model that enhances language models’ text generation by connecting them to external knowledge databases. This allows the model to retrieve relevant information to improve its understanding and generation of factual responses.
Like any machine learning system, RAG comes with its own set of challenges:
To address the challenges above, techniques that preprocess queries and post-process results are widely used. HyDE and reranking are some of those techniques.
HyDE generates a hypothetical answer to the query using an LLM and then searches the database with this answer instead of the original query. This hypothetical answer often serves as a better query for retrieving relevant documents, potentially improving the context and relevance of the generated response.
After initial document retrieval, reranking uses another LLM to reorder the documents based on their relevance to the query. This minimizes information loss during the vector conversion process but increases the time required for retrieval.
You might wonder how much HyDE and reranking improve RAG performance. Let us show you the benchmark results.
Before diving into benchmark details, let's discuss how to measure RAG's performance. We will use the Ragas↗ framework, which evaluates answer relevance and accuracy using various metrics.
Ragas uses LLM as a judge, making the evaluation result not completely reliable, but it is much more efficient than manually evaluating generated responses.
We will use a part of the FEVER (Fact Extraction and VERification)↗ dataset, which consists of 5 million Wikipedia documents and 400,000 claims. The task involves determining the veracity of these claims by leveraging the Wikipedia dataset stored in a vector database.
For our benchmarking process, we used a subset of the data to make the task more manageable. Instead of the full 5 million documents, we extracted 200,000 documents. We then used these documents to benchmark against 2,010 claims, providing a more focused and efficient evaluation.
We will use Ragas to compare various RAG architectures based on four key metrics: Faithfulness, Answer Relevancy, Context Precision, and Context Recall. However, for architectures that rely solely on prompt engineering and lack a retrieval process, these four metrics cannot be applied. Therefore, we will include an additional metric, Answer Correctness, to ensure a comprehensive comparison.
We used Chroma↗ as the vector database to load Wikipedia documents from the FEVER dataset. Utilizing Langchain↗, we created different RAG chains and executed them, evaluating the results with Ragas. For the full code, please refer to our official GitHub repository↗.
First, let’s compare the performance evaluation results for each type of RAG. In all metrics, RAG with HyDE showed a significant improvement over naive RAG. This supports the hypothesis that hypothetical answers enhance retrieval performance more effectively than the original questions.
Meanwhile, RAG with reranking showed a slight performance improvement over RAG without reranking, but the difference was not significant. However, this could improve further with adjustments to reranking parameters, such as the number of retrieved documents and the top_n setting.
Next, we compared the performance evaluation results of raw LLM without RAG and RAG based on Answer Correctness. OpenAI's GPT-4o delivered the best results. Interestingly, the naive RAG, despite being based on the same Llama-3 8B model, performed worse than the raw Llama-3. However, similar to the earlier findings, applying HyDE resulted in significant performance improvements, bringing it close to the performance level of GPT-4o.
We have explored what RAG is, the challenges in building a RAG system, and techniques like HyDE and reranking to overcome these challenges. Benchmark results demonstrate how these methods improve RAG's performance.
In the next post, we'll cover how to enhance model performance through fine-tuning. Please stay tuned!
Solutions Engineer
CTO
Technical Communicator
Build, train, and deploy models faster at scale with fully managed infrastructure, tools, and workflows.