Machine Learning

12 July 2024

How To Build & Serve Private LLMs - (2) RAG

This post delves deep into the RAG techniques.

How To Build & Serve Private LLMs - (2) RAG
This is the second article in a series on How to Build & Serve Private LLMs. You can read the first post here↗.

In our previous post, we covered three strategies for building private LLMs: Retrieval-Augmented Generation (RAG), fine-tuning, and quantization and attention operation optimization. In this post, we will focus on RAG.

What is Retrieval-Augmented Generation (RAG)?

RAG is a model that enhances language models’ text generation by connecting them to external knowledge databases. This allows the model to retrieve relevant information to improve its understanding and generation of factual responses.

How RAG Works

  • Knowledge Base Creation: External knowledge (documents) is vectorized and stored in a database.
  • Query Encoding: When a user asks a question, it is encoded into a vector format.
  • Document Retrieval: The encoded query is used to search for and retrieve the most relevant documents from the knowledge database based on vector similarity.
  • Context Augmentation: Retrieved documents are added to the model's input to enhance context.
  • Answer Generation: The language model generates a response based on the augmented context.

Advantages of RAG:

  • Cost Efficiency: Instead of constantly retraining the model to include the latest information, updating the external knowledge base ensures the model always generates answers based on the most current data.
  • Reliable Sources: Using retrieved documents as a foundation for answers enhances the accuracy and trustworthiness of the responses, which is especially crucial in fields like finance and law.

Challenges in Building a RAG System

Like any machine learning system, RAG comes with its own set of challenges:

  • Document Processing: Documents may contain tables, images, or be in formats like PDFs, making text extraction and parsing difficult. Poorly processed documents won't provide meaningful information.
  • Cost of Document Ingestion: Splitting documents into manageable chunks for ingestion can be resource-intensive. Larger chunks ensure more relevant information but slow down retrieval, while smaller chunks speed up retrieval but may miss key context.
  • Reliability of Search Results: Retrieval methods may not always fetch the most relevant information, as they rely on vector similarity rather than factual accuracy.

Enhancing RAG with HyDE and Reranking

To address the challenges above, techniques that preprocess queries and post-process results are widely used. HyDE and reranking are some of those techniques.

Hypothetical Document Embeddings (HyDE)

HyDE generates a hypothetical answer to the query using an LLM and then searches the database with this answer instead of the original query. This hypothetical answer often serves as a better query for retrieving relevant documents, potentially improving the context and relevance of the generated response.

Reranking

After initial document retrieval, reranking uses another LLM to reorder the documents based on their relevance to the query. This minimizes information loss during the vector conversion process but increases the time required for retrieval.

You might wonder how much HyDE and reranking improve RAG performance. Let us show you the benchmark results.

Evaluating RAG Performance

Before diving into benchmark details, let's discuss how to measure RAG's performance. We will use the Ragas↗ framework, which evaluates answer relevance and accuracy using various metrics.

Ragas uses LLM as a judge, making the evaluation result not completely reliable, but it is much more efficient than manually evaluating generated responses.

Key Metrics in Ragas

  • Faithfulness: Consistency of the generated answer with the given context.
  • Answer Relevance: Pertinence of the answer to the prompt.
  • Context Precision: Whether all relevant items are ranked higher in the context.
  • Context Recall: How well the retrieved context matches the annotated answer.

Benchmarking

Benchmarking Task

We will use a part of the FEVER (Fact Extraction and VERification)↗ dataset, which consists of 5 million Wikipedia documents and 400,000 claims. The task involves determining the veracity of these claims by leveraging the Wikipedia dataset stored in a vector database.

For our benchmarking process, we used a subset of the data to make the task more manageable. Instead of the full 5 million documents, we extracted 200,000 documents. We then used these documents to benchmark against 2,010 claims, providing a more focused and efficient evaluation.

Benchmarking Architectures

  • GPT-4 with Prompt Engineering
  • Llama-3 8B with Prompt Engineering
  • Llama-3 8B with Naive RAG
  • Llama-3 8B with HyDE
  • Llama-3 8B with Reranking
  • Llama-3 8B with HyDE and Reranking

Metrics for Comparison

We will use Ragas to compare various RAG architectures based on four key metrics: Faithfulness, Answer Relevancy, Context Precision, and Context Recall. However, for architectures that rely solely on prompt engineering and lack a retrieval process, these four metrics cannot be applied. Therefore, we will include an additional metric, Answer Correctness, to ensure a comprehensive comparison.

Codes

We used Chroma↗ as the vector database to load Wikipedia documents from the FEVER dataset. Utilizing Langchain↗, we created different RAG chains and executed them, evaluating the results with Ragas. For the full code, please refer to our official GitHub repository↗.

Result

First, let’s compare the performance evaluation results for each type of RAG. In all metrics, RAG with HyDE showed a significant improvement over naive RAG. This supports the hypothesis that hypothetical answers enhance retrieval performance more effectively than the original questions.

Meanwhile, RAG with reranking showed a slight performance improvement over RAG without reranking, but the difference was not significant. However, this could improve further with adjustments to reranking parameters, such as the number of retrieved documents and the top_n setting.

Next, we compared the performance evaluation results of raw LLM without RAG and RAG based on Answer Correctness. OpenAI's GPT-4o delivered the best results. Interestingly, the naive RAG, despite being based on the same Llama-3 8B model, performed worse than the raw Llama-3. However, similar to the earlier findings, applying HyDE resulted in significant performance improvements, bringing it close to the performance level of GPT-4o.

Conclusion

We have explored what RAG is, the challenges in building a RAG system, and techniques like HyDE and reranking to overcome these challenges. Benchmark results demonstrate how these methods improve RAG's performance.

In the next post, we'll cover how to enhance model performance through fine-tuning. Please stay tuned!

Reference

Ian Lee

Ian Lee

Solutions Engineer

Jay Chun

Jay Chun

CTO

Wayne Kim

Wayne Kim

Technical Communicator

Try VESSL today

Build, train, and deploy models faster at scale with fully managed infrastructure, tools, and workflows.

Get Started

MLOps for high-performance ML teams

© 2025 VESSL AI, Inc. All rights reserved.