02 August 2024
This post delves deep into the fine-tuning of LLMs.
This is the third article in a series on How to Build & Serve Private LLMs. You can read the previous posts here:
(1) Introduction↗
(2) Retrieval-Augmented Generation (RAG)↗
In our previous post, we covered three strategies for building private LLMs: RAG, fine-tuning, and quantization and caching. In this post, we will focus on fine-tuning.
LLM fine-tuning is the process of taking a pre-trained model and further training it with new data to perform a specific task more effectively. While pre-trained models can handle general tasks well, they may struggle with tasks requiring precise knowledge from a specific domain. By training the model with a relatively small, domain-specific dataset, you can significantly improve its performance for those particular tasks.
When you want to adjust the responses of an LLM using external knowledge, RAG is usually sufficient. However, there are scenarios where fine-tuning can dramatically enhance LLM performance:
When the LLM needs to use specialized terms unknown to a generally trained model, such as legal, medical, and educational domains.
Training a model typically requires substantial computational resources and time, but various fine-tuning techniques can help reduce these costs. Here are several famous techniques for fine-tuning that will be introduced:
Instead of training the entire neural network, PEFT↗ involves training only a subset of parameters while keeping most of them fixed. This method drastically reduces the computational resources and time required compared to full model training.
PEFT techniques include:
One widely-used reparameterization technique is LoRA↗, which uses low-rank matrices to approximate the changes needed during fine-tuning. This method leverages the insight that the space of task-specific modifications is often much smaller than the full parameter space of the model. By adapting only these low-rank matrices, LoRA significantly reduces the computational resources and memory required for fine-tuning, making it a highly efficient approach.
The main advantage of LoRA is that it allows LLMs to be fine-tuned with minimal computational overhead while maintaining performance close to that of full fine-tuning. Additionally, LoRA’s ability to keep most of the original model parameters fixed ensures that the pre-trained knowledge is retained, while the low-rank updates provide the necessary flexibility to adapt to specific tasks.
For further cost reduction, Quantized LoRA (QLoRA↗) could be used to apply quantization to the LoRA process. We are planning to cover quantization in our next post.
Recently, a new method called DoRA↗ has emerged. DoRA builds on the principles of LoRA by further decomposing the pre-trained model's weights into two components: magnitude and direction. It applies low-rank adaptations only to the direction component, which represents the orientation of the weight vectors in the parameter space.
The key benefit of DoRA is its ability to enhance the stability and performance of low-rank adaptations. By focusing on directional adjustments, DoRA minimizes the potential disruptions to the pre-trained model's inherent knowledge while still providing the flexibility needed to adapt to new tasks. This method also mitigates some of the accuracy trade-offs typically associated with aggressive quantization or low-rank adaptations.
ReFT introduces the principles of reinforcement learning into the fine-tuning process of LLMs. ReFT involves a two-step process starting with a warm-up phase where the model undergoes supervised fine-tuning using a dataset of questions and chain-of-thought (CoT) pairs. The second phase of ReFT employs reinforcement learning, specifically using the Proximal Policy Optimization (PPO) algorithm. In this stage, the model explores various reasoning paths and updates its parameters based on the accuracy of its answers compared to ground-truth data.
This combination of supervised and reinforcement learning allows the model to learn from diverse CoT annotations, improving its ability to generalize to new scenarios. ReFT has demonstrated significant performance improvements over traditional supervised fine-tuning across various datasets, showcasing its effectiveness in enhancing the generalization and problem-solving capabilities of LLMs.
Lastly, we want to introduce Unsloth↗, a library designed to accelerate LLM training by using optimized GPU kernel implementations. By manually deriving all compute-heavy maths steps and handwriting GPU kernels, Unsloth allows for faster training with reduced GPU memory usage. Despite some stability issues during the actual test, it significantly increased training speed by over 2x.
NOTE: Open-source Unsloth only supports single-GPU training. You must subscribe to a paid plan to train on multiple GPUs.
To evaluate how fine-tuning enhances LLM performance and reduces costs, we conducted a simple experiment:
ljp_criminal
subset in lbox/lbox_open↗ datasetThe evaluation results are as follows:
In this post, we explored the concept of fine-tuning and various techniques to make it more cost-effective. If you want to try fine-tuning Llama-3-8B using QLoRA, you can easily start the process on VESSL Hub↗. Give it a try!
Our next post will delve into quantization and attention operation optimization techniques to further enhance the efficiency of training and inference processes. Stay tuned!
Solutions Engineer
CTO
Technical Communicator
Build, train, and deploy models faster at scale with fully managed infrastructure, tools, and workflows.