This is the third article in a series on How to Build & Serve Private LLMs. You can read the previous posts here:
(1) Introduction↗
(2) Retrieval-Augmented Generation (RAG)↗
In our previous post, we covered three strategies for building private LLMs: RAG, fine-tuning, and quantization and caching. In this post, we will focus on fine-tuning.
What Is LLM Fine-Tuning?
LLM fine-tuning is the process of taking a pre-trained model and further training it with new data to perform a specific task more effectively. While pre-trained models can handle general tasks well, they may struggle with tasks requiring precise knowledge from a specific domain. By training the model with a relatively small, domain-specific dataset, you can significantly improve its performance for those particular tasks.
When & Why Do You Need Fine-Tuning?
When you want to adjust the responses of an LLM using external knowledge, RAG is usually sufficient. However, there are scenarios where fine-tuning can dramatically enhance LLM performance:
- Domain-specific terminology:
When the LLM needs to use specialized terms unknown to a generally trained model, such as legal, medical, and educational domains.
- Structured response formats:
- When you need responses in a specific format, such as JSON or YAML.
- Especially useful for API integrations.
- Consistent tone control:
- Useful for maintaining a specific “AI persona,” such as Character.ai↗.
- Helps tailor responses to align with the preferences of a target audience. For example, explicitly stating that information is not found in situations like failed retrievals.
How to Fine-Tune Efficiently
Training a model typically requires substantial computational resources and time, but various fine-tuning techniques can help reduce these costs. Here are several famous techniques for fine-tuning that will be introduced:
Parameter-Efficient Fine-Tuning (PEFT)
Instead of training the entire neural network, PEFT↗ involves training only a subset of parameters while keeping most of them fixed. This method drastically reduces the computational resources and time required compared to full model training.
PEFT techniques include:
- Partial Fine-tuning: Only a portion of the model parameters are trained.
- Additive Fine-tuning: New parameters are added to the model and trained while keeping the pre-trained parameters fixed.
- Reparameterization: Fine-tuning is performed in a low-rank subspace to minimize the number of trainable parameters.
Low-Rank Adaptation (LoRA)
One widely-used reparameterization technique is LoRA↗, which uses low-rank matrices to approximate the changes needed during fine-tuning. This method leverages the insight that the space of task-specific modifications is often much smaller than the full parameter space of the model. By adapting only these low-rank matrices, LoRA significantly reduces the computational resources and memory required for fine-tuning, making it a highly efficient approach.
The main advantage of LoRA is that it allows LLMs to be fine-tuned with minimal computational overhead while maintaining performance close to that of full fine-tuning. Additionally, LoRA’s ability to keep most of the original model parameters fixed ensures that the pre-trained knowledge is retained, while the low-rank updates provide the necessary flexibility to adapt to specific tasks.
For further cost reduction, Quantized LoRA (QLoRA↗) could be used to apply quantization to the LoRA process. We are planning to cover quantization in our next post.
Weight-Decomposed Low-Rank Adaptation (DoRA)
Recently, a new method called DoRA↗ has emerged. DoRA builds on the principles of LoRA by further decomposing the pre-trained model's weights into two components: magnitude and direction. It applies low-rank adaptations only to the direction component, which represents the orientation of the weight vectors in the parameter space.
The key benefit of DoRA is its ability to enhance the stability and performance of low-rank adaptations. By focusing on directional adjustments, DoRA minimizes the potential disruptions to the pre-trained model's inherent knowledge while still providing the flexibility needed to adapt to new tasks. This method also mitigates some of the accuracy trade-offs typically associated with aggressive quantization or low-rank adaptations.
Reinforced Fine-Tuning (ReFT)
ReFT introduces the principles of reinforcement learning into the fine-tuning process of LLMs. ReFT involves a two-step process starting with a warm-up phase where the model undergoes supervised fine-tuning using a dataset of questions and chain-of-thought (CoT) pairs. The second phase of ReFT employs reinforcement learning, specifically using the Proximal Policy Optimization (PPO) algorithm. In this stage, the model explores various reasoning paths and updates its parameters based on the accuracy of its answers compared to ground-truth data.
This combination of supervised and reinforcement learning allows the model to learn from diverse CoT annotations, improving its ability to generalize to new scenarios. ReFT has demonstrated significant performance improvements over traditional supervised fine-tuning across various datasets, showcasing its effectiveness in enhancing the generalization and problem-solving capabilities of LLMs.
Accelerating Even More With Optimized Kernel: Unsloth
Lastly, we want to introduce Unsloth↗, a library designed to accelerate LLM training by using optimized GPU kernel implementations. By manually deriving all compute-heavy maths steps and handwriting GPU kernels, Unsloth allows for faster training with reduced GPU memory usage. Despite some stability issues during the actual test, it significantly increased training speed by over 2x.
NOTE: Open-source Unsloth only supports single-GPU training. You must subscribe to a paid plan to train on multiple GPUs.
Evaluating The Efficiency Of Fine-Tuning
To evaluate how fine-tuning enhances LLM performance and reduces costs, we conducted a simple experiment:
- Dataset:
ljp_criminal
subset in lbox/lbox_open↗ dataset- Task: Determine the type and severity of punishment (fine/imprisonment) based on the context and background.
- Data Size: 8400 chunks, 10M tokens (equivalent to 6-7 books).
- Output: Forced JSON format for data processing.
- GPU Used: NVIDIA L4
- Cost: Approximately $1 per hour in the cloud.
- VRAM: 24GB, primarily used for inference.
- Model: mistralai/Mistral-7B-Instruct-v0.2↗
- Configuration: rank=8, lora_alpha=32.
The evaluation results are as follows:
- Performance Assessment:
- Fine-tuning significantly improved the structured output of responses.
- Many errors from the baseline model were due to failures in producing JSON output.
- GPT-4 followed prompts well but often gave incorrect sentencing or irrelevant text when it lacked information.
- Comparison of QLoRA, QDoRA, and Unsloth:
- Training costs for LoRA and DoRA were similar.
- Unsloth achieved double the training speed compared to LoRA and DoRA and used only 50% of the VRAM.
Conclusion
In this post, we explored the concept of fine-tuning and various techniques to make it more cost-effective. If you want to try fine-tuning Llama-3-8B using QLoRA, you can easily start the process on VESSL Hub↗. Give it a try!
Our next post will delve into quantization and attention operation optimization techniques to further enhance the efficiency of training and inference processes. Stay tuned!