As developers, one of the most familiar use cases of “ChatGPT for X“ has been Docs AI↗, a virtual assistant that allows developers to pose natural language queries and get responses with a summary and links to relevant docs. These enhancements are helping developers to spend less time reading and more time building.
Across the industries, companies are exploring adopting similar, both internal and customer-facing, LLM-powered document Q&A chat, often with OpenAI APIs like the GPTs, Assistant, and Embedding. However, these APIs come with inherent limitations.
- Companies want to self-host the models and embeddings without third-party APIs and vendor lock-ins. Some even have to host them offline for privacy issues.
- These APIs add up in cost. For example, planned to be priced at $0.20 / GB / assistant / day, Assistants API↗ can quickly outpace the up-front investment for self-hosting LLMs.
In this post, we’ll walk through a quick recipe for a self-hosting LLM-powered Q&A chatbot. Here’s what we are going to use.
- We will use the Llama 2 pre-trained checkpoint and instructor-large↗ embedding model for our customized embedding. These two replace OpenAI APIs for GPT and Embedding models.
- We’ll use the raw markdown files for VESSL AI Docs↗ as our data and index and ingest them using LlamaIndex↗. Simply put, LlamaIndex acts as a data interface for our guide.
- We use VESSL Serve as our serving and deployment infrastructure. VESSL Serve abstracts everything you need to deploy a model — pulling models from the model registry, building web APIs, port forwarding, autoscaling, and more — into a single command.
1. Getting started
Preparing the model — Llama 2
The easiest way to start with pre-trained Llama 2 is to convert the model checkpoint to the Hugging Face interface using the following conversion script↗.
To prepare the model for deployment using VESSL Serve, upload the converted checkpoint to our model registry. You can do this by uploading your local file under Models or using our CLI command
Preparing the dataset — VESSL AI Docs
Here, we will be using the raw text dataset↗ from our documentation↗. We’ve crawled our docs into separate
.txt files and uploaded it on our S3. LlamaIndex is designed to take raw data like ours here; however, data preprocessing becomes essential when dealing with diverse or specialized data formats such as PDF and web crawls. For now, let’s upload the
.zip file to VESSL Dataset. VESSL Dataset will help us record the lineage and snapshots of the dataset every time we run the model. This can also be done using our CLI command
2. Configuring LlamaIndex
serve.py↗, we’ve prepared a custom LlamaIndex session that pinpoints different values for pre-trained model, embedding model, dataset, and more by defining values like
We are pulling the instructor-large embedding model from Hugging Face. We’ve also set
model_name as Llama 2 7B. You refer to other LLMs hosted on Hugging Face Models like Zephyr 7B by changing the value to
HuggingFaceH4/zephyr-7b-beta. Lastly, we are also loading our dataset from our managed storage.
Since we are using
HuggingFaceLLM to define our LLM, we can also play around with LLM-specific parameters like
context_window (the size of the context window),
num_output (the number of maximum output),
chunk_size (the size of the text chunk). If you are working with small GPU memory size first uncomment
torch.float16 and keep these values relatively low. Refer to the LlamaIndex document↗ to find out more about these parameters.
3. Serving our Docs AI
With our LlamaIndex code all set, let’s make it serviceable using BentoML. Below, we are using BentoML's
Runnable class. This
Runner class defines our custom function
generate which receives user input text and returns inference results. Then, we package the model into an API.
Finally, we’ll set up the serving infrastructure using VESSL Serve. VESSL Serve abstracts the cloud infrastructure for serving large-scale AI models into a unified YAML definition. After defining the YAML, we can serve the model on the cloud with just a single command line.
resources, you can define the cloud or GPU specifications for the task. Here, we are using a V100 instance from our managed AWS Cloud. In case of peak usage, we also handled
import, we are pulling the code from GitHub, and model and dataset from our managed model registry and storage which we uploaded previously. The YAML also opens up a dedicated port for inference tasks.
Finally, we can deploy the model into production using the following single-line command.
4. Docs AI in action
After effectively submitting our serving workload on the cloud, you can see our Docs AI in action using the serving API window configured with BentoML. The Swagger UI page provided by BentoML provides a list of APIs you can use for our app. Here, we have a
curl command that inputs a user query and uses the API to return generated texts.
You can also use Python to send requests.
In this guide, we explore how you can set up an MVP LLM-powered document Q&A with LlamaIndex, BentoML, and VESSL Serve. We’ve modified LlamaIndex to use a custom embedder instructor-large↗ alongside pre-trained Llama 2 7B. The dataset we used here is a minimally processed text files crawled from our documentation which exposes our Docs AI for hallucination. You can see our app in action at VESSL Hub↗.
In our next post, we will explore (1) how controlling parameters like the number of tokens and chunk size can mitigate hallucination, and (2) how companies can use our LLM Suite to easily process data to improve our app.
Yong Hee Lee, Growth Manager
David Oh, ML Engineer
Sunghyun Moon, ML Lead