30 October 2023
Scalable infrastructure for putting LLMs in production faster at scale — without the massive cost
The release of milestone models like Llama 2 and Mistral 7B has largely commoditized the model stack. On the one hand, frameworks like LlamaIndex and Langchain have made it easier for companies to plug in and experiment with these open-source models using proprietary datasets. However, building production-grade custom LLMs still remains a massive challenge.
Companies often ask, 'We have the models and datasets. Now, how can we scale them into production?' The interactions in the past few months with our customers↗ exploring LLMOps — and the general new language model stack↗ — have taught us that building and deploying production-grade LLMs are more of a challenge in the compute and infrastructure stack, than the models themselves.
In this post, we share how VESSL AI solves these problems through a series of 3-minute guides on
Along the way, we also discuss the critical components that make up a Llama-scale AI infrastructure.
With VESSL AI, you can run pre-built open-source LLMs on any cloud with a single command. Take a look at the following example of LLM as Chatbot↗.
vessl run create -f llm-chatbot.yaml
# llm-chatbot.yaml
name: LLM-As-Chatbot
resources:
cluster: aws-apne2
preset: v1.v100-1.mem-52
image: quay.io/vessl-ai/ngc-pytorch-kernel:22.12-py3-202301160809
imports:
/root/examples: git://github.com/deep-diver/LLM-As-Chatbot
run:
- workdir: /root/examples
command: |
pip install -r requirements.txt
LLMCHAT_APP_MODE=GRADIO python entry_point.py
ports:
- 6006
For security both cost and security purposes, companies often want to try out different model demos built on HuggingFace Spaces or publicly available Streamlit, except on their own infrastructure. We created a magic command you can use to bring your GPU infrastructure and set up Kubernetes-backed private clouds or on-premise clusters in minutes.
vessl cluster create [CLUSTER_NAME]
After setting up the cluster, you can simply change resources
.
# llm-chatbot.yaml
name: LLM-As-Chatbot
resources:
cluster: [CLUSTER_NAME]
preset: v100-1.mem-52
...
VESSL AI supports serverless inference vLLM↗, making test-phase LLMs affordable even for teams with limited GPU resources.
VESSL AI also handles the runtime configuration and containerization of the workload, saving developers from wasting hours on CUDA and Python dependencies.
Fine-tuning LLMs to production grade not only requires custom datasets but also a highly scalable and fault-tolerant infrastructure. With VESSL AI, you can launch a full-scale LLM fine-tuning workload on any cloud, at any scale, without worrying about these underlying system backends. Take a look at an example of the YAML file for Vicuna on Llama 2↗.
vessl run create -f vicuna-llama2.yaml
# vicuna-llama2.yaml
name: vicuna-llama2
resources:
requests:
nvidia.com/gpu:
device_type: A100-80GB
quantity: "8"
image: quay.io/vessl-ai/ngc-pytorch-kernel:22.12-py3-202301160809
imports:
/datasets: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
/root/fastchat/: git://github.com/lm-sys/FastChat
run:
- workdir: /root/fastchat/data/
command: |
pip install git+https://github.com/lm-sys/FastChat.git@cfc73bf3e13c22ded81e89675e0d7b228cf4b342
pip install -U xformers
python hardcoded_questions.py
python -m fastchat.data.merge --in /datasets/sharegpt.json hardcoded.json --out /root/data.json
- workdir: /root/fastchat/train/
command: |
python train_xformers.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--data_path /root/data.json \
--num_train_epochs 3 \
--per_device_train_batch_size $PER_DEVICE_BATCH_SIZE \
--per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE \
--gradient_accumulation_steps $((128 * 512 / $SEQ_LEN / $PER_DEVICE_BATCH_SIZE / $NUM_NODES / $SKYPILOT_NUM_GPUS_PER_NODE)) \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 600 \
--save_total_limit 10 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True \
--model_max_length ${SEQ_LEN} \
--gradient_checkpointing True \
--lazy_preprocess True
Here are the common toolkits that companies need as they progress toward production-grade LLMs.
Faster time to deployment
DistributedDataParallel
(DDP) and simplifies the process for setting up multi-cluster, multi-node distributed training into the following snippet.num_nodes: 8
autoscaling:
min: 2
max: 8
metric: GPU
target: 80
imports:
/my-dataset1: vessl-dataset://myOrg/myDataset
# /my-dataset2: s3://myOrg/myDataset
# /my-dataset3: nfs://local/myDataset
Cost saving up to 80%
vessl cluster create
command and source GPUs automatically based on availability and price.resources:
cluster: aws-apne
# cluster: gcp us-central
# cluster: local-a100
preset: v1.v100-1.mem-52
Reliability & fault-tolerance
.pt
files to mounted volumes or model registry and ensures seamless checkpointing of fine-tuning progress, and proactively allocates checkpoints across diverse clouds.export:
/output: vessl-model://myOrg/llama2
VESSL AI is committed to providing a highly scalable infrastructure AI infrastructure for putting LLMs in production faster at scale. With VESSL AI, this comes at up to 80% reduced cost with zero engineering overheads. VESSL AI unloads the SW and Ops challenges in LLM infrastructure, helping teams like ScatterLab↗ and Wrtn Technologies to focus on the models and services themselves. ScatterLab, for example, uses VESSL AI to train, fine-tune, and deploy↗ Asia’s most advanced custom LLMs with support for emotional conversation, multi-modality, and proactive conversations.
For those who are progressing towards production-grade models LLMs and want to check off the critical components of Llama-scale cloud infrastructure, contact us at growth@vessl.ai to learn more about VESSL AI.
Growth Manager
Build, train, and deploy models faster at scale with fully managed infrastructure, tools, and workflows.