Product

30 October 2023

AI Infrastructure for LLMs

Scalable infrastructure for putting LLMs in production faster at scale — without the massive cost

AI Infrastructure for LLMs

The release of milestone models like Llama 2 and Mistral 7B has largely commoditized the model stack. On the one hand, frameworks like LlamaIndex and Langchain have made it easier for companies to plug in and experiment with these open-source models using proprietary datasets. However, building production-grade custom LLMs still remains a massive challenge.

Companies often ask, 'We have the models and datasets. Now, how can we scale them into production?' The interactions in the past few months with our customers↗ exploring LLMOps — and the general new language model stack↗ — have taught us that building and deploying production-grade LLMs are more of a challenge in the compute and infrastructure stack, than the models themselves.

  • How can we get HPC-grade GPUs amidst the global shortage?
    • Securing computing capacity in the cloud is difficult to find, especially for Generative AI and LLM applications. A100 80GB GPUs are sold out not only on major clouds but also on alternatives like Oracle Cloud and CoreWeave, and the transition to H100 has been slow.
  • LLMs are expensive. How can we save GPU costs?
    • Companies often resort to open-source from OpenAI GPT APIs to reduce cost but it still takes $500K to $1M and weeks to train custom models from scratch. Without the right infrastructure, companies often lose visibility into GPU spending with respect to model performance and waste $100Ks in vain.
  • How can we iterate and deploy faster at scale on private cloud or on-prem?
    • Deploying multi-billion parameter LLMs on the cloud and testing them on Hugging Face Spaces or personal laptops pose an entirely different set of Ops challenges. Companies often don’t have the right resources or people to address these engineering overheads, preventing them from putting LLMs beyond experimentation.

In this post, we share how VESSL AI solves these problems through a series of 3-minute guides on

  • Launching up a simple playground for Llama2.c on your cloud with a single command
  • Fine-tuning Vicuna on Llama 2 at scale with custom datasets with a single YAML file

Along the way, we also discuss the critical components that make up a Llama-scale AI infrastructure.

Llama2.c playground in a single command

With VESSL AI, you can run pre-built open-source LLMs on any cloud with a single command. Take a look at the following example of LLM as Chatbot↗.

vessl run create -f llm-chatbot.yaml
# llm-chatbot.yaml
name: LLM-As-Chatbot
resources:
  cluster: aws-apne2
  preset: v1.v100-1.mem-52
image: quay.io/vessl-ai/ngc-pytorch-kernel:22.12-py3-202301160809
imports:
  /root/examples: git://github.com/deep-diver/LLM-As-Chatbot
run:
  - workdir: /root/examples
		command: |
        pip install -r requirements.txt
        LLMCHAT_APP_MODE=GRADIO python entry_point.py
ports:
  - 6006

For security both cost and security purposes, companies often want to try out different model demos built on HuggingFace Spaces or publicly available Streamlit, except on their own infrastructure. We created a magic command you can use to bring your GPU infrastructure and set up Kubernetes-backed private clouds or on-premise clusters in minutes.

vessl cluster create [CLUSTER_NAME]

After setting up the cluster, you can simply change resources.

# llm-chatbot.yaml
name: LLM-As-Chatbot
resources:
	cluster: [CLUSTER_NAME]
  preset: v100-1.mem-52

...

VESSL AI supports serverless inference vLLM↗, making test-phase LLMs affordable even for teams with limited GPU resources.

  • Serverless inference — Reduce GPU costs automatically during idle periods.
  • vLLM — Improve throughput and per-minute requests by 24x.

VESSL AI also handles the runtime configuration and containerization of the workload, saving developers from wasting hours on CUDA and Python dependencies.

Fine-tuning Vicuna on Llama 2 at scale

Fine-tuning LLMs to production grade not only requires custom datasets but also a highly scalable and fault-tolerant infrastructure. With VESSL AI, you can launch a full-scale LLM fine-tuning workload on any cloud, at any scale, without worrying about these underlying system backends. Take a look at an example of the YAML file for Vicuna on Llama 2↗.

vessl run create -f vicuna-llama2.yaml
# vicuna-llama2.yaml
name: vicuna-llama2
resources:
  requests:
		nvidia.com/gpu: 
      device_type: A100-80GB
      quantity: "8"
image: quay.io/vessl-ai/ngc-pytorch-kernel:22.12-py3-202301160809
imports:
	/datasets: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
	/root/fastchat/: git://github.com/lm-sys/FastChat
run:
  - workdir: /root/fastchat/data/
    command: |
		  pip install git+https://github.com/lm-sys/FastChat.git@cfc73bf3e13c22ded81e89675e0d7b228cf4b342
		  pip install -U xformers
		  python hardcoded_questions.py
		  python -m fastchat.data.merge --in /datasets/sharegpt.json hardcoded.json --out /root/data.json
	- workdir: /root/fastchat/train/
		command: |
		  python train_xformers.py \
		    --model_name_or_path meta-llama/Llama-2-7b-hf \
		    --data_path /root/data.json \
		    --num_train_epochs 3 \
		    --per_device_train_batch_size $PER_DEVICE_BATCH_SIZE \
		    --per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE \
		    --gradient_accumulation_steps $((128 * 512 / $SEQ_LEN / $PER_DEVICE_BATCH_SIZE / $NUM_NODES / $SKYPILOT_NUM_GPUS_PER_NODE)) \
		    --evaluation_strategy "no" \
		    --save_strategy "steps" \
		    --save_steps 600 \
		    --save_total_limit 10 \
		    --learning_rate 2e-5 \
		    --weight_decay 0. \
		    --warmup_ratio 0.03 \
		    --lr_scheduler_type "cosine" \
		    --logging_steps 1 \
		    --fsdp "full_shard auto_wrap" \
		    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
		    --tf32 True \
		    --model_max_length ${SEQ_LEN} \
		    --gradient_checkpointing True \
		    --lazy_preprocess True

Here are the common toolkits that companies need as they progress toward production-grade LLMs.

Faster time to deployment

  • Distributed training — Companies often use at least 16 A100 or H100 GPUs to fine-tune a LLM. However, distributed training is not just a matter of plugging in the number of nodes. VESSL AI comes with native support for PyTorch DistributedDataParallel (DDP) and simplifies the process for setting up multi-cluster, multi-node distributed training into the following snippet.
num_nodes: 8
  • Autoscaling — As more GPUs are released from other tasks, you can dedicate more GPUs to fine-tuning workloads. You can do this on VESSL AI by adding the following to your existing fine-tuning YAML.
autoscaling:
  min: 2
  max: 8
  metric: GPU
  target: 80
  • Dataset mount — Fine-tuning LLMs by nature involves dynamically mounting different cloud and local storage. VESSL AI provides out-of-the-box integration with NFS cloud storage for all major cloud storage. For those who resort to local storage, we automatically cache the datasets and Docker Images, improving instance startup and setup time by up to 8x. You can also pull datasets from HuggingFace to quickly experiment with different datasets.
imports:
  /my-dataset1: vessl-dataset://myOrg/myDataset
# /my-dataset2: s3://myOrg/myDataset
# /my-dataset3: nfs://local/myDataset

Cost saving up to 80%

  • Multi-cloud — GPU shortage & cost is forcing companies to explore multiple cloud, including relatively new GPU cloud providers such as Lambda Labs, CoreWeave, and RunPod, in addition to AWS, GCP, and Azure. You can integrate on-prems and clouds from multiple regions in minutes with the vessl cluster create command and source GPUs automatically based on availability and price.
resources:
  cluster: aws-apne
# cluster: gcp us-central
# cluster: local-a100
  preset: v1.v100-1.mem-52
  • Spot instances — Spot instances can help you save up to 70% in cloud spending. Spot instance on VESSL AI works with model checkpointing and export volumes, saving and resuming the progress of interrupted workloads safely.

Reliability & fault-tolerance

  • GPU failovers — Fine-tuning LLMs take days if not weeks. VESSL AI can autonomously detect GPU failures, attempt to recover (self-heal↗) failed containers, and automatically re-assign workload to other GPUs.
  • Model checkpointing — Coupled with GPU failovers and spot instances, VESSL AI stores .pt files to mounted volumes or model registry and ensures seamless checkpointing of fine-tuning progress, and proactively allocates checkpoints across diverse clouds.
export:
  /output: vessl-model://myOrg/llama2

MLOps for LLMs

VESSL AI is committed to providing a highly scalable infrastructure AI infrastructure for putting LLMs in production faster at scale. With VESSL AI, this comes at up to 80% reduced cost with zero engineering overheads. VESSL AI unloads the SW and Ops challenges in LLM infrastructure, helping teams like ScatterLab↗ and Wrtn Technologies to focus on the models and services themselves. ScatterLab, for example, uses VESSL AI to train, fine-tune, and deploy↗ Asia’s most advanced custom LLMs with support for emotional conversation, multi-modality, and proactive conversations.

For those who are progressing towards production-grade models LLMs and want to check off the critical components of Llama-scale cloud infrastructure, contact us at growth@vessl.ai to learn more about VESSL AI.

Yong Hee

Yong Hee

Growth Manager

Try VESSL today

Build, train, and deploy models faster at scale with fully managed infrastructure, tools, and workflows.

Get Started

MLOps for high-performance ML teams

© 2025 VESSL AI, Inc. All rights reserved.