19 June 2023
VESSL Run makes fine-tuning and scaling the latest open-source models easier than ever
Today, we are releasing VESSL Run↗, the easiest way to train, fine-tune, and scale open-source AI/ML models. VESSL Run simplifies the complex compute backends and system details required to run off-the-shelf models into a unified YAML interface. This means developers can start training without being bogged down in ML-specific peripherals like cloud infrastructures, CUDA configurations, and Python dependencies.
Together with VESSL Run, we are also releasing several custom Docker images for the latest Generative AI models and highlight papers from CVPR 2023 such as DreamBooth Stable Diffusion. Explore these models at bit.ly/cvpr2023↗ and run them right from the terminal.
pip install --upgrade vessl
vessl run hello
We’ve seen an explosion of open-source models that are on par with, if not better than, the latest closed-source counterparts — Stable Diffusion for DALL-E 2 and LLaMa for GPT-3. Every day, we see ML enthusiasts and professionals create their own versions and applications of these models and showcase them on GitHub. These models with hundreds of forks and stars, however, have one major problem.
Most of them don’t work.
If you tried actually running these models — whether following the guides on model cards or running Colab notebooks — you probably spent hours just configuring PyTorch and CUDA. Even if you do get through this step, it’s a whole other story to fine-tune and scale the model on your datasets and cloud. The value of these models is either lost between CUDA errors or they remain as toy projects without getting to the production level. It’s a common story that most of us in AI face today.
Our approach to solving this problem is to provide a simple, unified YAML interface that abstracts the peripherals surrounding the models. These include everything from CUDA configurations and Python dependencies for your first run; custom data loaders and cloud infrastructures for fine-tuning and scaling; and finally, endpoints and automatic scaling for serving and deployment. With VESSL Run, developers can experiment with the latest open-source models on their dataset and GPUs, without having to go through manual setup processes.
The following YAML snippet, for example, is all you need to run Dreambooth on Stable Diffusion↗ with A100 GPUs.
Mount a public GitHub repo and a dataset from an S3 bucket.
Set up a training environment with our custom Docker Image.
Run a training task on an on-premise DGX cluster using A100 GPUs.
name: dreamboothstablediffusion
image: nvcr.io/nvidia/pytorch:22.10-py3
resources:
accelerators: A100:1
volumes:
/root/examples: git://github.com/vessl-ai/examples
/output:
artifact: true
run:
- workdir: /root/examples/Dreambooth-Stable-Diffusion
command: |
conda env create -f environment.yaml
source activate ldm
pip install Omegaconf
pip install pytorch-lightning
mkdir data/
wget https://github.com/prawnpdf/prawn/raw/master/data/fonts/DejaVuSans.ttf -P data/
wget https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/sd-v1-4-full-ema.ckpt
python main.py --base configs/stable-diffusion/v1-finetune_unfrozen.yaml -t --actual_resume ./sd-v1-4-full-ema.ckpt -n "generate_pikachu" --no-test --gpus "0," --data_root ./dataset --reg_data_root ./reg --class_word "{$class_word}"
rm -rf ./logs/*.ipynb_checkpoints
python scripts/stable_txt2img.py --ddim_eta 0.0 --n_samples 2 --n_iter 4 --scale 10.0 --ddim_steps 100 --ckpt ./logs/*/checkpoints/last.ckpt --prompt "{$prompt}"
cp -r ./outputs /output
env:
class_word: "pikachu"
prompt: "A photo of sks pikachu playing soccer."
You can see it in action by running the following command and referring to the YAML file↗.
vessl run -f dreambooth.yaml
In essence, with every vessl run
, you are launching a Kubernetes pod that’s configured specifically for machine learning. Our custom Docker Images are dockerized versions of each machine learning GitHub repo — /Dreambooth-Stable-Diffusion
↗, /nanoGPT
↗, /LangChain
↗ and more — with the right CUDA and application dependencies.
This means that you can launch not only individual training jobs but also create persistent workspaces for GPU-enabled inference tasks with the same YAML definition — and use tools like Streamlit to create a Lensa-like app, for example — all without worrying about the peripherals.
We prepared a few run-proof models for the latest Generative AI models and highlight papers from CVPR 2023 on our VESSL Hub gallery↗. These all come with our custom Docker Images and can be launched with the same vessl run
command.
We also made a few resources to help you get started:
Our latest development on VESSL Run extends our efforts to bring the easiest way to train and deploy production-ready ML models at scale, along with our ML task launcher and workflow manager.
The onset of Stable Diffusion and LLaMa showed that lots of people want to build AI-enabled tools, whether that be for a side project or a full-scaled AI product. By creating a simple and unified interface that abstracts away the minute, yet time-consuming peripherals, we hope to make the latest-open models accessible to all builders and enable more enthusiasts to rapidly experiment with the latest developments in machine learning.
Growth Manager
Product Manager
ML Engineer Intern
Build, train, and deploy models faster at scale with fully managed infrastructure, tools, and workflows.