22 June 2023
Run CVPR 2023 highlight models and papers with a single YAML file
If you tried to run GitHub or Colab codes from the top AI/ML conferences like NeurIPS, CVPR, ICML, and ICCV, you soon realize that most of them don’t work. You have to spend hours just to configure CUDA, Python dependencies.
We created VESSL Run to help ML researchers and data scientists explore the latest models effortlessly with a unified YAML interface. With the release of VESSL Run, we are sharing several YAML files for highlight papers & models from CVPR 2023. These YAML files make the models like Dreambooth by Google Research, ImageBind by Meta AI, and VisProg by Allen AI all run-proof on your laptop and any clouds.
You can run these models simply using our vessl run
command and referencing the YAML file. Explore more models from CVPR 2023 at our model gallery, https://vessl.ai/hub.↗
pip install --upgrade vessl
vessl run -f dreambooth.yaml
DreamBooth presents a novel method to personalize text-to-image diffusion models by fine-tuning them with a small set of subject images. By incorporating a unique identifier and leveraging semantic prior, the models can generate highly realistic images of the subject in different contexts, surpassing previous limitations in tasks like subject recontextualization, text-guided view synthesis, appearance modification, and artistic rendering while preserving important features.
name: dreamboothstablediffusion
image: nvcr.io/nvidia/pytorch:22.10-py3
resources:
accelerators: A100:1
volumes:
/root/examples: git://github.com/vessl-ai/examples
/output:
artifact: true
run:
- workdir: /root/examples/Dreambooth-Stable-Diffusion
command: |
conda env create -f environment.yaml
source activate ldm
pip install Omegaconf
pip install pytorch-lightning
mkdir data/
wget https://github.com/prawnpdf/prawn/raw/master/data/fonts/DejaVuSans.ttf -P data/
wget https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/sd-v1-4-full-ema.ckpt
python main.py --base configs/stable-diffusion/v1-finetune_unfrozen.yaml -t --actual_resume ./sd-v1-4-full-ema.ckpt -n "generate_pikachu" --no-test --gpus "0," --data_root ./dataset --reg_data_root ./reg --class_word "{$class_word}"
rm -rf ./logs/*.ipynb_checkpoints
python scripts/stable_txt2img.py --ddim_eta 0.0 --n_samples 2 --n_iter 4 --scale 10.0 --ddim_steps 100 --ckpt ./logs/*/checkpoints/last.ckpt --prompt "{$prompt}"
cp -r ./outputs /output
env:
class_word: "pikachu"
prompt: "A photo of sks pikachu playing soccer."
The YAML snippet uses the Docker image “nvcr.io/nvidia/pytorch:22.10-py3↗” to configure the runtime and allocates one NVIDIA A100 GPU. It specifies volumes for the GitHub repository and artifact creation. The project runs a sequence of commands, including environment setup, package installation, data download, model training, and output generation.
Under env
, you can enter an example class word and a prompt.
class_word
: Customize an identifier. For this example, here we are using “pikachu” for the new class word.prompt
: You can generate a regularization image by entering a prompt. In this example, we’re using “A photo of pikachu playing soccer.” as our example.Segment Anything (SA) introduces a task, model, and dataset for image segmentation, including over 1 billion masks on 11 million images. Their promptable model demonstrates impressive zero-shot performance, rivaling or surpassing prior fully supervised methods. Meta AI released the Segment Anything Model along with the dataset (SA-1B) to foster research in computer vision.
The YAML uses the Docker image “nvcr.io/nvidia/pytorch:21.05-py3↗” allocates one NVIDIA V100 on AWS. It runs a setup script located in the “/root/segment-anything/” directory. The GitHub repository “git://github.com/vessl-ai/segment-anything↗” is mounted as a volume. For interactive usage, the project has a runtime of 24 hours and exposes the port 8501.
name : segment-anything
resources:
accelerators: V100:1
image: nvcr.io/nvidia/pytorch:21.05-py3
run:
- workdir: /root/segment-anything/
command: |
bash ./setup.sh
volumes:
/root/segment-anything: git://github.com/vessl-ai/segment-anything
interactive:
runtime: 24h
ports:
- 8501
The paper introduces a new end-to-end unsupervised motion transfer framework to address the challenge of large pose gaps between source and driving images in image animation. The framework utilizes thin-plate spline motion estimation for flexible optical flow, incorporates multi-resolution occlusion masks to realistically restore missing regions, and employs additional auxiliary loss functions to ensure high-quality image generation. Experimental results demonstrate the superiority of this method over existing approaches, showing significant improvements in pose-related metrics across various objects such as talking faces, human bodies, and pixel animations.
The YAML uses the “nvcr.io/nvidia/pytorch:21.05-py3↗” image with a V100 accelerator. It runs a script and mounts a code and dataset from a GitHub repo and S3 bucket, respectively.
name: Thin-Plate-Spline-Motion-Model
image: nvcr.io/nvidia/pytorch:21.05-py3
resources:
accelerators: V100:1
run:
- workdir: /root/thin-plate-spline-motion-model
command: |
pip install -r requirements.txt && python run.py --config config/vox-256.yaml --device_ids 0
volumes:
/root/thin-plate-spline-motion-model: git://github.com/saeyoon17/Thin-Plate-Spline-Motion-Model
/root/vox: s3://vessl-public-apne2/vessl_run_datasets/vox/
Neural Radiance Fields (NeRFs) have impressive image synthesis capabilities for 3D scenes. This paper introduces a new NeRF representation using textured polygons that can efficiently synthesize images using standard rendering pipelines. By incorporating a z-buffer, which assigns features to each pixel, and utilizing a view-dependent MLP in a fragment shader, the final pixel colors are produced. This approach enables NeRFs to be rendered with the traditional polygon rasterization pipeline, achieving interactive frame rates on various compute platforms.
The YAML involves tasks like unzipping a dataset, cloning a GitHub repository, installing dependencies, and executing a Python script. The dataset is sourced from an S3 bucket.
ImageBind the learning of a joint embedding across diverse modalities such as images, text, audio, depth, thermal, and IMU data. By utilizing image-paired data, ImageBind effectively binds these modalities together and expands the zero-shot capabilities of large-scale vision-language models. It enables various applications, including cross-modal retrieval, arithmetic composition, detection, and generation, achieving state-of-the-art performance in emergent zero-shot recognition and few-shot recognition tasks, while also serving as a valuable evaluation framework for vision models across visual and non-visual domains.
The YAML utilizes the “nvcr.io/nvidia/pytorch:22.10-py3↗” image with an A100 accelerator. It involves creating an environment, installing dependencies, and running a Streamlit demo. The code and resources are sourced from the “treasuraid/ImageBind↗” repository, and the project is set to run interactively for 24 hours on port 8501.
To run this YAML, you need a A100 GPU. You can bring your own GPU clusters using our vessl cluster create
command. Refer to our documentation↗ to get started.
name: ImageBind
image: nvcr.io/nvidia/pytorch:22.10-py3
resources:
accelerators: A100:1
run:
- command: |
cd ImageBind
conda create --name imagebind python=3.8 -y
source activate imagebind
pip install numpy
pip install vtk==9.0.1
pip install mayavi
pip install -r requirements.txt
conda install -c conda-forge cartopy -y
streamlit run streamlit_demo.py
volumes:
/root/ImageBind: git://github.com/treasuraid/ImageBind
interactive:
runtime: 24h
ports:
-8501
VisProg is an innovative neuro-symbolic approach that utilizes natural language instructions to tackle complex visual tasks. By generating modular programs and employing computer vision models and image processing routines, VisProg offers flexible solutions for tasks like visual question answering and language-guided image editing. This approach broadens the capabilities of AI systems, allowing them to cater to diverse user needs and effectively handle a wide range of complex tasks.
For this YAML, you need to enter your Open AI API key
name: visprog
image: nvcr.io/nvidia/pytorch:22.10-py3
resources:
accelerators: V100:1
run:
- workdir: /root
command: |
echo $OPENAI_API_KEY
git clone https://github.com/treasuraid/visprog.git
cd visprog
conda env create -f environment.yaml
source activate visprog
pip install vessl opencv-python-headless
cd script
python image_editing.py
env:
OPENAI_API_KEY: "your openai api key"
Input query: “Replace man in black henley (person) with brick wall” (top: original, bottom: after the query)
Current attention algorithms, such as self-attention, highlight all salient objects in an image without considering the specific task. In contrast, humans use task-guided top-down attention to focus on task-related objects. This paper introduces AbSViT, a top-down modulated ViT model that approximates AbS and enables controllable top-down attention. AbSViT improves performance on Vision-Language tasks and serves as a versatile backbone for classification, semantic segmentation, and model robustness.
The YAML utilizes the “nvcr.io/nvidia/pytorch:22.10-py3↗” image and runs with a V100 accelerator. It involves installing requirements and a library dependency. The project’s code and resources are fetched from the “bfshi/AbSViT↗” repository. During runtime, it operates interactively for 24 hours on port 8501.
name: AbSViT
image: nvcr.io/nvidia/pytorch:22.10-py3
resources:
accelerators: V100:1
run:
- workdir: /root/AbSVit
command: |
pip install -r requirements.txt
apt-get install libmagickwand-dev
volumes:
/root/AbSvit: git://github.com/bfshi/AbSViT.git
interactive:
runtime: 24h
ports:
- 8501
VESSL AI will be at CVPR 2023 all week to host the official student social event, share our latest product updates, and showcase demos! Our team will be also in booth 📍1527 so stop by our booth to see more of our latest work!
ML Lead
ML Engineer Intern
Growth Manager
Build, train, and deploy models faster at scale with fully managed infrastructure, tools, and workflows.