Back to Blog


22 June 2023

Run CVPR 2023 highlights with VESSL Run

Run CVPR 2023 highlight models and papers with a single YAML file

Run CVPR 2023 highlights with VESSL Run

If you tried to run GitHub or Colab codes from the top AI/ML conferences like NeurIPS, CVPR, ICML, and ICCV, you soon realize that most of them don’t work. You have to spend hours just to configure CUDA, Python dependencies.

We created VESSL Run to help ML researchers and data scientists explore the latest models effortlessly with a unified YAML interface. With the release of VESSL Run, we are sharing several YAML files for highlight papers & models from CVPR 2023. These YAML files make the models like Dreambooth by Google Research, ImageBind by Meta AI, and VisProg by Allen AI all run-proof on your laptop and any clouds.

You can run these models simply using our vessl run command and referencing the YAML file. Explore more models from CVPR 2023 at our model gallery,

pip install --upgrade vessl vessl run -f dreambooth.yaml

DreamBooth by Google Research

DreamBooth presents a novel method to personalize text-to-image diffusion models by fine-tuning them with a small set of subject images. By incorporating a unique identifier and leveraging semantic prior, the models can generate highly realistic images of the subject in different contexts, surpassing previous limitations in tasks like subject recontextualization, text-guided view synthesis, appearance modification, and artistic rendering while preserving important features.

name: dreamboothstablediffusion image: resources: accelerators: A100:1 volumes: /root/examples: git:// /output: artifact: true run: - workdir: /root/examples/Dreambooth-Stable-Diffusion command: | conda env create -f environment.yaml source activate ldm pip install Omegaconf pip install pytorch-lightning mkdir data/ wget -P data/ wget python --base configs/stable-diffusion/v1-finetune_unfrozen.yaml -t --actual_resume ./sd-v1-4-full-ema.ckpt -n "generate_pikachu" --no-test --gpus "0," --data_root ./dataset --reg_data_root ./reg --class_word "{$class_word}" rm -rf ./logs/*.ipynb_checkpoints python scripts/ --ddim_eta 0.0 --n_samples 2 --n_iter 4 --scale 10.0 --ddim_steps 100 --ckpt ./logs/*/checkpoints/last.ckpt --prompt "{$prompt}" cp -r ./outputs /output env: class_word: "pikachu" prompt: "A photo of sks pikachu playing soccer."

The YAML snippet uses the Docker image “” to configure the runtime and allocates one NVIDIA A100 GPU. It specifies volumes for the GitHub repository and artifact creation. The project runs a sequence of commands, including environment setup, package installation, data download, model training, and output generation.

Under env, you can enter an example class word and a prompt.

  • class_word: Customize an identifier. For this example, here we are using “pikachu” for the new class word.
  • prompt: You can generate a regularization image by entering a prompt. In this example, we’re using “A photo of pikachu playing soccer.” as our example.

Segment Anything by Meta AI

Segment Anything (SA) introduces a task, model, and dataset for image segmentation, including over 1 billion masks on 11 million images. Their promptable model demonstrates impressive zero-shot performance, rivaling or surpassing prior fully supervised methods. Meta AI released the Segment Anything Model along with the dataset (SA-1B) to foster research in computer vision.

The YAML uses the Docker image “” allocates one NVIDIA V100 on AWS. It runs a setup script located in the “/root/segment-anything/” directory. The GitHub repository “git://” is mounted as a volume. For interactive usage, the project has a runtime of 24 hours and exposes the port 8501.

name : segment-anything resources: accelerators: V100:1 image: run: - workdir: /root/segment-anything/ command: | bash ./ volumes: /root/segment-anything: git:// interactive: runtime: 24h ports: - 8501

Thin-Plate Spline Motion Model for Image Animation

The paper introduces a new end-to-end unsupervised motion transfer framework to address the challenge of large pose gaps between source and driving images in image animation. The framework utilizes thin-plate spline motion estimation for flexible optical flow, incorporates multi-resolution occlusion masks to realistically restore missing regions, and employs additional auxiliary loss functions to ensure high-quality image generation. Experimental results demonstrate the superiority of this method over existing approaches, showing significant improvements in pose-related metrics across various objects such as talking faces, human bodies, and pixel animations.

The YAML uses the “” image with a V100 accelerator. It runs a script and mounts a code and dataset from a GitHub repo and S3 bucket, respectively.

name: Thin-Plate-Spline-Motion-Model image: resources: accelerators: V100:1 run: - workdir: /root/thin-plate-spline-motion-model command: | pip install -r requirements.txt && python --config config/vox-256.yaml --device_ids 0 volumes: /root/thin-plate-spline-motion-model: git:// /root/vox: s3://vessl-public-apne2/vessl_run_datasets/vox/

MobileNeRF by Google Research

Neural Radiance Fields (NeRFs) have impressive image synthesis capabilities for 3D scenes. This paper introduces a new NeRF representation using textured polygons that can efficiently synthesize images using standard rendering pipelines. By incorporating a z-buffer, which assigns features to each pixel, and utilizing a view-dependent MLP in a fragment shader, the final pixel colors are produced. This approach enables NeRFs to be rendered with the traditional polygon rasterization pipeline, achieving interactive frame rates on various compute platforms.

The YAML involves tasks like unzipping a dataset, cloning a GitHub repository, installing dependencies, and executing a Python script. The dataset is sourced from an S3 bucket.

ImageBind by Meta AI

ImageBind the learning of a joint embedding across diverse modalities such as images, text, audio, depth, thermal, and IMU data. By utilizing image-paired data, ImageBind effectively binds these modalities together and expands the zero-shot capabilities of large-scale vision-language models. It enables various applications, including cross-modal retrieval, arithmetic composition, detection, and generation, achieving state-of-the-art performance in emergent zero-shot recognition and few-shot recognition tasks, while also serving as a valuable evaluation framework for vision models across visual and non-visual domains.

The YAML utilizes the “” image with an A100 accelerator. It involves creating an environment, installing dependencies, and running a Streamlit demo. The code and resources are sourced from the “treasuraid/ImageBind” repository, and the project is set to run interactively for 24 hours on port 8501.

To run this YAML, you need a A100 GPU. You can bring your own GPU clusters using our vessl cluster create command. Refer to our documentation to get started.

name: ImageBind image: resources: accelerators: A100:1 run: - command: | cd ImageBind conda create --name imagebind python=3.8 -y source activate imagebind pip install numpy pip install vtk==9.0.1 pip install mayavi pip install -r requirements.txt conda install -c conda-forge cartopy -y streamlit run volumes: /root/ImageBind: git:// interactive: runtime: 24h ports: -8501

Visual Programming by Allen AI

VisProg is an innovative neuro-symbolic approach that utilizes natural language instructions to tackle complex visual tasks. By generating modular programs and employing computer vision models and image processing routines, VisProg offers flexible solutions for tasks like visual question answering and language-guided image editing. This approach broadens the capabilities of AI systems, allowing them to cater to diverse user needs and effectively handle a wide range of complex tasks.

For this YAML, you need to enter your Open AI API key

name: visprog image: resources: accelerators: V100:1 run: - workdir: /root command: | echo $OPENAI_API_KEY git clone cd visprog conda env create -f environment.yaml source activate visprog pip install vessl opencv-python-headless cd script python env: OPENAI_API_KEY: "your openai api key"

Input query: “Replace man in black henley (person) with brick wall” (top: original, bottom: after the query)

Top-Down Visual Attention from Analysis by Synthesis

Current attention algorithms, such as self-attention, highlight all salient objects in an image without considering the specific task. In contrast, humans use task-guided top-down attention to focus on task-related objects. This paper introduces AbSViT, a top-down modulated ViT model that approximates AbS and enables controllable top-down attention. AbSViT improves performance on Vision-Language tasks and serves as a versatile backbone for classification, semantic segmentation, and model robustness.

The YAML utilizes the “” image and runs with a V100 accelerator. It involves installing requirements and a library dependency. The project’s code and resources are fetched from the “bfshi/AbSViT” repository. During runtime, it operates interactively for 24 hours on port 8501.

name: AbSViT image: resources: accelerators: V100:1 run: - workdir: /root/AbSVit command: | pip install -r requirements.txt apt-get install libmagickwand-dev volumes: /root/AbSvit: git:// interactive: runtime: 24h ports: - 8501


VESSL AI will be at CVPR 2023 all week to host the official student social event, share our latest product updates, and showcase demos! Our team will be also in booth 📍1527 so stop by our booth to see more of our latest work!

SungHyun Moon, ML Lead

David Oh, ML Engineer Intern

Yong Hee Lee, Growth Manager

Try VESSL today

Build, train, and deploy models faster at scale with fully managed infrastructure, tools, and workflows.

Get Started

MLOps for high-performance ML teams

© 2024 VESSL AI, Inc. All rights reserved.