Back to Blog

Machine Learning

04 March 2024

3 Multi-modal models & papers from NeurIPS 2023 — Try them out now at VESSL Hub

Run LLaVa, MusicGen, MotionGPT at VESSL Hub

3 Multi-modal models & papers from NeurIPS 2023 — Try them out now at VESSL Hub

With the release of GPT-4, Multimodal AI has become the biggest trend in Generative AI and it was also one of the highlights we chose from NeurIPS 2023. It’s also an area that leading Gen AI & LLM companies are chasing after — including our customer Scatter Lab — as they experiment beyond single-mode text processing to encompass multiple context-based input types such as images and sound.

“The interfaces of the world are multimodal. We want our models to see what we see and hear what we hear, and we want them to also generate content that appeals to more than one of our senses." — Mark Chen, Head of Frontiers Research, OpenAI

We imported the original code from the authors’ GitHub repo and created a simple playground for InstructBLIP, LLaVA, and AudioCraft. You can try them out with a single click at VESSL Hub.

Run multi-modal models & papers from NeurIPS 2023

  • LLaVa — Large Language and Vision Assistant →
  • MusicGen from Meta AI — Simple & Controllable Music Generation — →
  • MotionGPT — LLM-powered text-to-motion model →


MusicGen is a text-to-music model capable of generating high-quality music samples based on text descriptions or audio prompts. Unlike previous works that consist of several models for multiple streams, MusicGen has only a single-stage transformer LM with an efficient token interleaving pattern. This means that it can generate multiple parallel streams with just one model. MusicGen can generate the music not only from the text prompt but also from text & melody, generating a sound clip that “follows” the given melody.


LLaVA is an end-to-end trained large multimodal model that connects a vision encoder and an LLM for general-purpose visual and language understanding. It combines a language instruction embedding and an image feature extracted with CLIP. Then, it processes them with an LLM model such as Vicuna or Llama, giving the model a visual reasoning and image-based chat capability. With this Run, you can deploy a Streamlit demo space that runs LLaVA inference. You can upload your photo and ask questions about the image.


MotionGPT is a unified and versatile motion-language model that combines motion data with a large language model. MotionGPT uses motion-specific vector quantized variational autoencoder (VQ-VAE) to construct motion vocabulary. The input motion feature is encoded into discrete motion by VQ-VAE. The encoded motion tokens are mixed into text tokens and fed to the LLM. Thanks to the power of LLM, MotionGPT achieved state-of-the-art performance on multiple motion tasks, including text-based motion generation, motion captioning, and motion prediction.

Floyd, Product Manager

Yong Hee, Growth Manager

Try VESSL today

Build, train, and deploy models faster at scale with fully managed infrastructure, tools, and workflows.

Get Started

MLOps for high-performance ML teams

© 2024 VESSL AI, Inc. All rights reserved.