With the release of GPT-4, Multimodal AI has become the biggest trend in Generative AI and it was also one of the highlights we chose from NeurIPS 2023↗. It’s also an area that leading Gen AI & LLM companies are chasing after — including our customer Scatter Lab↗ — as they experiment beyond single-mode text processing to encompass multiple context-based input types such as images and sound.
“The interfaces of the world are multimodal. We want our models to see what we see and hear what we hear, and we want them to also generate content that appeals to more than one of our senses." — Mark Chen, Head of Frontiers Research, OpenAI
We imported the original code from the authors’ GitHub repo and created a simple playground for InstructBLIP↗, LLaVA↗, and AudioCraft↗. You can try them out with a single click at VESSL Hub.
Run multi-modal models & papers from NeurIPS 2023
- LLaVa↗ — Large Language and Vision Assistant →
- MusicGen↗ from Meta AI — Simple & Controllable Music Generation — →
- MotionGPT↗ — LLM-powered text-to-motion model →
MusicGen
MusicGen is a text-to-music model capable of generating high-quality music samples based on text descriptions or audio prompts. Unlike previous works that consist of several models for multiple streams, MusicGen has only a single-stage transformer LM with an efficient token interleaving pattern. This means that it can generate multiple parallel streams with just one model. MusicGen can generate the music not only from the text prompt but also from text & melody, generating a sound clip that “follows” the given melody.
LLaVa
LLaVA is an end-to-end trained large multimodal model that connects a vision encoder and an LLM for general-purpose visual and language understanding. It combines a language instruction embedding and an image feature extracted with CLIP. Then, it processes them with an LLM model such as Vicuna or Llama, giving the model a visual reasoning and image-based chat capability. With this Run, you can deploy a Streamlit demo space that runs LLaVA inference. You can upload your photo and ask questions about the image.
MotionGPT
MotionGPT is a unified and versatile motion-language model that combines motion data with a large language model. MotionGPT uses motion-specific vector quantized variational autoencoder (VQ-VAE) to construct motion vocabulary. The input motion feature is encoded into discrete motion by VQ-VAE. The encoded motion tokens are mixed into text tokens and fed to the LLM. Thanks to the power of LLM, MotionGPT achieved state-of-the-art performance on multiple motion tasks, including text-based motion generation, motion captioning, and motion prediction.
—
Floyd, Product Manager
Yong Hee, Growth Manager