The increasing complexity in machine learning models and datasets↗ is a challenge not just for industry but also for academia. In response, top universities are investing heavily in high-performance computing systems to further scale their research infrastructures. Some are building GPU servers at enterprise-scale — Stanford Research Computing Center, for example, operates more than 2,200 HPC servers with 50,000 CPU cores and 2,040 GPUs.
While universities and departments are continuously increasing GPU resources, machine learning researchers are still suffering from the challenges that are also common in the industry:
Inefficiencies in infrastructure management
- Clusters operate at far less than full capacity without proper cluster monitoring and resource provisioning.
- Researchers still rely on Slack, spreadsheets, and scrappy scripts to schedule GPU workloads.
- Labs do not have sophisticated tooling and standardized practices that allow researchers to access and use clusters at ease.
Outdated research workflow
- Researchers rely on time-consuming, repetitive, manual tasks to run experiments and train models.
- Models are disorganized across multiple laptops and notebooks without a centralized model store, resulting in reproducibility errors.
- Models, datasets, and experiments are not shared among researchers or authors, limiting collaboration.
VESSL for Academics
To help academic research teams tackle these challenges, we are introducing a free academic plan dedicated to faculty members and graduate students. Just as we are serving ML practitioners and enterprise customers, our goal with the new plan is to empower research teams at universities with a scalable modern workflow but with zero maintenance overheads. Specifically, we have three objectives in mind.
1) Improve access to shared clusters and optimize resource usage
- Instant setup: researchers can access a shared cluster and receive GPU resources at a click of a button.
- Resource provisioning: resources are allocated dynamically across on-premise and cloud clusters based on the team’s quota policy.
- Cluster management: teams can monitor the usage and status of clusters and of each node.
2) Bring systematic approach to managing models and experiments
- Experiment dashboard: track all experiments regardless of the run environment.
- Model registry: manage publication-ready models in a central repository with all metadata and pipeline history.
- Reproducible models: reproduce models at a click of a button and breeze through the reproducibility checklist↗.
3) Enhance research productivity by modernizing legacy workflows
- Advanced training: use automated model tuning and distributed training to minimize training time.
- Team collaboration: share experiment results and research assets in a secure unified workspace.
- Research notes: import experiment results, media files, and charts within the platform to write and share interactive reports.
During our closed beta, we met hundreds of masters and Ph.D students from top university research labs here in Korea, and some of these labs have already integrated VESSL into their research workflows.
Though different in scale, our very first customer, the Graduate School of AI at KAIST has been using VESSL since the summer of 2021 to manage 100 HPC servers and 500 GPUs.
- Today, more than 300 students from 20 research labs are using VESSL every day to use the campus-wide cluster — all without having to manually check availability, request access, or write scripts to configure training jobs and access GPUs.
- Some of the labs are using VESSL as a central hub for their machine learning research by integrating their own AWS clusters, GitHub repositories, and NFS servers.
Several research teams at Seoul National University and Korea University approached VESSL to deploy hybrid clusters.
- These teams were looking to extend their on-premise clusters to AWS or GCP for resource-intensive computation tasks like last-minute model optimization.
- Instead of configuring instances from scratch and manually combining disparate environments, they decided to use VESSL’s managed cluster. Together with resource provisioning and spot instances, these teams were able to save up to 80% of the cloud spending.
- They have now streamlined research workflow on a single, unified platform regardless of the development environment, from a researcher’s personal laptop and on-premise cluster, and to cloud instances.
While most of our academic users are currently based in Korea, we are looking to expand our network globally.
What we offer
Our free academic plan includes the following features and benefits:
- All core features including a collaborative dashboard, advanced training options, model registry, and dataset integration. Refer to our release note↗ to find more information about our latest features.
- Up to 15 seats and support for up to 5 on-premise server integration. Contact us at email@example.com↗ to discuss additional seats and server integration.
- Support from VESSL engineering team through community Slack and forums, and a chance to showcase your research team and their work to the machine learning community through blog posts↗.
If you are a graduate student or faculty member eligible for our free academic plan, apply by filling in the application form↗. Our team will schedule an online product demo and help you get started. To learn more about us, watch our guides on YouTube↗ or check out our docs↗.
Yong Hee, Growth at VESSL AI