Established in 2019, the Graduate School of AI at KAIST is the latest installment to the university’s efforts to advance research in AI. As part of its mission to foster top-tier research experience, the school invested heavily in high-performance computing systems from the very beginning. By 2021, it has accrued over 600 GPUs, a computing capacity that exceeds campus-wide clusters worldwide, and is on par with some of our largest enterprise customers.
Today, VESSL AI provides dynamic provisioning and scheduling of these GPU clusters across 20 research teams at KAIST AI. Over 300 graduate students use VESSL as an access point to launch GPU-accelerated Jupyter notebooks and execute parallel experiment runs. For the small team of technical administrators, VESSL Cluster serves as a real-time monitoring tool for node usage and status.
Barebone machines are only a part of setting upmodern ML infrastructures. Allocating multiple GPUs across hundreds of users requires a highly scalable cluster management and job scheduling system. Typically, for teams with engineering talent and resources, the go-to solutions have been Slurm Workload Manager↗ and Kubernetes Scheduler↗ — for example, the Stanford Research Computing Center team uses Slurm to manage its XStream↗ cluster.
However, with steep learning curves, these solutions require months of complex setup and a dedicated system engineering team. The challenge becomes even greater if you take the topics that are unique to machine learning↗ into consideration — such as mounting dataset volumes, saving training checkpoints, and caching gigabytes of features — especially for a small IT team that needs to operationalize the HPCs in just a few weeks.
Previous to VESSL, the team at KAIST AI also briefly used Kubernetes to build a FIFO scheduler. However, very quickly, they recognized the need for a smarter and more efficient cluster manager that is intuitive to use, both by the admin team and the students.
With the existing primitive scheduler, many of the students had trouble just accessing the clusters due to the sheer complexity of Kubernetes, and those who did often held on to GPUs for weeks. Without the proper quota policies and workload monitoring, these anti-patterns went unnoticed.
Beyond resource provisioning, working with VESSL gave KAIST AI greater flexibility and the ability to manage ML infrastructure as a whole from mounting datasets to distributed training. This began with providing admin rights to the IT team through VESSL — helping them setup quota limits on monthly GPU hours and the number of concurrent jobs. The team also gained real-time visibility of cluster usage down to each GPU through VESSL Cluster — which has been crucial to their long-term GPU budgeting and plans to expand their HPC.
By working with VESSL, students on the other hand are able to focus on their actual research rather than resource management and environment configuration. Accessing the HPC is easier than ever with VESSL’s web interface and intuitive CLI. Every student is now assigned with enough GPUs to get started which are not only automatically scheduled but scale out elastically based on the usage across the HPC.
VESSL Dataset is eliminating the need to download large datasets every time they launch new workloads by mounting NFS volumes directly to their working environments. Finally, VESSL Experiment is helping them work more resource-efficiently by guiding them to optimize their model as training jobs across multiple GPUs instead of running commands on persistent Jupyter notebooks.
As KAIST AI continues to increase its Petaflops, our team at VESSL is working closely to provide a more resilient and fault-tolerant solution. The school’s research teams are adopting VESSL more heavily to their workflows — logging everything from experiment parameters to the node number for reproducible experiments and integrating GitHub and cloud storage in addition to the school’s HPCs.
For VESSL AI, the success story of KAIST AI has helped us reach more researcher teams at leading institutions like MIT and Seoul National University, and we are grateful to be a small part of the cutting-edge academic machine learning research around the world.
Yong Hee, Growth at VESSL AI