09 January 2023
Set up a hybrid ML infrastructure with a single-line command with VESSL Clusters
In this tutorial, we will set up a hybrid ML infrastructure using VESSL Clusters. We will also share the common challenges that ML teams face as they make the transition to hybrid cloud and the motivation behind this recent trend.
By the end of this tutorial, you will be able to use VESSL as an access point for the hybrid cloud. Scroll down to Step-by-step guide if you want to get started right away.
Compute infrastructure is the key enabler for the latest developments in machine learning and yet it’s also the most complex and expensive hidden technical debt in ML systems. High-performance cloud instances such as Amazon AWS EC2 P3 and Google Cloud TPU have made HPC-grade GPUs more available, but not necessarily more accessible.
Cloud computing is here to stay but the high cost of training is one of the largest challenges that ML teams face today. Even considering the diminishing return of on-prem machines, cloud-only teams are 60% more expensive↗ than their counterparts. In extreme cases like GPT-3 which has 175 billion parameters, it and would require $4,600,000 to train↗ — even with the lowest priced GPU cloud on the market.
We’ve seen multiple customers of various sizes whose cost of cloud is putting noticeable strains on their growth and revenues:
We’ve seen a similar trend↗ happening in the general software development — the pressure that cloud puts on margins starts to outweigh the benefits as a company scales. The emergence of cost-sensitive GPU clouds built for deep learning such as Lambda↗ and Jarvis Labs↗, and ML libraries for improving training efficiencies like MosicML↗ demonstrate that the field is maturing as well.
Cloud providers have argued that the computing requirements for ML are far too expensive and cumbersome↗ to start up on their own. However, some companies are shifting their machine learning data and models to their own machines and now investing again in their hybrid infrastructure↗ — on-prem infrastructure combined with the cloud. The result — adopters are spending less money and getting better performance.
Barebone machines are only a small part of setting up modern ML infrastructures. Many of the challenges teams that face as they make transition to hybrid cloud is in fact in the software stack. The complex compute backends and system details abstracted by AWS, Google Cloud, and Azure have to be manually configured. The challenge becomes even greater if you take the ML-specific topics into account:
Thankfully, the applications of GPU-accelerated Docker containers↗ and Kubernetes pods↗ in machine learning, along with the inflected tools from multi-cloud paradigm like MinIO↗ is enabling the transition to hybrid infrastructure. For teams looking to leverage hybrid cloud, Kubernetes has become the go-to dev tool for orchestration, deployment, scaling, and management of containerized ML workloads for reasons more than just costs:
Despite these benefits, Kubernetes infrastructure is hard. Everything that an application typically has to take into consideration, like security, logging, redundancy, and scaling, are all built into the Kubernetes fabric — making it heavy and inherently difficult to learn, especially for data scientists who do not have the engineering background. For organizations that have the luxury of a dedicated DevOps or MLOps team, it may still take up to 6 months to implement and distribute the tools and workflows down to every MLEs.
VESSL Clusters makes the integration, orchestration, and management of hybrid clouds simple and straightforward. It abstracts the complex compute backends and system details of Kubernetes-backed GPU infrastructure into an easy-to-use web interface and simple CLI commands
Upon setting up a hybrid infrastructure with VESSL Clusters, you can easily provision GPU resources to launch any ML workloads including notebook workspace, job scheduler, hyperparameter optimization, and model server.
VESSL’s cluster integration is composed of four primitives:
To connect your on-premise GPU clusters to VESSL, you should first have Docker and Helm↗ installed on your machine. The easiest way is to brew install. Check out the Docker↗ and Helm↗ docs for more detailed installation instructions. Keep in mind you have to sign in and keep Docker running while using VESSL clusters.
brew install docker
brew install helm
Sign up for a free account on VESSL and pip install VESSL SDK on your Mac. Then, set up a VESSL environment on your machine using vessl configure. This will open a new browser window that asks you to grant access. Proceed by clicking grant access.
pip install vessl --upgrade
vessl configure
You are now ready to connect your Mac to VESSL. The following single-line command connects your Mac, which will appear as macbook-pro-16 on VESSL. Note the --mode single flag specifying that you will be installing a single-node cluster.
vessl cluster create --name '[CLUSTER_NAME_HERE]' --mode single
The command will automatically check dependencies and ask you to install Kubernetes. Proceed by entering y. This process will take a few minutes.
If you have Kubernetes installed on your machine, the command will then ask you to install VESSL agent on the Kubernetes cluster. Enter y and proceed.
Use our CLI command to confirm your integration and try running a training job on your laptop. Your first run may take a few minutes to get the Docker images installed on your device.
vessl cluster list
Integrating more powerful, multi-node GPU clusters for your team is as easy as repeating the steps above. To make the process easier, we’ve prepared a single-line curl command that installs all the binaries and dependencies on your server. Note that Ubunto 18.04 or Centos 7.9 or higher Linux OS is installed on your server.
Install all the dependencies using our use our magic curl command that
Go ahead and run the following command. Note that if you wish to use your control plane solely for control plane node — meaning not running any ML workloads on the control plane node and only using it for admin and visibility purposes — add--taint-controller flag at the end of the command.
curl -sSLf https://install.dev.vssl.ai | sudo bash -s -- --role=controller
Upon installing all the dependencies, the command returns a follow-up command with a token thatyou can use to add additional worker nodes to the control plane. Copy & paste and run the command to do so. If you don’t want to add an additional worker node, skip the step below.
curl -sSLf https://install.dev.vssl.ai | sudo bash -s -- --role worker --token '[TOKEN_HERE]'
You can confirm your control plane and worker node configuration using a k0s command.
sudo k0s kubectl get nodes
Once you have completed the steps above, you can now integrate the Kubernetes cluster with VESSL. Note the --mode multi flag.
pip install vessl --upgradevessl configure
vessl cluster create --name='[CLUSTER_NAME_HERE]' --mode=multi
You can once again use our CLI command or visit Clusters page on VESSL to confirm your integration.
vessl cluster list
This post covers the recent trends in and motivation behind hybrid ML infrastructure while connecting your GPU clusters to VESSL using our magic, single-line command.
If you haven’t already, make sure to sign up for a free VESSL account so you can follow along. Also, check out how our education and enterprise customers like KAIST and Hyundai are scaling their on-premise severs and hybrid cloud up to 1000+ Kubernetes nodes.
If you need any support or have any additional questions while using VESSL Clusters, let us know at growth@vessl.ai or on our community Slack↗.
Growth Manager
Build, train, and deploy models faster at scale with fully managed infrastructure, tools, and workflows.