Tutorials

06 March 2026

Cluster Storage: Your Team's Shared Hard Drive in the Cloud

Cluster Storage gives teams a shared volume by GPUs, so workspaces mount the same data, avoid duplicate copies, and keep projects moving after shutdowns.

Cluster Storage: Your Team's Shared Hard Drive in the Cloud

Cluster Storage: Your Team's Shared Hard Drive in the Cloud

Stop copying datasets. Stop losing work. Start collaborating.

If you've ever trained a model on VESSL, you've probably had this moment: you spin up a workspace, download a 200GB dataset, train for hours, then terminate the workspace and realize the data is gone. Or maybe your teammate needs the same dataset, so they download it again into their own workspace. That's two copies of 200GB, burning through storage costs for no reason.

We built Cluster Storage to make this pain disappear.

What is Cluster Storage?

Think of it as a shared NAS drive that lives right next to your GPUs. It's a persistent, high-performance storage pool attached to a Kubernetes cluster that any workspace in your organization can mount simultaneously.

The keyword here is simultaneously. Unlike the old Workspace volume (which was locked to a single workspace), Cluster storage uses Read-Write-Many (RWX) semantics. Multiple workspaces can read from and write to the same storage at the same time, just like a shared network drive in an office.

Before: The Old Way

image.png

Pain points:

  • Workspace volume was Read-Write-Once (RWO) — only one workspace at a time could use it
  • Terminate a workspace? Your data vanishes
  • Shared volume (S3-backed) works across clusters, but it's slow for training workloads
  • Teams end up with duplicate copies of the same datasets

After: Cluster storage

image.png

What changed:

  1. Read-Write-Many (RWX) — multiple workspaces can mount the same volume at once
  2. Data persists even when all workspaces are terminated
  3. Fast throughput (~130 MB/s on EBS + CephFS) - great for training
  4. Team-shared storage scoped to organization, not individual workspace

Why "Cluster" Storage?

The name tells you exactly where it lives: on the cluster. Cluster storage is physically co-located with your compute nodes, which is why it's fast. This is a deliberate design choice.

When your workspace reads a dataset from Cluster storage, the data travels over the cluster's internal network — not across the internet. This gives you near-local-disk throughput (~150 MB/s) while still being shared and persistent.

The trade-off? It's bound to one cluster. If you need to share data across clusters in different regions, that's what Object storage (S3-backed) is for. Think of it as choosing between the fast local drive and the cloud backup — sometimes you need both.

Reliable by Design: Distributed Storage

Under the hood, Cluster storage runs on CephFS, a production-proven distributed filesystem. Your data isn't sitting on a single disk hoping nothing goes wrong:

  • Metadata: Replicated 3 times (replicas=3) across different nodes
  • Data: Protected by erasure coding (4 data chunks + 2 coding chunks), meaning the system can lose up to 2 storage nodes and still recover every byte
  • Metadata Server: Active-standby failover for continuous availability

This is enterprise-grade storage — the same technology that powers some of the largest storage clusters in the world — packaged into a simple "create storage, mount it, done" experience.

Storage Tiering: Warm and Cold

Not all data is created equal. A training dataset you're actively iterating on has very different requirements from last month's checkpoint logs. That's why VESSL Cloud offers two tiers of persistent storage:

  • What: Warm Tier: Cluster storage / Cold Tier: Object storage (S3)
  • Backend: Warm Tier: CephFS on NVMe / Cold Tier: Object storage
  • Speed: Warm Tier: ~200 MB/s (fast) / Cold Tier: ~100 MB/s
  • Persistence: Warm Tier: Survives workspace termination / Cold Tier: Survives workspace termination
  • Scope: Warm Tier: Within a cluster / Cold Tier: Across all clusters
  • Best for: Warm Tier: Active datasets, code, models, virtualenvs / Cold Tier: Checkpoints, logs, archived artifacts
  • Cost: Warm Tier: $0.20/GB/month / Cold Tier: Lower (check in-app pricing)
  • The rule of thumb: If you're actively training with it, put it in Cluster storage. If you're archiving it or sharing across regions, use Object storage.

What about Temporary storage?

Every workspace still gets ephemeral scratch space for caches, temp files, and intermediate results. This is blazing fast (local NVMe) but wiped when the workspace stops. Use it for things you can regenerate — not for things you'll cry about losing.

What Changed from Workspace Volume?

image.png
  • Data on terminate: Legacy Workspace volume: Lost / Cluster storage: Preserved
  • Sharing: Legacy Workspace volume: Single workspace only (RWO) / Cluster storage: Multiple workspaces (RWX)
  • Mount path: Legacy Workspace volume: Fixed to /root / Cluster storage: User-configurable
  • Team collaboration: Legacy Workspace volume: Not possible / Cluster storage: Built-in

If you have existing Workspace volume data, reach out to support@vessl.ai for migration assistance.

Future Works: Even Faster Storage

We're not stopping at CephFS. For workloads that demand extreme I/O — multi-node distributed training, large language model fine-tuning — we're working on bringing RDMA-level storage to the platform.

Technologies like AWS FSx for Lustre and WEKA can deliver throughput in the GB/s range (not MB/s), which is a game-changer for large-scale training. We already have the technical foundation for this, and plan to productize it in the near future.

Stay tuned.

Intae Ryoo

Intae Ryoo

Product Manager

Wayne Kim

Wayne Kim

Product Marketer

Try Cluster Storage

Try VESSL today

Build, train, and deploy models faster at scale with fully managed infrastructure, tools, and workflows.

Get Started

MLOps for high-performance ML teams

© 2026 VESSL AI, Inc. All rights reserved.