The Easiest Way to Fine‑Tune OpenAI GPT‑OSS with LoRA on VESSL

This guide walks you through fine‑tuning OpenAI GPT‑OSS models (20B, 120B) with Low-Rank Adaptation of Large Language Models (LoRA) on the VESSL platform and then taking the result all the way through model registry upload to serving.

What is GPT‑OSS?

GPT‑OSS is an open-weight model family released by OpenAI on August 5, 2025. It adopts the Harmony response format and MXFP4 (4-bit) quantization, allowing even large models to run on modest hardware—a single H100 80 GB for 120B and ~16 GB for 20B. Unlike proprietary models, GPT-OSS is freely downloadable, usable, and modifiable. It is licensed under Apache‑2.0.

1) Mixture‑of‑Experts (MoE) architecture

120B: 4 of 128 experts are active per token
20B: 4 of 32 experts are active per token (no shared expert).
Although MoE weights account for about 90% of total parameters, the sparse structure keeps inference efficient.
This sparsity is a key reason GPT‑OSS achieves fast inference relative to size.

2) Native Microscaling FP4 (MXFP4)

Based on the Open Compute Project (OCP) Microscaling Formats v1.0 FP4 spec with block‑wise scaling (size 32).
FP4 is paired with stochastic rounding and the Random Hadamard Transform (RHT), allowing 4-bit precision to be applied to large MoE weights while preserving accuracy. It can be used not only for inference but also directly during training. (Open Compute Project, Hugging Face)

3) Harmony format (chat/reasoning/tool use)

GPT‑OSS is post‑trained with the Harmony format and is recommended to be used with the o200k_harmony tokenizer.
Multi‑channel output: supports CoT (Chain‑of‑Thought), tool calls, and standard responses, with a clear instruction hierarchy and namespace for tools.
- Roles/channels: system / developer / user / assistant / tool + analysis / final
Designed to produce efficient reasoning outputs and structured function calls (tool calls).

Why fine‑tune GPT‑OSS on VESSL?

Ready-to-use training environment: VESSL provides container images with Torch/CUDA Triton kernels tailored for GPT-OSS training, allowing you to run them immediately.
Optimized hardware: H100 80 GB supports both 20B and 120B. While 120B fits on a single GPU, Tensor Parallel (TP) is recommended for throughput.
Integrated ML/LLMOps: End‑to‑end workflow with real‑time training metrics, automatic checkpointing/model saving, a Model Registry, and one‑click deployment.

Step‑by‑Step Guide

1. Create a VESSL account & project

Start at vessl.ai, create an account, and create a new project from the dashboard. In this guide, we’ll use gpt-oss-finetuning.

2. Set up the VESSL CLI

Install and configure the VESSL CLI.

# Install VESSL CLI (skip if already installed)
pip install vessl

# Configure VESSL
vessl configure --organization YOUR_ORG_NAME --project gpt-oss-finetuning

3. Clone the example repository

VESSL’s examples repo includes code and recipes to fine‑tune GPT‑OSS. Clone it and move into the fine‑tuning directory.

git clone https://github.com/vessl-ai/examples.git
cd examples/runs/finetune-llms

4. Launch fine‑tuning

Inside finetune-llms you’ll find:

Training scripts: main.py, model.py, dataset.py, and so on, optimized for efficient fine‑tuning.
VESSL Run template: run_yamls/run_lora_gpt_oss.yaml — a ready‑to‑run configuration.

Open run_lora_gpt_oss.yaml and review the key settings.

Model & dataset:

env:
  MODEL_NAME: openai/gpt-oss-20b  # or openai/gpt-oss-120b
  DATASET_NAME: HuggingFaceH4/Multilingual-Thinking
  REPOSITORY_NAME: gpt-oss-20b-multilingual-reasoner

Uses the HuggingFaceH4/Multilingual‑Thinking dataset by default—feel free to swap in another dataset or your own.
Trained artifacts are saved as a VESSL Model named gpt-oss-20b-multilingual-reasoner in the Model Registry.

Resources:

resources:
  cluster: vessl-eu-h100-80g-sxm
  preset: gpu-h100-80g-small
image: quay.io/vessl-ai/torch:2.8.0-cuda12.8

gpu-h100-80g-small uses 1× H100 80 GB. For large sequences or higher throughput on gpt‑oss‑120b, use multi‑GPU/TP.
Container includes Torch 2.8.0 + CUDA 12.8 with GPT‑OSS support.

Training hyperparameters
- lora_r: 32 — LoRA rank for parameter efficiency
- lora_alpha: 64 — LoRA scaling factor
- lora_target_modules: all-linear — include all linear layers (MoE experts included)
Optimization
- lr_scheduler_type: cosine_with_min_lr — cosine schedule with a floor
- warmup_ratio: 0.03 — 3% warmup
Memory optimization
- load_in_4bit: True — memory‑efficient 4‑bit loading
- gradient_checkpointing: True — trade compute for memory
- per_device_train_batch_size: 4
- gradient_accumulation_steps: 4 — effective batch size 16
- bf16: True — uses bfloat16, required by GPT‑OSS

Create a VESSL Run with the config:

vessl run create -f run_yamls/run_lora_gpt_oss.yaml

5. Monitor training

Once the Run is created, the console log will print a link to the dashboard where you can inspect details, logs, and metrics in real time.

Image pulls and model downloads can delay the start. Seeing Pulling image "..." in the log is expected.

The plots (epoch, grad_norm, learning_rate, loss, mean_token_accuracy, num_tokens) in run.

OOM (Out‑of‑Memory) Troubleshooting

Try the following, one at a time:

1. Decrease per_device_train_batch_size to 2 or 1
2. Increase gradient_accumulation_steps accordingly
3. Reduce lora_r from 32 to 16
4. Lower max_length from 2048 to 1024

6. Verify the upload

When training completes, the LoRA adapter is automatically uploaded to your VESSL Model.

The LoRA adapter is uploaded VESSL Model

You can inspect each version to see the actual adapter files.

Each model version includes:

LoRA adapter weights (adapter_model.safetensors)
Config (adapter_config.json)
README.md

7. Serve your fine‑tuned adapter: Direct adapter serving vs. merge

As of August 2025, most inference frameworks (for example, vLLM) do not serve GPT‑OSS LoRA adapters directly. For inference, merge the adapter into the base model and serve the merged weights.

Launch the merge and an inference server for the merged model:

# Modify {YOUR_ORGANIZATION} in the YAML to your actual organization name
vessl run create -f run_yamls/run_lora_gpt_oss_merge.yaml

When the server is up, open the Connect dropdown and click API to access the endpoint.

Use the Python snippet below to test streaming with your fine‑tuned API:

#!/usr/bin/env python3
"""
Simple streaming test script for GPT-OSS API
"""
import openai
from datetime import datetime

# Configure client for your GPT-OSS server
client = openai.OpenAI(
    base_url="https://{YOUR_API_ENDPOINT}/v1",
    api_key="dummy"  # Not needed for our server
)

# OpenAI Harmony format system prompt
current_date = datetime.now().strftime("%Y-%m-%d")
system_prompt = f"""
<|start|>system<|message|>You are VESSL-GPT, a large language model fine-tuned on VESSL.
Knowledge cutoff: 2024-06
Current date: {current_date}
Reasoning: low
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
"""

def test_streaming():
    print("🚀 Testing GPT-OSS Streaming...")
    print("=" * 50)

    try:
        stream = client.chat.completions.create(
            model="gpt-oss-20b",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": "Write a haiku about artificial intelligence"}
            ],
            max_tokens=1024,
            temperature=0.7,
            stream=True
        )

        print("🤖 GPT-OSS: ", end="", flush=True)

        for chunk in stream:
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                print(content, end="", flush=True)

        print("\n" + "=" * 50)
        print("✅ Streaming test completed!")

    except Exception as e:
        print(f"❌ Error: {e}")

if __name__ == "__main__":
    test_streaming()

References

OpenAI
OCP MX v1.0
- https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf
Hugging Face
- https://huggingface.co/datasets/HuggingFaceH4/Multilingual-Thinking
vLLM
- https://docs.vllm.ai/en/v0.9.1/features/lora
- https://blog.vllm.ai/2025/08/05/gpt-oss
Miscellaneous references
- https://www.codecademy.com/article/gpt-oss-run-locally
- https://huggingface.co/blog/RakshitAralimatti/learn-ai-with-me
- https://arxiv.org/abs/2106.09685

VESSL is an integrated ML/LLMOps platform for operating GPT‑OSS workloads in enterprise environments. With the Model Registry, you can systematically manage fine‑tuned artifacts and deploy services quickly.

If you’re looking to run the latest open-source models inside a restricted or private network with the right security and infrastructure controls, contact us at sales@vessl.ai or visit the "Talk to Sales" page—we’re happy to help.