Fine-tune GPT-OSS with us.
This guide walks you through fine‑tuning OpenAI GPT‑OSS models (20B, 120B) with Low-Rank Adaptation of Large Language Models (LoRA) on the VESSL platform and then taking the result all the way through model registry upload to serving.
GPT‑OSS is an open-weight model family released by OpenAI on August 5, 2025. It adopts the Harmony response format and MXFP4 (4-bit) quantization, allowing even large models to run on modest hardware—a single H100 80 GB for 120B and ~16 GB for 20B. Unlike proprietary models, GPT-OSS is freely downloadable, usable, and modifiable. It is licensed under Apache‑2.0.
system / developer / user / assistant / tool
+ analysis / final
Start at vessl.ai, create an account, and create a new project from the dashboard. In this guide, we’ll use gpt-oss-finetuning
.
Install and configure the VESSL CLI.
# Install VESSL CLI (skip if already installed)
pip install vessl
# Configure VESSL
vessl configure --organization YOUR_ORG_NAME --project gpt-oss-finetuning
VESSL’s examples repo includes code and recipes to fine‑tune GPT‑OSS. Clone it and move into the fine‑tuning directory.
git clone https://github.com/vessl-ai/examples.git
cd examples/runs/finetune-llms
Inside finetune-llms
you’ll find:
main.py
, model.py
, dataset.py
, and so on, optimized for efficient fine‑tuning.run_yamls/run_lora_gpt_oss.yaml
— a ready‑to‑run configuration.Open run_lora_gpt_oss.yaml
and review the key settings.
env:
MODEL_NAME: openai/gpt-oss-20b # or openai/gpt-oss-120b
DATASET_NAME: HuggingFaceH4/Multilingual-Thinking
REPOSITORY_NAME: gpt-oss-20b-multilingual-reasoner
gpt-oss-20b-multilingual-reasoner
in the Model Registry.resources:
cluster: vessl-eu-h100-80g-sxm
preset: gpu-h100-80g-small
image: quay.io/vessl-ai/torch:2.8.0-cuda12.8
gpu-h100-80g-small
uses 1× H100 80 GB. For large sequences or higher throughput on gpt‑oss‑120b, use multi‑GPU/TP.lora_r: 32
— LoRA rank for parameter efficiencylora_alpha: 64
— LoRA scaling factorlora_target_modules: all-linear
— include all linear layers (MoE experts included)lr_scheduler_type: cosine_with_min_lr
— cosine schedule with a floorwarmup_ratio: 0.03
— 3% warmupload_in_4bit: True
— memory‑efficient 4‑bit loadinggradient_checkpointing: True
— trade compute for memoryper_device_train_batch_size: 4
gradient_accumulation_steps: 4
— effective batch size 16bf16: True
— uses bfloat16, required by GPT‑OSSCreate a VESSL Run with the config:
vessl run create -f run_yamls/run_lora_gpt_oss.yaml
Once the Run is created, the console log will print a link to the dashboard where you can inspect details, logs, and metrics in real time.
Image pulls and model downloads can delay the start. Seeing Pulling image "..."
in the log is expected.
OOM (Out‑of‑Memory) Troubleshooting
Try the following, one at a time:
1. Decreaseper_device_train_batch_size
to 2 or 1
2. Increasegradient_accumulation_steps
accordingly
3. Reducelora_r
from 32 to 16
4. Lowermax_length
from 2048 to 1024
When training completes, the LoRA adapter is automatically uploaded to your VESSL Model.
You can inspect each version to see the actual adapter files.
Each model version includes:
adapter_model.safetensors
)adapter_config.json
)README.md
As of August 2025, most inference frameworks (for example, vLLM) do not serve GPT‑OSS LoRA adapters directly. For inference, merge the adapter into the base model and serve the merged weights.
Launch the merge and an inference server for the merged model:
# Modify {YOUR_ORGANIZATION} in the YAML to your actual organization name
vessl run create -f run_yamls/run_lora_gpt_oss_merge.yaml
When the server is up, open the Connect dropdown and click API to access the endpoint.
Use the Python snippet below to test streaming with your fine‑tuned API:
#!/usr/bin/env python3
"""
Simple streaming test script for GPT-OSS API
"""
import openai
from datetime import datetime
# Configure client for your GPT-OSS server
client = openai.OpenAI(
base_url="https://{YOUR_API_ENDPOINT}/v1",
api_key="dummy" # Not needed for our server
)
# OpenAI Harmony format system prompt
current_date = datetime.now().strftime("%Y-%m-%d")
system_prompt = f"""
<|start|>system<|message|>You are VESSL-GPT, a large language model fine-tuned on VESSL.
Knowledge cutoff: 2024-06
Current date: {current_date}
Reasoning: low
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
"""
def test_streaming():
print("🚀 Testing GPT-OSS Streaming...")
print("=" * 50)
try:
stream = client.chat.completions.create(
model="gpt-oss-20b",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "Write a haiku about artificial intelligence"}
],
max_tokens=1024,
temperature=0.7,
stream=True
)
print("🤖 GPT-OSS: ", end="", flush=True)
for chunk in stream:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
print(content, end="", flush=True)
print("\n" + "=" * 50)
print("✅ Streaming test completed!")
except Exception as e:
print(f"❌ Error: {e}")
if __name__ == "__main__":
test_streaming()
VESSL is an integrated ML/LLMOps platform for operating GPT‑OSS workloads in enterprise environments. With the Model Registry, you can systematically manage fine‑tuned artifacts and deploy services quickly.
If you’re looking to run the latest open-source models inside a restricted or private network with the right security and infrastructure controls, contact us at sales@vessl.ai or visit the "Talk to Sales" page—we’re happy to help.
CTO
Solutions Engineer
Product Marketer