https://huggingface.co/openai/gpt-oss-120b/discussions/73#689604892dc285a0804eb331
nice dockerfile hack:
I have hacked together a Dockerfile and instructions to build/run here: https://gist.github.com/Ithanil/fd7644bf3e44eec752d1263a8b8acb3a
You'll get almost 100 Token/s single request on 1x A100, >500 Token/s throughput for 30 parallel.
So far it survives load testing fine.
alternative recipe:
python3.12 -m venv ./.venv
source ./.venv/bin/activate
uv pip install --pre torch torchvision --index-url <https://download.pytorch.org/whl/nightly/cu128>
uv pip install "transformers[torch]"
git clone <https://github.com/zyongye/vllm.git>
cd vllm
git checkout 8260948cdc379d13bf4b80d3172a03d21a983e05
python use_existing_torch.py
uv pip install -r requirements/build.txt
CCACHE_NOHASHDIR="true" uv pip install --no-build-isolation -e . -v
uv pip uninstall triton
uv pip uninstall pytorch-triton
uv pip install triton==3.4.0
uv pip install openai_harmony
uv pip install mcp
git clone <https://github.com/openai/triton.git>
pushd triton
uv pip install -e python/triton_kernels --no-deps
popd
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 vllm serve openai/gpt-oss-120b --tensor-parallel-size 2
I'm trying on A100 too, here's my install order:
# Install latest transformers
pip install -U "transformers[torch]"
# Compile vllm from source, remember to checkout to pr <https://github.com/vllm-project/vllm/pull/22259>
cd vllm
python use_existing_torch.py
pip install -r requirements/build.txt
pip install -U -e . --verbose --no-build-isolation
# Then start, the env is important to start, without this, FA3 error will raise
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 vllm serve openai/gpt-oss-120b
everything works fine on my environment.
Just for reference only, as I skipped the installation of torch nightly, triton, and flashinfer included in the original guide. I guess the Ampere cannot obtain optimizations from the latest packages.
My package version:
torch 2.8.0
triton 3.4.0
transformers 4.55.0