Current setup/environment:
debug_vllm.py{
"version": "0.2.0",
"configurations": [
{
"name": "Debug vLLM (Eager + Subproc)",
"type": "debugpy",
"request": "launch",
"program": "${workspaceFolder}/debug_vllm.py",
"console": "integratedTerminal",
"justMyCode": false,
"subProcess": true,
"env": {
"CUDA_VISIBLE_DEVICES": "0",
"VLLM_LOGGING_LEVEL": "DEBUG",
"VLLM_CONFIGURE_LOGGING": "1",
}
}
]
}
debug_vllm.py uses almost the same
If we launch it,
❯ cd /home/junu/Projects/oss/vllm ; /usr/bin/env /home/junu/Projects/oss/vllm/.venv/bin/python /home/junu/.vscode-server/extensions/ms-python.debugpy-2025.18.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 41597 -- /home/junu/Projects/oss/vllm/debug_vllm.py
DEBUG 01-04 15:55:54 [plugins/__init__.py:35] No plugins for group vllm.platform_plugins found.
DEBUG 01-04 15:55:54 [platforms/__init__.py:36] Checking if TPU platform is available.
DEBUG 01-04 15:55:54 [platforms/__init__.py:55] TPU platform is not available because: No module named 'libtpu'
DEBUG 01-04 15:55:54 [platforms/__init__.py:61] Checking if CUDA platform is available.
DEBUG 01-04 15:55:54 [platforms/__init__.py:84] Confirmed CUDA platform is available.
DEBUG 01-04 15:55:54 [platforms/__init__.py:112] Checking if ROCm platform is available.
DEBUG 01-04 15:55:54 [platforms/__init__.py:126] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 01-04 15:55:54 [platforms/__init__.py:133] Checking if XPU platform is available.
DEBUG 01-04 15:55:54 [platforms/__init__.py:153] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 01-04 15:55:54 [platforms/__init__.py:160] Checking if CPU platform is available.
DEBUG 01-04 15:55:54 [platforms/__init__.py:61] Checking if CUDA platform is available.
DEBUG 01-04 15:55:54 [platforms/__init__.py:84] Confirmed CUDA platform is available.
DEBUG 01-04 15:55:54 [platforms/__init__.py:225] Automatically detected platform cuda.
DEBUG 01-04 15:56:05 [plugins/__init__.py:43] Available plugins for group vllm.general_plugins:
DEBUG 01-04 15:56:05 [plugins/__init__.py:45] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 01-04 15:56:05 [plugins/__init__.py:48] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 01-04 15:56:05 [entrypoints/utils.py:253] non-default args: {'gpu_memory_utilization': 0.8, 'disable_log_stats': True, 'model': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'}
DEBUG 01-04 15:56:07 [model_executor/models/registry.py:633] Cached model info file for class vllm.model_executor.models.llama.LlamaForCausalLM not found
DEBUG 01-04 15:56:07 [model_executor/models/registry.py:693] Cache model info for class vllm.model_executor.models.llama.LlamaForCausalLM miss. Loading model instead.
The first place we visit is
@register_fake("_C::cutlass_w4a8_mm")
vLLM uses Model Profiling during initialization. Before we start the engine, we need to know exactly how much memory a single request will take. To do this, we run a "dummy" forward pass.
If we ran a real forward pass with real weights, it would be slow and might OOM before we've even started. Instead, we use meta devices (tensors with shapes but no data; which runs much faster).
This tells to PyTorch don't try to launch the CUDA code. Just calculate what the output shape would be.
ModelConfig where we set all the configurations
return ModelConfig(
model=self.model,
hf_config_path=self.hf_config_path,
runner=self.runner,
convert=self.convert,