was following https://zerohertz.github.io/vllm-openai-1/ and just running ❯ vllm serve Qwen/Qwen3-0.6B --max-model-len 8192 broke.

is it because im 3050ti ? → yesyesyes

❯ uv pip install vllm
Using Python 3.12.11 environment at: /home/jpotw/Projects/oss/vllm/.venv
Audited 1 package in 154ms
❯ uv pip install vllm --torch-backend=auto
Using Python 3.12.11 environment at: /home/jpotw/Projects/oss/vllm/.venv
Audited 1 package in 19ms
❯ vllm serve Qwen/Qwen3-0.6B --max-model-len 8192
INFO 09-16 19:09:04 [__init__.py:241] Automatically detected platform cuda.
(APIServer pid=190128) INFO 09-16 19:09:07 [api_server.py:1805] vLLM API server version 0.1.dev8589+g3e6dd4001
(APIServer pid=190128) INFO 09-16 19:09:07 [utils.py:326] non-default args: {'model_tag': 'Qwen/Qwen3-0.6B', 'max_model_len': 8192}
(APIServer pid=190128) INFO 09-16 19:09:15 [__init__.py:702] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=190128) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=190128) INFO 09-16 19:09:15 [__init__.py:1740] Using max model len 8192
(APIServer pid=190128) INFO 09-16 19:09:16 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 09-16 19:09:21 [__init__.py:241] Automatically detected platform cuda.
(EngineCore_0 pid=190268) INFO 09-16 19:09:22 [core.py:620] Waiting for init message from front-end.
(EngineCore_0 pid=190268) INFO 09-16 19:09:22 [core.py:72] Initializing a V1 LLM engine (v0.1.dev8589+g3e6dd4001) with config: model='Qwen/Qwen3-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-0.6B, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684] EngineCore failed to start.
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684] Traceback (most recent call last):
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]   File "/home/jpotw/Projects/oss/vllm/vllm/v1/engine/core.py", line 675, in run_engine_core
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]   File "/home/jpotw/Projects/oss/vllm/vllm/v1/engine/core.py", line 476, in __init__
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]   File "/home/jpotw/Projects/oss/vllm/vllm/v1/engine/core.py", line 78, in __init__
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]     self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]   File "/home/jpotw/Projects/oss/vllm/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]     self._init_executor()
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]   File "/home/jpotw/Projects/oss/vllm/vllm/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]     self.collective_rpc("init_device")
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]   File "/home/jpotw/Projects/oss/vllm/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]   File "/home/jpotw/Projects/oss/vllm/vllm/utils/__init__.py", line 2997, in run_method
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]     return func(*args, **kwargs)
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]   File "/home/jpotw/Projects/oss/vllm/vllm/worker/worker_base.py", line 603, in init_device
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]     self.worker.init_device()  # type: ignore
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]     ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]   File "/home/jpotw/Projects/oss/vllm/vllm/v1/worker/gpu_worker.py", line 179, in init_device
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684]     raise ValueError(
(EngineCore_0 pid=190268) ERROR 09-16 19:09:24 [core.py:684] ValueError: Free memory on device (3.2/4.0 GiB) on startup is less than desired GPU memory utilization (0.9, 3.6 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
(EngineCore_0 pid=190268) Process EngineCore_0:
(EngineCore_0 pid=190268) Traceback (most recent call last):
(EngineCore_0 pid=190268)   File "/home/jpotw/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_0 pid=190268)     self.run()
(EngineCore_0 pid=190268)   File "/home/jpotw/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_0 pid=190268)     self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=190268)   File "/home/jpotw/Projects/oss/vllm/vllm/v1/engine/core.py", line 688, in run_engine_core
(EngineCore_0 pid=190268)     raise e
(EngineCore_0 pid=190268)   File "/home/jpotw/Projects/oss/vllm/vllm/v1/engine/core.py", line 675, in run_engine_core
(EngineCore_0 pid=190268)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=190268)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=190268)   File "/home/jpotw/Projects/oss/vllm/vllm/v1/engine/core.py", line 476, in __init__
(EngineCore_0 pid=190268)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=190268)   File "/home/jpotw/Projects/oss/vllm/vllm/v1/engine/core.py", line 78, in __init__
(EngineCore_0 pid=190268)     self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=190268)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=190268)   File "/home/jpotw/Projects/oss/vllm/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_0 pid=190268)     self._init_executor()
(EngineCore_0 pid=190268)   File "/home/jpotw/Projects/oss/vllm/vllm/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_0 pid=190268)     self.collective_rpc("init_device")
(EngineCore_0 pid=190268)   File "/home/jpotw/Projects/oss/vllm/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=190268)     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=190268)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=190268)   File "/home/jpotw/Projects/oss/vllm/vllm/utils/__init__.py", line 2997, in run_method
(EngineCore_0 pid=190268)     return func(*args, **kwargs)
(EngineCore_0 pid=190268)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=190268)   File "/home/jpotw/Projects/oss/vllm/vllm/worker/worker_base.py", line 603, in init_device
(EngineCore_0 pid=190268)     self.worker.init_device()  # type: ignore
(EngineCore_0 pid=190268)     ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=190268)   File "/home/jpotw/Projects/oss/vllm/vllm/v1/worker/gpu_worker.py", line 179, in init_device
(EngineCore_0 pid=190268)     raise ValueError(
(EngineCore_0 pid=190268) ValueError: Free memory on device (3.2/4.0 GiB) on startup is less than desired GPU memory utilization (0.9, 3.6 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
(APIServer pid=190128) Traceback (most recent call last):
(APIServer pid=190128)   File "/home/jpotw/Projects/oss/vllm/.venv/bin/vllm", line 10, in <module>
(APIServer pid=190128)     sys.exit(main())
(APIServer pid=190128)              ^^^^^^
(APIServer pid=190128)   File "/home/jpotw/Projects/oss/vllm/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=190128)     args.dispatch_function(args)
(APIServer pid=190128)   File "/home/jpotw/Projects/oss/vllm/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=190128)     uvloop.run(run_server(args))
(APIServer pid=190128)   File "/home/jpotw/Projects/oss/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
(APIServer pid=190128)     return __asyncio.run(
(APIServer pid=190128)            ^^^^^^^^^^^^^^
(APIServer pid=190128)   File "/home/jpotw/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=190128)     return runner.run(main)
(APIServer pid=190128)            ^^^^^^^^^^^^^^^^
(APIServer pid=190128)   File "/home/jpotw/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=190128)     return self._loop.run_until_complete(task)
(APIServer pid=190128)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=190128)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=190128)   File "/home/jpotw/Projects/oss/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=190128)     return await main
(APIServer pid=190128)            ^^^^^^^^^^
(APIServer pid=190128)   File "/home/jpotw/Projects/oss/vllm/vllm/entrypoints/openai/api_server.py", line 1850, in run_server
(APIServer pid=190128)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=190128)   File "/home/jpotw/Projects/oss/vllm/vllm/entrypoints/openai/api_server.py", line 1870, in run_server_worker
(APIServer pid=190128)     async with build_async_engine_client(
(APIServer pid=190128)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=190128)   File "/home/jpotw/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=190128)     return await anext(self.gen)
(APIServer pid=190128)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=190128)   File "/home/jpotw/Projects/oss/vllm/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client
(APIServer pid=190128)     async with build_async_engine_client_from_engine_args(
(APIServer pid=190128)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=190128)   File "/home/jpotw/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=190128)     return await anext(self.gen)
(APIServer pid=190128)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=190128)   File "/home/jpotw/Projects/oss/vllm/vllm/entrypoints/openai/api_server.py", line 220, in build_async_engine_client_from_engine_args
(APIServer pid=190128)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=190128)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=190128)   File "/home/jpotw/Projects/oss/vllm/vllm/utils/__init__.py", line 1551, in inner
(APIServer pid=190128)     return fn(*args, **kwargs)
(APIServer pid=190128)            ^^^^^^^^^^^^^^^^^^^
(APIServer pid=190128)   File "/home/jpotw/Projects/oss/vllm/vllm/v1/engine/async_llm.py", line 173, in from_vllm_config
(APIServer pid=190128)     return cls(
(APIServer pid=190128)            ^^^^
(APIServer pid=190128)   File "/home/jpotw/Projects/oss/vllm/vllm/v1/engine/async_llm.py", line 119, in __init__
(APIServer pid=190128)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=190128)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=190128)   File "/home/jpotw/Projects/oss/vllm/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=190128)     return AsyncMPClient(*client_args)
(APIServer pid=190128)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=190128)   File "/home/jpotw/Projects/oss/vllm/vllm/v1/engine/core_client.py", line 758, in __init__
(APIServer pid=190128)     super().__init__(
(APIServer pid=190128)   File "/home/jpotw/Projects/oss/vllm/vllm/v1/engine/core_client.py", line 446, in __init__
(APIServer pid=190128)     with launch_core_engines(vllm_config, executor_class,
(APIServer pid=190128)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=190128)   File "/home/jpotw/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=190128)     next(self.gen)
(APIServer pid=190128)   File "/home/jpotw/Projects/oss/vllm/vllm/v1/engine/utils.py", line 706, in launch_core_engines
(APIServer pid=190128)     wait_for_engine_startup(
(APIServer pid=190128)   File "/home/jpotw/Projects/oss/vllm/vllm/v1/engine/utils.py", line 759, in wait_for_engine_startup
(APIServer pid=190128)     raise RuntimeError("Engine core initialization failed. "
(APIServer pid=190128) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

the main error was this:

(EngineCore_0 pid=1575) ValueError: Free memory on device (3.2/4.0 GiB) on startup is less than desired GPU memory utilization (0.9, 3.6 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.

RTX3050ti laptop gpu only provides 4.0GiB but vLLM’s default GPU memory utilization is model size x 4(0.9x4=3.6GiB), which exceeds memory on device(3.2GiB)

so if I allow the threshold in a generous way, it works well

❯ vllm serve Qwen/Qwen3-0.6B \\
--max-model-len 512 \\
--gpu-memory-utilization 0.7
❯ vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \\
  --gpu-memory-utilization 0.6 \\
  --max-model-len 1024 \\
  --max-num-seqs 1
❯ vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 \\
  --gpu-memory-utilization 0.6 \\
  --max-model-len 1024 \\
  --max-num-seqs 1

INFO 11-10 15:55:15 [__init__.py:241] Automatically detected platform cuda.
(APIServer pid=1455) INFO 11-10 15:55:20 [api_server.py:1805] vLLM API server version 0.1.dev8589+g3e6dd4001
(APIServer pid=1455) INFO 11-10 15:55:20 [utils.py:326] non-default args: {'model_tag': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0', 'model': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0', 'max_model_len': 1024, 'gpu_memory_utilization': 0.6, 'max_num_seqs': 1}
config.json: 100%|████████████████████████████████████| 608/608 [00:00<00:00, 3.05MB/s]
(APIServer pid=1455) INFO 11-10 15:55:33 [__init__.py:702] Resolved architecture: LlamaForCausalLM
(APIServer pid=1455) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=1455) INFO 11-10 15:55:33 [__init__.py:1740] Using max model len 1024
(APIServer pid=1455) INFO 11-10 15:55:34 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1455) WARNING 11-10 15:55:34 [scheduler.py:269] max_num_batched_tokens (2048) exceeds max_num_seqs * max_model_len (1024). This may lead to unexpected behavior.
(APIServer pid=1455) WARNING 11-10 15:55:34 [scheduler.py:269] max_num_batched_tokens (2048) exceeds max_num_seqs * max_model_len (1024). This may lead to unexpected behavior.
tokenizer_config.json: 1.29kB [00:00, 385kB/s]
tokenizer.model: 100%|███████████████████████████████| 500k/500k [00:01<00:00, 298kB/s]
tokenizer.json: 1.84MB [00:00, 11.7MB/s]
special_tokens_map.json: 100%|████████████████████████| 551/551 [00:00<00:00, 2.69MB/s]
generation_config.json: 100%|██████████████████████████| 124/124 [00:00<00:00, 713kB/s]
INFO 11-10 15:55:47 [__init__.py:241] Automatically detected platform cuda.
(EngineCore_0 pid=1582) INFO 11-10 15:55:49 [core.py:620] Waiting for init message from front-end.
(EngineCore_0 pid=1582) INFO 11-10 15:55:50 [core.py:72] Initializing a V1 LLM engine (v0.1.dev8589+g3e6dd4001) with config: model='TinyLlama/TinyLlama-1.1B-Chat-v1.0', speculative_config=None, tokenizer='TinyLlama/TinyLlama-1.1B-Chat-v1.0', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=TinyLlama/TinyLlama-1.1B-Chat-v1.0, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":4,"local_cache_dir":null}
(EngineCore_0 pid=1582) INFO 11-10 15:55:53 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=1582) WARNING 11-10 15:55:53 [interface.py:387] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore_0 pid=1582) WARNING 11-10 15:55:53 [topk_topp_sampler.py:61] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_0 pid=1582) INFO 11-10 15:55:53 [gpu_model_runner.py:1942] Starting to load model TinyLlama/TinyLlama-1.1B-Chat-v1.0...
(EngineCore_0 pid=1582) INFO 11-10 15:55:53 [gpu_model_runner.py:1974] Loading model from scratch...
(EngineCore_0 pid=1582) INFO 11-10 15:55:54 [cuda.py:325] Using Flash Attention backend on V1 engine.
(EngineCore_0 pid=1582) INFO 11-10 15:55:55 [weight_utils.py:296] Using model weights format ['*.safetensors']
model.safetensors: 100%|██████████████████████████| 2.20G/2.20G [04:42<00:00, 7.78MB/s]
(EngineCore_0 pid=1582) INFO 11-10 16:00:40 [weight_utils.py:312] Time spent downloading weights for TinyLlama/TinyLlama-1.1B-Chat-v1.0: 271.318530 seconds
(EngineCore_0 pid=1582) INFO 11-10 16:00:40 [weight_utils.py:349] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:18<00:00, 18.86s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:18<00:00, 18.86s/it]
(EngineCore_0 pid=1582)
(EngineCore_0 pid=1582) INFO 11-10 16:00:59 [default_loader.py:262] Loading weights took 17.15 seconds
(EngineCore_0 pid=1582) INFO 11-10 16:01:00 [gpu_model_runner.py:1996] Model loading took 2.0513 GiB and 290.556820 seconds
(EngineCore_0 pid=1582) INFO 11-10 16:03:04 [backends.py:530] Using cache directory: /home/jpotw/.cache/vllm/torch_compile_cache/c657951611/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_0 pid=1582) INFO 11-10 16:03:04 [backends.py:541] Dynamo bytecode transform time: 123.49 s
(EngineCore_0 pid=1582) [rank0]:W1110 16:03:07.202000 1582 torch/_inductor/utils.py:1250] [0/0] Not enough SMs to use max_autotune_gemm mode
(EngineCore_0 pid=1582) INFO 11-10 16:03:14 [backends.py:194] Cache the graph for dynamic shape for later use
(EngineCore_0 pid=1582) INFO 11-10 16:03:38 [backends.py:215] Compiling a graph for dynamic shape takes 33.83 s
(EngineCore_0 pid=1582) INFO 11-10 16:03:43 [monitor.py:34] torch.compile takes 157.32 s in total
(EngineCore_0 pid=1582) INFO 11-10 16:03:45 [gpu_worker.py:276] Available KV cache memory: 0.23 GiB
(EngineCore_0 pid=1582) INFO 11-10 16:03:45 [kv_cache_utils.py:829] GPU KV cache size: 10,864 tokens
(EngineCore_0 pid=1582) INFO 11-10 16:03:45 [kv_cache_utils.py:833] Maximum concurrency for 1,024 tokens per request: 10.61x
Capturing CUDA graph shapes: 100%|███████████████████████| 3/3 [00:00<00:00, 12.62it/s]
(EngineCore_0 pid=1582) INFO 11-10 16:03:46 [gpu_model_runner.py:2598] Graph capturing finished in 1 secs, took 0.07 GiB
(EngineCore_0 pid=1582) INFO 11-10 16:03:46 [core.py:199] init engine (profile, create kv cache, warmup model) took 165.98 seconds
(APIServer pid=1455) INFO 11-10 16:03:48 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 679
(APIServer pid=1455) INFO 11-10 16:03:49 [api_server.py:1611] Supported_tasks: ['generate']
(APIServer pid=1455) INFO 11-10 16:03:50 [api_server.py:1880] Starting vLLM API server 0 on <http://0.0.0.0:8000>
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:29] Available routes are:
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /docs, Methods: GET, HEAD
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /health, Methods: GET
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /load, Methods: GET
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /ping, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /ping, Methods: GET
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /tokenize, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /detokenize, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /v1/models, Methods: GET
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /version, Methods: GET
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /v1/responses, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /v1/completions, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /v1/embeddings, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /pooling, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /classify, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /score, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /v1/score, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /rerank, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /v1/rerank, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /v2/rerank, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /invocations, Methods: POST
(APIServer pid=1455) INFO 11-10 16:03:50 [launcher.py:37] Route: /metrics, Methods: GET
(APIServer pid=1455) INFO:     Started server process [1455]
(APIServer pid=1455) INFO:     Waiting for application startup.
(APIServer pid=1455) INFO:     Application startup complete.
(APIServer pid=1455) INFO:     127.0.0.1:47168 - "GET /health HTTP/1.1" 200 OK
export OPENAI_API_KEY=DUMMY