Architecture Overview

1. LLM Class¶

The LLM class provides the primary Python interface for doing offline inference, which is interacting with a model without using a separate model inference server.

from vllm import LLM

More API details can be found in the https://docs.vllm.ai/en/latest/design/arch_overview.html#offline-inference-api section of the API docs.

The code for the LLM class can be found in https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py.

OpenAI-Compatible API Server¶

The second primary interface to vLLM is via its OpenAI-compatible API server. This server can be started using the vllm serve command.

vllm serve <model>

The code for the vllm CLI can be found in https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/main.py.

ometimes you may see the API server entrypoint used directly instead of via the vllm CLI command. For example:

python -m vllm.entrypoints.openai.api_server \\
 --model <model>

That code can be found in https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py.

3. LLMEngine¶

The LLMEngine class is the core component of the vLLM engine. It is responsible for receiving requests from clients and generating outputs from the model. The LLMEngine includes input processing, model execution (possibly distributed across multiple hosts and/or GPUs), scheduling, and output processing.

Input Processing: Handles tokenization of input text using the specified tokenizer.