vLLM 高效能 LLM 推論引擎

Wed, 15 Apr 2026 00:00:00 +0000

專案簡介

vLLM 是一個高效能的 LLM 推論和服務引擎。採用 PagedAttention 技術有效管理注意力機制的 KV 快取，大幅提升吞吐量和 GPU 利用率。

GitHub Stars: 70K+

主要特色

PagedAttention - 高效記憶體管理
連續批次處理 - 最大化吞吐量
多模型支援 - Llama、Mistral、Qwen 等
OpenAI 相容 - 直接替換 API
分散式推論 - 多 GPU 支援

安裝

pip

1

pip install vllm

Docker

1

docker pull vllm/vllm-openai:latest

快速開始

離線批次推論

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


from vllm import LLM, SamplingParams

# 載入模型
llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct")

# 設定採樣參數
sampling_params = SamplingParams(
 temperature=0.7,
 top_p=0.9,
 max_tokens=256
)

# 批次推論
prompts = [
 "什麼是 Kubernetes？",
 "解釋 SQL Injection 攻擊",
 "如何設計高可用架構？"
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
 print(f"Prompt: {output.prompt}")
 print(f"Output: {output.outputs[0].text}")

啟動 API 伺服器

1
2
3


python -m vllm.entrypoints.openai.api_server \
 --model meta-llama/Llama-3.3-70B-Instruct \
 --port 8000

使用 API

1
2
3
4
5
6
7
8


curl http://localhost:8000/v1/completions \
 -H "Content-Type: application/json" \
 -d '{
 "model": "meta-llama/Llama-3.3-70B-Instruct",
 "prompt": "什麼是零信任架構？",
 "max_tokens": 256,
 "temperature": 0.7
 }'

進階設定

GPU 記憶體控制

1
2
3
4


python -m vllm.entrypoints.openai.api_server \
 --model meta-llama/Llama-3.3-70B-Instruct \
 --gpu-memory-utilization 0.9 \
 --max-model-len 4096

多 GPU 推論

1
2
3
4
5
6
7
8
9


# Tensor Parallelism
python -m vllm.entrypoints.openai.api_server \
 --model meta-llama/Llama-3.3-70B-Instruct \
 --tensor-parallel-size 4

# Pipeline Parallelism
python -m vllm.entrypoints.openai.api_server \
 --model meta-llama/Llama-3.3-70B-Instruct \
 --pipeline-parallel-size 2

量化模型

1
2
3
4
5
6
7
8
9


# AWQ 量化
python -m vllm.entrypoints.openai.api_server \
 --model TheBloke/Llama-2-70B-AWQ \
 --quantization awq

# GPTQ 量化
python -m vllm.entrypoints.openai.api_server \
 --model TheBloke/Llama-2-70B-GPTQ \
 --quantization gptq

Docker 部署

基本部署

1
2
3
4
5


docker run --gpus all \
 -v ~/.cache/huggingface:/root/.cache/huggingface \
 -p 8000:8000 \
 vllm/vllm-openai:latest \
 --model meta-llama/Llama-3.3-70B-Instruct

Docker Compose

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


version: '3.8'
services:
 vllm:
 image: vllm/vllm-openai:latest
 runtime: nvidia
 environment:
 - NVIDIA_VISIBLE_DEVICES=all
 - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
 volumes:
 - ~/.cache/huggingface:/root/.cache/huggingface
 ports:
 - "8000:8000"
 command: >
 --model meta-llama/Llama-3.3-70B-Instruct
 --gpu-memory-utilization 0.9
 --max-model-len 8192
 deploy:
 resources:
 reservations:
 devices:
 - driver: nvidia
 count: all
 capabilities: [gpu]

Python 整合

OpenAI 相容客戶端

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


from openai import OpenAI

client = OpenAI(
 base_url="http://localhost:8000/v1",
 api_key="token-abc123" # vLLM 不驗證 API key
)

response = client.chat.completions.create(
 model="meta-llama/Llama-3.3-70B-Instruct",
 messages=[
 {"role": "system", "content": "你是一位資安專家"},
 {"role": "user", "content": "什麼是 XSS 攻擊？"}
 ],
 temperature=0.7,
 max_tokens=512
)

print(response.choices[0].message.content)

串流回應

1
2
3
4
5
6
7
8
9


stream = client.chat.completions.create(
 model="meta-llama/Llama-3.3-70B-Instruct",
 messages=[{"role": "user", "content": "解釋 Docker"}],
 stream=True
)

for chunk in stream:
 if chunk.choices[0].delta.content:
 print(chunk.choices[0].delta.content, end="")

效能調優

批次大小

1
2
3
4


python -m vllm.entrypoints.openai.api_server \
 --model meta-llama/Llama-3.3-70B-Instruct \
 --max-num-batched-tokens 32768 \
 --max-num-seqs 256

KV 快取設定

1
2
3
4
5
6
7


from vllm import LLM

llm = LLM(
 model="meta-llama/Llama-3.3-70B-Instruct",
 block_size=16,
 swap_space=4 # GB
)

模型支援

模型系列	範例
Llama	Llama-3.3-70B, Llama-3.1-8B
Mistral	Mistral-7B, Mixtral-8x7B
Qwen	Qwen2.5-72B
DeepSeek	DeepSeek-V3
Gemma	Gemma-2-27B

gpu on Astroicers Blog

vLLM 高效能 LLM 推論引擎

專案簡介

主要特色

安裝

pip

Docker

快速開始

離線批次推論

啟動 API 伺服器

使用 API

進階設定

GPU 記憶體控制

多 GPU 推論

量化模型

Docker 部署

基本部署

Docker Compose

Python 整合

OpenAI 相容客戶端

串流回應

效能調優

批次大小

KV 快取設定

模型支援

相關連結

延伸閱讀