inference on Astroicers Blog

llama.cpp 高效能 LLM 推理引擎

Thu, 23 Apr 2026 00:00:00 +0000

專案簡介

llama.cpp 是一個純 C/C++ 實作的 LLM 推理引擎，無需 GPU 即可高效運行大型語言模型。支援多種硬體加速，是本地 LLM 部署的首選方案。

GitHub Stars: 94K+

主要功能

純 C/C++ - 無 Python 依賴
CPU 優化 - AVX、AVX2、AVX512 加速
GPU 支援 - CUDA、Metal、Vulkan
量化支援 - 2-8 bit 量化減少記憶體
GGUF 格式 - 高效模型格式

安裝

從原始碼編譯

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# CPU 版本
make

# CUDA 版本
make GGML_CUDA=1

# Metal 版本（macOS）
make GGML_METAL=1

使用 CMake

1
2
3


mkdir build && cd build
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release

下載模型

從 HuggingFace 下載

1
2
3
4
5


# 安裝 huggingface-cli
pip install huggingface-hub

# 下載 GGUF 模型
huggingface-cli download TheBloke/Llama-2-7B-GGUF llama-2-7b.Q4_K_M.gguf

常用量化版本

量化等級	大小	品質	速度
Q2_K	最小	較低	最快
Q4_K_M	中等	良好	快
Q5_K_M	較大	優秀	中等
Q8_0	大	接近原始	較慢

基本使用

命令列推理

1
2
3
4
5


# 互動模式
./llama-cli -m llama-2-7b.Q4_K_M.gguf -p "Hello, how are you?" -n 128

# 對話模式
./llama-cli -m llama-2-7b.Q4_K_M.gguf --chat-template llama2 -cnv

參數說明

1
2
3
4
5
6


-m, --model # 模型檔案路徑
-p, --prompt # 輸入提示詞
-n, --predict # 生成 token 數量
-c, --ctx-size # 上下文大小
-t, --threads # CPU 執行緒數
-ngl, --gpu-layers # GPU 層數

Server 模式

啟動 API Server

1

./llama-server -m llama-2-7b.Q4_K_M.gguf --host 0.0.0.0 --port 8080

OpenAI 相容 API

1
2
3
4
5
6
7
8


curl http://localhost:8080/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
 "model": "llama-2-7b",
 "messages": [
 {"role": "user", "content": "什麼是 Kubernetes？"}
 ]
 }'

Python 整合

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


from openai import OpenAI

client = OpenAI(
 base_url="http://localhost:8080/v1",
 api_key="not-needed"
)

response = client.chat.completions.create(
 model="llama-2-7b",
 messages=[
 {"role": "user", "content": "解釋什麼是容器化"}
 ]
)

print(response.choices[0].message.content)

GPU 加速

CUDA 設定

1
2
3
4
5


# 將所有層載入 GPU
./llama-cli -m model.gguf -ngl 99

# 部分層載入 GPU（記憶體不足時）
./llama-cli -m model.gguf -ngl 20

記憶體需求

模型大小	Q4_K_M	Q8_0
7B	~4 GB	~7 GB
13B	~8 GB	~14 GB
70B	~40 GB	~70 GB

效能優化

CPU 優化

1
2
3
4
5


# 設定執行緒數
./llama-cli -m model.gguf -t 8

# 啟用 mmap
./llama-cli -m model.gguf --mmap

Batch 處理

1
2


# 增加 batch size
./llama-server -m model.gguf -b 512 --parallel 4

模型轉換

轉換為 GGUF

1
2
3
4
5


# 從 HuggingFace 格式轉換
python convert_hf_to_gguf.py ./model_dir --outtype f16

# 量化
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

Docker 部署

1
2
3
4
5


docker run -p 8080:8080 \
 -v /path/to/models:/models \
 ghcr.io/ggerganov/llama.cpp:server \
 -m /models/llama-2-7b.Q4_K_M.gguf \
 --host 0.0.0.0 --port 8080

vLLM 高效能 LLM 推論引擎

Wed, 15 Apr 2026 00:00:00 +0000

專案簡介

vLLM 是一個高效能的 LLM 推論和服務引擎。採用 PagedAttention 技術有效管理注意力機制的 KV 快取，大幅提升吞吐量和 GPU 利用率。

GitHub Stars: 70K+

主要特色

PagedAttention - 高效記憶體管理
連續批次處理 - 最大化吞吐量
多模型支援 - Llama、Mistral、Qwen 等
OpenAI 相容 - 直接替換 API
分散式推論 - 多 GPU 支援

安裝

pip

1

pip install vllm

Docker

1

docker pull vllm/vllm-openai:latest

快速開始

離線批次推論

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


from vllm import LLM, SamplingParams

# 載入模型
llm = LLM(model="meta-llama/Llama-3.3-70B-Instruct")

# 設定採樣參數
sampling_params = SamplingParams(
 temperature=0.7,
 top_p=0.9,
 max_tokens=256
)

# 批次推論
prompts = [
 "什麼是 Kubernetes？",
 "解釋 SQL Injection 攻擊",
 "如何設計高可用架構？"
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
 print(f"Prompt: {output.prompt}")
 print(f"Output: {output.outputs[0].text}")

啟動 API 伺服器

1
2
3


python -m vllm.entrypoints.openai.api_server \
 --model meta-llama/Llama-3.3-70B-Instruct \
 --port 8000

使用 API

1
2
3
4
5
6
7
8


curl http://localhost:8000/v1/completions \
 -H "Content-Type: application/json" \
 -d '{
 "model": "meta-llama/Llama-3.3-70B-Instruct",
 "prompt": "什麼是零信任架構？",
 "max_tokens": 256,
 "temperature": 0.7
 }'

進階設定

GPU 記憶體控制

1
2
3
4


python -m vllm.entrypoints.openai.api_server \
 --model meta-llama/Llama-3.3-70B-Instruct \
 --gpu-memory-utilization 0.9 \
 --max-model-len 4096

多 GPU 推論

1
2
3
4
5
6
7
8
9


# Tensor Parallelism
python -m vllm.entrypoints.openai.api_server \
 --model meta-llama/Llama-3.3-70B-Instruct \
 --tensor-parallel-size 4

# Pipeline Parallelism
python -m vllm.entrypoints.openai.api_server \
 --model meta-llama/Llama-3.3-70B-Instruct \
 --pipeline-parallel-size 2

量化模型

1
2
3
4
5
6
7
8
9


# AWQ 量化
python -m vllm.entrypoints.openai.api_server \
 --model TheBloke/Llama-2-70B-AWQ \
 --quantization awq

# GPTQ 量化
python -m vllm.entrypoints.openai.api_server \
 --model TheBloke/Llama-2-70B-GPTQ \
 --quantization gptq

Docker 部署

基本部署

1
2
3
4
5


docker run --gpus all \
 -v ~/.cache/huggingface:/root/.cache/huggingface \
 -p 8000:8000 \
 vllm/vllm-openai:latest \
 --model meta-llama/Llama-3.3-70B-Instruct

Docker Compose

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


version: '3.8'
services:
 vllm:
 image: vllm/vllm-openai:latest
 runtime: nvidia
 environment:
 - NVIDIA_VISIBLE_DEVICES=all
 - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
 volumes:
 - ~/.cache/huggingface:/root/.cache/huggingface
 ports:
 - "8000:8000"
 command: >
 --model meta-llama/Llama-3.3-70B-Instruct
 --gpu-memory-utilization 0.9
 --max-model-len 8192
 deploy:
 resources:
 reservations:
 devices:
 - driver: nvidia
 count: all
 capabilities: [gpu]

Python 整合

OpenAI 相容客戶端

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


from openai import OpenAI

client = OpenAI(
 base_url="http://localhost:8000/v1",
 api_key="token-abc123" # vLLM 不驗證 API key
)

response = client.chat.completions.create(
 model="meta-llama/Llama-3.3-70B-Instruct",
 messages=[
 {"role": "system", "content": "你是一位資安專家"},
 {"role": "user", "content": "什麼是 XSS 攻擊？"}
 ],
 temperature=0.7,
 max_tokens=512
)

print(response.choices[0].message.content)

串流回應

1
2
3
4
5
6
7
8
9


stream = client.chat.completions.create(
 model="meta-llama/Llama-3.3-70B-Instruct",
 messages=[{"role": "user", "content": "解釋 Docker"}],
 stream=True
)

for chunk in stream:
 if chunk.choices[0].delta.content:
 print(chunk.choices[0].delta.content, end="")

效能調優

批次大小

1
2
3
4


python -m vllm.entrypoints.openai.api_server \
 --model meta-llama/Llama-3.3-70B-Instruct \
 --max-num-batched-tokens 32768 \
 --max-num-seqs 256

KV 快取設定

1
2
3
4
5
6
7


from vllm import LLM

llm = LLM(
 model="meta-llama/Llama-3.3-70B-Instruct",
 block_size=16,
 swap_space=4 # GB
)

模型支援

模型系列	範例
Llama	Llama-3.3-70B, Llama-3.1-8B
Mistral	Mistral-7B, Mixtral-8x7B
Qwen	Qwen2.5-72B
DeepSeek	DeepSeek-V3
Gemma	Gemma-2-27B

inference on Astroicers Blog

llama.cpp 高效能 LLM 推理引擎

專案簡介

主要功能

安裝

從原始碼編譯

使用 CMake

下載模型

從 HuggingFace 下載

常用量化版本

基本使用

命令列推理

參數說明

Server 模式

啟動 API Server

OpenAI 相容 API

Python 整合

GPU 加速

CUDA 設定

記憶體需求

效能優化

CPU 優化

Batch 處理

模型轉換

轉換為 GGUF

Docker 部署

相關連結

延伸閱讀

vLLM 高效能 LLM 推論引擎

專案簡介

主要特色

安裝

pip

Docker

快速開始

離線批次推論

啟動 API 伺服器

使用 API

進階設定

GPU 記憶體控制

多 GPU 推論

量化模型

Docker 部署

基本部署

Docker Compose

Python 整合

OpenAI 相容客戶端

串流回應

效能調優

批次大小

KV 快取設定

模型支援

相關連結

延伸閱讀