在現代容器化應用程式中,健康檢查(Health Check)是確保服務穩定運行的關鍵機制。本文將深入探討如何在 Docker 和 Docker Compose 中實作健康檢查,幫助您打造更可靠的容器化環境。
健康檢查概念與重要性
什麼是健康檢查?
健康檢查是一種定期驗證容器內應用程式是否正常運行的機制。Docker 會定期執行指定的檢查命令,並根據結果判斷容器的健康狀態。
為什麼需要健康檢查?
- 早期問題發現:在問題影響使用者之前及時發現異常
- 自動化恢復:搭配編排工具實現自動重啟或替換故障容器
- 服務依賴管理:確保依賴的服務真正就緒後再啟動下游服務
- 負載均衡整合:讓負載均衡器只將流量導向健康的實例
- 監控與告警:提供服務健康狀態的即時可見性
容器健康狀態
Docker 定義了三種健康狀態:
| 狀態 | 說明 |
|---|
starting | 容器正在啟動,健康檢查尚未開始或未通過 |
healthy | 健康檢查連續成功通過 |
unhealthy | 健康檢查連續失敗達到指定次數 |
Dockerfile HEALTHCHECK 指令
基本語法
在 Dockerfile 中,您可以使用 HEALTHCHECK 指令定義健康檢查:
1
| HEALTHCHECK [OPTIONS] CMD command
|
可用選項
| 選項 | 預設值 | 說明 |
|---|
--interval | 30s | 檢查間隔時間 |
--timeout | 30s | 單次檢查超時時間 |
--start-period | 0s | 容器啟動後的等待時間,期間檢查失敗不計入重試次數 |
--start-interval | 5s | 啟動期間的檢查間隔(Docker 25.0+) |
--retries | 3 | 連續失敗多少次判定為 unhealthy |
實際範例
Node.js 應用程式
1
2
3
4
5
6
7
8
9
10
11
12
13
| FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD node -e "require('http').get('http://localhost:3000/health', (r) => r.statusCode === 200 ? process.exit(0) : process.exit(1))"
CMD ["node", "server.js"]
|
Python Flask 應用程式
1
2
3
4
5
6
7
8
9
10
11
12
13
| FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 5000
HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
CMD curl --fail http://localhost:5000/health || exit 1
CMD ["python", "app.py"]
|
使用 wget 的替代方案
如果容器中沒有 curl,可以使用 wget:
1
2
| HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1
|
停用健康檢查
如果基礎映像檔已定義健康檢查,但您想停用它:
Docker Compose healthcheck 設定
基本設定結構
在 docker-compose.yml 中,健康檢查設定位於服務定義下:
1
2
3
4
5
6
7
8
9
10
11
| version: "3.9"
services:
web:
image: nginx:alpine
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
|
test 指令格式
test 欄位支援多種格式:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # 格式一:字串形式(使用 shell 執行)
healthcheck:
test: curl -f http://localhost/health || exit 1
# 格式二:陣列形式(使用 CMD,推薦)
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost/health"]
# 格式三:使用 CMD-SHELL(透過 shell 執行)
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost/health || exit 1"]
# 停用健康檢查
healthcheck:
test: ["NONE"]
|
完整設定範例
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| version: "3.9"
services:
api:
build: ./api
ports:
- "3000:3000"
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"]
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
environment:
- NODE_ENV=production
|
各類服務健康檢查範例
資料庫服務
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| services:
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: myapp
POSTGRES_USER: admin
POSTGRES_PASSWORD: secret
healthcheck:
test: ["CMD-SHELL", "pg_isready -U admin -d myapp"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
volumes:
- postgres_data:/var/lib/postgresql/data
volumes:
postgres_data:
|
MySQL / MariaDB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| services:
mysql:
image: mysql:8.0
environment:
MYSQL_ROOT_PASSWORD: rootpassword
MYSQL_DATABASE: myapp
MYSQL_USER: admin
MYSQL_PASSWORD: secret
healthcheck:
test: ["CMD", "mysqladmin", "ping", "-h", "localhost", "-u", "root", "-p${MYSQL_ROOT_PASSWORD}"]
interval: 10s
timeout: 5s
retries: 5
start_period: 60s
volumes:
- mysql_data:/var/lib/mysql
volumes:
mysql_data:
|
MongoDB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| services:
mongodb:
image: mongo:7
environment:
MONGO_INITDB_ROOT_USERNAME: admin
MONGO_INITDB_ROOT_PASSWORD: secret
healthcheck:
test: ["CMD", "mongosh", "--eval", "db.adminCommand('ping')"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
volumes:
- mongo_data:/data/db
volumes:
mongo_data:
|
快取服務
Redis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| services:
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 3
start_period: 5s
volumes:
- redis_data:/data
volumes:
redis_data:
|
Memcached
1
2
3
4
5
6
7
8
| services:
memcached:
image: memcached:1.6-alpine
healthcheck:
test: ["CMD-SHELL", "echo stats | nc localhost 11211 | grep -q 'STAT pid'"]
interval: 10s
timeout: 5s
retries: 3
|
訊息佇列
RabbitMQ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| services:
rabbitmq:
image: rabbitmq:3-management-alpine
environment:
RABBITMQ_DEFAULT_USER: admin
RABBITMQ_DEFAULT_PASS: secret
healthcheck:
test: ["CMD", "rabbitmq-diagnostics", "-q", "ping"]
interval: 30s
timeout: 10s
retries: 5
start_period: 60s
ports:
- "5672:5672"
- "15672:15672"
|
Apache Kafka
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| services:
kafka:
image: confluentinc/cp-kafka:7.5.0
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
healthcheck:
test: ["CMD-SHELL", "kafka-broker-api-versions --bootstrap-server localhost:9092"]
interval: 30s
timeout: 10s
retries: 5
start_period: 60s
depends_on:
zookeeper:
condition: service_healthy
|
Web 伺服器
Nginx
1
2
3
4
5
6
7
8
9
10
| services:
nginx:
image: nginx:alpine
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost/health"]
interval: 30s
timeout: 10s
retries: 3
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
|
對應的 Nginx 設定(需新增 health 端點):
1
2
3
4
5
6
7
8
9
10
11
12
13
| server {
listen 80;
location /health {
access_log off;
return 200 "healthy\n";
add_header Content-Type text/plain;
}
location / {
# 其他設定...
}
}
|
Elasticsearch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
healthcheck:
test: ["CMD-SHELL", "curl -s http://localhost:9200/_cluster/health | grep -q '\"status\":\"green\"\\|\"status\":\"yellow\"'"]
interval: 30s
timeout: 10s
retries: 5
start_period: 60s
volumes:
- es_data:/usr/share/elasticsearch/data
volumes:
es_data:
|
depends_on 與 condition 搭配
服務啟動順序控制
Docker Compose v3.9+ 支援使用 condition 來控制服務啟動順序:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
| version: "3.9"
services:
db:
image: postgres:16-alpine
environment:
POSTGRES_PASSWORD: secret
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 5s
retries: 5
cache:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 5s
retries: 5
api:
build: ./api
depends_on:
db:
condition: service_healthy
cache:
condition: service_healthy
environment:
DATABASE_URL: postgres://postgres:secret@db:5432/postgres
REDIS_URL: redis://cache:6379
|
condition 可用值
| 條件 | 說明 |
|---|
service_started | 服務已啟動(預設行為) |
service_healthy | 服務健康檢查通過 |
service_completed_successfully | 服務成功執行完成(用於一次性任務) |
複雜依賴關係範例
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
| version: "3.9"
services:
# 資料庫遷移任務
db-migration:
build: ./migration
command: npm run migrate
depends_on:
db:
condition: service_healthy
restart: "no"
# 主要 API 服務
api:
build: ./api
depends_on:
db:
condition: service_healthy
cache:
condition: service_healthy
db-migration:
condition: service_completed_successfully
ports:
- "3000:3000"
# 背景工作者
worker:
build: ./worker
depends_on:
api:
condition: service_healthy
queue:
condition: service_healthy
# 基礎設施服務
db:
image: postgres:16-alpine
environment:
POSTGRES_PASSWORD: secret
healthcheck:
test: ["CMD-SHELL", "pg_isready"]
interval: 5s
timeout: 5s
retries: 5
cache:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 5s
retries: 5
queue:
image: rabbitmq:3-alpine
healthcheck:
test: ["CMD", "rabbitmq-diagnostics", "-q", "ping"]
interval: 10s
timeout: 10s
retries: 5
start_period: 30s
|
健康狀態監控
查看容器健康狀態
1
2
3
4
5
6
7
| # 查看所有容器狀態(包含健康狀態)
docker ps
# 輸出範例:
# CONTAINER ID IMAGE STATUS NAMES
# abc123 myapp:latest Up 5 minutes (healthy) myapp_api_1
# def456 postgres:16 Up 5 minutes (healthy) myapp_db_1
|
檢視詳細健康資訊
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # 查看容器健康檢查詳細資訊
docker inspect --format='{{json .State.Health}}' container_name | jq
# 輸出範例:
# {
# "Status": "healthy",
# "FailingStreak": 0,
# "Log": [
# {
# "Start": "2025-09-08T10:00:00.000000000Z",
# "End": "2025-09-08T10:00:01.000000000Z",
# "ExitCode": 0,
# "Output": "OK"
# }
# ]
# }
|
使用 docker compose 命令
1
2
3
4
5
6
7
8
| # 查看服務狀態
docker compose ps
# 查看特定服務的日誌
docker compose logs -f api
# 查看健康檢查輸出
docker inspect $(docker compose ps -q api) --format='{{range .State.Health.Log}}{{.Output}}{{end}}'
|
建立監控腳本
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| #!/bin/bash
# health-monitor.sh - 監控所有容器健康狀態
echo "=== Container Health Status ==="
echo ""
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" | while read line; do
echo "$line"
done
echo ""
echo "=== Unhealthy Containers ==="
unhealthy=$(docker ps --filter "health=unhealthy" --format "{{.Names}}")
if [ -z "$unhealthy" ]; then
echo "All containers are healthy!"
else
echo "$unhealthy"
# 顯示詳細的健康檢查日誌
for container in $unhealthy; do
echo ""
echo "--- $container ---"
docker inspect --format='{{range .State.Health.Log}}Exit: {{.ExitCode}} | Output: {{.Output}}{{end}}' $container
done
fi
|
Docker Events 監聽
1
2
3
4
5
6
| # 監聽健康狀態變更事件
docker events --filter event=health_status
# 輸出範例:
# 2025-09-08T10:00:00.000000000Z container health_status: healthy abc123 (name=myapp_api_1)
# 2025-09-08T10:05:00.000000000Z container health_status: unhealthy def456 (name=myapp_worker_1)
|
自動重啟與恢復策略
restart 策略
Docker Compose 提供多種重啟策略:
1
2
3
4
5
6
7
8
9
| services:
api:
image: myapp:latest
restart: always # 或 "on-failure", "unless-stopped", "no"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
|
| 策略 | 說明 |
|---|
no | 不自動重啟(預設) |
always | 總是重啟,除非手動停止 |
on-failure | 只在非零退出碼時重啟 |
unless-stopped | 除非明確停止,否則重啟 |
結合 Docker Swarm 的進階策略
在 Swarm 模式下,可以使用更進階的部署設定:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| version: "3.9"
services:
api:
image: myapp:latest
deploy:
replicas: 3
update_config:
parallelism: 1
delay: 10s
failure_action: rollback
rollback_config:
parallelism: 1
delay: 10s
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 120s
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 10s
timeout: 5s
retries: 3
start_period: 30s
|
使用 autoheal 容器
autoheal 是一個可以自動重啟 unhealthy 容器的工具:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| version: "3.9"
services:
autoheal:
image: willfarrell/autoheal:latest
environment:
AUTOHEAL_CONTAINER_LABEL: all
AUTOHEAL_INTERVAL: 5
AUTOHEAL_START_PERIOD: 60
volumes:
- /var/run/docker.sock:/var/run/docker.sock
restart: always
api:
image: myapp:latest
labels:
autoheal: "true"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
|
自訂健康檢查與重啟腳本
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| #!/bin/bash
# auto-restart-unhealthy.sh
COMPOSE_FILE="docker-compose.yml"
MAX_RESTART_ATTEMPTS=3
RESTART_DELAY=30
declare -A restart_counts
while true; do
unhealthy=$(docker compose -f $COMPOSE_FILE ps --filter "health=unhealthy" --format "{{.Service}}")
for service in $unhealthy; do
current_count=${restart_counts[$service]:-0}
if [ $current_count -lt $MAX_RESTART_ATTEMPTS ]; then
echo "$(date): Restarting unhealthy service: $service (attempt $((current_count + 1)))"
docker compose -f $COMPOSE_FILE restart $service
restart_counts[$service]=$((current_count + 1))
else
echo "$(date): Service $service exceeded max restart attempts, alerting..."
# 在此加入告警邏輯(如發送 Slack 通知)
fi
done
sleep $RESTART_DELAY
done
|
故障排除與最佳實務
常見問題排除
1. 健康檢查持續失敗
1
2
3
4
5
6
7
8
| # 檢查健康檢查命令是否正確
docker exec -it container_name sh -c "curl -f http://localhost:3000/health"
# 查看最近的健康檢查日誌
docker inspect container_name --format='{{json .State.Health.Log}}' | jq '.[-5:]'
# 進入容器內部除錯
docker exec -it container_name sh
|
2. start_period 設定不當
如果服務需要較長時間啟動,請適當增加 start_period:
1
2
3
4
5
6
| healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 120s # 給予足夠的啟動時間
|
3. 網路問題導致檢查失敗
1
2
3
4
| # 確保健康檢查使用 localhost 而非服務名稱
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"] # 正確
# test: ["CMD", "curl", "-f", "http://api:3000/health"] # 錯誤
|
4. 檢查命令工具不存在
1
2
3
4
5
6
7
8
9
| # 方案一:安裝 curl
FROM node:20-alpine
RUN apk add --no-cache curl
# 方案二:使用 wget(Alpine 預設已安裝)
HEALTHCHECK CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
# 方案三:使用語言原生方式
HEALTHCHECK CMD node -e "fetch('http://localhost:3000/health').then(r => process.exit(r.ok ? 0 : 1))"
|
最佳實務
1. 設計專用的健康檢查端點
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| // Node.js Express 範例
app.get('/health', async (req, res) => {
try {
// 檢查資料庫連線
await db.query('SELECT 1');
// 檢查快取連線
await redis.ping();
res.status(200).json({
status: 'healthy',
timestamp: new Date().toISOString(),
checks: {
database: 'ok',
cache: 'ok'
}
});
} catch (error) {
res.status(503).json({
status: 'unhealthy',
error: error.message
});
}
});
// 輕量級存活檢查(liveness)
app.get('/health/live', (req, res) => {
res.status(200).send('OK');
});
// 就緒檢查(readiness)
app.get('/health/ready', async (req, res) => {
// 檢查所有依賴服務...
});
|
2. 適當的時間間隔設定
1
2
3
4
5
6
| healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s # 不要太頻繁,避免增加負擔
timeout: 10s # 給予足夠的回應時間
retries: 3 # 避免單次失敗就判定為 unhealthy
start_period: 30s # 根據應用程式啟動時間調整
|
3. 區分存活與就緒檢查
1
2
3
4
5
6
7
8
9
| services:
api:
image: myapp:latest
healthcheck:
# 使用輕量級的存活檢查
test: ["CMD", "curl", "-f", "http://localhost:3000/health/live"]
interval: 10s
timeout: 5s
retries: 3
|
4. 避免健康檢查的副作用
健康檢查應該是:
5. 記錄健康檢查資訊
1
2
3
4
5
| services:
api:
image: myapp:latest
healthcheck:
test: ["CMD-SHELL", "curl -sf http://localhost:3000/health | tee /proc/1/fd/1"]
|
6. 完整的生產環境範例
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
| version: "3.9"
services:
# 反向代理
nginx:
image: nginx:alpine
ports:
- "80:80"
- "443:443"
healthcheck:
test: ["CMD", "nginx", "-t"]
interval: 30s
timeout: 10s
retries: 3
depends_on:
api:
condition: service_healthy
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
restart: unless-stopped
# API 服務
api:
build: ./api
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 15s
timeout: 10s
retries: 3
start_period: 30s
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
environment:
NODE_ENV: production
DATABASE_URL: postgres://admin:secret@db:5432/myapp
REDIS_URL: redis://redis:6379
restart: unless-stopped
# 背景工作者
worker:
build: ./worker
healthcheck:
test: ["CMD-SHELL", "pgrep -f 'node worker.js' || exit 1"]
interval: 30s
timeout: 10s
retries: 3
depends_on:
api:
condition: service_healthy
environment:
NODE_ENV: production
restart: unless-stopped
# PostgreSQL
db:
image: postgres:16-alpine
healthcheck:
test: ["CMD-SHELL", "pg_isready -U admin -d myapp"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
environment:
POSTGRES_USER: admin
POSTGRES_PASSWORD: secret
POSTGRES_DB: myapp
volumes:
- postgres_data:/var/lib/postgresql/data
restart: unless-stopped
# Redis
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 3
volumes:
- redis_data:/data
restart: unless-stopped
# 自動重啟 unhealthy 容器
autoheal:
image: willfarrell/autoheal:latest
environment:
AUTOHEAL_CONTAINER_LABEL: all
AUTOHEAL_INTERVAL: 5
volumes:
- /var/run/docker.sock:/var/run/docker.sock
restart: always
volumes:
postgres_data:
redis_data:
|
總結
Docker Compose 健康檢查是建立可靠容器化應用程式的重要機制。透過適當的設定,您可以:
- 確保服務真正就緒:使用
depends_on 搭配 condition: service_healthy - 快速發現問題:透過定期健康檢查及時發現異常
- 自動化恢復:搭配重啟策略或 autoheal 工具自動處理故障
- 提升可觀測性:透過健康狀態監控了解系統整體健康程度
記住,健康檢查不是一勞永逸的設定,而是需要根據實際運行情況持續調整優化的過程。建議從簡單的檢查開始,逐步增加覆蓋範圍,並定期審視檢查的有效性。