Kubernetes Liveness 與 Readiness 探針

前言

在 Kubernetes 叢集中運行應用程式時，確保容器健康狀態是維持服務可靠性的關鍵。Kubernetes 提供了三種探針（Probes）機制來監控容器的健康狀況：Liveness Probe、Readiness Probe 和 Startup Probe。本文將深入介紹這三種探針的用途、配置方式以及最佳實踐。

探針概述

探針是 kubelet 對容器執行的定期診斷。Kubernetes 透過探針來判斷容器的狀態，並根據結果採取相應的行動，例如重啟容器或將其從服務端點中移除。

探針類型比較

探針類型	用途	失敗時的行為
Liveness Probe	檢測容器是否正在運行	重啟容器
Readiness Probe	檢測容器是否準備好接收流量	從 Service 端點移除
Startup Probe	檢測應用程式是否已啟動完成	阻止其他探針執行直到成功

Liveness vs Readiness vs Startup

Liveness Probe（存活探針）

Liveness Probe 用於判斷容器是否仍在運行。如果探測失敗，kubelet 會終止容器，並根據重啟策略決定是否重新啟動。

適用場景：

應用程式進入死鎖狀態
應用程式執行緒耗盡
應用程式無法自我恢復的情況

Readiness Probe（就緒探針）

Readiness Probe 用於判斷容器是否已準備好接收請求。只有當探測成功時，Pod 才會被加入 Service 的端點列表中。

適用場景：

應用程式需要載入大量資料
應用程式依賴外部服務
需要進行暖機操作

Startup Probe（啟動探針）

Startup Probe 用於判斷容器內的應用程式是否已啟動。在 Startup Probe 成功之前，Liveness 和 Readiness Probe 都會被禁用。

適用場景：

啟動時間較長的舊有應用程式
初始化過程複雜的應用程式
需要避免過早被 Liveness Probe 終止的情況

HTTP 探針設定

HTTP 探針是最常用的探針類型，透過向容器發送 HTTP GET 請求來檢測健康狀態。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
apiVersion: v1
kind: Pod
metadata:
  name: http-probe-demo
  labels:
    app: web
spec:
  containers:
  - name: web-app
    image: nginx:1.24
    ports:
    - containerPort: 80
    livenessProbe:
      httpGet:
        path: /healthz
        port: 80
        httpHeaders:
        - name: Custom-Header
          value: Awesome
      initialDelaySeconds: 15
      periodSeconds: 10
      timeoutSeconds: 5
      successThreshold: 1
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /ready
        port: 80
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 3
      successThreshold: 1
      failureThreshold: 3

HTTP 探針會根據回應的 HTTP 狀態碼判斷結果：

200-399：探測成功
其他狀態碼：探測失敗

TCP 探針設定

TCP 探針透過嘗試建立 TCP 連線來檢測容器健康狀態，適合不提供 HTTP 端點的服務。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
apiVersion: v1
kind: Pod
metadata:
  name: tcp-probe-demo
  labels:
    app: database
spec:
  containers:
  - name: mysql
    image: mysql:8.0
    ports:
    - containerPort: 3306
    env:
    - name: MYSQL_ROOT_PASSWORD
      value: "password123"
    livenessProbe:
      tcpSocket:
        port: 3306
      initialDelaySeconds: 30
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3
    readinessProbe:
      tcpSocket:
        port: 3306
      initialDelaySeconds: 10
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 3

TCP 探針適用於：

資料庫服務（MySQL、PostgreSQL、Redis）
訊息佇列（RabbitMQ、Kafka）
其他基於 TCP 的服務

Exec 探針設定

Exec 探針在容器內執行指定的命令，根據命令的退出碼判斷健康狀態。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
apiVersion: v1
kind: Pod
metadata:
  name: exec-probe-demo
  labels:
    app: backend
spec:
  containers:
  - name: app
    image: busybox:1.36
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 600
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 5
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3
    readinessProbe:
      exec:
        command:
        - sh
        - -c
        - "test -f /tmp/ready && exit 0 || exit 1"
      initialDelaySeconds: 5
      periodSeconds: 5
      failureThreshold: 3

Exec 探針適用於：

需要執行複雜健康檢查邏輯
檢查檔案或資源是否存在
執行自訂的健康檢查腳本

探針參數配置

所有探針類型都支援以下配置參數：

參數	說明	預設值
initialDelaySeconds	容器啟動後等待多久開始探測	0
periodSeconds	探測間隔時間	10
timeoutSeconds	探測超時時間	1
successThreshold	連續成功多少次視為成功	1
failureThreshold	連續失敗多少次視為失敗	3

完整配置範例

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp
  labels:
    app: webapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
    spec:
      containers:
      - name: webapp
        image: myapp:1.0
        ports:
        - containerPort: 8080
        startupProbe:
          httpGet:
            path: /startup
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 30
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 15
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3

常見配置錯誤

1. initialDelaySeconds 設定過短

1
2
3
4
5
6
7
# 錯誤示範：應用程式需要 30 秒啟動，但只等待 5 秒
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 5  # 太短！
  failureThreshold: 3

這會導致容器在應用程式啟動完成前就被判定為不健康而重啟，形成重啟循環。

2. timeoutSeconds 設定過短

1
2
3
4
5
6
# 錯誤示範：健康檢查端點需要 3 秒回應，但超時設為 1 秒
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  timeoutSeconds: 1  # 太短！

3. Liveness 和 Readiness 使用相同設定

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# 錯誤示範：兩種探針應有不同的設定
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  periodSeconds: 5
  failureThreshold: 1
readinessProbe:
  httpGet:
    path: /health  # 應該使用不同的端點
    port: 8080
  periodSeconds: 5  # 應該更頻繁
  failureThreshold: 1  # 應該更寬鬆

4. 未考慮依賴服務的健康檢查

1
2
3
4
5
# 錯誤示範：Liveness Probe 不應檢查外部依賴
livenessProbe:
  httpGet:
    path: /health  # 此端點檢查資料庫連線
    port: 8080

如果 Liveness Probe 檢查外部依賴，當依賴服務故障時會導致所有 Pod 重啟。

最佳實踐

1. 分離健康檢查端點

1
2
3
4
5
6
7
8
9
# 建議的端點設計
livenessProbe:
  httpGet:
    path: /livez    # 僅檢查應用程式本身是否存活
    port: 8080
readinessProbe:
  httpGet:
    path: /readyz   # 檢查應用程式是否可以處理請求
    port: 8080

2. 使用 Startup Probe 處理長啟動時間

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# 對於需要較長啟動時間的應用程式
startupProbe:
  httpGet:
    path: /startup
    port: 8080
  failureThreshold: 30
  periodSeconds: 10
# 最長允許 30 * 10 = 300 秒的啟動時間

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

3. Readiness Probe 應更敏感

1
2
3
4
5
6
7
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 2      # 更頻繁的檢查
  failureThreshold: 1   # 一次失敗就移除端點
  successThreshold: 2   # 需要連續成功才加回端點

4. 適當的資源限制搭配

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
containers:
- name: webapp
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi
  livenessProbe:
    httpGet:
      path: /healthz
      port: 8080
    timeoutSeconds: 5  # 確保有足夠時間回應

除錯技巧

檢視 Pod 事件

1
2
3
4
5
6
7
8
# 查看 Pod 詳細資訊，包含探針失敗事件
kubectl describe pod <pod-name>

# 輸出範例：
# Events:
#   Type     Reason     Age   From     Message
#   ----     ------     ----  ----     -------
#   Warning  Unhealthy  10s   kubelet  Liveness probe failed: HTTP probe failed...

檢視 Pod 日誌

1
2
3
4
5
# 查看容器日誌
kubectl logs <pod-name>

# 查看前一個容器的日誌（如果已重啟）
kubectl logs <pod-name> --previous

手動測試探針端點

1
2
3
4
5
# 進入容器測試健康檢查端點
kubectl exec -it <pod-name> -- /bin/sh

# 在容器內測試
curl -v http://localhost:8080/healthz

檢視探針配置

1
2
# 以 YAML 格式輸出 Pod 配置
kubectl get pod <pod-name> -o yaml | grep -A 20 livenessProbe

監控探針狀態

1
2
3
4
5
# 持續監控 Pod 狀態
kubectl get pods -w

# 檢視 Pod 的條件狀態
kubectl get pod <pod-name> -o jsonpath='{.status.conditions}'

總結

Kubernetes 探針是維護應用程式可靠性的重要機制：

Liveness Probe：確保容器處於健康狀態，自動重啟無法自我恢復的容器
Readiness Probe：控制流量路由，確保只有準備好的 Pod 接收請求
Startup Probe：保護啟動緩慢的應用程式，避免被過早終止

正確配置探針可以顯著提升服務的可靠性和可用性。建議根據應用程式的特性選擇合適的探針類型和參數，並遵循最佳實踐來避免常見的配置錯誤。