Prometheus 監控系統完整指南

專案簡介

Prometheus 是 CNCF 畢業專案，為雲原生環境設計的監控和告警系統。採用 Pull 模式收集指標，具備強大的查詢語言 PromQL。

GitHub Stars: 62K+

核心概念

Metrics - 時間序列資料
Labels - 維度標籤
PromQL - 查詢語言
Alertmanager - 告警管理
Service Discovery - 自動發現目標

快速部署

Docker

1
2
3
4
5
docker run -d \
  -p 9090:9090 \
  -v ./prometheus.yml:/etc/prometheus/prometheus.yml \
  --name prometheus \
  prom/prometheus

基本設定

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - "alerts/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

指標類型

Counter（計數器）

只會增加，用於累計值：

1
2
3
4
5
# 請求總數
http_requests_total

# 請求率
rate(http_requests_total[5m])

Gauge（量表）

可增可減，用於當前值：

1
2
3
4
5
# 當前溫度
temperature_celsius

# 記憶體使用量
node_memory_MemAvailable_bytes

Histogram（直方圖）

用於分佈統計：

1
2
3
4
5
# P95 延遲
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# 平均延遲
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

Summary（摘要）

客戶端計算分位數：

1
http_request_duration_seconds{quantile="0.95"}

PromQL 查詢

基本查詢

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# 選擇指標
http_requests_total

# 標籤過濾
http_requests_total{method="POST"}

# 正則匹配
http_requests_total{method=~"GET|POST"}

# 排除標籤
http_requests_total{method!="DELETE"}

範圍查詢

1
2
3
4
5
# 過去 5 分鐘
http_requests_total[5m]

# 偏移
http_requests_total offset 1h

聚合函數

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# 加總
sum(http_requests_total)

# 按標籤加總
sum by (method) (rate(http_requests_total[5m]))

# 不包含標籤
sum without (instance) (rate(http_requests_total[5m]))

# 平均
avg(node_cpu_seconds_total)

# 最大/最小
max(node_memory_MemTotal_bytes)
min(node_memory_MemAvailable_bytes)

# 計數
count(up == 1)

# Top K
topk(5, rate(http_requests_total[5m]))

常用函數

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# 變化率
rate(http_requests_total[5m])
irate(http_requests_total[5m])

# 增量
increase(http_requests_total[1h])

# 直方圖分位數
histogram_quantile(0.99, sum(rate(http_request_duration_bucket[5m])) by (le))

# 時間函數
time()
timestamp(up)

# 數學函數
abs(changes)
ceil(value)
floor(value)
round(value, 0.1)

告警規則

告警設定

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# alerts/node.yml
groups:
  - name: node
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.2f\" }}%"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"

Alertmanager 設定

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@example.com'

  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'xxx'

服務發現

Kubernetes

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Docker

1
2
3
4
5
scrape_configs:
  - job_name: 'docker'
    dockerswarm_sd_configs:
      - host: unix:///var/run/docker.sock
        role: tasks

File SD

1
2
3
4
5
6
scrape_configs:
  - job_name: 'file-sd'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/*.json'
        refresh_interval: 5m

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
// targets/web.json
[
  {
    "targets": ["web1:9100", "web2:9100"],
    "labels": {
      "env": "production",
      "team": "web"
    }
  }
]

Exporter 生態

Node Exporter

1
2
3
4
5
6
docker run -d \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  prom/node-exporter \
  --path.rootfs=/host

自訂指標（Python）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from prometheus_client import start_http_server, Counter, Gauge, Histogram

# 定義指標
requests_total = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint'])
temperature = Gauge('temperature_celsius', 'Current temperature')
request_duration = Histogram('http_request_duration_seconds', 'Request duration')

# 使用指標
requests_total.labels(method='GET', endpoint='/api').inc()
temperature.set(23.5)

with request_duration.time():
    # 處理請求
    pass

# 啟動 HTTP 伺服器
start_http_server(8000)

高可用部署

Thanos

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Prometheus sidecar
prometheus:
  command:
    - --storage.tsdb.max-block-duration=2h
    - --storage.tsdb.min-block-duration=2h

thanos-sidecar:
  image: quay.io/thanos/thanos:latest
  args:
    - sidecar
    - --prometheus.url=http://prometheus:9090
    - --objstore.config-file=/etc/thanos/bucket.yml

Prometheus 監控系統完整指南

使用 Prometheus 建立監控系統，學習 PromQL 查詢、告警規則、服務發現

專案簡介

核心概念

快速部署

Docker

基本設定

指標類型

Counter（計數器）

Gauge（量表）

Histogram（直方圖）

Summary（摘要）

PromQL 查詢

基本查詢

範圍查詢

聚合函數

常用函數

告警規則

告警設定

Alertmanager 設定

服務發現

Kubernetes

Docker

File SD

Exporter 生態

Node Exporter

自訂指標（Python）

高可用部署

Thanos

相關連結

延伸閱讀