Prometheus 監控系統完整指南

使用 Prometheus 建立監控系統,學習 PromQL 查詢、告警規則、服務發現

專案簡介

Prometheus 是 CNCF 畢業專案,為雲原生環境設計的監控和告警系統。採用 Pull 模式收集指標,具備強大的查詢語言 PromQL。

GitHub Stars: 62K+

核心概念

  • Metrics - 時間序列資料
  • Labels - 維度標籤
  • PromQL - 查詢語言
  • Alertmanager - 告警管理
  • Service Discovery - 自動發現目標

快速部署

Docker

1
2
3
4
5
docker run -d \
  -p 9090:9090 \
  -v ./prometheus.yml:/etc/prometheus/prometheus.yml \
  --name prometheus \
  prom/prometheus

基本設定

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - "alerts/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

指標類型

Counter(計數器)

只會增加,用於累計值:

1
2
3
4
5
# 請求總數
http_requests_total

# 請求率
rate(http_requests_total[5m])

Gauge(量表)

可增可減,用於當前值:

1
2
3
4
5
# 當前溫度
temperature_celsius

# 記憶體使用量
node_memory_MemAvailable_bytes

Histogram(直方圖)

用於分佈統計:

1
2
3
4
5
# P95 延遲
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# 平均延遲
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

Summary(摘要)

客戶端計算分位數:

1
http_request_duration_seconds{quantile="0.95"}

PromQL 查詢

基本查詢

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# 選擇指標
http_requests_total

# 標籤過濾
http_requests_total{method="POST"}

# 正則匹配
http_requests_total{method=~"GET|POST"}

# 排除標籤
http_requests_total{method!="DELETE"}

範圍查詢

1
2
3
4
5
# 過去 5 分鐘
http_requests_total[5m]

# 偏移
http_requests_total offset 1h

聚合函數

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# 加總
sum(http_requests_total)

# 按標籤加總
sum by (method) (rate(http_requests_total[5m]))

# 不包含標籤
sum without (instance) (rate(http_requests_total[5m]))

# 平均
avg(node_cpu_seconds_total)

# 最大/最小
max(node_memory_MemTotal_bytes)
min(node_memory_MemAvailable_bytes)

# 計數
count(up == 1)

# Top K
topk(5, rate(http_requests_total[5m]))

常用函數

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# 變化率
rate(http_requests_total[5m])
irate(http_requests_total[5m])

# 增量
increase(http_requests_total[1h])

# 直方圖分位數
histogram_quantile(0.99, sum(rate(http_request_duration_bucket[5m])) by (le))

# 時間函數
time()
timestamp(up)

# 數學函數
abs(changes)
ceil(value)
floor(value)
round(value, 0.1)

告警規則

告警設定

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# alerts/node.yml
groups:
  - name: node
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.2f\" }}%"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"

Alertmanager 設定

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@example.com'

  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'xxx'

服務發現

Kubernetes

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Docker

1
2
3
4
5
scrape_configs:
  - job_name: 'docker'
    dockerswarm_sd_configs:
      - host: unix:///var/run/docker.sock
        role: tasks

File SD

1
2
3
4
5
6
scrape_configs:
  - job_name: 'file-sd'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/*.json'
        refresh_interval: 5m
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
// targets/web.json
[
  {
    "targets": ["web1:9100", "web2:9100"],
    "labels": {
      "env": "production",
      "team": "web"
    }
  }
]

Exporter 生態

Node Exporter

1
2
3
4
5
6
docker run -d \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  prom/node-exporter \
  --path.rootfs=/host

自訂指標(Python)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from prometheus_client import start_http_server, Counter, Gauge, Histogram

# 定義指標
requests_total = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint'])
temperature = Gauge('temperature_celsius', 'Current temperature')
request_duration = Histogram('http_request_duration_seconds', 'Request duration')

# 使用指標
requests_total.labels(method='GET', endpoint='/api').inc()
temperature.set(23.5)

with request_duration.time():
    # 處理請求
    pass

# 啟動 HTTP 伺服器
start_http_server(8000)

高可用部署

Thanos

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Prometheus sidecar
prometheus:
  command:
    - --storage.tsdb.max-block-duration=2h
    - --storage.tsdb.min-block-duration=2h

thanos-sidecar:
  image: quay.io/thanos/thanos:latest
  args:
    - sidecar
    - --prometheus.url=http://prometheus:9090
    - --objstore.config-file=/etc/thanos/bucket.yml

相關連結

延伸閱讀

comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy