使用 Prometheus 建立監控系統,學習 PromQL 查詢、告警規則、服務發現
專案簡介
Prometheus 是 CNCF 畢業專案,為雲原生環境設計的監控和告警系統。採用 Pull 模式收集指標,具備強大的查詢語言 PromQL。
GitHub Stars: 62K+
核心概念
- Metrics - 時間序列資料
- Labels - 維度標籤
- PromQL - 查詢語言
- Alertmanager - 告警管理
- Service Discovery - 自動發現目標
快速部署
Docker
1
2
3
4
5
| docker run -d \
-p 9090:9090 \
-v ./prometheus.yml:/etc/prometheus/prometheus.yml \
--name prometheus \
prom/prometheus
|
基本設定
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| # prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "alerts/*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
|
指標類型
Counter(計數器)
只會增加,用於累計值:
1
2
3
4
5
| # 請求總數
http_requests_total
# 請求率
rate(http_requests_total[5m])
|
Gauge(量表)
可增可減,用於當前值:
1
2
3
4
5
| # 當前溫度
temperature_celsius
# 記憶體使用量
node_memory_MemAvailable_bytes
|
Histogram(直方圖)
用於分佈統計:
1
2
3
4
5
| # P95 延遲
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# 平均延遲
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
|
Summary(摘要)
客戶端計算分位數:
1
| http_request_duration_seconds{quantile="0.95"}
|
PromQL 查詢
基本查詢
1
2
3
4
5
6
7
8
9
10
11
| # 選擇指標
http_requests_total
# 標籤過濾
http_requests_total{method="POST"}
# 正則匹配
http_requests_total{method=~"GET|POST"}
# 排除標籤
http_requests_total{method!="DELETE"}
|
範圍查詢
1
2
3
4
5
| # 過去 5 分鐘
http_requests_total[5m]
# 偏移
http_requests_total offset 1h
|
聚合函數
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| # 加總
sum(http_requests_total)
# 按標籤加總
sum by (method) (rate(http_requests_total[5m]))
# 不包含標籤
sum without (instance) (rate(http_requests_total[5m]))
# 平均
avg(node_cpu_seconds_total)
# 最大/最小
max(node_memory_MemTotal_bytes)
min(node_memory_MemAvailable_bytes)
# 計數
count(up == 1)
# Top K
topk(5, rate(http_requests_total[5m]))
|
常用函數
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| # 變化率
rate(http_requests_total[5m])
irate(http_requests_total[5m])
# 增量
increase(http_requests_total[1h])
# 直方圖分位數
histogram_quantile(0.99, sum(rate(http_request_duration_bucket[5m])) by (le))
# 時間函數
time()
timestamp(up)
# 數學函數
abs(changes)
ceil(value)
floor(value)
round(value, 0.1)
|
告警規則
告警設定
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| # alerts/node.yml
groups:
- name: node
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf \"%.2f\" }}%"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
|
Alertmanager 設定
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
| # alertmanager.yml
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
route:
receiver: 'default'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
email_configs:
- to: 'team@example.com'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'xxx'
|
服務發現
Kubernetes
1
2
3
4
5
6
7
8
9
10
11
12
| scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
|
Docker
1
2
3
4
5
| scrape_configs:
- job_name: 'docker'
dockerswarm_sd_configs:
- host: unix:///var/run/docker.sock
role: tasks
|
File SD
1
2
3
4
5
6
| scrape_configs:
- job_name: 'file-sd'
file_sd_configs:
- files:
- '/etc/prometheus/targets/*.json'
refresh_interval: 5m
|
1
2
3
4
5
6
7
8
9
10
| // targets/web.json
[
{
"targets": ["web1:9100", "web2:9100"],
"labels": {
"env": "production",
"team": "web"
}
}
]
|
Exporter 生態
Node Exporter
1
2
3
4
5
6
| docker run -d \
--net="host" \
--pid="host" \
-v "/:/host:ro,rslave" \
prom/node-exporter \
--path.rootfs=/host
|
自訂指標(Python)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| from prometheus_client import start_http_server, Counter, Gauge, Histogram
# 定義指標
requests_total = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint'])
temperature = Gauge('temperature_celsius', 'Current temperature')
request_duration = Histogram('http_request_duration_seconds', 'Request duration')
# 使用指標
requests_total.labels(method='GET', endpoint='/api').inc()
temperature.set(23.5)
with request_duration.time():
# 處理請求
pass
# 啟動 HTTP 伺服器
start_http_server(8000)
|
高可用部署
Thanos
1
2
3
4
5
6
7
8
9
10
11
12
| # Prometheus sidecar
prometheus:
command:
- --storage.tsdb.max-block-duration=2h
- --storage.tsdb.min-block-duration=2h
thanos-sidecar:
image: quay.io/thanos/thanos:latest
args:
- sidecar
- --prometheus.url=http://prometheus:9090
- --objstore.config-file=/etc/thanos/bucket.yml
|
相關連結
延伸閱讀