Grafana 可觀測性平台完整指南

使用 Grafana 建立監控儀表板,整合 Prometheus、Loki、Tempo 實現完整可觀測性

專案簡介

Grafana 是最受歡迎的開源可觀測性平台,提供美觀的儀表板來視覺化監控資料。支援 Prometheus、InfluxDB、Elasticsearch 等數十種資料源。

GitHub Stars: 72K+

主要功能

  • 儀表板 - 豐富的視覺化元件
  • 多資料源 - 整合 100+ 資料來源
  • 告警系統 - 多管道通知
  • 日誌查詢 - Loki 整合
  • 分散式追蹤 - Tempo 整合

快速部署

Docker

1
2
3
4
5
docker run -d \
  -p 3000:3000 \
  --name grafana \
  -v grafana-storage:/var/lib/grafana \
  grafana/grafana-oss

Docker Compose(完整堆疊)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
version: '3.8'
services:
  grafana:
    image: grafana/grafana-oss
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana

  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  loki:
    image: grafana/loki
    ports:
      - "3100:3100"

  promtail:
    image: grafana/promtail
    volumes:
      - /var/log:/var/log
      - ./promtail-config.yml:/etc/promtail/config.yml

volumes:
  grafana-data:

訪問 http://localhost:3000(admin/admin)

資料源設定

Prometheus

  1. Configuration → Data Sources → Add
  2. 選擇 Prometheus
  3. URL: http://prometheus:9090
  4. Save & Test

Loki(日誌)

  1. Configuration → Data Sources → Add
  2. 選擇 Loki
  3. URL: http://loki:3100

MySQL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# datasources/mysql.yaml
apiVersion: 1
datasources:
  - name: MySQL
    type: mysql
    url: mysql:3306
    database: mydb
    user: grafana
    secureJsonData:
      password: secret

建立儀表板

Panel 類型

類型用途
Time series時間序列資料
Stat單一數值
Gauge量表顯示
Bar chart長條圖
Table表格資料
Heatmap熱力圖
Logs日誌顯示

PromQL 範例

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# CPU 使用率
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 記憶體使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# HTTP 請求率
sum(rate(http_requests_total[5m])) by (status)

# P95 延遲
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

LogQL 範例(Loki)

1
2
3
4
5
6
7
8
# 查詢特定應用日誌
{app="nginx"} |= "error"

# JSON 解析
{app="api"} | json | response_code >= 500

# 統計錯誤數
sum(rate({app="api"} |= "error" [5m])) by (level)

告警設定

告警規則

  1. Alerting → Alert Rules → Create
  2. 設定條件
1
2
3
4
5
6
7
8
# 範例:CPU 高使用率告警
alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
  severity: warning
annotations:
  summary: "High CPU usage on {{ $labels.instance }}"

通知管道

支援多種通知方式:

  • Email
  • Slack
  • PagerDuty
  • Webhook
  • Microsoft Teams
  • Discord

Slack 設定

  1. Alerting → Contact points → New
  2. 選擇 Slack
  3. 輸入 Webhook URL

儀表板即程式碼

JSON 匯出

Dashboard Settings → JSON Model → Copy

Provisioning

1
2
3
4
5
6
7
8
# provisioning/dashboards/default.yaml
apiVersion: 1
providers:
  - name: 'default'
    folder: ''
    type: file
    options:
      path: /var/lib/grafana/dashboards

Terraform

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
resource "grafana_dashboard" "metrics" {
  config_json = file("dashboard.json")
}

resource "grafana_alert_rule" "cpu" {
  name      = "High CPU"
  folder_id = grafana_folder.alerts.id
  rule_group {
    name     = "cpu_alerts"
    interval = "1m"
  }
}

Grafana Loki

Promtail 設定

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# promtail-config.yml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: containers
    static_configs:
      - targets:
          - localhost
        labels:
          job: containerlogs
          __path__: /var/lib/docker/containers/*/*.log
    pipeline_stages:
      - json:
          expressions:
            log: log
            stream: stream
            time: time

查詢日誌

1
2
3
4
5
# 最近錯誤
{job="containerlogs"} |= "error" | limit 100

# 按級別統計
sum by (level) (count_over_time({app="api"}[1h]))

效能優化

資料保留

1
2
3
4
5
6
7
8
9
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

storage:
  tsdb:
    retention.time: 15d
    retention.size: 50GB

快取設定

1
2
3
4
# grafana.ini
[caching]
enabled = true
ttl = 300

安全設定

認證

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# grafana.ini
[auth]
disable_login_form = false

[auth.ldap]
enabled = true
config_file = /etc/grafana/ldap.toml

[auth.generic_oauth]
enabled = true
name = Keycloak
client_id = grafana
client_secret = secret
auth_url = https://keycloak/auth
token_url = https://keycloak/token
api_url = https://keycloak/userinfo

HTTPS

1
2
3
4
[server]
protocol = https
cert_file = /etc/grafana/cert.pem
cert_key = /etc/grafana/key.pem

相關連結

延伸閱讀

comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy