Jaeger 分散式追蹤系統

使用 Jaeger 追蹤微服務架構中的請求流程,定位效能瓶頸和問題根因

專案簡介

Jaeger 是由 Uber 開發的開源分散式追蹤系統,現為 CNCF 畢業專案。用於監控和排查微服務架構中的問題,支援 OpenTelemetry。

GitHub Stars: 22K+

主要功能

  • 分散式追蹤 - 端對端請求追蹤
  • 根因分析 - 定位效能問題
  • 服務相依 - 自動產生服務圖
  • 效能優化 - 識別延遲瓶頸
  • OpenTelemetry - 原生整合

架構元件

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Agent     │───▶│  Collector  │───▶│   Storage   │
└─────────────┘    └─────────────┘    └─────────────┘
                   ┌─────────────┐
                   │    Query    │
                   └─────────────┘
                   ┌─────────────┐
                   │     UI      │
                   └─────────────┘

安裝

All-in-One(開發用)

1
2
3
4
5
6
7
docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  -p 14250:14250 \
  -p 14268:14268 \
  jaegertracing/all-in-one:latest

訪問 http://localhost:16686

Docker Compose

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
version: '3.8'
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"   # UI
      - "4317:4317"     # OTLP gRPC
      - "4318:4318"     # OTLP HTTP
      - "14268:14268"   # Thrift HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true

Kubernetes

1
2
kubectl create namespace observability
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.50.0/jaeger-operator.yaml -n observability

OpenTelemetry 整合

Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource

resource = Resource.create({"service.name": "my-service"})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

# 使用
with tracer.start_as_current_span("operation") as span:
    span.set_attribute("user.id", "12345")
    # 業務邏輯

Go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
package main

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
)

func initTracer() func() {
    exporter, _ := otlptracegrpc.New(context.Background(),
        otlptracegrpc.WithEndpoint("localhost:4317"),
        otlptracegrpc.WithInsecure(),
    )

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(resource.NewWithAttributes(
            semconv.ServiceNameKey.String("my-service"),
        )),
    )

    otel.SetTracerProvider(tp)
    return func() { tp.Shutdown(context.Background()) }
}

Node.js

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
const { NodeSDK } = require("@opentelemetry/sdk-node");
const {
  OTLPTraceExporter,
} = require("@opentelemetry/exporter-trace-otlp-grpc");

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: "http://localhost:4317",
  }),
  serviceName: "my-node-service",
});

sdk.start();

儲存後端

Elasticsearch

1
2
3
4
5
6
7
8
services:
  jaeger-collector:
    environment:
      - SPAN_STORAGE_TYPE=elasticsearch
      - ES_SERVER_URLS=http://elasticsearch:9200

  elasticsearch:
    image: elasticsearch:8.x

Cassandra

1
2
3
4
5
6
7
8
services:
  jaeger-collector:
    environment:
      - SPAN_STORAGE_TYPE=cassandra
      - CASSANDRA_SERVERS=cassandra

  cassandra:
    image: cassandra:4.x

Kafka(緩衝)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
services:
  jaeger-collector:
    environment:
      - SPAN_STORAGE_TYPE=kafka
      - KAFKA_PRODUCER_BROKERS=kafka:9092

  jaeger-ingester:
    environment:
      - SPAN_STORAGE_TYPE=elasticsearch
      - KAFKA_CONSUMER_BROKERS=kafka:9092

查詢和分析

搜尋 Trace

1
2
3
4
5
6
# UI 查詢
Service: my-service
Operation: /api/users
Tags: http.status_code=500
Min Duration: 1s
Limit: 20

Trace 比較

  1. 選擇多個 Trace
  2. 點擊 Compare
  3. 分析差異

服務效能

  • P50/P95/P99 延遲
  • 請求率
  • 錯誤率
  • 服務相依圖

Span 屬性

設定屬性

1
2
3
4
5
6
7
8
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("db-query") as span:
    span.set_attribute("db.system", "postgresql")
    span.set_attribute("db.statement", "SELECT * FROM users")
    span.set_attribute("db.name", "mydb")

標準屬性

屬性說明
http.methodHTTP 方法
http.url請求 URL
http.status_code狀態碼
db.system資料庫類型
db.statementSQL 語句
rpc.systemRPC 系統

取樣策略

Collector 設定

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# sampling.json
{
  "service_strategies": [
    {
      "service": "my-service",
      "type": "probabilistic",
      "param": 0.5
    }
  ],
  "default_strategy": {
    "type": "probabilistic",
    "param": 0.1
  }
}

取樣類型

類型說明
const固定取樣(0 或 1)
probabilistic機率取樣
ratelimiting速率限制
remote遠端控制

Jaeger Operator

部署

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: production
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch:9200
  collector:
    replicas: 2
  query:
    replicas: 2

自動注入

1
2
3
4
5
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    sidecar.jaegertracing.io/inject: "true"

相關連結

延伸閱讀

comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy