AWS CloudWatch Container Insights

AWS CloudWatch Container Insights for Container Monitoring

前言

在現代雲端原生架構中,容器化應用程式的監控變得越來越重要。AWS CloudWatch Container Insights 提供了一套完整的解決方案,讓您能夠收集、彙整和分析來自容器化應用程式和微服務的指標與日誌。本文將深入介紹如何在 Amazon EKS 叢集中設定和使用 Container Insights。

1. Container Insights 功能概述

CloudWatch Container Insights 是 AWS CloudWatch 的一項功能,專門設計用於監控容器化環境。它能夠自動收集來自 Amazon EKS、Amazon ECS、Kubernetes 以及 Fargate 的指標和日誌。

主要功能

  • 自動化指標收集:自動收集 CPU、記憶體、磁碟和網路等基礎設施指標
  • 容器層級可見性:提供從叢集、節點、Pod 到容器各層級的效能資料
  • 效能日誌分析:收集並分析容器應用程式的日誌
  • 預建儀表板:提供現成的視覺化儀表板,快速掌握系統狀態
  • 異常偵測:整合 CloudWatch 異常偵測功能,自動識別異常模式

支援的平台

平台指標收集日誌收集
Amazon EKS
Amazon ECS
Amazon EKS Fargate
自管 Kubernetes

2. EKS 叢集啟用設定

前置需求

在開始設定之前,請確保您具備以下條件:

  • 運作中的 Amazon EKS 叢集
  • 已設定 kubectl 並能夠存取叢集
  • 適當的 IAM 權限

方法一:使用 AWS CLI 啟用

最簡單的方式是透過 AWS CLI 直接啟用 Container Insights:

1
2
3
4
5
# 啟用 Container Insights
aws eks update-cluster-config \
    --name <cluster-name> \
    --region <region> \
    --logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'

方法二:建立叢集時啟用

若您正在建立新的 EKS 叢集,可以使用 eksctl 直接啟用:

1
2
3
4
5
6
eksctl create cluster \
    --name my-cluster \
    --region ap-northeast-1 \
    --with-oidc \
    --managed \
    --enable-container-insights

設定 IAM 權限

為了讓 CloudWatch Agent 能夠將指標發送至 CloudWatch,您需要為節點 IAM 角色附加適當的政策:

1
2
3
4
# 附加 CloudWatchAgentServerPolicy 政策
aws iam attach-role-policy \
    --role-name <node-role-name> \
    --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy

或者,您可以建立自訂的 IAM 政策:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData",
                "ec2:DescribeVolumes",
                "ec2:DescribeTags",
                "logs:PutLogEvents",
                "logs:DescribeLogStreams",
                "logs:DescribeLogGroups",
                "logs:CreateLogStream",
                "logs:CreateLogGroup"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:GetParameter"
            ],
            "Resource": "arn:aws:ssm:*:*:parameter/AmazonCloudWatch-*"
        }
    ]
}

3. CloudWatch Agent 與 Fluent Bit 部署

Container Insights 使用 CloudWatch Agent 收集指標,並使用 Fluent Bit 收集日誌。AWS 提供了快速設定腳本來簡化部署流程。

使用快速啟動腳本

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# 設定環境變數
ClusterName=<your-cluster-name>
RegionName=<your-region>
FluentBitHttpPort='2020'
FluentBitReadFromHead='Off'
[[ ${FluentBitReadFromHead} = 'On' ]] && FluentBitReadFromTail='Off'|| FluentBitReadFromTail='On'
[[ -z ${FluentBitHttpPort} ]] && FluentBitHttpServer='Off' || FluentBitHttpServer='On'

# 下載並套用 CloudWatch Agent 與 Fluent Bit 設定
curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml | \
sed "s/{{cluster_name}}/${ClusterName}/g;s/{{region_name}}/${RegionName}/g;s/{{http_server_toggle}}/${FluentBitHttpServer}/g;s/{{http_server_port}}/${FluentBitHttpPort}/g;s/{{read_from_head}}/${FluentBitReadFromHead}/g;s/{{read_from_tail}}/${FluentBitReadFromTail}/g" | \
kubectl apply -f -

手動部署 CloudWatch Agent

如果您需要更精細的控制,可以手動部署各個元件。

建立 Namespace

1
2
3
4
5
6
7
8
kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: amazon-cloudwatch
  labels:
    name: amazon-cloudwatch
EOF

建立 ServiceAccount

1
2
3
4
5
6
7
kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cloudwatch-agent
  namespace: amazon-cloudwatch
EOF

建立 ClusterRole 和 ClusterRoleBinding

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cloudwatch-agent-role
rules:
  - apiGroups: [""]
    resources: ["pods", "nodes", "endpoints"]
    verbs: ["list", "watch"]
  - apiGroups: ["apps"]
    resources: ["replicasets"]
    verbs: ["list", "watch"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["list", "watch"]
  - apiGroups: [""]
    resources: ["nodes/proxy"]
    verbs: ["get"]
  - apiGroups: [""]
    resources: ["nodes/stats", "configmaps", "events"]
    verbs: ["create"]
  - apiGroups: [""]
    resources: ["configmaps"]
    resourceNames: ["cwagent-clusterleader"]
    verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cloudwatch-agent-role-binding
subjects:
  - kind: ServiceAccount
    name: cloudwatch-agent
    namespace: amazon-cloudwatch
roleRef:
  kind: ClusterRole
  name: cloudwatch-agent-role
  apiGroup: rbac.authorization.k8s.io

建立 ConfigMap

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
apiVersion: v1
kind: ConfigMap
metadata:
  name: cwagentconfig
  namespace: amazon-cloudwatch
data:
  cwagentconfig.json: |
    {
      "logs": {
        "metrics_collected": {
          "kubernetes": {
            "cluster_name": "{{cluster_name}}",
            "metrics_collection_interval": 60
          }
        },
        "force_flush_interval": 5
      }
    }    

部署 DaemonSet

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cloudwatch-agent
  namespace: amazon-cloudwatch
spec:
  selector:
    matchLabels:
      name: cloudwatch-agent
  template:
    metadata:
      labels:
        name: cloudwatch-agent
    spec:
      serviceAccountName: cloudwatch-agent
      containers:
        - name: cloudwatch-agent
          image: amazon/cloudwatch-agent:1.247359.0b252275
          resources:
            limits:
              cpu: 400m
              memory: 400Mi
            requests:
              cpu: 200m
              memory: 200Mi
          env:
            - name: HOST_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
            - name: HOST_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: K8S_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: CI_VERSION
              value: "k8s/1.3.11"
          volumeMounts:
            - name: cwagentconfig
              mountPath: /etc/cwagentconfig
            - name: rootfs
              mountPath: /rootfs
              readOnly: true
            - name: dockersock
              mountPath: /var/run/docker.sock
              readOnly: true
            - name: varlibdocker
              mountPath: /var/lib/docker
              readOnly: true
            - name: containerdsock
              mountPath: /run/containerd/containerd.sock
              readOnly: true
            - name: sys
              mountPath: /sys
              readOnly: true
            - name: devdisk
              mountPath: /dev/disk
              readOnly: true
      volumes:
        - name: cwagentconfig
          configMap:
            name: cwagentconfig
        - name: rootfs
          hostPath:
            path: /
        - name: dockersock
          hostPath:
            path: /var/run/docker.sock
        - name: varlibdocker
          hostPath:
            path: /var/lib/docker
        - name: containerdsock
          hostPath:
            path: /run/containerd/containerd.sock
        - name: sys
          hostPath:
            path: /sys
        - name: devdisk
          hostPath:
            path: /dev/disk/
      terminationGracePeriodSeconds: 60

部署 Fluent Bit

Fluent Bit 用於收集和轉發容器日誌到 CloudWatch Logs。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: amazon-cloudwatch
  labels:
    k8s-app: fluent-bit
spec:
  selector:
    matchLabels:
      k8s-app: fluent-bit
  template:
    metadata:
      labels:
        k8s-app: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      containers:
        - name: fluent-bit
          image: public.ecr.aws/aws-observability/aws-for-fluent-bit:stable
          imagePullPolicy: Always
          env:
            - name: AWS_REGION
              value: "ap-northeast-1"
            - name: CLUSTER_NAME
              value: "my-cluster"
            - name: HTTP_SERVER
              value: "On"
            - name: HTTP_PORT
              value: "2020"
            - name: READ_FROM_HEAD
              value: "Off"
            - name: READ_FROM_TAIL
              value: "On"
            - name: HOST_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          resources:
            limits:
              memory: 200Mi
            requests:
              cpu: 100m
              memory: 100Mi
          volumeMounts:
            - name: fluentbitstate
              mountPath: /var/fluent-bit/state
            - name: varlog
              mountPath: /var/log
              readOnly: true
            - name: varlibdockercontainers
              mountPath: /var/lib/docker/containers
              readOnly: true
            - name: fluent-bit-config
              mountPath: /fluent-bit/etc/
            - name: runlogjournal
              mountPath: /run/log/journal
              readOnly: true
            - name: dmesg
              mountPath: /var/log/dmesg
              readOnly: true
      volumes:
        - name: fluentbitstate
          hostPath:
            path: /var/fluent-bit/state
        - name: varlog
          hostPath:
            path: /var/log
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers
        - name: fluent-bit-config
          configMap:
            name: fluent-bit-config
        - name: runlogjournal
          hostPath:
            path: /run/log/journal
        - name: dmesg
          hostPath:
            path: /var/log/dmesg
      terminationGracePeriodSeconds: 10

驗證部署

1
2
3
4
5
6
7
8
# 檢查 Pod 狀態
kubectl get pods -n amazon-cloudwatch

# 檢查 CloudWatch Agent 日誌
kubectl logs -n amazon-cloudwatch -l name=cloudwatch-agent

# 檢查 Fluent Bit 日誌
kubectl logs -n amazon-cloudwatch -l k8s-app=fluent-bit

4. 收集的指標與日誌

自動收集的指標

Container Insights 會自動收集以下類別的指標:

叢集層級指標

指標名稱說明
cluster_failed_node_count失敗節點數量
cluster_node_count總節點數量
namespace_number_of_running_pods運行中 Pod 數量

節點層級指標

指標名稱說明
node_cpu_utilization節點 CPU 使用率
node_memory_utilization節點記憶體使用率
node_network_total_bytes網路流量總位元組數
node_filesystem_utilization檔案系統使用率
node_number_of_running_pods節點上運行的 Pod 數量
node_number_of_running_containers節點上運行的容器數量

Pod 層級指標

指標名稱說明
pod_cpu_utilizationPod CPU 使用率
pod_memory_utilizationPod 記憶體使用率
pod_network_rx_bytesPod 接收位元組數
pod_network_tx_bytesPod 傳輸位元組數
pod_cpu_utilization_over_pod_limitCPU 使用率相對於限制的百分比
pod_memory_utilization_over_pod_limit記憶體使用率相對於限制的百分比

容器層級指標

指標名稱說明
container_cpu_utilization容器 CPU 使用率
container_memory_utilization容器記憶體使用率
container_filesystem_usage容器檔案系統使用量

日誌類型

Container Insights 收集以下類型的日誌:

  1. 應用程式日誌:來自容器 stdout 和 stderr 的日誌
  2. 效能日誌:包含詳細效能資料的結構化 JSON 日誌
  3. 主機日誌:系統層級日誌,包括 dataplane 日誌

日誌群組結構

日誌會儲存在以下 CloudWatch Logs 群組中:

1
2
3
4
/aws/containerinsights/<cluster-name>/application
/aws/containerinsights/<cluster-name>/dataplane
/aws/containerinsights/<cluster-name>/host
/aws/containerinsights/<cluster-name>/performance

使用 CloudWatch Logs Insights 查詢

您可以使用 CloudWatch Logs Insights 來查詢和分析日誌:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
-- 查詢特定 Pod 的日誌
fields @timestamp, @message
| filter kubernetes.pod_name = "my-app-pod"
| sort @timestamp desc
| limit 100

-- 查詢錯誤日誌
fields @timestamp, @message, kubernetes.pod_name
| filter @message like /error|Error|ERROR/
| sort @timestamp desc
| limit 50

-- 統計各 namespace 的日誌數量
stats count(*) by kubernetes.namespace_name
| sort count desc

-- 查詢高 CPU 使用率的 Pod
fields @timestamp, PodName, CpuUtilized
| filter Type = "Pod"
| filter CpuUtilized > 80
| sort @timestamp desc

5. 自訂指標設定

除了內建指標外,您也可以設定自訂指標來監控特定的應用程式效能指標。

使用 Prometheus 指標

Container Insights 支援從 Prometheus 端點收集指標。首先,更新 CloudWatch Agent 設定:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
apiVersion: v1
kind: ConfigMap
metadata:
  name: cwagentconfig
  namespace: amazon-cloudwatch
data:
  cwagentconfig.json: |
    {
      "logs": {
        "metrics_collected": {
          "kubernetes": {
            "cluster_name": "my-cluster",
            "metrics_collection_interval": 60
          },
          "prometheus": {
            "cluster_name": "my-cluster",
            "log_group_name": "/aws/containerinsights/my-cluster/prometheus",
            "prometheus_config_path": "/etc/prometheusconfig/prometheus.yaml",
            "emf_processor": {
              "metric_declaration_dedup": true,
              "metric_namespace": "ContainerInsights/Prometheus",
              "metric_unit": {
                "http_requests_total": "Count",
                "http_request_duration_seconds": "Seconds"
              },
              "metric_declaration": [
                {
                  "source_labels": ["job"],
                  "label_matcher": "^my-app$",
                  "dimensions": [["ClusterName","Namespace","Service"]],
                  "metric_selectors": [
                    "^http_requests_total$",
                    "^http_request_duration_seconds.*$"
                  ]
                }
              ]
            }
          }
        }
      }
    }    

建立 Prometheus 設定

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: amazon-cloudwatch
data:
  prometheus.yaml: |
    global:
      scrape_interval: 1m
      scrape_timeout: 10s
    scrape_configs:
      - job_name: 'my-app'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__    

應用程式整合

在您的應用程式中加入 Prometheus 指標端點。以下是 Python Flask 範例:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from flask import Flask
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST

app = Flask(__name__)

# 定義指標
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint']
)

@app.route('/metrics')
def metrics():
    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}

@app.route('/api/data')
def get_data():
    with REQUEST_LATENCY.labels(method='GET', endpoint='/api/data').time():
        REQUEST_COUNT.labels(method='GET', endpoint='/api/data', status='200').inc()
        return {'data': 'example'}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

為 Pod 加入 Prometheus 註解

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
        - name: my-app
          image: my-app:latest
          ports:
            - containerPort: 8080

發送自訂指標到 CloudWatch

您也可以直接使用 AWS SDK 發送自訂指標:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import boto3
from datetime import datetime

cloudwatch = boto3.client('cloudwatch', region_name='ap-northeast-1')

def put_custom_metric(namespace, metric_name, value, dimensions):
    cloudwatch.put_metric_data(
        Namespace=namespace,
        MetricData=[
            {
                'MetricName': metric_name,
                'Dimensions': dimensions,
                'Timestamp': datetime.utcnow(),
                'Value': value,
                'Unit': 'Count'
            }
        ]
    )

# 使用範例
put_custom_metric(
    namespace='MyApp/CustomMetrics',
    metric_name='OrdersProcessed',
    value=42,
    dimensions=[
        {'Name': 'Environment', 'Value': 'production'},
        {'Name': 'Service', 'Value': 'order-service'}
    ]
)

6. 儀表板與視覺化

自動化儀表板

啟用 Container Insights 後,AWS 會自動建立預設儀表板,可在 CloudWatch 主控台中找到:

  1. 前往 CloudWatch 主控台
  2. 在左側選單選擇「Container Insights」
  3. 選擇您的叢集來檢視儀表板

建立自訂儀表板

您可以使用 CloudWatch Dashboard 建立自訂儀表板:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
aws cloudwatch put-dashboard \
    --dashboard-name "EKS-Custom-Dashboard" \
    --dashboard-body '{
        "widgets": [
            {
                "type": "metric",
                "x": 0,
                "y": 0,
                "width": 12,
                "height": 6,
                "properties": {
                    "metrics": [
                        ["ContainerInsights", "pod_cpu_utilization", "ClusterName", "my-cluster", "Namespace", "default", {"stat": "Average"}]
                    ],
                    "title": "Pod CPU Utilization",
                    "region": "ap-northeast-1",
                    "period": 300
                }
            },
            {
                "type": "metric",
                "x": 12,
                "y": 0,
                "width": 12,
                "height": 6,
                "properties": {
                    "metrics": [
                        ["ContainerInsights", "pod_memory_utilization", "ClusterName", "my-cluster", "Namespace", "default", {"stat": "Average"}]
                    ],
                    "title": "Pod Memory Utilization",
                    "region": "ap-northeast-1",
                    "period": 300
                }
            },
            {
                "type": "metric",
                "x": 0,
                "y": 6,
                "width": 24,
                "height": 6,
                "properties": {
                    "metrics": [
                        ["ContainerInsights", "node_cpu_utilization", "ClusterName", "my-cluster", {"stat": "Average"}],
                        [".", "node_memory_utilization", ".", ".", {"stat": "Average"}]
                    ],
                    "title": "Node Resource Utilization",
                    "region": "ap-northeast-1",
                    "period": 300
                }
            }
        ]
    }'

使用 CloudFormation 建立儀表板

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
AWSTemplateFormatVersion: '2010-09-09'
Description: Container Insights Dashboard

Parameters:
  ClusterName:
    Type: String
    Description: EKS Cluster Name

Resources:
  ContainerInsightsDashboard:
    Type: AWS::CloudWatch::Dashboard
    Properties:
      DashboardName: !Sub "${ClusterName}-container-insights"
      DashboardBody: !Sub |
        {
          "widgets": [
            {
              "type": "metric",
              "x": 0,
              "y": 0,
              "width": 8,
              "height": 6,
              "properties": {
                "metrics": [
                  ["ContainerInsights", "cluster_node_count", "ClusterName", "${ClusterName}"]
                ],
                "title": "Node Count",
                "region": "${AWS::Region}",
                "stat": "Average",
                "period": 60
              }
            },
            {
              "type": "metric",
              "x": 8,
              "y": 0,
              "width": 8,
              "height": 6,
              "properties": {
                "metrics": [
                  ["ContainerInsights", "namespace_number_of_running_pods", "ClusterName", "${ClusterName}", "Namespace", "default"]
                ],
                "title": "Running Pods (default namespace)",
                "region": "${AWS::Region}",
                "stat": "Average",
                "period": 60
              }
            },
            {
              "type": "log",
              "x": 0,
              "y": 6,
              "width": 24,
              "height": 6,
              "properties": {
                "query": "SOURCE '/aws/containerinsights/${ClusterName}/application' | fields @timestamp, @message | sort @timestamp desc | limit 100",
                "region": "${AWS::Region}",
                "title": "Application Logs"
              }
            }
          ]
        }

使用 Grafana 整合

您也可以將 Container Insights 指標匯入 Grafana 進行視覺化:

  1. 在 Grafana 中新增 CloudWatch 資料來源
  2. 設定 AWS 認證(使用 IAM 角色或存取金鑰)
  3. 建立儀表板並選擇 ContainerInsights 命名空間

7. 告警規則設定

建立 CloudWatch 告警

設定告警來監控關鍵指標並在異常時通知您:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# CPU 使用率過高告警
aws cloudwatch put-metric-alarm \
    --alarm-name "EKS-High-CPU-Utilization" \
    --alarm-description "Alert when CPU utilization exceeds 80%" \
    --metric-name pod_cpu_utilization \
    --namespace ContainerInsights \
    --statistic Average \
    --period 300 \
    --threshold 80 \
    --comparison-operator GreaterThanThreshold \
    --dimensions Name=ClusterName,Value=my-cluster \
    --evaluation-periods 3 \
    --alarm-actions arn:aws:sns:ap-northeast-1:123456789012:my-alerts \
    --ok-actions arn:aws:sns:ap-northeast-1:123456789012:my-alerts

# 記憶體使用率過高告警
aws cloudwatch put-metric-alarm \
    --alarm-name "EKS-High-Memory-Utilization" \
    --alarm-description "Alert when memory utilization exceeds 85%" \
    --metric-name pod_memory_utilization \
    --namespace ContainerInsights \
    --statistic Average \
    --period 300 \
    --threshold 85 \
    --comparison-operator GreaterThanThreshold \
    --dimensions Name=ClusterName,Value=my-cluster \
    --evaluation-periods 3 \
    --alarm-actions arn:aws:sns:ap-northeast-1:123456789012:my-alerts

# Pod 重啟告警
aws cloudwatch put-metric-alarm \
    --alarm-name "EKS-Pod-Restart-Alert" \
    --alarm-description "Alert when pods restart frequently" \
    --metric-name pod_number_of_container_restarts \
    --namespace ContainerInsights \
    --statistic Sum \
    --period 300 \
    --threshold 5 \
    --comparison-operator GreaterThanThreshold \
    --dimensions Name=ClusterName,Value=my-cluster \
    --evaluation-periods 1 \
    --alarm-actions arn:aws:sns:ap-northeast-1:123456789012:my-alerts

使用 CloudFormation 建立告警

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
AWSTemplateFormatVersion: '2010-09-09'
Description: Container Insights Alarms

Parameters:
  ClusterName:
    Type: String
  SNSTopicArn:
    Type: String

Resources:
  HighCPUAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ClusterName}-high-cpu"
      AlarmDescription: High CPU utilization detected
      MetricName: pod_cpu_utilization
      Namespace: ContainerInsights
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      Threshold: 80
      ComparisonOperator: GreaterThanThreshold
      Dimensions:
        - Name: ClusterName
          Value: !Ref ClusterName
      AlarmActions:
        - !Ref SNSTopicArn
      OKActions:
        - !Ref SNSTopicArn

  HighMemoryAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ClusterName}-high-memory"
      AlarmDescription: High memory utilization detected
      MetricName: pod_memory_utilization
      Namespace: ContainerInsights
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      Threshold: 85
      ComparisonOperator: GreaterThanThreshold
      Dimensions:
        - Name: ClusterName
          Value: !Ref ClusterName
      AlarmActions:
        - !Ref SNSTopicArn

  NodeCountAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${ClusterName}-node-count"
      AlarmDescription: Node count dropped below threshold
      MetricName: cluster_node_count
      Namespace: ContainerInsights
      Statistic: Minimum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 2
      ComparisonOperator: LessThanThreshold
      Dimensions:
        - Name: ClusterName
          Value: !Ref ClusterName
      AlarmActions:
        - !Ref SNSTopicArn

複合告警

建立複合告警來結合多個條件:

1
2
3
4
5
aws cloudwatch put-composite-alarm \
    --alarm-name "EKS-Critical-Resource-Alert" \
    --alarm-rule "ALARM(EKS-High-CPU-Utilization) AND ALARM(EKS-High-Memory-Utilization)" \
    --alarm-actions arn:aws:sns:ap-northeast-1:123456789012:critical-alerts \
    --alarm-description "Critical alert when both CPU and memory are high"

基於日誌的告警

使用 Metric Filter 從日誌建立指標並設定告警:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# 建立 Metric Filter
aws logs put-metric-filter \
    --log-group-name "/aws/containerinsights/my-cluster/application" \
    --filter-name "ErrorLogFilter" \
    --filter-pattern "ERROR" \
    --metric-transformations \
        metricName=ErrorCount,metricNamespace=ContainerInsights/Logs,metricValue=1

# 建立告警
aws cloudwatch put-metric-alarm \
    --alarm-name "EKS-Application-Errors" \
    --alarm-description "Alert when application errors increase" \
    --metric-name ErrorCount \
    --namespace ContainerInsights/Logs \
    --statistic Sum \
    --period 300 \
    --threshold 10 \
    --comparison-operator GreaterThanThreshold \
    --evaluation-periods 1 \
    --alarm-actions arn:aws:sns:ap-northeast-1:123456789012:my-alerts

8. 成本優化建議

Container Insights 會產生 CloudWatch 指標和日誌的費用。以下是一些成本優化策略:

日誌保留期設定

縮短日誌保留期可以顯著降低儲存成本:

1
2
3
4
5
6
7
8
9
# 設定日誌群組保留期為 7 天
aws logs put-retention-policy \
    --log-group-name "/aws/containerinsights/my-cluster/application" \
    --retention-in-days 7

# 設定效能日誌保留期為 3 天
aws logs put-retention-policy \
    --log-group-name "/aws/containerinsights/my-cluster/performance" \
    --retention-in-days 3

使用 S3 進行長期儲存

將日誌匯出至 S3 以降低長期儲存成本:

1
2
3
4
5
6
# 建立訂閱篩選器將日誌串流到 Kinesis Firehose,再儲存至 S3
aws logs put-subscription-filter \
    --log-group-name "/aws/containerinsights/my-cluster/application" \
    --filter-name "S3Export" \
    --filter-pattern "" \
    --destination-arn arn:aws:firehose:ap-northeast-1:123456789012:deliverystream/log-to-s3

調整指標收集頻率

減少指標收集頻率可以降低成本:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# 將收集間隔從 60 秒調整為 300 秒
apiVersion: v1
kind: ConfigMap
metadata:
  name: cwagentconfig
  namespace: amazon-cloudwatch
data:
  cwagentconfig.json: |
    {
      "logs": {
        "metrics_collected": {
          "kubernetes": {
            "cluster_name": "my-cluster",
            "metrics_collection_interval": 300
          }
        }
      }
    }    

篩選不必要的日誌

使用 Fluent Bit 篩選器排除不必要的日誌:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
[FILTER]
    Name    grep
    Match   *
    Exclude log health check
    Exclude log kube-probe

[FILTER]
    Name    grep
    Match   *
    Exclude kubernetes.namespace_name kube-system

成本估算

Container Insights 的主要成本來源:

項目計費方式優化建議
自訂指標每個指標/月減少自訂指標數量
日誌擷取每 GB使用篩選器減少日誌量
日誌儲存每 GB/月縮短保留期
日誌查詢每 GB 掃描使用精確查詢範圍
API 請求每請求減少 API 呼叫頻率

使用 AWS Cost Explorer 監控成本

定期檢視 Container Insights 相關成本:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
aws ce get-cost-and-usage \
    --time-period Start=2025-05-01,End=2025-05-31 \
    --granularity MONTHLY \
    --metrics BlendedCost \
    --filter '{
        "Dimensions": {
            "Key": "SERVICE",
            "Values": ["Amazon CloudWatch"]
        }
    }' \
    --group-by Type=DIMENSION,Key=USAGE_TYPE

啟用成本告警

設定預算告警以避免意外超支:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
aws budgets create-budget \
    --account-id 123456789012 \
    --budget '{
        "BudgetName": "CloudWatch-Monthly",
        "BudgetLimit": {
            "Amount": "100",
            "Unit": "USD"
        },
        "TimeUnit": "MONTHLY",
        "BudgetType": "COST",
        "CostFilters": {
            "Service": ["Amazon CloudWatch"]
        }
    }' \
    --notifications-with-subscribers '[
        {
            "Notification": {
                "NotificationType": "ACTUAL",
                "ComparisonOperator": "GREATER_THAN",
                "Threshold": 80
            },
            "Subscribers": [
                {
                    "SubscriptionType": "EMAIL",
                    "Address": "admin@example.com"
                }
            ]
        }
    ]'

結語

AWS CloudWatch Container Insights 提供了全面的容器監控解決方案,讓您能夠深入了解 EKS 叢集的效能和健康狀態。透過本文介紹的設定方法,您可以:

  • 快速部署 CloudWatch Agent 和 Fluent Bit 收集指標與日誌
  • 利用預建儀表板快速掌握系統狀態
  • 設定自訂指標以監控應用程式特定效能
  • 建立告警規則及時發現並處理問題
  • 實施成本優化策略控制監控支出

建議從基本設定開始,隨著對系統的了解逐步擴展監控範圍,並定期檢視成本以確保在預算範圍內。

參考資源

comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy