前言
在現代雲端原生架構中,容器化應用程式的監控變得越來越重要。AWS CloudWatch Container Insights 提供了一套完整的解決方案,讓您能夠收集、彙整和分析來自容器化應用程式和微服務的指標與日誌。本文將深入介紹如何在 Amazon EKS 叢集中設定和使用 Container Insights。
1. Container Insights 功能概述
CloudWatch Container Insights 是 AWS CloudWatch 的一項功能,專門設計用於監控容器化環境。它能夠自動收集來自 Amazon EKS、Amazon ECS、Kubernetes 以及 Fargate 的指標和日誌。
主要功能
- 自動化指標收集:自動收集 CPU、記憶體、磁碟和網路等基礎設施指標
- 容器層級可見性:提供從叢集、節點、Pod 到容器各層級的效能資料
- 效能日誌分析:收集並分析容器應用程式的日誌
- 預建儀表板:提供現成的視覺化儀表板,快速掌握系統狀態
- 異常偵測:整合 CloudWatch 異常偵測功能,自動識別異常模式
支援的平台
| 平台 | 指標收集 | 日誌收集 |
|---|
| Amazon EKS | ✓ | ✓ |
| Amazon ECS | ✓ | ✓ |
| Amazon EKS Fargate | ✓ | ✓ |
| 自管 Kubernetes | ✓ | ✓ |
2. EKS 叢集啟用設定
前置需求
在開始設定之前,請確保您具備以下條件:
- 運作中的 Amazon EKS 叢集
- 已設定
kubectl 並能夠存取叢集 - 適當的 IAM 權限
方法一:使用 AWS CLI 啟用
最簡單的方式是透過 AWS CLI 直接啟用 Container Insights:
1
2
3
4
5
| # 啟用 Container Insights
aws eks update-cluster-config \
--name <cluster-name> \
--region <region> \
--logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'
|
方法二:建立叢集時啟用
若您正在建立新的 EKS 叢集,可以使用 eksctl 直接啟用:
1
2
3
4
5
6
| eksctl create cluster \
--name my-cluster \
--region ap-northeast-1 \
--with-oidc \
--managed \
--enable-container-insights
|
設定 IAM 權限
為了讓 CloudWatch Agent 能夠將指標發送至 CloudWatch,您需要為節點 IAM 角色附加適當的政策:
1
2
3
4
| # 附加 CloudWatchAgentServerPolicy 政策
aws iam attach-role-policy \
--role-name <node-role-name> \
--policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
|
或者,您可以建立自訂的 IAM 政策:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricData",
"ec2:DescribeVolumes",
"ec2:DescribeTags",
"logs:PutLogEvents",
"logs:DescribeLogStreams",
"logs:DescribeLogGroups",
"logs:CreateLogStream",
"logs:CreateLogGroup"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ssm:GetParameter"
],
"Resource": "arn:aws:ssm:*:*:parameter/AmazonCloudWatch-*"
}
]
}
|
3. CloudWatch Agent 與 Fluent Bit 部署
Container Insights 使用 CloudWatch Agent 收集指標,並使用 Fluent Bit 收集日誌。AWS 提供了快速設定腳本來簡化部署流程。
使用快速啟動腳本
1
2
3
4
5
6
7
8
9
10
11
12
| # 設定環境變數
ClusterName=<your-cluster-name>
RegionName=<your-region>
FluentBitHttpPort='2020'
FluentBitReadFromHead='Off'
[[ ${FluentBitReadFromHead} = 'On' ]] && FluentBitReadFromTail='Off'|| FluentBitReadFromTail='On'
[[ -z ${FluentBitHttpPort} ]] && FluentBitHttpServer='Off' || FluentBitHttpServer='On'
# 下載並套用 CloudWatch Agent 與 Fluent Bit 設定
curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml | \
sed "s/{{cluster_name}}/${ClusterName}/g;s/{{region_name}}/${RegionName}/g;s/{{http_server_toggle}}/${FluentBitHttpServer}/g;s/{{http_server_port}}/${FluentBitHttpPort}/g;s/{{read_from_head}}/${FluentBitReadFromHead}/g;s/{{read_from_tail}}/${FluentBitReadFromTail}/g" | \
kubectl apply -f -
|
手動部署 CloudWatch Agent
如果您需要更精細的控制,可以手動部署各個元件。
建立 Namespace
1
2
3
4
5
6
7
8
| kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
name: amazon-cloudwatch
labels:
name: amazon-cloudwatch
EOF
|
建立 ServiceAccount
1
2
3
4
5
6
7
| kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
name: cloudwatch-agent
namespace: amazon-cloudwatch
EOF
|
建立 ClusterRole 和 ClusterRoleBinding
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
| apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cloudwatch-agent-role
rules:
- apiGroups: [""]
resources: ["pods", "nodes", "endpoints"]
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources: ["replicasets"]
verbs: ["list", "watch"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["list", "watch"]
- apiGroups: [""]
resources: ["nodes/proxy"]
verbs: ["get"]
- apiGroups: [""]
resources: ["nodes/stats", "configmaps", "events"]
verbs: ["create"]
- apiGroups: [""]
resources: ["configmaps"]
resourceNames: ["cwagent-clusterleader"]
verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cloudwatch-agent-role-binding
subjects:
- kind: ServiceAccount
name: cloudwatch-agent
namespace: amazon-cloudwatch
roleRef:
kind: ClusterRole
name: cloudwatch-agent-role
apiGroup: rbac.authorization.k8s.io
|
建立 ConfigMap
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| apiVersion: v1
kind: ConfigMap
metadata:
name: cwagentconfig
namespace: amazon-cloudwatch
data:
cwagentconfig.json: |
{
"logs": {
"metrics_collected": {
"kubernetes": {
"cluster_name": "{{cluster_name}}",
"metrics_collection_interval": 60
}
},
"force_flush_interval": 5
}
}
|
部署 DaemonSet
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
| apiVersion: apps/v1
kind: DaemonSet
metadata:
name: cloudwatch-agent
namespace: amazon-cloudwatch
spec:
selector:
matchLabels:
name: cloudwatch-agent
template:
metadata:
labels:
name: cloudwatch-agent
spec:
serviceAccountName: cloudwatch-agent
containers:
- name: cloudwatch-agent
image: amazon/cloudwatch-agent:1.247359.0b252275
resources:
limits:
cpu: 400m
memory: 400Mi
requests:
cpu: 200m
memory: 200Mi
env:
- name: HOST_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: HOST_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: K8S_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: CI_VERSION
value: "k8s/1.3.11"
volumeMounts:
- name: cwagentconfig
mountPath: /etc/cwagentconfig
- name: rootfs
mountPath: /rootfs
readOnly: true
- name: dockersock
mountPath: /var/run/docker.sock
readOnly: true
- name: varlibdocker
mountPath: /var/lib/docker
readOnly: true
- name: containerdsock
mountPath: /run/containerd/containerd.sock
readOnly: true
- name: sys
mountPath: /sys
readOnly: true
- name: devdisk
mountPath: /dev/disk
readOnly: true
volumes:
- name: cwagentconfig
configMap:
name: cwagentconfig
- name: rootfs
hostPath:
path: /
- name: dockersock
hostPath:
path: /var/run/docker.sock
- name: varlibdocker
hostPath:
path: /var/lib/docker
- name: containerdsock
hostPath:
path: /run/containerd/containerd.sock
- name: sys
hostPath:
path: /sys
- name: devdisk
hostPath:
path: /dev/disk/
terminationGracePeriodSeconds: 60
|
部署 Fluent Bit
Fluent Bit 用於收集和轉發容器日誌到 CloudWatch Logs。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
| apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: amazon-cloudwatch
labels:
k8s-app: fluent-bit
spec:
selector:
matchLabels:
k8s-app: fluent-bit
template:
metadata:
labels:
k8s-app: fluent-bit
spec:
serviceAccountName: fluent-bit
containers:
- name: fluent-bit
image: public.ecr.aws/aws-observability/aws-for-fluent-bit:stable
imagePullPolicy: Always
env:
- name: AWS_REGION
value: "ap-northeast-1"
- name: CLUSTER_NAME
value: "my-cluster"
- name: HTTP_SERVER
value: "On"
- name: HTTP_PORT
value: "2020"
- name: READ_FROM_HEAD
value: "Off"
- name: READ_FROM_TAIL
value: "On"
- name: HOST_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
volumeMounts:
- name: fluentbitstate
mountPath: /var/fluent-bit/state
- name: varlog
mountPath: /var/log
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: fluent-bit-config
mountPath: /fluent-bit/etc/
- name: runlogjournal
mountPath: /run/log/journal
readOnly: true
- name: dmesg
mountPath: /var/log/dmesg
readOnly: true
volumes:
- name: fluentbitstate
hostPath:
path: /var/fluent-bit/state
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluent-bit-config
configMap:
name: fluent-bit-config
- name: runlogjournal
hostPath:
path: /run/log/journal
- name: dmesg
hostPath:
path: /var/log/dmesg
terminationGracePeriodSeconds: 10
|
驗證部署
1
2
3
4
5
6
7
8
| # 檢查 Pod 狀態
kubectl get pods -n amazon-cloudwatch
# 檢查 CloudWatch Agent 日誌
kubectl logs -n amazon-cloudwatch -l name=cloudwatch-agent
# 檢查 Fluent Bit 日誌
kubectl logs -n amazon-cloudwatch -l k8s-app=fluent-bit
|
4. 收集的指標與日誌
自動收集的指標
Container Insights 會自動收集以下類別的指標:
叢集層級指標
| 指標名稱 | 說明 |
|---|
cluster_failed_node_count | 失敗節點數量 |
cluster_node_count | 總節點數量 |
namespace_number_of_running_pods | 運行中 Pod 數量 |
節點層級指標
| 指標名稱 | 說明 |
|---|
node_cpu_utilization | 節點 CPU 使用率 |
node_memory_utilization | 節點記憶體使用率 |
node_network_total_bytes | 網路流量總位元組數 |
node_filesystem_utilization | 檔案系統使用率 |
node_number_of_running_pods | 節點上運行的 Pod 數量 |
node_number_of_running_containers | 節點上運行的容器數量 |
Pod 層級指標
| 指標名稱 | 說明 |
|---|
pod_cpu_utilization | Pod CPU 使用率 |
pod_memory_utilization | Pod 記憶體使用率 |
pod_network_rx_bytes | Pod 接收位元組數 |
pod_network_tx_bytes | Pod 傳輸位元組數 |
pod_cpu_utilization_over_pod_limit | CPU 使用率相對於限制的百分比 |
pod_memory_utilization_over_pod_limit | 記憶體使用率相對於限制的百分比 |
容器層級指標
| 指標名稱 | 說明 |
|---|
container_cpu_utilization | 容器 CPU 使用率 |
container_memory_utilization | 容器記憶體使用率 |
container_filesystem_usage | 容器檔案系統使用量 |
日誌類型
Container Insights 收集以下類型的日誌:
- 應用程式日誌:來自容器 stdout 和 stderr 的日誌
- 效能日誌:包含詳細效能資料的結構化 JSON 日誌
- 主機日誌:系統層級日誌,包括 dataplane 日誌
日誌群組結構
日誌會儲存在以下 CloudWatch Logs 群組中:
1
2
3
4
| /aws/containerinsights/<cluster-name>/application
/aws/containerinsights/<cluster-name>/dataplane
/aws/containerinsights/<cluster-name>/host
/aws/containerinsights/<cluster-name>/performance
|
使用 CloudWatch Logs Insights 查詢
您可以使用 CloudWatch Logs Insights 來查詢和分析日誌:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| -- 查詢特定 Pod 的日誌
fields @timestamp, @message
| filter kubernetes.pod_name = "my-app-pod"
| sort @timestamp desc
| limit 100
-- 查詢錯誤日誌
fields @timestamp, @message, kubernetes.pod_name
| filter @message like /error|Error|ERROR/
| sort @timestamp desc
| limit 50
-- 統計各 namespace 的日誌數量
stats count(*) by kubernetes.namespace_name
| sort count desc
-- 查詢高 CPU 使用率的 Pod
fields @timestamp, PodName, CpuUtilized
| filter Type = "Pod"
| filter CpuUtilized > 80
| sort @timestamp desc
|
5. 自訂指標設定
除了內建指標外,您也可以設定自訂指標來監控特定的應用程式效能指標。
使用 Prometheus 指標
Container Insights 支援從 Prometheus 端點收集指標。首先,更新 CloudWatch Agent 設定:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
| apiVersion: v1
kind: ConfigMap
metadata:
name: cwagentconfig
namespace: amazon-cloudwatch
data:
cwagentconfig.json: |
{
"logs": {
"metrics_collected": {
"kubernetes": {
"cluster_name": "my-cluster",
"metrics_collection_interval": 60
},
"prometheus": {
"cluster_name": "my-cluster",
"log_group_name": "/aws/containerinsights/my-cluster/prometheus",
"prometheus_config_path": "/etc/prometheusconfig/prometheus.yaml",
"emf_processor": {
"metric_declaration_dedup": true,
"metric_namespace": "ContainerInsights/Prometheus",
"metric_unit": {
"http_requests_total": "Count",
"http_request_duration_seconds": "Seconds"
},
"metric_declaration": [
{
"source_labels": ["job"],
"label_matcher": "^my-app$",
"dimensions": [["ClusterName","Namespace","Service"]],
"metric_selectors": [
"^http_requests_total$",
"^http_request_duration_seconds.*$"
]
}
]
}
}
}
}
}
|
建立 Prometheus 設定
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: amazon-cloudwatch
data:
prometheus.yaml: |
global:
scrape_interval: 1m
scrape_timeout: 10s
scrape_configs:
- job_name: 'my-app'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
|
應用程式整合
在您的應用程式中加入 Prometheus 指標端點。以下是 Python Flask 範例:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| from flask import Flask
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
app = Flask(__name__)
# 定義指標
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint']
)
@app.route('/metrics')
def metrics():
return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
@app.route('/api/data')
def get_data():
with REQUEST_LATENCY.labels(method='GET', endpoint='/api/data').time():
REQUEST_COUNT.labels(method='GET', endpoint='/api/data', status='200').inc()
return {'data': 'example'}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
|
為 Pod 加入 Prometheus 註解
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: my-app
image: my-app:latest
ports:
- containerPort: 8080
|
發送自訂指標到 CloudWatch
您也可以直接使用 AWS SDK 發送自訂指標:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| import boto3
from datetime import datetime
cloudwatch = boto3.client('cloudwatch', region_name='ap-northeast-1')
def put_custom_metric(namespace, metric_name, value, dimensions):
cloudwatch.put_metric_data(
Namespace=namespace,
MetricData=[
{
'MetricName': metric_name,
'Dimensions': dimensions,
'Timestamp': datetime.utcnow(),
'Value': value,
'Unit': 'Count'
}
]
)
# 使用範例
put_custom_metric(
namespace='MyApp/CustomMetrics',
metric_name='OrdersProcessed',
value=42,
dimensions=[
{'Name': 'Environment', 'Value': 'production'},
{'Name': 'Service', 'Value': 'order-service'}
]
)
|
6. 儀表板與視覺化
自動化儀表板
啟用 Container Insights 後,AWS 會自動建立預設儀表板,可在 CloudWatch 主控台中找到:
- 前往 CloudWatch 主控台
- 在左側選單選擇「Container Insights」
- 選擇您的叢集來檢視儀表板
建立自訂儀表板
您可以使用 CloudWatch Dashboard 建立自訂儀表板:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
| aws cloudwatch put-dashboard \
--dashboard-name "EKS-Custom-Dashboard" \
--dashboard-body '{
"widgets": [
{
"type": "metric",
"x": 0,
"y": 0,
"width": 12,
"height": 6,
"properties": {
"metrics": [
["ContainerInsights", "pod_cpu_utilization", "ClusterName", "my-cluster", "Namespace", "default", {"stat": "Average"}]
],
"title": "Pod CPU Utilization",
"region": "ap-northeast-1",
"period": 300
}
},
{
"type": "metric",
"x": 12,
"y": 0,
"width": 12,
"height": 6,
"properties": {
"metrics": [
["ContainerInsights", "pod_memory_utilization", "ClusterName", "my-cluster", "Namespace", "default", {"stat": "Average"}]
],
"title": "Pod Memory Utilization",
"region": "ap-northeast-1",
"period": 300
}
},
{
"type": "metric",
"x": 0,
"y": 6,
"width": 24,
"height": 6,
"properties": {
"metrics": [
["ContainerInsights", "node_cpu_utilization", "ClusterName", "my-cluster", {"stat": "Average"}],
[".", "node_memory_utilization", ".", ".", {"stat": "Average"}]
],
"title": "Node Resource Utilization",
"region": "ap-northeast-1",
"period": 300
}
}
]
}'
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
| AWSTemplateFormatVersion: '2010-09-09'
Description: Container Insights Dashboard
Parameters:
ClusterName:
Type: String
Description: EKS Cluster Name
Resources:
ContainerInsightsDashboard:
Type: AWS::CloudWatch::Dashboard
Properties:
DashboardName: !Sub "${ClusterName}-container-insights"
DashboardBody: !Sub |
{
"widgets": [
{
"type": "metric",
"x": 0,
"y": 0,
"width": 8,
"height": 6,
"properties": {
"metrics": [
["ContainerInsights", "cluster_node_count", "ClusterName", "${ClusterName}"]
],
"title": "Node Count",
"region": "${AWS::Region}",
"stat": "Average",
"period": 60
}
},
{
"type": "metric",
"x": 8,
"y": 0,
"width": 8,
"height": 6,
"properties": {
"metrics": [
["ContainerInsights", "namespace_number_of_running_pods", "ClusterName", "${ClusterName}", "Namespace", "default"]
],
"title": "Running Pods (default namespace)",
"region": "${AWS::Region}",
"stat": "Average",
"period": 60
}
},
{
"type": "log",
"x": 0,
"y": 6,
"width": 24,
"height": 6,
"properties": {
"query": "SOURCE '/aws/containerinsights/${ClusterName}/application' | fields @timestamp, @message | sort @timestamp desc | limit 100",
"region": "${AWS::Region}",
"title": "Application Logs"
}
}
]
}
|
使用 Grafana 整合
您也可以將 Container Insights 指標匯入 Grafana 進行視覺化:
- 在 Grafana 中新增 CloudWatch 資料來源
- 設定 AWS 認證(使用 IAM 角色或存取金鑰)
- 建立儀表板並選擇 ContainerInsights 命名空間
7. 告警規則設定
建立 CloudWatch 告警
設定告警來監控關鍵指標並在異常時通知您:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
| # CPU 使用率過高告警
aws cloudwatch put-metric-alarm \
--alarm-name "EKS-High-CPU-Utilization" \
--alarm-description "Alert when CPU utilization exceeds 80%" \
--metric-name pod_cpu_utilization \
--namespace ContainerInsights \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=ClusterName,Value=my-cluster \
--evaluation-periods 3 \
--alarm-actions arn:aws:sns:ap-northeast-1:123456789012:my-alerts \
--ok-actions arn:aws:sns:ap-northeast-1:123456789012:my-alerts
# 記憶體使用率過高告警
aws cloudwatch put-metric-alarm \
--alarm-name "EKS-High-Memory-Utilization" \
--alarm-description "Alert when memory utilization exceeds 85%" \
--metric-name pod_memory_utilization \
--namespace ContainerInsights \
--statistic Average \
--period 300 \
--threshold 85 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=ClusterName,Value=my-cluster \
--evaluation-periods 3 \
--alarm-actions arn:aws:sns:ap-northeast-1:123456789012:my-alerts
# Pod 重啟告警
aws cloudwatch put-metric-alarm \
--alarm-name "EKS-Pod-Restart-Alert" \
--alarm-description "Alert when pods restart frequently" \
--metric-name pod_number_of_container_restarts \
--namespace ContainerInsights \
--statistic Sum \
--period 300 \
--threshold 5 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=ClusterName,Value=my-cluster \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:ap-northeast-1:123456789012:my-alerts
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
| AWSTemplateFormatVersion: '2010-09-09'
Description: Container Insights Alarms
Parameters:
ClusterName:
Type: String
SNSTopicArn:
Type: String
Resources:
HighCPUAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ClusterName}-high-cpu"
AlarmDescription: High CPU utilization detected
MetricName: pod_cpu_utilization
Namespace: ContainerInsights
Statistic: Average
Period: 300
EvaluationPeriods: 3
Threshold: 80
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: ClusterName
Value: !Ref ClusterName
AlarmActions:
- !Ref SNSTopicArn
OKActions:
- !Ref SNSTopicArn
HighMemoryAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ClusterName}-high-memory"
AlarmDescription: High memory utilization detected
MetricName: pod_memory_utilization
Namespace: ContainerInsights
Statistic: Average
Period: 300
EvaluationPeriods: 3
Threshold: 85
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: ClusterName
Value: !Ref ClusterName
AlarmActions:
- !Ref SNSTopicArn
NodeCountAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: !Sub "${ClusterName}-node-count"
AlarmDescription: Node count dropped below threshold
MetricName: cluster_node_count
Namespace: ContainerInsights
Statistic: Minimum
Period: 60
EvaluationPeriods: 2
Threshold: 2
ComparisonOperator: LessThanThreshold
Dimensions:
- Name: ClusterName
Value: !Ref ClusterName
AlarmActions:
- !Ref SNSTopicArn
|
複合告警
建立複合告警來結合多個條件:
1
2
3
4
5
| aws cloudwatch put-composite-alarm \
--alarm-name "EKS-Critical-Resource-Alert" \
--alarm-rule "ALARM(EKS-High-CPU-Utilization) AND ALARM(EKS-High-Memory-Utilization)" \
--alarm-actions arn:aws:sns:ap-northeast-1:123456789012:critical-alerts \
--alarm-description "Critical alert when both CPU and memory are high"
|
基於日誌的告警
使用 Metric Filter 從日誌建立指標並設定告警:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| # 建立 Metric Filter
aws logs put-metric-filter \
--log-group-name "/aws/containerinsights/my-cluster/application" \
--filter-name "ErrorLogFilter" \
--filter-pattern "ERROR" \
--metric-transformations \
metricName=ErrorCount,metricNamespace=ContainerInsights/Logs,metricValue=1
# 建立告警
aws cloudwatch put-metric-alarm \
--alarm-name "EKS-Application-Errors" \
--alarm-description "Alert when application errors increase" \
--metric-name ErrorCount \
--namespace ContainerInsights/Logs \
--statistic Sum \
--period 300 \
--threshold 10 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:ap-northeast-1:123456789012:my-alerts
|
8. 成本優化建議
Container Insights 會產生 CloudWatch 指標和日誌的費用。以下是一些成本優化策略:
日誌保留期設定
縮短日誌保留期可以顯著降低儲存成本:
1
2
3
4
5
6
7
8
9
| # 設定日誌群組保留期為 7 天
aws logs put-retention-policy \
--log-group-name "/aws/containerinsights/my-cluster/application" \
--retention-in-days 7
# 設定效能日誌保留期為 3 天
aws logs put-retention-policy \
--log-group-name "/aws/containerinsights/my-cluster/performance" \
--retention-in-days 3
|
使用 S3 進行長期儲存
將日誌匯出至 S3 以降低長期儲存成本:
1
2
3
4
5
6
| # 建立訂閱篩選器將日誌串流到 Kinesis Firehose,再儲存至 S3
aws logs put-subscription-filter \
--log-group-name "/aws/containerinsights/my-cluster/application" \
--filter-name "S3Export" \
--filter-pattern "" \
--destination-arn arn:aws:firehose:ap-northeast-1:123456789012:deliverystream/log-to-s3
|
調整指標收集頻率
減少指標收集頻率可以降低成本:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # 將收集間隔從 60 秒調整為 300 秒
apiVersion: v1
kind: ConfigMap
metadata:
name: cwagentconfig
namespace: amazon-cloudwatch
data:
cwagentconfig.json: |
{
"logs": {
"metrics_collected": {
"kubernetes": {
"cluster_name": "my-cluster",
"metrics_collection_interval": 300
}
}
}
}
|
篩選不必要的日誌
使用 Fluent Bit 篩選器排除不必要的日誌:
1
2
3
4
5
6
7
8
9
10
| [FILTER]
Name grep
Match *
Exclude log health check
Exclude log kube-probe
[FILTER]
Name grep
Match *
Exclude kubernetes.namespace_name kube-system
|
成本估算
Container Insights 的主要成本來源:
| 項目 | 計費方式 | 優化建議 |
|---|
| 自訂指標 | 每個指標/月 | 減少自訂指標數量 |
| 日誌擷取 | 每 GB | 使用篩選器減少日誌量 |
| 日誌儲存 | 每 GB/月 | 縮短保留期 |
| 日誌查詢 | 每 GB 掃描 | 使用精確查詢範圍 |
| API 請求 | 每請求 | 減少 API 呼叫頻率 |
使用 AWS Cost Explorer 監控成本
定期檢視 Container Insights 相關成本:
1
2
3
4
5
6
7
8
9
10
11
| aws ce get-cost-and-usage \
--time-period Start=2025-05-01,End=2025-05-31 \
--granularity MONTHLY \
--metrics BlendedCost \
--filter '{
"Dimensions": {
"Key": "SERVICE",
"Values": ["Amazon CloudWatch"]
}
}' \
--group-by Type=DIMENSION,Key=USAGE_TYPE
|
啟用成本告警
設定預算告警以避免意外超支:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| aws budgets create-budget \
--account-id 123456789012 \
--budget '{
"BudgetName": "CloudWatch-Monthly",
"BudgetLimit": {
"Amount": "100",
"Unit": "USD"
},
"TimeUnit": "MONTHLY",
"BudgetType": "COST",
"CostFilters": {
"Service": ["Amazon CloudWatch"]
}
}' \
--notifications-with-subscribers '[
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80
},
"Subscribers": [
{
"SubscriptionType": "EMAIL",
"Address": "admin@example.com"
}
]
}
]'
|
結語
AWS CloudWatch Container Insights 提供了全面的容器監控解決方案,讓您能夠深入了解 EKS 叢集的效能和健康狀態。透過本文介紹的設定方法,您可以:
- 快速部署 CloudWatch Agent 和 Fluent Bit 收集指標與日誌
- 利用預建儀表板快速掌握系統狀態
- 設定自訂指標以監控應用程式特定效能
- 建立告警規則及時發現並處理問題
- 實施成本優化策略控制監控支出
建議從基本設定開始,隨著對系統的了解逐步擴展監控範圍,並定期檢視成本以確保在預算範圍內。
參考資源