AWS ECS Service Connect 服務網格

AWS ECS Service Connect Service Mesh

Service Connect 概述

AWS ECS Service Connect 是 Amazon ECS 於 2022 年推出的原生服務網格解決方案,旨在簡化微服務之間的通訊。它提供了一種無需額外基礎設施即可實現服務發現、負載平衡和可觀測性的方式。

Service Connect 的核心特點

  • 簡化的服務發現:自動處理服務註冊和 DNS 解析,無需手動配置
  • 內建負載平衡:提供客戶端負載平衡,無需額外的 Load Balancer
  • 統一的可觀測性:自動收集連線指標並整合至 CloudWatch
  • 零程式碼變更:應用程式無需修改即可使用服務網格功能
  • 與 ECS 深度整合:原生支援 ECS 服務,配置簡單直觀

運作原理

Service Connect 在每個 ECS 任務中注入一個 Envoy Proxy sidecar 容器,負責處理所有進出流量:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
┌─────────────────────────────────────────────────────────────┐
│                        ECS Task                             │
│  ┌──────────────────┐    ┌─────────────────────────────┐   │
│  │                  │    │                             │   │
│  │   Application    │◄───│      Envoy Proxy            │   │
│  │   Container      │    │      (Service Connect)      │   │
│  │                  │───►│                             │   │
│  └──────────────────┘    └─────────────────────────────┘   │
│                                      │                      │
└──────────────────────────────────────│──────────────────────┘
                          ┌────────────▼────────────┐
                          │   Cloud Map Namespace   │
                          │   (Service Discovery)   │
                          └─────────────────────────┘

與 App Mesh 比較

在 Service Connect 推出之前,AWS App Mesh 是 AWS 上實現服務網格的主要方案。以下是兩者的詳細比較:

功能對比表

功能Service ConnectApp Mesh
設定複雜度低(ECS 原生整合)高(需額外資源配置)
服務發現AWS Cloud MapAWS Cloud Map
ProxyEnvoy(自動管理)Envoy(手動配置)
流量路由基本負載平衡進階路由規則
重試策略內建預設值完全可自定義
斷路器內建完全可自定義
mTLS不支援(截至目前)支援
跨帳號/跨叢集不支援支援
可觀測性CloudWatch 整合X-Ray、CloudWatch

選擇建議

選擇 Service Connect 的情境:

  • 希望快速實現服務網格功能
  • 主要需求是服務發現和基本負載平衡
  • 所有服務都在同一個 ECS 叢集中
  • 團隊對服務網格經驗較少

選擇 App Mesh 的情境:

  • 需要進階流量管理(金絲雀部署、流量分割)
  • 需要 mTLS 進行服務間加密
  • 跨叢集或跨帳號的服務通訊
  • 需要與 EKS 或 EC2 上的服務整合

遷移考量

1
2
3
4
5
# 查看現有 App Mesh 資源
aws appmesh list-meshes

# 列出特定 Mesh 中的虛擬服務
aws appmesh list-virtual-services --mesh-name my-mesh

從 App Mesh 遷移至 Service Connect 時,需要注意:

  1. Service Connect 目前不支援 mTLS,若有加密需求需另行處理
  2. 進階路由規則需要在應用層實現
  3. 遷移過程中可能需要維護兩套配置

Namespace 設定

Cloud Map Namespace 是 Service Connect 的核心元件,所有服務都會註冊到 Namespace 中進行發現。

建立 Cloud Map Namespace

1
2
3
4
5
6
7
8
# 建立 HTTP Namespace(推薦用於 Service Connect)
aws servicediscovery create-http-namespace \
  --name production \
  --description "Production services namespace"

# 查看 Namespace 詳細資訊
aws servicediscovery get-namespace \
  --id ns-xxxxxxxxxxxxxxxxx

Namespace 類型比較

類型用途Service Connect 支援
HTTP Namespace純粹的服務發現完全支援
DNS Private NamespaceVPC 內 DNS 解析部分支援
DNS Public Namespace公開 DNS 解析不支援

在 ECS Cluster 啟用 Service Connect

1
2
3
4
5
6
7
8
9
# 更新 Cluster 以啟用 Service Connect 預設 Namespace
aws ecs update-cluster \
  --cluster my-cluster \
  --service-connect-defaults namespace=arn:aws:servicediscovery:ap-northeast-1:123456789012:namespace/ns-xxxxxxxxx

# 驗證 Cluster 配置
aws ecs describe-clusters \
  --clusters my-cluster \
  --include SETTINGS

使用 AWS Console 建立 Namespace

  1. 前往 AWS Cloud Map 控制台
  2. 點選「Create namespace」
  3. 選擇「API calls」作為實例發現方式
  4. 輸入 Namespace 名稱(例如:production
  5. 完成建立後,在 ECS Cluster 設定中啟用

服務端與客戶端設定

Service Connect 區分兩種角色:服務端(提供服務)和客戶端(呼叫服務)。一個服務可以同時扮演兩種角色。

服務端配置(Server/Producer)

服務端需要定義 portMappings 並設定 Service Connect 配置:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
{
  "family": "backend-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789012.dkr.ecr.ap-northeast-1.amazonaws.com/backend-api:latest",
      "essential": true,
      "portMappings": [
        {
          "name": "api-port",
          "containerPort": 8080,
          "protocol": "tcp",
          "appProtocol": "http"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/backend-api",
          "awslogs-region": "ap-northeast-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}

建立服務端 ECS Service:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
aws ecs create-service \
  --cluster my-cluster \
  --service-name backend-api \
  --task-definition backend-api:1 \
  --desired-count 3 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={
    subnets=[subnet-private-1,subnet-private-2],
    securityGroups=[sg-ecs-tasks],
    assignPublicIp=DISABLED
  }" \
  --service-connect-configuration '{
    "enabled": true,
    "namespace": "production",
    "services": [
      {
        "portName": "api-port",
        "discoveryName": "backend-api",
        "clientAliases": [
          {
            "port": 8080,
            "dnsName": "backend-api"
          }
        ]
      }
    ]
  }'

客戶端配置(Client/Consumer)

客戶端只需啟用 Service Connect,即可透過服務名稱存取其他服務:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
aws ecs create-service \
  --cluster my-cluster \
  --service-name frontend-web \
  --task-definition frontend-web:1 \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={
    subnets=[subnet-private-1,subnet-private-2],
    securityGroups=[sg-ecs-tasks],
    assignPublicIp=DISABLED
  }" \
  --service-connect-configuration '{
    "enabled": true,
    "namespace": "production"
  }'

服務間通訊

啟用 Service Connect 後,客戶端可以直接使用服務名稱進行通訊:

1
2
3
4
5
6
7
8
# Python 範例 - 從 frontend-web 呼叫 backend-api
import requests

# 使用 Service Connect DNS 名稱
response = requests.get("http://backend-api:8080/api/users")

# 或使用完整的 namespace 格式
response = requests.get("http://backend-api.production:8080/api/users")
1
2
3
4
5
6
7
8
// Node.js 範例
const axios = require('axios');

async function fetchUsers() {
  // Service Connect 自動解析服務名稱
  const response = await axios.get('http://backend-api:8080/api/users');
  return response.data;
}

服務同時作為客戶端和服務端

許多微服務需要同時提供服務並呼叫其他服務:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
aws ecs create-service \
  --cluster my-cluster \
  --service-name order-service \
  --task-definition order-service:1 \
  --desired-count 3 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={
    subnets=[subnet-private-1,subnet-private-2],
    securityGroups=[sg-ecs-tasks]
  }" \
  --service-connect-configuration '{
    "enabled": true,
    "namespace": "production",
    "services": [
      {
        "portName": "order-api",
        "discoveryName": "order-service",
        "clientAliases": [
          {
            "port": 8080,
            "dnsName": "order-service"
          }
        ]
      }
    ]
  }'

流量管理與負載平衡

Service Connect 提供內建的客戶端負載平衡功能,透過 Envoy Proxy 實現智慧流量分配。

負載平衡策略

Service Connect 預設使用 Round Robin 負載平衡策略,將請求平均分配到所有健康的後端實例:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
                                    ┌──────────────┐
                                ┌──►│  Task 1      │
                                │   │  10.0.1.10   │
┌──────────────┐   ┌─────────┐  │   └──────────────┘
│   Client     │──►│  Envoy  │──┤
│   Service    │   │  Proxy  │  │   ┌──────────────┐
└──────────────┘   └─────────┘  ├──►│  Task 2      │
                                │   │  10.0.1.11   │
                                │   └──────────────┘
                                │   ┌──────────────┐
                                └──►│  Task 3      │
                                    │  10.0.1.12   │
                                    └──────────────┘

健康檢查與自動故障轉移

Service Connect 會自動將流量從不健康的實例轉移:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{
  "containerDefinitions": [
    {
      "name": "api",
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
        "interval": 10,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 30
      }
    }
  ]
}

逾時設定

透過 Task Definition 中的 Service Connect 配置設定連線逾時:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
{
  "serviceConnectConfiguration": {
    "enabled": true,
    "namespace": "production",
    "services": [
      {
        "portName": "api-port",
        "discoveryName": "backend-api",
        "timeout": {
          "idleTimeoutSeconds": 300,
          "perRequestTimeoutSeconds": 30
        },
        "clientAliases": [
          {
            "port": 8080,
            "dnsName": "backend-api"
          }
        ]
      }
    ]
  }
}

重試機制

Service Connect 提供內建的重試機制,處理暫時性故障:

1
2
3
4
5
# 查看 Service Connect 配置
aws ecs describe-services \
  --cluster my-cluster \
  --services backend-api \
  --query 'services[0].deployments[0].serviceConnectConfiguration'

連線池管理

Envoy Proxy 自動管理連線池,優化服務間通訊效能:

  • HTTP/1.1:維持持久連線,減少連線建立開銷
  • HTTP/2:支援多工,單一連線處理多個請求
  • 自動重連:連線中斷時自動重新建立

監控與可觀測性

Service Connect 自動收集豐富的指標數據,並整合至 Amazon CloudWatch。

自動收集的指標

Service Connect 會自動產生以下 CloudWatch 指標:

指標名稱說明維度
RequestCount請求總數ServiceName, TargetService
RequestCountPerTarget每個目標的請求數ServiceName, TargetService, TargetIP
ActiveConnectionCount活躍連線數ServiceName
NewConnectionCount新建連線數ServiceName
ProcessedBytes處理的位元組數ServiceName
TargetResponseTime目標回應時間ServiceName, TargetService
HTTPCode_Target_2XX_Count2XX 回應數量ServiceName
HTTPCode_Target_4XX_Count4XX 回應數量ServiceName
HTTPCode_Target_5XX_Count5XX 回應數量ServiceName

建立 CloudWatch Dashboard

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# 建立包含 Service Connect 指標的 Dashboard
aws cloudwatch put-dashboard \
  --dashboard-name "ECS-Service-Connect-Dashboard" \
  --dashboard-body '{
    "widgets": [
      {
        "type": "metric",
        "x": 0,
        "y": 0,
        "width": 12,
        "height": 6,
        "properties": {
          "title": "Request Count by Service",
          "metrics": [
            ["AWS/ECS", "RequestCount", "ClusterName", "my-cluster", "ServiceName", "backend-api"],
            ["...", "frontend-web"],
            ["...", "order-service"]
          ],
          "period": 60,
          "stat": "Sum"
        }
      },
      {
        "type": "metric",
        "x": 12,
        "y": 0,
        "width": 12,
        "height": 6,
        "properties": {
          "title": "Response Time",
          "metrics": [
            ["AWS/ECS", "TargetResponseTime", "ClusterName", "my-cluster", "ServiceName", "backend-api", {"stat": "p99"}],
            ["...", {"stat": "p50"}]
          ],
          "period": 60
        }
      },
      {
        "type": "metric",
        "x": 0,
        "y": 6,
        "width": 12,
        "height": 6,
        "properties": {
          "title": "Error Rate",
          "metrics": [
            ["AWS/ECS", "HTTPCode_Target_5XX_Count", "ClusterName", "my-cluster", "ServiceName", "backend-api"],
            [".", "HTTPCode_Target_4XX_Count", ".", ".", ".", "."]
          ],
          "period": 60,
          "stat": "Sum"
        }
      }
    ]
  }'

設定 CloudWatch 告警

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# 建立高錯誤率告警
aws cloudwatch put-metric-alarm \
  --alarm-name "ServiceConnect-HighErrorRate" \
  --alarm-description "Service Connect 5XX error rate exceeded threshold" \
  --metric-name HTTPCode_Target_5XX_Count \
  --namespace AWS/ECS \
  --dimensions Name=ClusterName,Value=my-cluster Name=ServiceName,Value=backend-api \
  --statistic Sum \
  --period 60 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:ap-northeast-1:123456789012:alerts

# 建立延遲告警
aws cloudwatch put-metric-alarm \
  --alarm-name "ServiceConnect-HighLatency" \
  --alarm-description "Service Connect response time exceeded threshold" \
  --metric-name TargetResponseTime \
  --namespace AWS/ECS \
  --dimensions Name=ClusterName,Value=my-cluster Name=ServiceName,Value=backend-api \
  --extended-statistic p99 \
  --period 60 \
  --threshold 1000 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:ap-northeast-1:123456789012:alerts

整合 AWS X-Ray

雖然 Service Connect 不直接支援 X-Ray,但可以在應用程式中加入 X-Ray SDK 實現分散式追蹤:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{
  "containerDefinitions": [
    {
      "name": "api",
      "image": "my-app:latest",
      "environment": [
        {
          "name": "AWS_XRAY_DAEMON_ADDRESS",
          "value": "xray-daemon:2000"
        }
      ]
    },
    {
      "name": "xray-daemon",
      "image": "public.ecr.aws/xray/aws-xray-daemon:latest",
      "portMappings": [
        {
          "containerPort": 2000,
          "protocol": "udp"
        }
      ]
    }
  ]
}

日誌分析

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# 使用 CloudWatch Logs Insights 查詢 Service Connect 相關日誌
aws logs start-query \
  --log-group-name /ecs/backend-api \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string '
    fields @timestamp, @message
    | filter @message like /connection|upstream|downstream/
    | sort @timestamp desc
    | limit 100
  '

# 取得查詢結果
aws logs get-query-results --query-id <query-id>

Terraform 部署範例

以下是使用 Terraform 完整部署 Service Connect 架構的範例。

專案結構

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
terraform/
├── main.tf
├── variables.tf
├── outputs.tf
├── vpc.tf
├── ecs.tf
├── service-connect.tf
└── services/
    ├── backend-api.tf
    └── frontend-web.tf

基礎設施配置(main.tf)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
terraform {
  required_version = ">= 1.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.aws_region
}

# 變數定義
variable "aws_region" {
  description = "AWS Region"
  default     = "ap-northeast-1"
}

variable "environment" {
  description = "Environment name"
  default     = "production"
}

variable "project_name" {
  description = "Project name"
  default     = "myapp"
}

VPC 和網路配置(vpc.tf)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# VPC
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "${var.project_name}-vpc"
  }
}

# 私有子網路
resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.${count.index + 1}.0/24"
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name = "${var.project_name}-private-${count.index + 1}"
  }
}

# 公有子網路
resource "aws_subnet" "public" {
  count                   = 2
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.${count.index + 101}.0/24"
  availability_zone       = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true

  tags = {
    Name = "${var.project_name}-public-${count.index + 1}"
  }
}

# Internet Gateway
resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "${var.project_name}-igw"
  }
}

# NAT Gateway
resource "aws_eip" "nat" {
  domain = "vpc"

  tags = {
    Name = "${var.project_name}-nat-eip"
  }
}

resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public[0].id

  tags = {
    Name = "${var.project_name}-nat"
  }
}

# 路由表
resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main.id
  }

  tags = {
    Name = "${var.project_name}-private-rt"
  }
}

resource "aws_route_table_association" "private" {
  count          = 2
  subnet_id      = aws_subnet.private[count.index].id
  route_table_id = aws_route_table.private.id
}

data "aws_availability_zones" "available" {
  state = "available"
}

Cloud Map Namespace 和 ECS Cluster(ecs.tf)

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
# Cloud Map Namespace
resource "aws_service_discovery_http_namespace" "main" {
  name        = var.environment
  description = "${var.environment} services namespace for Service Connect"

  tags = {
    Environment = var.environment
  }
}

# ECS Cluster
resource "aws_ecs_cluster" "main" {
  name = "${var.project_name}-cluster"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }

  # 設定預設的 Service Connect namespace
  service_connect_defaults {
    namespace = aws_service_discovery_http_namespace.main.arn
  }

  tags = {
    Environment = var.environment
  }
}

# ECS Cluster 容量提供者
resource "aws_ecs_cluster_capacity_providers" "main" {
  cluster_name = aws_ecs_cluster.main.name

  capacity_providers = ["FARGATE", "FARGATE_SPOT"]

  default_capacity_provider_strategy {
    base              = 1
    weight            = 100
    capacity_provider = "FARGATE"
  }
}

# ECS Task 執行角色
resource "aws_iam_role" "ecs_task_execution" {
  name = "${var.project_name}-ecs-task-execution"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ecs-tasks.amazonaws.com"
        }
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "ecs_task_execution" {
  role       = aws_iam_role.ecs_task_execution.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

# ECS Task 角色
resource "aws_iam_role" "ecs_task" {
  name = "${var.project_name}-ecs-task"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ecs-tasks.amazonaws.com"
        }
      }
    ]
  })
}

# 安全群組
resource "aws_security_group" "ecs_tasks" {
  name        = "${var.project_name}-ecs-tasks-sg"
  description = "Security group for ECS tasks"
  vpc_id      = aws_vpc.main.id

  # 允許任務間通訊
  ingress {
    from_port = 0
    to_port   = 65535
    protocol  = "tcp"
    self      = true
  }

  # 允許所有出站流量
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "${var.project_name}-ecs-tasks-sg"
  }
}

Backend API Service(services/backend-api.tf)

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
# CloudWatch Log Group
resource "aws_cloudwatch_log_group" "backend_api" {
  name              = "/ecs/${var.project_name}/backend-api"
  retention_in_days = 30

  tags = {
    Service = "backend-api"
  }
}

# Task Definition
resource "aws_ecs_task_definition" "backend_api" {
  family                   = "${var.project_name}-backend-api"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = 512
  memory                   = 1024
  execution_role_arn       = aws_iam_role.ecs_task_execution.arn
  task_role_arn            = aws_iam_role.ecs_task.arn

  container_definitions = jsonencode([
    {
      name      = "api"
      image     = "${var.ecr_repository_url}/backend-api:latest"
      essential = true

      portMappings = [
        {
          name          = "api-port"
          containerPort = 8080
          protocol      = "tcp"
          appProtocol   = "http"
        }
      ]

      environment = [
        {
          name  = "NODE_ENV"
          value = var.environment
        },
        {
          name  = "PORT"
          value = "8080"
        }
      ]

      healthCheck = {
        command     = ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
        interval    = 30
        timeout     = 5
        retries     = 3
        startPeriod = 60
      }

      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = aws_cloudwatch_log_group.backend_api.name
          "awslogs-region"        = var.aws_region
          "awslogs-stream-prefix" = "ecs"
        }
      }
    }
  ])

  tags = {
    Service = "backend-api"
  }
}

# ECS Service with Service Connect
resource "aws_ecs_service" "backend_api" {
  name            = "backend-api"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.backend_api.arn
  desired_count   = 3
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = aws_subnet.private[*].id
    security_groups  = [aws_security_group.ecs_tasks.id]
    assign_public_ip = false
  }

  # Service Connect 配置 - 作為服務端
  service_connect_configuration {
    enabled   = true
    namespace = aws_service_discovery_http_namespace.main.arn

    service {
      port_name      = "api-port"
      discovery_name = "backend-api"

      client_alias {
        port     = 8080
        dns_name = "backend-api"
      }

      timeout {
        idle_timeout_seconds        = 300
        per_request_timeout_seconds = 30
      }
    }

    log_configuration {
      log_driver = "awslogs"
      options = {
        "awslogs-group"         = aws_cloudwatch_log_group.backend_api.name
        "awslogs-region"        = var.aws_region
        "awslogs-stream-prefix" = "service-connect"
      }
    }
  }

  deployment_configuration {
    maximum_percent         = 200
    minimum_healthy_percent = 100

    deployment_circuit_breaker {
      enable   = true
      rollback = true
    }
  }

  tags = {
    Service = "backend-api"
  }
}

Frontend Web Service(services/frontend-web.tf)

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
# CloudWatch Log Group
resource "aws_cloudwatch_log_group" "frontend_web" {
  name              = "/ecs/${var.project_name}/frontend-web"
  retention_in_days = 30

  tags = {
    Service = "frontend-web"
  }
}

# Task Definition
resource "aws_ecs_task_definition" "frontend_web" {
  family                   = "${var.project_name}-frontend-web"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = 256
  memory                   = 512
  execution_role_arn       = aws_iam_role.ecs_task_execution.arn
  task_role_arn            = aws_iam_role.ecs_task.arn

  container_definitions = jsonencode([
    {
      name      = "web"
      image     = "${var.ecr_repository_url}/frontend-web:latest"
      essential = true

      portMappings = [
        {
          name          = "web-port"
          containerPort = 3000
          protocol      = "tcp"
          appProtocol   = "http"
        }
      ]

      environment = [
        {
          name  = "BACKEND_URL"
          value = "http://backend-api:8080"  # 使用 Service Connect DNS 名稱
        }
      ]

      healthCheck = {
        command     = ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"]
        interval    = 30
        timeout     = 5
        retries     = 3
        startPeriod = 30
      }

      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = aws_cloudwatch_log_group.frontend_web.name
          "awslogs-region"        = var.aws_region
          "awslogs-stream-prefix" = "ecs"
        }
      }
    }
  ])

  tags = {
    Service = "frontend-web"
  }
}

# ECS Service with Service Connect - 純客戶端
resource "aws_ecs_service" "frontend_web" {
  name            = "frontend-web"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.frontend_web.arn
  desired_count   = 2
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = aws_subnet.private[*].id
    security_groups  = [aws_security_group.ecs_tasks.id]
    assign_public_ip = false
  }

  # Service Connect 配置 - 作為客戶端
  service_connect_configuration {
    enabled   = true
    namespace = aws_service_discovery_http_namespace.main.arn
    # 不需要定義 service 區塊,因為只作為客戶端

    log_configuration {
      log_driver = "awslogs"
      options = {
        "awslogs-group"         = aws_cloudwatch_log_group.frontend_web.name
        "awslogs-region"        = var.aws_region
        "awslogs-stream-prefix" = "service-connect"
      }
    }
  }

  deployment_configuration {
    maximum_percent         = 200
    minimum_healthy_percent = 100

    deployment_circuit_breaker {
      enable   = true
      rollback = true
    }
  }

  # 與 ALB 整合(可選)
  load_balancer {
    target_group_arn = aws_lb_target_group.frontend.arn
    container_name   = "web"
    container_port   = 3000
  }

  tags = {
    Service = "frontend-web"
  }

  depends_on = [aws_ecs_service.backend_api]
}

輸出配置(outputs.tf)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
output "cluster_name" {
  description = "ECS Cluster name"
  value       = aws_ecs_cluster.main.name
}

output "namespace_arn" {
  description = "Cloud Map namespace ARN"
  value       = aws_service_discovery_http_namespace.main.arn
}

output "namespace_name" {
  description = "Cloud Map namespace name"
  value       = aws_service_discovery_http_namespace.main.name
}

output "backend_api_service_name" {
  description = "Backend API service name"
  value       = aws_ecs_service.backend_api.name
}

output "frontend_web_service_name" {
  description = "Frontend Web service name"
  value       = aws_ecs_service.frontend_web.name
}

部署指令

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# 初始化 Terraform
terraform init

# 檢視執行計畫
terraform plan -out=tfplan

# 套用變更
terraform apply tfplan

# 驗證部署
aws ecs describe-services \
  --cluster myapp-cluster \
  --services backend-api frontend-web \
  --query 'services[*].{Name:serviceName,Status:status,Running:runningCount}'

故障排除與最佳實務

常見問題與解決方案

1. 服務無法相互連線

症狀:客戶端服務無法透過服務名稱存取其他服務

診斷步驟

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# 檢查 Service Connect 配置是否正確
aws ecs describe-services \
  --cluster my-cluster \
  --services backend-api \
  --query 'services[0].deployments[0].serviceConnectConfiguration'

# 確認 Namespace 是否正確設定
aws servicediscovery list-services \
  --filters Name=NAMESPACE_ID,Values=ns-xxxxxxxxx

# 檢查任務中的 Envoy Proxy 狀態
aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks <task-id> \
  --query 'tasks[0].containers[?name==`ecs-service-connect-agent`]'

解決方案

  • 確認兩個服務都在同一個 Namespace 中
  • 驗證 portMappings 的 name 屬性與 Service Connect 配置中的 portName 一致
  • 檢查安全群組是否允許服務間通訊

2. Envoy Proxy Sidecar 啟動失敗

症狀:任務啟動但快速失敗,容器日誌顯示 Envoy 相關錯誤

診斷步驟

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# 查看任務停止原因
aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks <task-id> \
  --query 'tasks[0].stoppedReason'

# 查看 Envoy Proxy 日誌
aws logs filter-log-events \
  --log-group-name /ecs/my-service \
  --log-stream-name-prefix "service-connect" \
  --start-time $(date -d '1 hour ago' +%s)000

解決方案

  • 確認 Task Definition 中的 portMappings 使用正確的 appProtocol(http 或 grpc)
  • 檢查 ECS Task 執行角色是否有足夠權限
  • 確認 Container 的健康檢查端點可正常存取

3. 延遲過高

症狀:服務間通訊延遲明顯高於預期

診斷步驟

1
2
3
4
5
6
7
8
9
# 查看 CloudWatch 延遲指標
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name TargetResponseTime \
  --dimensions Name=ClusterName,Value=my-cluster Name=ServiceName,Value=backend-api \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Average p99

解決方案

  • 檢查目標服務的資源使用情況(CPU、記憶體)
  • 考慮增加目標服務的任務數量
  • 檢查應用程式本身的效能問題
  • 調整逾時設定以符合實際需求

4. 服務發現更新延遲

症狀:新部署的任務需要較長時間才能接收流量

解決方案

1
2
3
4
5
6
7
8
9
{
  "healthCheck": {
    "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
    "interval": 10,
    "timeout": 5,
    "retries": 2,
    "startPeriod": 30
  }
}
  • 縮短健康檢查間隔
  • 減少 startPeriod 以加快健康狀態確認
  • 確保健康檢查端點快速回應

最佳實務

1. 命名規範

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# 使用一致的命名規範
# Namespace: 使用環境名稱
production
staging
development

# 服務名稱: 使用 kebab-case
user-service
order-service
payment-gateway

# DNS 別名: 保持簡潔且有意義
users
orders
payments

2. 資源配置建議

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# 為 Envoy Proxy 預留額外資源
resource "aws_ecs_task_definition" "api" {
  cpu    = 512  # 其中約 64-128 用於 Envoy
  memory = 1024 # 其中約 128-256 用於 Envoy

  # 使用明確的 appProtocol
  container_definitions = jsonencode([
    {
      portMappings = [
        {
          name          = "api-http"
          containerPort = 8080
          appProtocol   = "http"  # 或 "grpc"
        }
      ]
    }
  ])
}

3. 安全性配置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# 最小權限原則的安全群組
resource "aws_security_group" "ecs_tasks" {
  # 僅允許必要的服務間通訊端口
  ingress {
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.allowed_services.id]
  }

  # 限制出站流量
  egress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
    description = "HTTPS for AWS services"
  }
}

4. 可觀測性配置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# 為 Service Connect 啟用詳細日誌
service_connect_configuration {
  enabled = true

  log_configuration {
    log_driver = "awslogs"
    options = {
      "awslogs-group"         = "/ecs/service-connect"
      "awslogs-region"        = var.aws_region
      "awslogs-stream-prefix" = "envoy"
    }
  }
}

5. 部署策略

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# 使用藍綠部署確保穩定性
deployment_configuration {
  maximum_percent         = 200
  minimum_healthy_percent = 100

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }
}

效能調優建議

  1. 連線池優化:Service Connect 的 Envoy Proxy 會自動管理連線池,但應確保應用程式也正確配置 HTTP 連線重用

  2. 健康檢查間隔:根據服務特性調整健康檢查頻率,避免過於頻繁造成資源浪費

  3. 逾時設定:根據實際服務回應時間設定合理的逾時值

  4. 資源監控:定期檢視 CloudWatch 指標,識別效能瓶頸


參考資料

comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy