AWS EC2 Spot Instance 成本優化

AWS EC2 Spot Instance Cost Optimization Strategies

前言

在雲端運算環境中,成本控制是企業永恆的課題。AWS EC2 Spot Instance 提供了一個絕佳的機會,讓您以最高可達 90% 的折扣使用 EC2 運算資源。本文將深入探討 Spot Instance 的運作原理、最佳實務,以及如何有效整合到您的雲端架構中。

透過正確使用 Spot Instance,您可以大幅降低運算成本,同時維持工作負載的可用性和效能。


Spot Instance 概念與運作原理

什麼是 Spot Instance?

Spot Instance 是 AWS EC2 的一種定價模式,允許您使用 AWS 雲端中閒置的運算容量。這些是與隨需執行個體 (On-Demand Instance) 相同的硬體資源,但價格可低至隨需價格的 10%。

運作原理

Spot Instance 的運作基於供需市場機制:

  1. 容量池 (Capacity Pool):AWS 將每個可用區域中相同執行個體類型的閒置容量組成容量池
  2. Spot 價格:價格根據長期供需趨勢緩慢調整,不再像早期那樣劇烈波動
  3. 容量中斷:當 AWS 需要回收容量時,會發出 2 分鐘的中斷通知

Spot Instance 的生命週期

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
請求 Spot Instance
    容量可用? ──否──→ 請求等待中
  執行個體啟動
    正常運作
  AWS 需要回收容量? ──否──→ 繼續運作
  發送 2 分鐘中斷通知
  執行個體終止/停止/休眠

查詢 Spot 價格歷史

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# 查詢特定執行個體類型的 Spot 價格歷史
aws ec2 describe-spot-price-history \
    --instance-types m5.large \
    --product-descriptions "Linux/UNIX" \
    --start-time $(date -u +"%Y-%m-%dT%H:%M:%SZ" -d "7 days ago") \
    --end-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") \
    --region ap-northeast-1

# 查詢多種執行個體類型的當前 Spot 價格
aws ec2 describe-spot-price-history \
    --instance-types m5.large m5.xlarge c5.large c5.xlarge \
    --product-descriptions "Linux/UNIX" \
    --start-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") \
    --region ap-northeast-1 \
    --query 'SpotPriceHistory[*].[InstanceType,AvailabilityZone,SpotPrice]' \
    --output table

與隨需執行個體、預留執行個體比較

三種定價模式比較

特性隨需執行個體預留執行個體Spot Instance
折扣幅度無折扣(基準價格)最高 72%最高 90%
承諾期間1-3 年
可用性保證可能被中斷
適用場景短期、不規則工作負載穩定、可預測工作負載容錯、彈性工作負載
付款方式按秒計費預付/部分預付/無預付按秒計費

成本比較範例

m5.xlarge 執行個體在 ap-northeast-1 區域為例(價格為示意):

定價模式每小時價格 (USD)每月成本 (730 小時)相對節省
隨需$0.248$181.04-
預留(1年,無預付)$0.156$113.8837%
預留(3年,全預付)$0.099$72.2760%
Spot$0.050-0.080$36.50-58.4068-80%

選擇建議

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
工作負載類型評估
        ├── 可容忍中斷?
        │       │
        │      否 ──→ 使用隨需或預留執行個體
        │       │
        │      是
        │       ↓
        ├── 工作負載穩定且持續?
        │       │
        │      是 ──→ 混合使用預留 + Spot
        │       │
        │      否 ──→ 使用 Spot Instance
        └── 需要立即可用且不可中斷? ──→ 使用隨需執行個體

Spot Fleet 設定與管理

什麼是 Spot Fleet?

Spot Fleet 是一個執行個體集合,可以根據您的需求自動維護指定容量的 Spot Instance。它支援多種執行個體類型和可用區域,提高獲得容量的機會。

建立 Spot Fleet 請求

使用 AWS CLI 建立 Spot Fleet

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# 建立 Spot Fleet 設定檔
cat > spot-fleet-config.json << 'EOF'
{
    "SpotPrice": "0.10",
    "TargetCapacity": 10,
    "IamFleetRole": "arn:aws:iam::123456789012:role/aws-ec2-spot-fleet-role",
    "LaunchSpecifications": [
        {
            "ImageId": "ami-0abcdef1234567890",
            "InstanceType": "m5.large",
            "SubnetId": "subnet-0123456789abcdef0",
            "SecurityGroups": [
                {
                    "GroupId": "sg-0123456789abcdef0"
                }
            ],
            "KeyName": "my-key-pair"
        },
        {
            "ImageId": "ami-0abcdef1234567890",
            "InstanceType": "m5.xlarge",
            "SubnetId": "subnet-0123456789abcdef0",
            "SecurityGroups": [
                {
                    "GroupId": "sg-0123456789abcdef0"
                }
            ],
            "KeyName": "my-key-pair"
        },
        {
            "ImageId": "ami-0abcdef1234567890",
            "InstanceType": "c5.large",
            "SubnetId": "subnet-abcdef0123456789a",
            "SecurityGroups": [
                {
                    "GroupId": "sg-0123456789abcdef0"
                }
            ],
            "KeyName": "my-key-pair"
        }
    ],
    "AllocationStrategy": "capacityOptimized",
    "Type": "maintain",
    "TerminateInstancesWithExpiration": true
}
EOF

# 建立 Spot Fleet
aws ec2 request-spot-fleet \
    --spot-fleet-request-config file://spot-fleet-config.json \
    --region ap-northeast-1

分配策略 (Allocation Strategy)

策略說明適用場景
lowestPrice優先選擇最低價格的容量池成本敏感,可接受較高中斷率
capacityOptimized優先選擇容量最充足的池需要降低中斷風險
capacityOptimizedPrioritized結合容量優化與優先順序有偏好的執行個體類型
diversified分散至所有容量池需要高可用性

使用 Launch Template 建立 Spot Fleet

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# 首先建立 Launch Template
aws ec2 create-launch-template \
    --launch-template-name my-spot-template \
    --version-description "Spot Instance Template v1" \
    --launch-template-data '{
        "ImageId": "ami-0abcdef1234567890",
        "InstanceType": "m5.large",
        "KeyName": "my-key-pair",
        "SecurityGroupIds": ["sg-0123456789abcdef0"],
        "TagSpecifications": [
            {
                "ResourceType": "instance",
                "Tags": [
                    {"Key": "Environment", "Value": "Production"},
                    {"Key": "Type", "Value": "Spot"}
                ]
            }
        ]
    }'

# 使用 Launch Template 建立 Spot Fleet
cat > spot-fleet-template-config.json << 'EOF'
{
    "TargetCapacity": 10,
    "IamFleetRole": "arn:aws:iam::123456789012:role/aws-ec2-spot-fleet-role",
    "LaunchTemplateConfigs": [
        {
            "LaunchTemplateSpecification": {
                "LaunchTemplateName": "my-spot-template",
                "Version": "1"
            },
            "Overrides": [
                {"InstanceType": "m5.large", "AvailabilityZone": "ap-northeast-1a"},
                {"InstanceType": "m5.large", "AvailabilityZone": "ap-northeast-1c"},
                {"InstanceType": "m5.xlarge", "AvailabilityZone": "ap-northeast-1a"},
                {"InstanceType": "c5.large", "AvailabilityZone": "ap-northeast-1a"},
                {"InstanceType": "c5.large", "AvailabilityZone": "ap-northeast-1c"}
            ]
        }
    ],
    "AllocationStrategy": "capacityOptimized",
    "Type": "maintain"
}
EOF

aws ec2 request-spot-fleet \
    --spot-fleet-request-config file://spot-fleet-template-config.json

管理 Spot Fleet

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# 查看 Spot Fleet 請求
aws ec2 describe-spot-fleet-requests \
    --query 'SpotFleetRequestConfigs[*].[SpotFleetRequestId,SpotFleetRequestState,TargetCapacity]' \
    --output table

# 修改目標容量
aws ec2 modify-spot-fleet-request \
    --spot-fleet-request-id sfr-12345678-1234-1234-1234-123456789012 \
    --target-capacity 20

# 取消 Spot Fleet 請求(保留執行中的執行個體)
aws ec2 cancel-spot-fleet-requests \
    --spot-fleet-request-ids sfr-12345678-1234-1234-1234-123456789012 \
    --no-terminate-instances

# 取消 Spot Fleet 請求(終止所有執行個體)
aws ec2 cancel-spot-fleet-requests \
    --spot-fleet-request-ids sfr-12345678-1234-1234-1234-123456789012 \
    --terminate-instances

中斷處理與優雅關機

理解 Spot Instance 中斷

當 AWS 需要回收 Spot Instance 容量時,會在終止前 2 分鐘發送中斷通知。您可以透過以下方式偵測中斷:

  1. Instance Metadata Service:輪詢執行個體中繼資料
  2. CloudWatch Events / EventBridge:接收事件通知
  3. EC2 Rebalance Recommendation:提早收到再平衡建議

使用 Instance Metadata 偵測中斷

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/bin/bash
# spot-interruption-handler.sh
# 每 5 秒檢查一次中斷通知

TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
    -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

while true; do
    INTERRUPTION_TIME=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
        http://169.254.169.254/latest/meta-data/spot/instance-action \
        2>/dev/null | jq -r '.time // empty')

    if [ -n "$INTERRUPTION_TIME" ]; then
        echo "Spot Instance 中斷通知收到,終止時間: $INTERRUPTION_TIME"

        # 執行優雅關機程序
        /opt/scripts/graceful-shutdown.sh

        break
    fi

    sleep 5
done

優雅關機腳本範例

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#!/bin/bash
# graceful-shutdown.sh
# Spot Instance 中斷時的優雅關機程序

set -e

LOG_FILE="/var/log/spot-shutdown.log"
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")" http://169.254.169.254/latest/meta-data/instance-id)

log() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a $LOG_FILE
}

log "開始優雅關機程序 - Instance: $INSTANCE_ID"

# 1. 從負載均衡器移除
log "從負載均衡器移除..."
aws elbv2 deregister-targets \
    --target-group-arn arn:aws:elasticloadbalancing:ap-northeast-1:123456789012:targetgroup/my-target-group/1234567890123456 \
    --targets Id=$INSTANCE_ID \
    2>/dev/null || true

# 等待連線排空 (Connection Draining)
log "等待連線排空(30秒)..."
sleep 30

# 2. 停止接受新任務
log "停止接受新任務..."
# 若使用 SQS,停止輪詢新訊息
pkill -SIGTERM -f "sqs-worker" 2>/dev/null || true

# 3. 完成進行中的任務
log "等待進行中的任務完成..."
# 設定最大等待時間
MAX_WAIT=60
WAITED=0
while [ -f /tmp/task-in-progress ] && [ $WAITED -lt $MAX_WAIT ]; do
    sleep 5
    WAITED=$((WAITED + 5))
    log "等待任務完成... ($WAITED/$MAX_WAIT 秒)"
done

# 4. 儲存狀態到 S3 或 DynamoDB
log "儲存應用程式狀態..."
aws s3 cp /var/app/state.json \
    s3://my-bucket/instance-states/$INSTANCE_ID/state.json \
    2>/dev/null || true

# 5. 傳送關機完成通知
log "傳送關機通知到 SNS..."
aws sns publish \
    --topic-arn arn:aws:sns:ap-northeast-1:123456789012:spot-instance-events \
    --message "{\"event\": \"graceful-shutdown\", \"instance\": \"$INSTANCE_ID\", \"timestamp\": \"$(date -Iseconds)\"}" \
    2>/dev/null || true

log "優雅關機程序完成"

使用 EventBridge 監控中斷

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# 建立 EventBridge 規則監控 Spot 中斷
aws events put-rule \
    --name "spot-instance-interruption" \
    --event-pattern '{
        "source": ["aws.ec2"],
        "detail-type": ["EC2 Spot Instance Interruption Warning"]
    }' \
    --state ENABLED

# 建立目標(發送到 SNS)
aws events put-targets \
    --rule "spot-instance-interruption" \
    --targets '[
        {
            "Id": "1",
            "Arn": "arn:aws:sns:ap-northeast-1:123456789012:spot-interruption-topic"
        }
    ]'

# 建立 Rebalance Recommendation 規則
aws events put-rule \
    --name "spot-rebalance-recommendation" \
    --event-pattern '{
        "source": ["aws.ec2"],
        "detail-type": ["EC2 Instance Rebalance Recommendation"]
    }' \
    --state ENABLED

使用 Terraform 設定中斷處理

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# eventbridge.tf

resource "aws_cloudwatch_event_rule" "spot_interruption" {
  name        = "spot-instance-interruption"
  description = "Capture Spot Instance interruption warnings"

  event_pattern = jsonencode({
    source      = ["aws.ec2"]
    detail-type = ["EC2 Spot Instance Interruption Warning"]
  })
}

resource "aws_cloudwatch_event_rule" "spot_rebalance" {
  name        = "spot-rebalance-recommendation"
  description = "Capture EC2 Instance Rebalance Recommendations"

  event_pattern = jsonencode({
    source      = ["aws.ec2"]
    detail-type = ["EC2 Instance Rebalance Recommendation"]
  })
}

resource "aws_cloudwatch_event_target" "interruption_to_lambda" {
  rule      = aws_cloudwatch_event_rule.spot_interruption.name
  target_id = "SpotInterruptionHandler"
  arn       = aws_lambda_function.spot_handler.arn
}

resource "aws_cloudwatch_event_target" "rebalance_to_lambda" {
  rule      = aws_cloudwatch_event_rule.spot_rebalance.name
  target_id = "SpotRebalanceHandler"
  arn       = aws_lambda_function.spot_handler.arn
}

resource "aws_lambda_permission" "allow_eventbridge" {
  statement_id  = "AllowExecutionFromEventBridge"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.spot_handler.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.spot_interruption.arn
}

Lambda 中斷處理函式

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# lambda_function.py
import json
import boto3
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

ec2 = boto3.client('ec2')
asg = boto3.client('autoscaling')
sns = boto3.client('sns')

def lambda_handler(event, context):
    logger.info(f"收到事件: {json.dumps(event)}")

    detail_type = event.get('detail-type')
    detail = event.get('detail', {})
    instance_id = detail.get('instance-id')

    if detail_type == 'EC2 Spot Instance Interruption Warning':
        handle_interruption(instance_id, detail)
    elif detail_type == 'EC2 Instance Rebalance Recommendation':
        handle_rebalance(instance_id, detail)

    return {'statusCode': 200}

def handle_interruption(instance_id, detail):
    """處理 Spot 中斷警告"""
    logger.info(f"處理 Spot 中斷: {instance_id}")

    # 取得執行個體資訊
    response = ec2.describe_instances(InstanceIds=[instance_id])
    instance = response['Reservations'][0]['Instances'][0]

    # 檢查是否為 Auto Scaling Group 的一部分
    for tag in instance.get('Tags', []):
        if tag['Key'] == 'aws:autoscaling:groupName':
            asg_name = tag['Value']
            # 將執行個體設為不健康,觸發替換
            asg.set_instance_health(
                InstanceId=instance_id,
                HealthStatus='Unhealthy',
                ShouldRespectGracePeriod=False
            )
            logger.info(f"已將 {instance_id} 標記為不健康(ASG: {asg_name})")

    # 發送通知
    sns.publish(
        TopicArn='arn:aws:sns:ap-northeast-1:123456789012:spot-alerts',
        Subject=f'Spot Instance 中斷警告: {instance_id}',
        Message=json.dumps({
            'instance_id': instance_id,
            'interruption_time': detail.get('instance-action', {}).get('time'),
            'action': detail.get('instance-action', {}).get('action')
        }, indent=2)
    )

def handle_rebalance(instance_id, detail):
    """處理再平衡建議"""
    logger.info(f"收到再平衡建議: {instance_id}")

    # 再平衡建議表示該執行個體有較高的中斷風險
    # 可以選擇主動替換或等待
    sns.publish(
        TopicArn='arn:aws:sns:ap-northeast-1:123456789012:spot-alerts',
        Subject=f'Spot Instance 再平衡建議: {instance_id}',
        Message=f'執行個體 {instance_id} 收到再平衡建議,建議考慮主動替換'
    )

Spot Instance Advisor 使用

什麼是 Spot Instance Advisor?

Spot Instance Advisor 是 AWS 提供的工具,顯示各執行個體類型的中斷頻率和相對於隨需價格的節省幅度,幫助您選擇最適合的執行個體類型。

存取 Spot Instance Advisor

  1. 前往 EC2 Spot Instance Advisor
  2. 選擇作業系統和區域
  3. 檢視各執行個體類型的中斷頻率和節省幅度

中斷頻率等級

等級中斷頻率建議
<5%適合大多數工作負載
5-10%中低適合有容錯機制的工作負載
10-15%需要良好的中斷處理
15-20%中高僅適合短期任務
>20%建議選擇其他執行個體類型

使用 AWS CLI 查詢 Spot 資訊

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# 取得 Spot 執行個體建議
# 使用 EC2 Instance Type 資訊
aws ec2 describe-instance-types \
    --instance-types m5.large m5.xlarge c5.large c5.xlarge r5.large \
    --query 'InstanceTypes[*].[InstanceType,VCpuInfo.DefaultVCpus,MemoryInfo.SizeInMiB]' \
    --output table

# 查詢特定區域的可用 Spot 容量池
aws ec2 describe-spot-price-history \
    --instance-types m5.large m5.xlarge c5.large c5.xlarge \
    --product-descriptions "Linux/UNIX" \
    --start-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") \
    --query 'SpotPriceHistory[*].[InstanceType,AvailabilityZone,SpotPrice]' \
    --output table

選擇最佳執行個體組合

根據 Spot Instance Advisor 的資訊,建立多樣化的執行個體組合:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# spot_instance_selector.py
import boto3
from collections import defaultdict

def get_optimal_instance_types(region, vcpu_min, memory_min_gb, max_types=5):
    """
    根據需求選擇最佳的 Spot 執行個體類型組合
    """
    ec2 = boto3.client('ec2', region_name=region)

    # 取得符合需求的執行個體類型
    paginator = ec2.get_paginator('describe_instance_types')
    suitable_types = []

    for page in paginator.paginate():
        for instance_type in page['InstanceTypes']:
            vcpus = instance_type['VCpuInfo']['DefaultVCpus']
            memory_mb = instance_type['MemoryInfo']['SizeInMiB']
            memory_gb = memory_mb / 1024

            if vcpus >= vcpu_min and memory_gb >= memory_min_gb:
                # 排除特殊類型(如金屬執行個體)
                if 'metal' not in instance_type['InstanceType']:
                    suitable_types.append({
                        'type': instance_type['InstanceType'],
                        'vcpus': vcpus,
                        'memory_gb': memory_gb
                    })

    # 取得 Spot 價格
    spot_prices = get_spot_prices(ec2, [t['type'] for t in suitable_types])

    # 計算價格效益比
    for t in suitable_types:
        if t['type'] in spot_prices:
            t['spot_price'] = spot_prices[t['type']]
            t['price_per_vcpu'] = t['spot_price'] / t['vcpus']
        else:
            t['spot_price'] = float('inf')
            t['price_per_vcpu'] = float('inf')

    # 依價格效益比排序
    suitable_types.sort(key=lambda x: x['price_per_vcpu'])

    return suitable_types[:max_types]

def get_spot_prices(ec2, instance_types):
    """取得當前 Spot 價格"""
    from datetime import datetime

    prices = {}

    # 分批查詢(每次最多 100 個)
    for i in range(0, len(instance_types), 100):
        batch = instance_types[i:i+100]

        response = ec2.describe_spot_price_history(
            InstanceTypes=batch,
            ProductDescriptions=['Linux/UNIX'],
            StartTime=datetime.utcnow()
        )

        for price_info in response['SpotPriceHistory']:
            instance_type = price_info['InstanceType']
            price = float(price_info['SpotPrice'])

            # 保留最低價格
            if instance_type not in prices or price < prices[instance_type]:
                prices[instance_type] = price

    return prices

if __name__ == '__main__':
    # 尋找至少 4 vCPU、8 GB 記憶體的執行個體
    optimal_types = get_optimal_instance_types(
        region='ap-northeast-1',
        vcpu_min=4,
        memory_min_gb=8,
        max_types=10
    )

    print("建議的 Spot 執行個體類型:")
    print("-" * 60)
    for t in optimal_types:
        print(f"{t['type']:15} | {t['vcpus']:3} vCPU | {t['memory_gb']:6.1f} GB | "
              f"${t['spot_price']:.4f}/hr | ${t['price_per_vcpu']:.4f}/vCPU/hr")

Auto Scaling 與 Spot 整合

Auto Scaling Group 混合執行個體配置

Auto Scaling Group 支援同時使用隨需執行個體和 Spot Instance,提供成本與可用性的平衡。

使用 AWS CLI 建立混合 Auto Scaling Group

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# 建立 Launch Template
aws ec2 create-launch-template \
    --launch-template-name mixed-instance-template \
    --version-description "v1" \
    --launch-template-data '{
        "ImageId": "ami-0abcdef1234567890",
        "SecurityGroupIds": ["sg-0123456789abcdef0"],
        "KeyName": "my-key-pair",
        "UserData": "'$(base64 -w0 <<< '#!/bin/bash
echo "Hello from Mixed Instance ASG"
')'",
        "TagSpecifications": [
            {
                "ResourceType": "instance",
                "Tags": [
                    {"Key": "Name", "Value": "mixed-asg-instance"}
                ]
            }
        ]
    }'

# 建立混合執行個體 Auto Scaling Group
aws autoscaling create-auto-scaling-group \
    --auto-scaling-group-name mixed-instance-asg \
    --mixed-instances-policy '{
        "LaunchTemplate": {
            "LaunchTemplateSpecification": {
                "LaunchTemplateName": "mixed-instance-template",
                "Version": "$Latest"
            },
            "Overrides": [
                {"InstanceType": "m5.large"},
                {"InstanceType": "m5.xlarge"},
                {"InstanceType": "m4.large"},
                {"InstanceType": "m4.xlarge"},
                {"InstanceType": "c5.large"},
                {"InstanceType": "c5.xlarge"}
            ]
        },
        "InstancesDistribution": {
            "OnDemandAllocationStrategy": "prioritized",
            "OnDemandBaseCapacity": 2,
            "OnDemandPercentageAboveBaseCapacity": 20,
            "SpotAllocationStrategy": "capacity-optimized",
            "SpotInstancePools": 0
        }
    }' \
    --min-size 2 \
    --max-size 20 \
    --desired-capacity 10 \
    --vpc-zone-identifier "subnet-0123456789abcdef0,subnet-abcdef0123456789a"

執行個體分配說明

上述設定的含義:

  • OnDemandBaseCapacity: 2:前 2 個執行個體使用隨需
  • OnDemandPercentageAboveBaseCapacity: 20:超過基本容量的 20% 使用隨需
  • 以 10 個執行個體為例:2 + (10-2) * 0.2 = 3.6 ≈ 4 個隨需,6 個 Spot

使用 Terraform 設定混合 Auto Scaling Group

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
# asg.tf

resource "aws_launch_template" "mixed" {
  name_prefix   = "mixed-instance-"
  image_id      = data.aws_ami.amazon_linux.id
  instance_type = "m5.large"
  key_name      = "my-key-pair"

  vpc_security_group_ids = [aws_security_group.app.id]

  user_data = base64encode(<<-EOF
              #!/bin/bash
              yum update -y
              yum install -y httpd
              systemctl start httpd
              systemctl enable httpd
              EOF
  )

  tag_specifications {
    resource_type = "instance"
    tags = {
      Name        = "mixed-asg-instance"
      Environment = "production"
    }
  }

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_autoscaling_group" "mixed" {
  name                = "mixed-instance-asg"
  vpc_zone_identifier = var.subnet_ids
  min_size            = 2
  max_size            = 20
  desired_capacity    = 10

  mixed_instances_policy {
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.mixed.id
        version            = "$Latest"
      }

      override {
        instance_type     = "m5.large"
        weighted_capacity = "1"
      }

      override {
        instance_type     = "m5.xlarge"
        weighted_capacity = "2"
      }

      override {
        instance_type     = "m4.large"
        weighted_capacity = "1"
      }

      override {
        instance_type     = "c5.large"
        weighted_capacity = "1"
      }

      override {
        instance_type     = "c5.xlarge"
        weighted_capacity = "2"
      }
    }

    instances_distribution {
      on_demand_allocation_strategy            = "prioritized"
      on_demand_base_capacity                  = 2
      on_demand_percentage_above_base_capacity = 20
      spot_allocation_strategy                 = "capacity-optimized"
    }
  }

  # 容量再平衡
  capacity_rebalance = true

  # 健康檢查
  health_check_type         = "ELB"
  health_check_grace_period = 300

  # 執行個體更新策略
  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 90
    }
  }

  tag {
    key                 = "Name"
    value               = "mixed-asg"
    propagate_at_launch = true
  }
}

容量再平衡功能

Auto Scaling Group 支援容量再平衡功能,當 Spot Instance 收到再平衡建議時,會主動啟動替換執行個體:

1
2
3
4
# 啟用容量再平衡
aws autoscaling update-auto-scaling-group \
    --auto-scaling-group-name mixed-instance-asg \
    --capacity-rebalance

擴展政策設定

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# scaling_policy.tf

# 目標追蹤擴展政策
resource "aws_autoscaling_policy" "cpu_target" {
  name                   = "cpu-target-tracking"
  autoscaling_group_name = aws_autoscaling_group.mixed.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }
    target_value = 70.0
  }
}

# ALB 請求計數擴展政策
resource "aws_autoscaling_policy" "request_count" {
  name                   = "request-count-target"
  autoscaling_group_name = aws_autoscaling_group.mixed.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ALBRequestCountPerTarget"
      resource_label         = "${aws_lb.app.arn_suffix}/${aws_lb_target_group.app.arn_suffix}"
    }
    target_value = 1000.0
  }
}

混合使用策略

策略一:基本容量保障

使用隨需執行個體作為基本容量,Spot Instance 處理額外負載:

1
2
3
4
5
6
7
總容量需求
    ├── 基本容量(隨需):處理最低必要負載
    ├── 預留容量(預留執行個體):處理穩定的額外負載
    └── 彈性容量(Spot):處理峰值負載

策略二:工作負載分類

根據工作負載的特性選擇適當的執行個體類型:

工作負載類型建議配置原因
網站前端Spot + ASG無狀態,可快速替換
應用程式伺服器混合(20% 隨需 + 80% Spot)需要一定的穩定性
資料庫隨需或預留狀態性工作負載,不適合中斷
批次處理100% Spot可重試,對中斷容忍度高
CI/CDSpot建置任務可重新執行
開發/測試環境Spot成本敏感,可接受中斷

策略三:多容量池分散

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# 分散式 Spot Fleet 配置
resource "aws_spot_fleet_request" "diversified" {
  iam_fleet_role = aws_iam_role.spot_fleet.arn

  target_capacity                     = 20
  allocation_strategy                 = "diversified"
  terminate_instances_with_expiration = true

  # 跨多個可用區域和執行個體類型
  launch_specification {
    instance_type = "m5.large"
    ami           = data.aws_ami.amazon_linux.id
    subnet_id     = var.subnet_a_id
  }

  launch_specification {
    instance_type = "m5.large"
    ami           = data.aws_ami.amazon_linux.id
    subnet_id     = var.subnet_b_id
  }

  launch_specification {
    instance_type = "m5.xlarge"
    ami           = data.aws_ami.amazon_linux.id
    subnet_id     = var.subnet_a_id
  }

  launch_specification {
    instance_type = "c5.large"
    ami           = data.aws_ami.amazon_linux.id
    subnet_id     = var.subnet_a_id
  }

  launch_specification {
    instance_type = "c5.large"
    ami           = data.aws_ami.amazon_linux.id
    subnet_id     = var.subnet_b_id
  }

  launch_specification {
    instance_type = "r5.large"
    ami           = data.aws_ami.amazon_linux.id
    subnet_id     = var.subnet_a_id
  }
}

實務建議

  1. 選擇多種執行個體類型:至少選擇 10 種以上的執行個體類型,增加獲得容量的機會

  2. 跨多個可用區域:在所有可用區域中請求容量

  3. 使用 capacity-optimized 策略:優先選擇最不可能被中斷的容量池

  4. 設定適當的 Spot 價格上限:通常設定為隨需價格,避免意外高價

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# 取得隨需價格作為 Spot 價格上限參考
aws pricing get-products \
    --service-code AmazonEC2 \
    --filters "Type=TERM_MATCH,Field=instanceType,Value=m5.large" \
              "Type=TERM_MATCH,Field=location,Value=Asia Pacific (Tokyo)" \
              "Type=TERM_MATCH,Field=operatingSystem,Value=Linux" \
              "Type=TERM_MATCH,Field=tenancy,Value=Shared" \
              "Type=TERM_MATCH,Field=preInstalledSw,Value=NA" \
              "Type=TERM_MATCH,Field=capacitystatus,Value=Used" \
    --region us-east-1

成本分析與最佳實務

成本監控設定

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# 建立 Cost Explorer 預算警示
aws budgets create-budget \
    --account-id 123456789012 \
    --budget '{
        "BudgetName": "EC2-Spot-Monthly",
        "BudgetLimit": {
            "Amount": "1000",
            "Unit": "USD"
        },
        "BudgetType": "COST",
        "CostFilters": {
            "Service": ["Amazon Elastic Compute Cloud - Compute"],
            "PurchaseType": ["Spot"]
        },
        "TimeUnit": "MONTHLY"
    }' \
    --notifications-with-subscribers '[
        {
            "Notification": {
                "NotificationType": "ACTUAL",
                "ComparisonOperator": "GREATER_THAN",
                "Threshold": 80,
                "ThresholdType": "PERCENTAGE"
            },
            "Subscribers": [
                {
                    "SubscriptionType": "EMAIL",
                    "Address": "admin@example.com"
                }
            ]
        }
    ]'

使用 Cost Explorer 分析 Spot 節省

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# analyze_spot_savings.py
import boto3
from datetime import datetime, timedelta

def analyze_spot_savings(days=30):
    """分析 Spot Instance 的成本節省"""
    ce = boto3.client('ce')

    end_date = datetime.now().strftime('%Y-%m-%d')
    start_date = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d')

    # 取得 Spot 成本
    spot_response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': start_date,
            'End': end_date
        },
        Granularity='MONTHLY',
        Metrics=['UnblendedCost', 'UsageQuantity'],
        Filter={
            'And': [
                {
                    'Dimensions': {
                        'Key': 'SERVICE',
                        'Values': ['Amazon Elastic Compute Cloud - Compute']
                    }
                },
                {
                    'Dimensions': {
                        'Key': 'PURCHASE_TYPE',
                        'Values': ['Spot']
                    }
                }
            ]
        }
    )

    # 取得隨需成本
    ondemand_response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': start_date,
            'End': end_date
        },
        Granularity='MONTHLY',
        Metrics=['UnblendedCost', 'UsageQuantity'],
        Filter={
            'And': [
                {
                    'Dimensions': {
                        'Key': 'SERVICE',
                        'Values': ['Amazon Elastic Compute Cloud - Compute']
                    }
                },
                {
                    'Dimensions': {
                        'Key': 'PURCHASE_TYPE',
                        'Values': ['On Demand Instances']
                    }
                }
            ]
        }
    )

    print(f"\n過去 {days} 天的 EC2 成本分析")
    print("=" * 50)

    total_spot = 0
    total_ondemand = 0

    for result in spot_response['ResultsByTime']:
        period = result['TimePeriod']
        cost = float(result['Total']['UnblendedCost']['Amount'])
        total_spot += cost
        print(f"Spot ({period['Start']} ~ {period['End']}): ${cost:.2f}")

    for result in ondemand_response['ResultsByTime']:
        period = result['TimePeriod']
        cost = float(result['Total']['UnblendedCost']['Amount'])
        total_ondemand += cost
        print(f"隨需 ({period['Start']} ~ {period['End']}): ${cost:.2f}")

    print("-" * 50)
    print(f"Spot 總成本: ${total_spot:.2f}")
    print(f"隨需總成本: ${total_ondemand:.2f}")

    if total_ondemand > 0:
        savings_rate = (1 - total_spot / total_ondemand) * 100 if total_spot < total_ondemand else 0
        print(f"相對節省率: {savings_rate:.1f}%")

if __name__ == '__main__':
    analyze_spot_savings(30)

成本優化最佳實務總結

  1. 多樣化執行個體配置

    • 選擇至少 10 種不同的執行個體類型
    • 跨所有可用區域分散
    • 使用 capacity-optimized 分配策略
  2. 適當的中斷處理

    • 實作 2 分鐘中斷通知處理
    • 使用 EventBridge 監控中斷事件
    • 設計無狀態架構以簡化恢復
  3. 混合使用策略

    • 關鍵工作負載使用隨需或預留
    • 可容忍中斷的工作負載使用 Spot
    • 使用 Auto Scaling Group 混合配置
  4. 監控與報警

    • 設定成本預算警示
    • 監控 Spot 中斷頻率
    • 追蹤節省金額
  5. 工作負載設計

    • 設計為無狀態應用
    • 實作檢查點和斷點續傳
    • 使用外部儲存保存重要資料

檢查清單

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
## Spot Instance 導入檢查清單

### 前置評估
- [ ] 工作負載可容忍中斷?
- [ ] 已識別適合的執行個體類型?
- [ ] 已評估各可用區域的容量狀況?

### 技術準備
- [ ] 已建立 Launch Template?
- [ ] 已設定中斷處理機制?
- [ ] 已設定 EventBridge 規則?
- [ ] 已實作優雅關機腳本?

### Auto Scaling 設定
- [ ] 已設定混合執行個體政策?
- [ ] 已啟用容量再平衡?
- [ ] 已設定適當的擴展政策?

### 監控與警報
- [ ] 已設定 CloudWatch 儀表板?
- [ ] 已設定成本預算警示?
- [ ] 已設定中斷通知?

### 測試驗證
- [ ] 已測試中斷處理流程?
- [ ] 已驗證自動替換機制?
- [ ] 已確認應用程式正常恢復?

總結

AWS EC2 Spot Instance 是一個強大的成本優化工具,正確使用可以節省高達 90% 的運算成本。本文涵蓋了:

  • Spot Instance 基礎:了解運作原理和價格機制
  • 比較分析:與隨需和預留執行個體的差異
  • Spot Fleet:管理多個 Spot Instance 的進階方法
  • 中斷處理:確保工作負載在中斷時優雅降級
  • Spot Instance Advisor:選擇最佳執行個體類型
  • Auto Scaling 整合:自動化容量管理
  • 混合策略:平衡成本與可用性
  • 成本分析:監控和優化支出

成功使用 Spot Instance 的關鍵在於:設計容錯架構、多樣化執行個體配置、以及完善的中斷處理機制。透過本文的指引,您可以開始在適當的工作負載中導入 Spot Instance,顯著降低雲端運算成本。

參考資源

comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy