前言
在雲端運算環境中,成本控制是企業永恆的課題。AWS EC2 Spot Instance 提供了一個絕佳的機會,讓您以最高可達 90% 的折扣使用 EC2 運算資源。本文將深入探討 Spot Instance 的運作原理、最佳實務,以及如何有效整合到您的雲端架構中。
透過正確使用 Spot Instance,您可以大幅降低運算成本,同時維持工作負載的可用性和效能。
Spot Instance 概念與運作原理
什麼是 Spot Instance?
Spot Instance 是 AWS EC2 的一種定價模式,允許您使用 AWS 雲端中閒置的運算容量。這些是與隨需執行個體 (On-Demand Instance) 相同的硬體資源,但價格可低至隨需價格的 10%。
運作原理
Spot Instance 的運作基於供需市場機制:
- 容量池 (Capacity Pool):AWS 將每個可用區域中相同執行個體類型的閒置容量組成容量池
- Spot 價格:價格根據長期供需趨勢緩慢調整,不再像早期那樣劇烈波動
- 容量中斷:當 AWS 需要回收容量時,會發出 2 分鐘的中斷通知
Spot Instance 的生命週期
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| 請求 Spot Instance
↓
容量可用? ──否──→ 請求等待中
│
是
↓
執行個體啟動
↓
正常運作
↓
AWS 需要回收容量? ──否──→ 繼續運作
│
是
↓
發送 2 分鐘中斷通知
↓
執行個體終止/停止/休眠
|
查詢 Spot 價格歷史
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # 查詢特定執行個體類型的 Spot 價格歷史
aws ec2 describe-spot-price-history \
--instance-types m5.large \
--product-descriptions "Linux/UNIX" \
--start-time $(date -u +"%Y-%m-%dT%H:%M:%SZ" -d "7 days ago") \
--end-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") \
--region ap-northeast-1
# 查詢多種執行個體類型的當前 Spot 價格
aws ec2 describe-spot-price-history \
--instance-types m5.large m5.xlarge c5.large c5.xlarge \
--product-descriptions "Linux/UNIX" \
--start-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") \
--region ap-northeast-1 \
--query 'SpotPriceHistory[*].[InstanceType,AvailabilityZone,SpotPrice]' \
--output table
|
與隨需執行個體、預留執行個體比較
三種定價模式比較
| 特性 | 隨需執行個體 | 預留執行個體 | Spot Instance |
|---|
| 折扣幅度 | 無折扣(基準價格) | 最高 72% | 最高 90% |
| 承諾期間 | 無 | 1-3 年 | 無 |
| 可用性保證 | 高 | 高 | 可能被中斷 |
| 適用場景 | 短期、不規則工作負載 | 穩定、可預測工作負載 | 容錯、彈性工作負載 |
| 付款方式 | 按秒計費 | 預付/部分預付/無預付 | 按秒計費 |
成本比較範例
以 m5.xlarge 執行個體在 ap-northeast-1 區域為例(價格為示意):
| 定價模式 | 每小時價格 (USD) | 每月成本 (730 小時) | 相對節省 |
|---|
| 隨需 | $0.248 | $181.04 | - |
| 預留(1年,無預付) | $0.156 | $113.88 | 37% |
| 預留(3年,全預付) | $0.099 | $72.27 | 60% |
| Spot | $0.050-0.080 | $36.50-58.40 | 68-80% |
選擇建議
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| 工作負載類型評估
│
├── 可容忍中斷?
│ │
│ 否 ──→ 使用隨需或預留執行個體
│ │
│ 是
│ ↓
├── 工作負載穩定且持續?
│ │
│ 是 ──→ 混合使用預留 + Spot
│ │
│ 否 ──→ 使用 Spot Instance
│
└── 需要立即可用且不可中斷? ──→ 使用隨需執行個體
|
Spot Fleet 設定與管理
什麼是 Spot Fleet?
Spot Fleet 是一個執行個體集合,可以根據您的需求自動維護指定容量的 Spot Instance。它支援多種執行個體類型和可用區域,提高獲得容量的機會。
建立 Spot Fleet 請求
使用 AWS CLI 建立 Spot Fleet
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
| # 建立 Spot Fleet 設定檔
cat > spot-fleet-config.json << 'EOF'
{
"SpotPrice": "0.10",
"TargetCapacity": 10,
"IamFleetRole": "arn:aws:iam::123456789012:role/aws-ec2-spot-fleet-role",
"LaunchSpecifications": [
{
"ImageId": "ami-0abcdef1234567890",
"InstanceType": "m5.large",
"SubnetId": "subnet-0123456789abcdef0",
"SecurityGroups": [
{
"GroupId": "sg-0123456789abcdef0"
}
],
"KeyName": "my-key-pair"
},
{
"ImageId": "ami-0abcdef1234567890",
"InstanceType": "m5.xlarge",
"SubnetId": "subnet-0123456789abcdef0",
"SecurityGroups": [
{
"GroupId": "sg-0123456789abcdef0"
}
],
"KeyName": "my-key-pair"
},
{
"ImageId": "ami-0abcdef1234567890",
"InstanceType": "c5.large",
"SubnetId": "subnet-abcdef0123456789a",
"SecurityGroups": [
{
"GroupId": "sg-0123456789abcdef0"
}
],
"KeyName": "my-key-pair"
}
],
"AllocationStrategy": "capacityOptimized",
"Type": "maintain",
"TerminateInstancesWithExpiration": true
}
EOF
# 建立 Spot Fleet
aws ec2 request-spot-fleet \
--spot-fleet-request-config file://spot-fleet-config.json \
--region ap-northeast-1
|
分配策略 (Allocation Strategy)
| 策略 | 說明 | 適用場景 |
|---|
lowestPrice | 優先選擇最低價格的容量池 | 成本敏感,可接受較高中斷率 |
capacityOptimized | 優先選擇容量最充足的池 | 需要降低中斷風險 |
capacityOptimizedPrioritized | 結合容量優化與優先順序 | 有偏好的執行個體類型 |
diversified | 分散至所有容量池 | 需要高可用性 |
使用 Launch Template 建立 Spot Fleet
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
| # 首先建立 Launch Template
aws ec2 create-launch-template \
--launch-template-name my-spot-template \
--version-description "Spot Instance Template v1" \
--launch-template-data '{
"ImageId": "ami-0abcdef1234567890",
"InstanceType": "m5.large",
"KeyName": "my-key-pair",
"SecurityGroupIds": ["sg-0123456789abcdef0"],
"TagSpecifications": [
{
"ResourceType": "instance",
"Tags": [
{"Key": "Environment", "Value": "Production"},
{"Key": "Type", "Value": "Spot"}
]
}
]
}'
# 使用 Launch Template 建立 Spot Fleet
cat > spot-fleet-template-config.json << 'EOF'
{
"TargetCapacity": 10,
"IamFleetRole": "arn:aws:iam::123456789012:role/aws-ec2-spot-fleet-role",
"LaunchTemplateConfigs": [
{
"LaunchTemplateSpecification": {
"LaunchTemplateName": "my-spot-template",
"Version": "1"
},
"Overrides": [
{"InstanceType": "m5.large", "AvailabilityZone": "ap-northeast-1a"},
{"InstanceType": "m5.large", "AvailabilityZone": "ap-northeast-1c"},
{"InstanceType": "m5.xlarge", "AvailabilityZone": "ap-northeast-1a"},
{"InstanceType": "c5.large", "AvailabilityZone": "ap-northeast-1a"},
{"InstanceType": "c5.large", "AvailabilityZone": "ap-northeast-1c"}
]
}
],
"AllocationStrategy": "capacityOptimized",
"Type": "maintain"
}
EOF
aws ec2 request-spot-fleet \
--spot-fleet-request-config file://spot-fleet-template-config.json
|
管理 Spot Fleet
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| # 查看 Spot Fleet 請求
aws ec2 describe-spot-fleet-requests \
--query 'SpotFleetRequestConfigs[*].[SpotFleetRequestId,SpotFleetRequestState,TargetCapacity]' \
--output table
# 修改目標容量
aws ec2 modify-spot-fleet-request \
--spot-fleet-request-id sfr-12345678-1234-1234-1234-123456789012 \
--target-capacity 20
# 取消 Spot Fleet 請求(保留執行中的執行個體)
aws ec2 cancel-spot-fleet-requests \
--spot-fleet-request-ids sfr-12345678-1234-1234-1234-123456789012 \
--no-terminate-instances
# 取消 Spot Fleet 請求(終止所有執行個體)
aws ec2 cancel-spot-fleet-requests \
--spot-fleet-request-ids sfr-12345678-1234-1234-1234-123456789012 \
--terminate-instances
|
中斷處理與優雅關機
理解 Spot Instance 中斷
當 AWS 需要回收 Spot Instance 容量時,會在終止前 2 分鐘發送中斷通知。您可以透過以下方式偵測中斷:
- Instance Metadata Service:輪詢執行個體中繼資料
- CloudWatch Events / EventBridge:接收事件通知
- EC2 Rebalance Recommendation:提早收到再平衡建議
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| #!/bin/bash
# spot-interruption-handler.sh
# 每 5 秒檢查一次中斷通知
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
while true; do
INTERRUPTION_TIME=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/spot/instance-action \
2>/dev/null | jq -r '.time // empty')
if [ -n "$INTERRUPTION_TIME" ]; then
echo "Spot Instance 中斷通知收到,終止時間: $INTERRUPTION_TIME"
# 執行優雅關機程序
/opt/scripts/graceful-shutdown.sh
break
fi
sleep 5
done
|
優雅關機腳本範例
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
| #!/bin/bash
# graceful-shutdown.sh
# Spot Instance 中斷時的優雅關機程序
set -e
LOG_FILE="/var/log/spot-shutdown.log"
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")" http://169.254.169.254/latest/meta-data/instance-id)
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a $LOG_FILE
}
log "開始優雅關機程序 - Instance: $INSTANCE_ID"
# 1. 從負載均衡器移除
log "從負載均衡器移除..."
aws elbv2 deregister-targets \
--target-group-arn arn:aws:elasticloadbalancing:ap-northeast-1:123456789012:targetgroup/my-target-group/1234567890123456 \
--targets Id=$INSTANCE_ID \
2>/dev/null || true
# 等待連線排空 (Connection Draining)
log "等待連線排空(30秒)..."
sleep 30
# 2. 停止接受新任務
log "停止接受新任務..."
# 若使用 SQS,停止輪詢新訊息
pkill -SIGTERM -f "sqs-worker" 2>/dev/null || true
# 3. 完成進行中的任務
log "等待進行中的任務完成..."
# 設定最大等待時間
MAX_WAIT=60
WAITED=0
while [ -f /tmp/task-in-progress ] && [ $WAITED -lt $MAX_WAIT ]; do
sleep 5
WAITED=$((WAITED + 5))
log "等待任務完成... ($WAITED/$MAX_WAIT 秒)"
done
# 4. 儲存狀態到 S3 或 DynamoDB
log "儲存應用程式狀態..."
aws s3 cp /var/app/state.json \
s3://my-bucket/instance-states/$INSTANCE_ID/state.json \
2>/dev/null || true
# 5. 傳送關機完成通知
log "傳送關機通知到 SNS..."
aws sns publish \
--topic-arn arn:aws:sns:ap-northeast-1:123456789012:spot-instance-events \
--message "{\"event\": \"graceful-shutdown\", \"instance\": \"$INSTANCE_ID\", \"timestamp\": \"$(date -Iseconds)\"}" \
2>/dev/null || true
log "優雅關機程序完成"
|
使用 EventBridge 監控中斷
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| # 建立 EventBridge 規則監控 Spot 中斷
aws events put-rule \
--name "spot-instance-interruption" \
--event-pattern '{
"source": ["aws.ec2"],
"detail-type": ["EC2 Spot Instance Interruption Warning"]
}' \
--state ENABLED
# 建立目標(發送到 SNS)
aws events put-targets \
--rule "spot-instance-interruption" \
--targets '[
{
"Id": "1",
"Arn": "arn:aws:sns:ap-northeast-1:123456789012:spot-interruption-topic"
}
]'
# 建立 Rebalance Recommendation 規則
aws events put-rule \
--name "spot-rebalance-recommendation" \
--event-pattern '{
"source": ["aws.ec2"],
"detail-type": ["EC2 Instance Rebalance Recommendation"]
}' \
--state ENABLED
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
| # eventbridge.tf
resource "aws_cloudwatch_event_rule" "spot_interruption" {
name = "spot-instance-interruption"
description = "Capture Spot Instance interruption warnings"
event_pattern = jsonencode({
source = ["aws.ec2"]
detail-type = ["EC2 Spot Instance Interruption Warning"]
})
}
resource "aws_cloudwatch_event_rule" "spot_rebalance" {
name = "spot-rebalance-recommendation"
description = "Capture EC2 Instance Rebalance Recommendations"
event_pattern = jsonencode({
source = ["aws.ec2"]
detail-type = ["EC2 Instance Rebalance Recommendation"]
})
}
resource "aws_cloudwatch_event_target" "interruption_to_lambda" {
rule = aws_cloudwatch_event_rule.spot_interruption.name
target_id = "SpotInterruptionHandler"
arn = aws_lambda_function.spot_handler.arn
}
resource "aws_cloudwatch_event_target" "rebalance_to_lambda" {
rule = aws_cloudwatch_event_rule.spot_rebalance.name
target_id = "SpotRebalanceHandler"
arn = aws_lambda_function.spot_handler.arn
}
resource "aws_lambda_permission" "allow_eventbridge" {
statement_id = "AllowExecutionFromEventBridge"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.spot_handler.function_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.spot_interruption.arn
}
|
Lambda 中斷處理函式
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
| # lambda_function.py
import json
import boto3
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
ec2 = boto3.client('ec2')
asg = boto3.client('autoscaling')
sns = boto3.client('sns')
def lambda_handler(event, context):
logger.info(f"收到事件: {json.dumps(event)}")
detail_type = event.get('detail-type')
detail = event.get('detail', {})
instance_id = detail.get('instance-id')
if detail_type == 'EC2 Spot Instance Interruption Warning':
handle_interruption(instance_id, detail)
elif detail_type == 'EC2 Instance Rebalance Recommendation':
handle_rebalance(instance_id, detail)
return {'statusCode': 200}
def handle_interruption(instance_id, detail):
"""處理 Spot 中斷警告"""
logger.info(f"處理 Spot 中斷: {instance_id}")
# 取得執行個體資訊
response = ec2.describe_instances(InstanceIds=[instance_id])
instance = response['Reservations'][0]['Instances'][0]
# 檢查是否為 Auto Scaling Group 的一部分
for tag in instance.get('Tags', []):
if tag['Key'] == 'aws:autoscaling:groupName':
asg_name = tag['Value']
# 將執行個體設為不健康,觸發替換
asg.set_instance_health(
InstanceId=instance_id,
HealthStatus='Unhealthy',
ShouldRespectGracePeriod=False
)
logger.info(f"已將 {instance_id} 標記為不健康(ASG: {asg_name})")
# 發送通知
sns.publish(
TopicArn='arn:aws:sns:ap-northeast-1:123456789012:spot-alerts',
Subject=f'Spot Instance 中斷警告: {instance_id}',
Message=json.dumps({
'instance_id': instance_id,
'interruption_time': detail.get('instance-action', {}).get('time'),
'action': detail.get('instance-action', {}).get('action')
}, indent=2)
)
def handle_rebalance(instance_id, detail):
"""處理再平衡建議"""
logger.info(f"收到再平衡建議: {instance_id}")
# 再平衡建議表示該執行個體有較高的中斷風險
# 可以選擇主動替換或等待
sns.publish(
TopicArn='arn:aws:sns:ap-northeast-1:123456789012:spot-alerts',
Subject=f'Spot Instance 再平衡建議: {instance_id}',
Message=f'執行個體 {instance_id} 收到再平衡建議,建議考慮主動替換'
)
|
Spot Instance Advisor 使用
什麼是 Spot Instance Advisor?
Spot Instance Advisor 是 AWS 提供的工具,顯示各執行個體類型的中斷頻率和相對於隨需價格的節省幅度,幫助您選擇最適合的執行個體類型。
存取 Spot Instance Advisor
- 前往 EC2 Spot Instance Advisor
- 選擇作業系統和區域
- 檢視各執行個體類型的中斷頻率和節省幅度
中斷頻率等級
| 等級 | 中斷頻率 | 建議 |
|---|
| <5% | 低 | 適合大多數工作負載 |
| 5-10% | 中低 | 適合有容錯機制的工作負載 |
| 10-15% | 中 | 需要良好的中斷處理 |
| 15-20% | 中高 | 僅適合短期任務 |
| >20% | 高 | 建議選擇其他執行個體類型 |
使用 AWS CLI 查詢 Spot 資訊
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # 取得 Spot 執行個體建議
# 使用 EC2 Instance Type 資訊
aws ec2 describe-instance-types \
--instance-types m5.large m5.xlarge c5.large c5.xlarge r5.large \
--query 'InstanceTypes[*].[InstanceType,VCpuInfo.DefaultVCpus,MemoryInfo.SizeInMiB]' \
--output table
# 查詢特定區域的可用 Spot 容量池
aws ec2 describe-spot-price-history \
--instance-types m5.large m5.xlarge c5.large c5.xlarge \
--product-descriptions "Linux/UNIX" \
--start-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") \
--query 'SpotPriceHistory[*].[InstanceType,AvailabilityZone,SpotPrice]' \
--output table
|
選擇最佳執行個體組合
根據 Spot Instance Advisor 的資訊,建立多樣化的執行個體組合:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
| # spot_instance_selector.py
import boto3
from collections import defaultdict
def get_optimal_instance_types(region, vcpu_min, memory_min_gb, max_types=5):
"""
根據需求選擇最佳的 Spot 執行個體類型組合
"""
ec2 = boto3.client('ec2', region_name=region)
# 取得符合需求的執行個體類型
paginator = ec2.get_paginator('describe_instance_types')
suitable_types = []
for page in paginator.paginate():
for instance_type in page['InstanceTypes']:
vcpus = instance_type['VCpuInfo']['DefaultVCpus']
memory_mb = instance_type['MemoryInfo']['SizeInMiB']
memory_gb = memory_mb / 1024
if vcpus >= vcpu_min and memory_gb >= memory_min_gb:
# 排除特殊類型(如金屬執行個體)
if 'metal' not in instance_type['InstanceType']:
suitable_types.append({
'type': instance_type['InstanceType'],
'vcpus': vcpus,
'memory_gb': memory_gb
})
# 取得 Spot 價格
spot_prices = get_spot_prices(ec2, [t['type'] for t in suitable_types])
# 計算價格效益比
for t in suitable_types:
if t['type'] in spot_prices:
t['spot_price'] = spot_prices[t['type']]
t['price_per_vcpu'] = t['spot_price'] / t['vcpus']
else:
t['spot_price'] = float('inf')
t['price_per_vcpu'] = float('inf')
# 依價格效益比排序
suitable_types.sort(key=lambda x: x['price_per_vcpu'])
return suitable_types[:max_types]
def get_spot_prices(ec2, instance_types):
"""取得當前 Spot 價格"""
from datetime import datetime
prices = {}
# 分批查詢(每次最多 100 個)
for i in range(0, len(instance_types), 100):
batch = instance_types[i:i+100]
response = ec2.describe_spot_price_history(
InstanceTypes=batch,
ProductDescriptions=['Linux/UNIX'],
StartTime=datetime.utcnow()
)
for price_info in response['SpotPriceHistory']:
instance_type = price_info['InstanceType']
price = float(price_info['SpotPrice'])
# 保留最低價格
if instance_type not in prices or price < prices[instance_type]:
prices[instance_type] = price
return prices
if __name__ == '__main__':
# 尋找至少 4 vCPU、8 GB 記憶體的執行個體
optimal_types = get_optimal_instance_types(
region='ap-northeast-1',
vcpu_min=4,
memory_min_gb=8,
max_types=10
)
print("建議的 Spot 執行個體類型:")
print("-" * 60)
for t in optimal_types:
print(f"{t['type']:15} | {t['vcpus']:3} vCPU | {t['memory_gb']:6.1f} GB | "
f"${t['spot_price']:.4f}/hr | ${t['price_per_vcpu']:.4f}/vCPU/hr")
|
Auto Scaling 與 Spot 整合
Auto Scaling Group 混合執行個體配置
Auto Scaling Group 支援同時使用隨需執行個體和 Spot Instance,提供成本與可用性的平衡。
使用 AWS CLI 建立混合 Auto Scaling Group
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
| # 建立 Launch Template
aws ec2 create-launch-template \
--launch-template-name mixed-instance-template \
--version-description "v1" \
--launch-template-data '{
"ImageId": "ami-0abcdef1234567890",
"SecurityGroupIds": ["sg-0123456789abcdef0"],
"KeyName": "my-key-pair",
"UserData": "'$(base64 -w0 <<< '#!/bin/bash
echo "Hello from Mixed Instance ASG"
')'",
"TagSpecifications": [
{
"ResourceType": "instance",
"Tags": [
{"Key": "Name", "Value": "mixed-asg-instance"}
]
}
]
}'
# 建立混合執行個體 Auto Scaling Group
aws autoscaling create-auto-scaling-group \
--auto-scaling-group-name mixed-instance-asg \
--mixed-instances-policy '{
"LaunchTemplate": {
"LaunchTemplateSpecification": {
"LaunchTemplateName": "mixed-instance-template",
"Version": "$Latest"
},
"Overrides": [
{"InstanceType": "m5.large"},
{"InstanceType": "m5.xlarge"},
{"InstanceType": "m4.large"},
{"InstanceType": "m4.xlarge"},
{"InstanceType": "c5.large"},
{"InstanceType": "c5.xlarge"}
]
},
"InstancesDistribution": {
"OnDemandAllocationStrategy": "prioritized",
"OnDemandBaseCapacity": 2,
"OnDemandPercentageAboveBaseCapacity": 20,
"SpotAllocationStrategy": "capacity-optimized",
"SpotInstancePools": 0
}
}' \
--min-size 2 \
--max-size 20 \
--desired-capacity 10 \
--vpc-zone-identifier "subnet-0123456789abcdef0,subnet-abcdef0123456789a"
|
執行個體分配說明
上述設定的含義:
OnDemandBaseCapacity: 2:前 2 個執行個體使用隨需OnDemandPercentageAboveBaseCapacity: 20:超過基本容量的 20% 使用隨需- 以 10 個執行個體為例:2 + (10-2) * 0.2 = 3.6 ≈ 4 個隨需,6 個 Spot
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
| # asg.tf
resource "aws_launch_template" "mixed" {
name_prefix = "mixed-instance-"
image_id = data.aws_ami.amazon_linux.id
instance_type = "m5.large"
key_name = "my-key-pair"
vpc_security_group_ids = [aws_security_group.app.id]
user_data = base64encode(<<-EOF
#!/bin/bash
yum update -y
yum install -y httpd
systemctl start httpd
systemctl enable httpd
EOF
)
tag_specifications {
resource_type = "instance"
tags = {
Name = "mixed-asg-instance"
Environment = "production"
}
}
lifecycle {
create_before_destroy = true
}
}
resource "aws_autoscaling_group" "mixed" {
name = "mixed-instance-asg"
vpc_zone_identifier = var.subnet_ids
min_size = 2
max_size = 20
desired_capacity = 10
mixed_instances_policy {
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.mixed.id
version = "$Latest"
}
override {
instance_type = "m5.large"
weighted_capacity = "1"
}
override {
instance_type = "m5.xlarge"
weighted_capacity = "2"
}
override {
instance_type = "m4.large"
weighted_capacity = "1"
}
override {
instance_type = "c5.large"
weighted_capacity = "1"
}
override {
instance_type = "c5.xlarge"
weighted_capacity = "2"
}
}
instances_distribution {
on_demand_allocation_strategy = "prioritized"
on_demand_base_capacity = 2
on_demand_percentage_above_base_capacity = 20
spot_allocation_strategy = "capacity-optimized"
}
}
# 容量再平衡
capacity_rebalance = true
# 健康檢查
health_check_type = "ELB"
health_check_grace_period = 300
# 執行個體更新策略
instance_refresh {
strategy = "Rolling"
preferences {
min_healthy_percentage = 90
}
}
tag {
key = "Name"
value = "mixed-asg"
propagate_at_launch = true
}
}
|
容量再平衡功能
Auto Scaling Group 支援容量再平衡功能,當 Spot Instance 收到再平衡建議時,會主動啟動替換執行個體:
1
2
3
4
| # 啟用容量再平衡
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name mixed-instance-asg \
--capacity-rebalance
|
擴展政策設定
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| # scaling_policy.tf
# 目標追蹤擴展政策
resource "aws_autoscaling_policy" "cpu_target" {
name = "cpu-target-tracking"
autoscaling_group_name = aws_autoscaling_group.mixed.name
policy_type = "TargetTrackingScaling"
target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
}
target_value = 70.0
}
}
# ALB 請求計數擴展政策
resource "aws_autoscaling_policy" "request_count" {
name = "request-count-target"
autoscaling_group_name = aws_autoscaling_group.mixed.name
policy_type = "TargetTrackingScaling"
target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = "ALBRequestCountPerTarget"
resource_label = "${aws_lb.app.arn_suffix}/${aws_lb_target_group.app.arn_suffix}"
}
target_value = 1000.0
}
}
|
混合使用策略
策略一:基本容量保障
使用隨需執行個體作為基本容量,Spot Instance 處理額外負載:
1
2
3
4
5
6
7
| 總容量需求
│
├── 基本容量(隨需):處理最低必要負載
│
├── 預留容量(預留執行個體):處理穩定的額外負載
│
└── 彈性容量(Spot):處理峰值負載
|
策略二:工作負載分類
根據工作負載的特性選擇適當的執行個體類型:
| 工作負載類型 | 建議配置 | 原因 |
|---|
| 網站前端 | Spot + ASG | 無狀態,可快速替換 |
| 應用程式伺服器 | 混合(20% 隨需 + 80% Spot) | 需要一定的穩定性 |
| 資料庫 | 隨需或預留 | 狀態性工作負載,不適合中斷 |
| 批次處理 | 100% Spot | 可重試,對中斷容忍度高 |
| CI/CD | Spot | 建置任務可重新執行 |
| 開發/測試環境 | Spot | 成本敏感,可接受中斷 |
策略三:多容量池分散
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
| # 分散式 Spot Fleet 配置
resource "aws_spot_fleet_request" "diversified" {
iam_fleet_role = aws_iam_role.spot_fleet.arn
target_capacity = 20
allocation_strategy = "diversified"
terminate_instances_with_expiration = true
# 跨多個可用區域和執行個體類型
launch_specification {
instance_type = "m5.large"
ami = data.aws_ami.amazon_linux.id
subnet_id = var.subnet_a_id
}
launch_specification {
instance_type = "m5.large"
ami = data.aws_ami.amazon_linux.id
subnet_id = var.subnet_b_id
}
launch_specification {
instance_type = "m5.xlarge"
ami = data.aws_ami.amazon_linux.id
subnet_id = var.subnet_a_id
}
launch_specification {
instance_type = "c5.large"
ami = data.aws_ami.amazon_linux.id
subnet_id = var.subnet_a_id
}
launch_specification {
instance_type = "c5.large"
ami = data.aws_ami.amazon_linux.id
subnet_id = var.subnet_b_id
}
launch_specification {
instance_type = "r5.large"
ami = data.aws_ami.amazon_linux.id
subnet_id = var.subnet_a_id
}
}
|
實務建議
選擇多種執行個體類型:至少選擇 10 種以上的執行個體類型,增加獲得容量的機會
跨多個可用區域:在所有可用區域中請求容量
使用 capacity-optimized 策略:優先選擇最不可能被中斷的容量池
設定適當的 Spot 價格上限:通常設定為隨需價格,避免意外高價
1
2
3
4
5
6
7
8
9
10
| # 取得隨需價格作為 Spot 價格上限參考
aws pricing get-products \
--service-code AmazonEC2 \
--filters "Type=TERM_MATCH,Field=instanceType,Value=m5.large" \
"Type=TERM_MATCH,Field=location,Value=Asia Pacific (Tokyo)" \
"Type=TERM_MATCH,Field=operatingSystem,Value=Linux" \
"Type=TERM_MATCH,Field=tenancy,Value=Shared" \
"Type=TERM_MATCH,Field=preInstalledSw,Value=NA" \
"Type=TERM_MATCH,Field=capacitystatus,Value=Used" \
--region us-east-1
|
成本分析與最佳實務
成本監控設定
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| # 建立 Cost Explorer 預算警示
aws budgets create-budget \
--account-id 123456789012 \
--budget '{
"BudgetName": "EC2-Spot-Monthly",
"BudgetLimit": {
"Amount": "1000",
"Unit": "USD"
},
"BudgetType": "COST",
"CostFilters": {
"Service": ["Amazon Elastic Compute Cloud - Compute"],
"PurchaseType": ["Spot"]
},
"TimeUnit": "MONTHLY"
}' \
--notifications-with-subscribers '[
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [
{
"SubscriptionType": "EMAIL",
"Address": "admin@example.com"
}
]
}
]'
|
使用 Cost Explorer 分析 Spot 節省
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
| # analyze_spot_savings.py
import boto3
from datetime import datetime, timedelta
def analyze_spot_savings(days=30):
"""分析 Spot Instance 的成本節省"""
ce = boto3.client('ce')
end_date = datetime.now().strftime('%Y-%m-%d')
start_date = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d')
# 取得 Spot 成本
spot_response = ce.get_cost_and_usage(
TimePeriod={
'Start': start_date,
'End': end_date
},
Granularity='MONTHLY',
Metrics=['UnblendedCost', 'UsageQuantity'],
Filter={
'And': [
{
'Dimensions': {
'Key': 'SERVICE',
'Values': ['Amazon Elastic Compute Cloud - Compute']
}
},
{
'Dimensions': {
'Key': 'PURCHASE_TYPE',
'Values': ['Spot']
}
}
]
}
)
# 取得隨需成本
ondemand_response = ce.get_cost_and_usage(
TimePeriod={
'Start': start_date,
'End': end_date
},
Granularity='MONTHLY',
Metrics=['UnblendedCost', 'UsageQuantity'],
Filter={
'And': [
{
'Dimensions': {
'Key': 'SERVICE',
'Values': ['Amazon Elastic Compute Cloud - Compute']
}
},
{
'Dimensions': {
'Key': 'PURCHASE_TYPE',
'Values': ['On Demand Instances']
}
}
]
}
)
print(f"\n過去 {days} 天的 EC2 成本分析")
print("=" * 50)
total_spot = 0
total_ondemand = 0
for result in spot_response['ResultsByTime']:
period = result['TimePeriod']
cost = float(result['Total']['UnblendedCost']['Amount'])
total_spot += cost
print(f"Spot ({period['Start']} ~ {period['End']}): ${cost:.2f}")
for result in ondemand_response['ResultsByTime']:
period = result['TimePeriod']
cost = float(result['Total']['UnblendedCost']['Amount'])
total_ondemand += cost
print(f"隨需 ({period['Start']} ~ {period['End']}): ${cost:.2f}")
print("-" * 50)
print(f"Spot 總成本: ${total_spot:.2f}")
print(f"隨需總成本: ${total_ondemand:.2f}")
if total_ondemand > 0:
savings_rate = (1 - total_spot / total_ondemand) * 100 if total_spot < total_ondemand else 0
print(f"相對節省率: {savings_rate:.1f}%")
if __name__ == '__main__':
analyze_spot_savings(30)
|
成本優化最佳實務總結
多樣化執行個體配置
- 選擇至少 10 種不同的執行個體類型
- 跨所有可用區域分散
- 使用 capacity-optimized 分配策略
適當的中斷處理
- 實作 2 分鐘中斷通知處理
- 使用 EventBridge 監控中斷事件
- 設計無狀態架構以簡化恢復
混合使用策略
- 關鍵工作負載使用隨需或預留
- 可容忍中斷的工作負載使用 Spot
- 使用 Auto Scaling Group 混合配置
監控與報警
- 設定成本預算警示
- 監控 Spot 中斷頻率
- 追蹤節省金額
工作負載設計
- 設計為無狀態應用
- 實作檢查點和斷點續傳
- 使用外部儲存保存重要資料
檢查清單
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| ## Spot Instance 導入檢查清單
### 前置評估
- [ ] 工作負載可容忍中斷?
- [ ] 已識別適合的執行個體類型?
- [ ] 已評估各可用區域的容量狀況?
### 技術準備
- [ ] 已建立 Launch Template?
- [ ] 已設定中斷處理機制?
- [ ] 已設定 EventBridge 規則?
- [ ] 已實作優雅關機腳本?
### Auto Scaling 設定
- [ ] 已設定混合執行個體政策?
- [ ] 已啟用容量再平衡?
- [ ] 已設定適當的擴展政策?
### 監控與警報
- [ ] 已設定 CloudWatch 儀表板?
- [ ] 已設定成本預算警示?
- [ ] 已設定中斷通知?
### 測試驗證
- [ ] 已測試中斷處理流程?
- [ ] 已驗證自動替換機制?
- [ ] 已確認應用程式正常恢復?
|
總結
AWS EC2 Spot Instance 是一個強大的成本優化工具,正確使用可以節省高達 90% 的運算成本。本文涵蓋了:
- Spot Instance 基礎:了解運作原理和價格機制
- 比較分析:與隨需和預留執行個體的差異
- Spot Fleet:管理多個 Spot Instance 的進階方法
- 中斷處理:確保工作負載在中斷時優雅降級
- Spot Instance Advisor:選擇最佳執行個體類型
- Auto Scaling 整合:自動化容量管理
- 混合策略:平衡成本與可用性
- 成本分析:監控和優化支出
成功使用 Spot Instance 的關鍵在於:設計容錯架構、多樣化執行個體配置、以及完善的中斷處理機制。透過本文的指引,您可以開始在適當的工作負載中導入 Spot Instance,顯著降低雲端運算成本。
參考資源