DevOps Day 100: Create and Configure Alarm Using CloudWatch Using Terraform¶

This document outlines the solution for DevOps Day 100. The objective was to enhance operational monitoring by creating an EC2 instance and a corresponding CloudWatch alarm. This alarm monitors CPU utilization and triggers an alert via SNS if the threshold is breached.

Table of Contents¶

Task Overview
Step-by-Step Solution
Deep Dive: Terraform Concepts Used
- SNS Topic Resource
- CloudWatch Metric Alarm
Troubleshooting

Task Overview¶

Objective: Provision an EC2 instance and a CloudWatch CPU utilization alarm.

Requirements: 1. EC2 Instance: xfusion-ec2 (AMI: ami-0c02fb55956c7d316, Type: t2.micro). 2. SNS Topic: xfusion-sns-topic (Already exists or needs creation for the alarm action). 3. CloudWatch Alarm: xfusion-alarm. * Metric: CPUUtilization >= 90%. * Period: 5 minutes (300 seconds). * Action: Notify the SNS topic. 4. Outputs: Export the instance name and alarm name.

Step-by-Step Solution¶

1. Create Infrastructure (`main.tf`)¶

The configuration creates the SNS topic (to ensure we have a valid ARN for the alarm), the EC2 instance, and the CloudWatch alarm linked to that specific instance.

Command:

cd /home/bob/terraform
vi main.tf

Content:

# 1. Create SNS Topic for Notifications
resource "aws_sns_topic" "sns_topic" {
  name = "xfusion-sns-topic"
}

# 2. Launch EC2 Instance
resource "aws_instance" "nautilus_node" {
  ami           = "ami-0c02fb55956c7d316"
  instance_type = "t2.micro"

  tags = {
    Name = "xfusion-ec2"
  }
}

# 3. Create CloudWatch Alarm
resource "aws_cloudwatch_metric_alarm" "cpu_alert" {
  alarm_name          = "xfusion-alarm"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = "1"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = "300"
  statistic           = "Average"
  threshold           = "90"
  alarm_description   = "Alarm when CPU exceeds 90%"

  # Actions to take when the alarm state changes to ALARM
  alarm_actions       = [aws_sns_topic.sns_topic.arn]

  # Dimensions map the metric to a specific resource (our EC2 instance)
  dimensions = {
    InstanceId = aws_instance.nautilus_node.id
  }
}

2. Define Outputs (`outputs.tf`)¶

We define outputs to confirm the resource creation names.

Command:

vi outputs.tf

Content:

output "KKE_instance_name" {
  value = aws_instance.nautilus_node.tags.Name
}

output "KKE_alarm_name" {
  value = aws_cloudwatch_metric_alarm.cpu_alert.alarm_name
}

3. Initialize and Apply¶

Run the Terraform workflow to deploy the monitoring stack.

terraform init
terraform plan
terraform apply -auto-approve

Verification:

terraform state list
# Expected:
# aws_cloudwatch_metric_alarm.cpu_alert
# aws_instance.nautilus_node
# aws_sns_topic.sns_topic

Deep Dive: Terraform Concepts Used¶

* aws_sns_topic: Simple Notification Service. It acts as a pub/sub messaging channel. In this context, CloudWatch "publishes" an alarm message to this topic, and any subscribers (email, SMS, Lambda) would receive it.

CloudWatch Metric Alarm¶

* aws_cloudwatch_metric_alarm: Defines the rule for monitoring. * metric_name & namespace: These define what to watch. AWS/EC2 and CPUUtilization are standard metrics provided by the AWS hypervisor. * dimensions: This is critical. Without dimensions, CloudWatch looks at the aggregate CPU of all instances. By specifying InstanceId, we target only the specific instance we just created. * alarm_actions: Links the alarm to the SNS topic ARN.

Troubleshooting¶

Issue: Invalid AMI * Cause: The AMI ID ami-0c02fb55956c7d316 might not exist in the configured region if it's not us-east-1 (or if the AMI is deprecated). * Fix: Ensure the provider region matches the region where the AMI exists.

Issue: Alarm stuck in "Insufficient Data" * Cause: This is normal immediately after creation. It takes at least one period (300 seconds/5 minutes) for CloudWatch to gather enough data points to evaluate the state.