DevOps & Cloud — Worked Examples
The goal here is to be able to “drop in” a minimal but realistic setup during interviews or when spinning up a demo.
1) Terraform mini-project (AWS VPC + EC2 + S3, remote state)
Key Infrastructure Concepts
VPC (Virtual Private Cloud)
A VPC is a logically isolated section of the AWS cloud where you can launch AWS resources in a virtual network that you define. Think of it as your own private data center in the cloud.
What it provides:
- Network isolation from other AWS customers
- Custom IP address ranges (CIDR blocks like 10.0.0.0/16)
- Subnet configuration for organizing resources
- Route tables for controlling traffic flow
- Security groups and Network ACLs for access control
- Internet connectivity control (public vs private subnets)
Why it matters:
- Security: Isolates your resources from other AWS customers
- Compliance: Required for HIPAA, SOC2, and other security standards
- Cost control: Prevents unauthorized resource creation
- Network design: Allows you to design your network architecture
Subnets
Subnets are subdivisions of your VPC that allow you to group resources and control network access. They’re like different floors or sections in a building.
Types of Subnets:
- Public Subnets:
- Resources can have public IP addresses
- Direct internet access through Internet Gateway
- Used for load balancers, bastion hosts
- Security risk: More exposed to internet threats
- Private Subnets:
- No public IP addresses assigned
- Internet access through NAT Gateway (controlled)
- Used for application servers, databases
- Security benefit: Protected from direct internet access
Subnet Design Best Practices:
- Availability Zones: Distribute subnets across multiple AZs for high availability
- CIDR Planning: Use non-overlapping IP ranges (e.g., 10.0.1.0/24, 10.0.2.0/24)
- Resource Grouping: Group similar resources in the same subnet
- Security: Use private subnets for sensitive resources
Spot Instances
Spot Instances are AWS EC2 instances that you can bid on and use for up to 90% off the On-Demand price. AWS sells unused capacity at a discount.
How Spot Instances Work:
- Bidding: You set a maximum price you’re willing to pay
- Availability: AWS fills your request if spot price ≤ your bid
- Interruption: AWS can terminate your instance with 2-minute notice if:
- Spot price exceeds your bid
- AWS needs the capacity back
- Spot capacity is no longer available
Use Cases:
- Batch processing: Data analysis, video encoding, scientific computing
- Testing/Development: Non-critical workloads
- Cost optimization: Up to 90% savings vs On-Demand
- Fault-tolerant applications: Can handle interruptions
Spot Instance Strategies:
- Diversification: Use multiple instance types and AZs
- Bid strategy: Set bid at On-Demand price for better availability
- Interruption handling: Implement graceful shutdown and recovery
- Fallback: Use On-Demand instances as backup
Incident Severity Levels (SEV 1-4)
SEV-1 (Critical) - “All Hands on Deck”
- Definition: Service completely down, data loss, security breach
- Response Time: Immediate (within 5 minutes)
- Communication: All stakeholders, status page updates, executive notification
- Resolution Target: 1 hour
- Examples:
- Database corruption
- Complete service outage
- Customer data breach
- Payment system failure
SEV-2 (High) - “Urgent Response Required”
- Definition: Major feature broken, significant performance degradation
- Response Time: Within 15 minutes
- Communication: Engineering team, product managers, customer support
- Resolution Target: 4 hours
- Examples:
- Core feature unavailable
- 50%+ performance degradation
- Multiple customers affected
- Revenue-impacting issues
SEV-3 (Medium) - “Normal Priority”
- Definition: Minor feature broken, slight performance impact
- Response Time: Within 1 hour
- Communication: Engineering team, internal stakeholders
- Resolution Target: 24 hours
- Examples:
- Non-critical feature broken
- Minor performance issues
- Limited customer impact
- Cosmetic bugs
SEV-4 (Low) - “Business Hours”
- Definition: Cosmetic issues, minor bugs, enhancement requests
- Response Time: Within 4 hours
- Communication: Engineering team
- Resolution Target: 1 week
- Examples:
- UI text typos
- Minor styling issues
- Enhancement requests
- Documentation updates
Multi-Environment Setup
This example shows how to structure Terraform for multiple environments (dev, staging, prod) with shared modules.
Layout
terraform/
main.tf
vpc.tf
ec2.tf
s3.tf
variables.tf
outputs.tf
backend.hcl
backend.hcl (remote state configuration)
bucket = "my-tf-state-bucket" # S3 bucket to store Terraform state
key = "envs/dev/terraform.tfstate" # Path within bucket for this environment
region = "us-west-2" # AWS region for the backend
dynamodb_table = "my-tf-locks" # DynamoDB table for state locking (prevents concurrent modifications)
encrypt = true # Encrypt state files at rest
main.tf (provider and version configuration)
terraform {
required_version = ">= 1.6.0" # Minimum Terraform version required
backend "s3" {} # Use S3 backend for remote state (configured via backend.hcl at init-time)
required_providers {
aws = { source = "hashicorp/aws", version = "~> 5.0" } # AWS provider with version constraint
}
}
provider "aws" {
region = var.region # AWS region for all resources
}
vpc.tf (networking infrastructure)
# Virtual Private Cloud - isolated network environment
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16" # Private IP range: 10.0.0.0 to 10.0.255.255
tags = { Name = "demo-vpc" } # Resource tagging for cost tracking and organization
}
# Public subnet in availability zone 'a' - accessible from internet
resource "aws_subnet" "public_a" {
vpc_id = aws_vpc.main.id # Associate with our VPC
cidr_block = "10.0.1.0/24" # Subnet range: 10.0.1.0 to 10.0.1.255
map_public_ip_on_launch = true # Auto-assign public IPs to instances
availability_zone = "${var.region}a" # Place in first AZ (e.g., us-west-2a)
}
# Internet Gateway - allows VPC to communicate with internet
resource "aws_internet_gateway" "igw" {
vpc_id = aws_vpc.main.id # Attach to our VPC
}
ec2.tf (compute and security)
# Security Group - firewall rules for EC2 instances
resource "aws_security_group" "web" {
name = "web-sg" # Security group name
vpc_id = aws_vpc.main.id # Associate with our VPC
# Allow SSH access from anywhere (0.0.0.0/0 = all IPs)
ingress {
from_port = 22 # SSH port
to_port = 22
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"] # WARNING: In production, restrict to specific IPs
}
# Allow HTTP access from anywhere (for web traffic)
ingress {
from_port = 80 # HTTP port
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"] # WARNING: In production, restrict to specific IPs
}
# Allow all outbound traffic (instances can reach internet)
egress {
from_port = 0 # All ports
to_port = 0
protocol = "-1" # All protocols
cidr_blocks = ["0.0.0.0/0"] # All destinations
}
}
# Data source to get latest Amazon Linux AMI
data "aws_ami" "amazon_linux" {
most_recent = true # Get the newest available
owners = ["amazon"] # Official Amazon AMIs
filter {
name = "name" # Filter by AMI name
values = ["al2023-ami-*-x86_64"] # Amazon Linux 2023, 64-bit
}
}
resource “aws_instance” “web” { ami = data.aws_ami.amazon_linux.id # Use the AMI we found earlier instance_type = “t3.micro” # Small instance type (1 vCPU, 1 GB RAM) subnet_id = aws_subnet.public_a.id # Place in our public subnet vpc_security_group_ids = [aws_security_group.web.id] # Apply our security group rules
# User data script runs when instance first boots user_data = «-EOF #!/bin/bash yum install -y httpd # Install Apache web server systemctl enable httpd # Start Apache on boot systemctl start httpd # Start Apache now echo “hello from terraform” > /var/www/html/index.html # Create simple webpage EOF
tags = { Name = “web” } # Resource tagging }
**s3.tf** (object storage)
```bash
# S3 bucket for storing static assets (images, CSS, JS files)
resource "aws_s3_bucket" "assets" {
bucket = var.bucket_name # Bucket name from variable
tags = { Name = "assets" } # Resource tagging
}
# Enable versioning to keep multiple versions of objects
resource "aws_s3_bucket_versioning" "assets" {
bucket = aws_s3_bucket.assets.id # Reference to our bucket
versioning_configuration {
status = "Enabled" # Turn on versioning
}
}
variables.tf (input parameters)
variable "region" { type = string, default = "us-west-2" } # AWS region with default
variable "bucket_name" { type = string } # Required: S3 bucket name
outputs.tf (return values)
output "ec2_public_ip" { value = aws_instance.web.public_ip } # Public IP of web server
output "s3_bucket" { value = aws_s3_bucket.assets.bucket } # Name of S3 bucket
CLI Commands (deployment workflow)
# Initialize Terraform and configure backend
terraform init -backend-config=backend.hcl
# Preview changes before applying
terraform plan -var="bucket_name=my-artifacts-bucket"
# Apply changes and create infrastructure
terraform apply -auto-approve -var="bucket_name=my-artifacts-bucket"
Advanced Terraform Patterns
Workspace-based Environment Management
# Create and switch to dev workspace (isolates state for different environments)
terraform workspace new dev
terraform workspace select dev
# Apply with environment-specific variables
terraform apply -var-file="dev.tfvars"
# Switch to prod workspace (different state, different environment)
terraform workspace select prod
terraform apply -var-file="prod.tfvars"
Module-based Architecture
# modules/vpc/main.tf - Reusable VPC module
module "vpc" {
source = "../../modules/vpc" # Path to module source
environment = var.environment # Pass environment name
vpc_cidr = var.vpc_cidr # Pass VPC CIDR block
azs = var.azs # Pass availability zones
}
# modules/vpc/variables.tf - Module input variables
variable "environment" {
description = "Environment name (dev, staging, prod)" # Variable documentation
type = string # Variable type
}
variable "vpc_cidr" {
description = "VPC CIDR block"
type = string
}
variable "azs" {
description = "Availability zones"
type = list(string)
}
Security Best Practices
# Enable VPC Flow Logs
resource "aws_flow_log" "vpc_flow_log" {
iam_role_arn = aws_iam_role.vpc_flow_log_role.arn
log_destination = aws_cloudwatch_log_group.vpc_flow_log_group.arn
traffic_type = "ALL"
vpc_id = aws_vpc.main.id
}
# Enable VPC Endpoints for private subnets
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = [aws_route_table.private.id]
}
2) Jenkinsfile (build → test → Docker → push → deploy to K8s)
pipeline {
agent any
environment {
APP_NAME = 'example-svc'
AWS_REGION = 'us-west-2'
ECR_REPO = "123456789012.dkr.ecr.${AWS_REGION}.amazonaws.com/${APP_NAME}"
}
stages {
stage('Checkout') {
steps { checkout scm }
}
stage('Test') {
steps {
sh 'python -m pip install -r requirements.txt'
sh 'pytest -q'
}
}
stage('Build Image') {
steps {
sh 'docker build -t ${APP_NAME}:${BUILD_NUMBER} .'
}
}
stage('Login ECR & Push') {
steps {
sh 'aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin 123456789012.dkr.ecr.${AWS_REGION}.amazonaws.com'
sh 'docker tag ${APP_NAME}:${BUILD_NUMBER} ${ECR_REPO}:${BUILD_NUMBER}'
sh 'docker push ${ECR_REPO}:${BUILD_NUMBER}'
}
}
stage('Deploy to K8s') {
steps {
sh 'kubectl set image deployment/${APP_NAME} ${APP_NAME}=${ECR_REPO}:${BUILD_NUMBER} --record'
}
}
}
}
3) Kubernetes Manifests (Deployment + Service + Ingress)
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: example-svc
labels:
app: example-svc
spec:
replicas: 2
selector:
matchLabels:
app: example-svc
template:
metadata:
labels:
app: example-svc
spec:
containers:
- name: example-svc
image: 123456789012.dkr.ecr.us-west-2.amazonaws.com/example-svc:latest
ports:
- containerPort: 8080
readinessProbe:
httpGet: { path: /healthz, port: 8080 }
initialDelaySeconds: 3
livenessProbe:
httpGet: { path: /livez, port: 8080 }
initialDelaySeconds: 5
resources:
requests: { cpu: "100m", memory: "128Mi" }
limits: { cpu: "500m", memory: "256Mi" }
service.yaml
apiVersion: v1
kind: Service
metadata:
name: example-svc
spec:
selector:
app: example-svc
ports:
- port: 80
targetPort: 8080
type: ClusterIP
ingress.yaml (requires ingress controller)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: example-svc
spec:
rules:
- host: example.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: example-svc
port:
number: 80
Apply
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f ingress.yaml
4) Prometheus Alerting Rules (examples)
alert-rules.yaml
groups:
- name: app.rules
rules:
- alert: HighCPU
expr: avg(rate(container_cpu_usage_seconds_total{container!="",pod=~"example-svc.*"}[5m])) > 0.8
for: 5m
labels: { severity: warning }
annotations:
summary: "High CPU usage on example-svc"
description: "CPU > 80% for 5m"
- alert: HighErrorRate
expr: rate(http_requests_total{job="example-svc",code=~"5.."}[5m]) / rate(http_requests_total{job="example-svc"}[5m]) > 0.05
for: 10m
labels: { severity: critical }
annotations:
summary: "High 5xx error rate"
description: ">5% 5xx over 10m"
- alert: CrashLooping
expr: increase(kube_pod_container_status_restarts_total{pod=~"example-svc.*"}[10m]) > 3
for: 10m
labels: { severity: warning }
annotations:
summary: "Pod restarting frequently"
description: "More than 3 restarts in 10 minutes"
Wire this into Prometheus via
rule_files
and configure Alertmanager receivers (email/Slack/PagerDuty).