Skip to content

Tax Practice AI - Cloud Deployment Plan

Version: 1.0 Created: 2024-12-27 Status: Planning


Table of Contents

  1. Executive Summary
  2. Local Development Guarantee
  3. Current State Assessment
  4. Target Architecture
  5. Backlog Items for Deployment
  6. Infrastructure as Code Strategy
  7. Phase 1: Foundation Infrastructure
  8. Phase 2: Database and Storage
  9. Phase 3: Compute and Networking
  10. Phase 4: Frontend Deployment
  11. Phase 5: Orchestration and Background Jobs
  12. Phase 6: External Integrations
  13. Testing Requirements
  14. Best Practices Checklist
  15. Security Considerations
  16. Cloud Service Security Configurations
  17. Monitoring and Observability
  18. Disaster Recovery
  19. Cost Estimates
  20. Rollback Strategy

1. Executive Summary

This document outlines the plan to deploy Tax Practice AI to AWS cloud infrastructure. The deployment will use Infrastructure as Code (IaC) via Terraform to ensure reproducibility, version control, and automated provisioning.

Key Decisions

Decision Choice Rationale
IaC Tool Terraform Multi-cloud capable, mature ecosystem, state management
Container Orchestration ECS Fargate Serverless containers, no EC2 management, cost-effective for our scale
Database Aurora PostgreSQL Serverless v2 Auto-scaling, Aurora-compatible, cost-effective for variable loads
Frontend Hosting CloudFront + S3 Global CDN, low latency, cost-effective for static assets
Orchestration Self-hosted Airflow on EC2 Full control, low cost (~$23/mo), Python DAGs
Secrets AWS Secrets Manager Native integration, automatic rotation
Monitoring CloudWatch + Sentry AWS-native metrics, application error tracking

Deployment Environments

Environment Purpose Database Domain
Development Feature development Local PostgreSQL (Docker) localhost
Staging Pre-production testing Aurora Serverless v2 (separate cluster) staging.taxpractice.ai
Production Live system Aurora Serverless v2 (dedicated cluster) app.taxpractice.ai

2. Local Development Guarantee

Core Principle

Local development MUST remain fully functional. Cloud deployment is additive - it does NOT replace or remove local development capabilities.

What Stays in Place

Component Local Tool Cloud Equivalent Guarantee
Database PostgreSQL 15 (Docker) Aurora PostgreSQL docker-compose.yml preserved and maintained
Object Storage LocalStack S3 AWS S3 LocalStack container continues to work
Secrets .env file Secrets Manager .env.example always current
AI/LLM Anthropic API (direct) AWS Bedrock ANTHROPIC_API_KEY continues to work
Frontend Vite dev server CloudFront + S3 npm run dev works offline
API uvicorn (local) ECS Fargate python -m uvicorn works

Environment Switching

The existing config.yaml already supports environment-based switching:

# config.yaml - Works for BOTH local and cloud

database:
  # host: localhost for local, Aurora endpoint for cloud
  host: ${DB_HOST:-localhost}

  # port: 5433 for local Docker, 5432 for Aurora
  port: ${DB_PORT:-5433}

aws:
  s3:
    # endpoint_url: LocalStack for local, empty for real AWS
    endpoint_url: ${S3_ENDPOINT_URL:-}

Local Development Commands (Unchanged)

# Start local services (PostgreSQL + LocalStack)
docker compose up -d

# Run API locally
python -m uvicorn src.api.main:app --reload --port 8000

# Run frontend locally
cd frontend && pnpm dev

# Run tests locally
pytest tests/

# Run E2E tests with LocalStack
S3_ENDPOINT_URL=http://localhost:4566 pytest tests/e2e/

Verification Checklist

Before any cloud deployment PR is merged, verify:

  • docker compose up -d starts PostgreSQL and LocalStack
  • pytest tests/unit/ passes without cloud credentials
  • pytest tests/integration/ passes with local PostgreSQL
  • pytest tests/e2e/ passes with LocalStack
  • Frontend runs with pnpm dev (no cloud required)
  • API starts with uvicorn locally
  • .env.example documents all required local variables

Why This Matters

  1. Developer productivity: No cloud credentials needed to write code
  2. Cost control: No cloud charges during development
  3. Offline capability: Can develop without internet
  4. Fast iteration: No deployment delays for testing changes
  5. CI/CD reliability: Tests run against local services (fast, deterministic)

3. Current State Assessment

What Exists

Component Status Notes
Python Backend (FastAPI) Complete 1,522 tests passing
React Frontend (2 apps) Complete Staff App + Client Portal
Local Development Complete docker-compose.yml with PostgreSQL + LocalStack
CI/CD Pipeline Complete GitHub Actions (lint, unit, integration, E2E)
Configuration Complete config.yaml with env var substitution

What's Missing for Cloud Deployment

Component Status Priority
Terraform IaC Not Started P0
Production Dockerfiles Not Started P0
ECS Task Definitions Not Started P0
VPC and Networking Not Started P0
Aurora RDS Setup Not Started P0
S3 Buckets (with policies) Not Started P0
CloudFront Distributions Not Started P0
WAF Configuration Not Started P1
Secrets Manager Setup Not Started P0
CloudWatch Dashboards Not Started P1
Route 53 DNS Not Started P0
ACM Certificates Not Started P0
IAM Roles/Policies Not Started P0
Database Migration Scripts Not Started P0
CD Pipeline (Deploy) Not Started P0

3. Target Architecture

┌─────────────────────────────────────────────────────────────────────────────────┐
│                           TAX PRACTICE AI - AWS ARCHITECTURE                     │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   INTERNET                                                                       │
│      │                                                                           │
│      ▼                                                                           │
│   ┌──────────────────────────────────────────────────────────────────────────┐  │
│   │                           ROUTE 53 (DNS)                                  │  │
│   │   portal.taxpractice.ai  │  app.taxpractice.ai  │  api.taxpractice.ai    │  │
│   └─────────────┬────────────────────┬────────────────────┬──────────────────┘  │
│                 │                    │                    │                      │
│                 ▼                    ▼                    ▼                      │
│   ┌─────────────────────────────────────────────────────────────────────────┐   │
│   │                              AWS WAF                                     │   │
│   │          (Rate limiting, SQL injection, XSS protection)                  │   │
│   └─────────────────────────────────────────────────────────────────────────┘   │
│                 │                    │                    │                      │
│                 ▼                    ▼                    ▼                      │
│   ┌─────────────────────┐  ┌─────────────────────┐  ┌─────────────────────┐    │
│   │   CLOUDFRONT (CDN)  │  │   CLOUDFRONT (CDN)  │  │   APPLICATION LB    │    │
│   │   Client Portal     │  │   Staff App         │  │   (HTTPS/443)       │    │
│   └──────────┬──────────┘  └──────────┬──────────┘  └──────────┬──────────┘    │
│              │                        │                        │                 │
│              ▼                        ▼                        ▼                 │
│   ┌─────────────────────┐  ┌─────────────────────┐  ┌─────────────────────┐    │
│   │   S3 BUCKET         │  │   S3 BUCKET         │  │   ECS FARGATE       │    │
│   │   (Static Assets)   │  │   (Static Assets)   │  │   (FastAPI)         │    │
│   │   client-portal/*   │  │   staff-app/*       │  │   2-4 tasks         │    │
│   └─────────────────────┘  └─────────────────────┘  └─────────┬───────────┘    │
│                                                                │                 │
│   ┌─────────────────────────────────────────────────────────────────────────┐   │
│   │                              VPC (10.0.0.0/16)                            │   │
│   │  ┌──────────────────────────────────────────────────────────────────┐   │   │
│   │  │                        PRIVATE SUBNETS                            │   │   │
│   │  │                                                                   │   │   │
│   │  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │   │   │
│   │  │  │ ECS Fargate │  │ EC2 Airflow │  │ Lambda      │              │   │   │
│   │  │  │ (API)       │  │ (t3.medium) │  │ Functions   │              │   │   │
│   │  │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘              │   │   │
│   │  │         │                │                │                       │   │   │
│   │  │         └────────────────┼────────────────┘                       │   │   │
│   │  │                          │                                        │   │   │
│   │  │                          ▼                                        │   │   │
│   │  │  ┌─────────────────────────────────────────────────────────┐     │   │   │
│   │  │  │                   DATA LAYER                             │     │   │   │
│   │  │  │                                                          │     │   │   │
│   │  │  │  ┌─────────────────┐  ┌─────────────────┐               │     │   │   │
│   │  │  │  │ Aurora          │  │ S3 Documents    │               │     │   │   │
│   │  │  │  │ PostgreSQL      │  │ (KMS Encrypted) │               │     │   │   │
│   │  │  │  │ Serverless v2   │  │                 │               │     │   │   │
│   │  │  │  └─────────────────┘  └─────────────────┘               │     │   │   │
│   │  │  │                                                          │     │   │   │
│   │  │  │  ┌─────────────────┐  ┌─────────────────┐               │     │   │   │
│   │  │  │  │ Secrets Manager │  │ ElastiCache     │               │     │   │   │
│   │  │  │  │ (Credentials)   │  │ (Redis - opt)   │               │     │   │   │
│   │  │  │  └─────────────────┘  └─────────────────┘               │     │   │   │
│   │  │  │                                                          │     │   │   │
│   │  │  └─────────────────────────────────────────────────────────┘     │   │   │
│   │  │                                                                   │   │   │
│   │  └──────────────────────────────────────────────────────────────────┘   │   │
│   │                                                                          │   │
│   └──────────────────────────────────────────────────────────────────────────┘   │
│                                                                                  │
│   EXTERNAL SERVICES                                                              │
│   ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐   │
│   │ Anthropic  │ │ Stripe     │ │ Persona    │ │ SmartVault │ │ SurePrep   │   │
│   │ (Bedrock)  │ │ (Payments) │ │ (KYC)      │ │ (Portal)   │ │ (OCR)      │   │
│   └────────────┘ └────────────┘ └────────────┘ └────────────┘ └────────────┘   │
│                                                                                  │
└──────────────────────────────────────────────────────────────────────────────────┘

4. Backlog Items for Deployment

P0: Deployment Blockers

These must be completed before cloud deployment:

ID Item Status Description
DEPLOY-001 Terraform Foundation Not Started VPC, subnets, security groups
DEPLOY-002 Aurora RDS Module Not Started Database cluster with encryption
DEPLOY-003 S3 Module Not Started Document buckets with policies
DEPLOY-004 ECS Fargate Module Not Started Container orchestration
DEPLOY-005 CloudFront Module Not Started CDN for frontends
DEPLOY-006 Backend Dockerfile Not Started Production container image
DEPLOY-007 Frontend Build Pipeline Not Started Build and deploy to S3
DEPLOY-008 Database Migration Not Started Schema deployment strategy
DEPLOY-009 Secrets Manager Setup Not Started All credentials in Secrets Manager
DEPLOY-010 CD Pipeline Not Started GitHub Actions deploy workflow

P1: Production Readiness (from backlog.md)

These production blockers from TD-006 must be addressed:

Phase Services Status Notes
Phase 1 EmailService + SMSService Not Started Requires API credentials
Phase 2 PersonaService Not Started Requires API credentials
Phase 3 SmartVaultService Not Started Requires API credentials
Phase 4 SurePrepService Not Started Requires API credentials
Phase 5 GoogleService Not Started Requires API credentials
Phase 6 Webhook Security Not Started HMAC verification

P2: Operational Readiness

ID Item Status Description
TD-004 UAT Scripts Not Started User acceptance testing
TD-001 Java Build Config Not Started Maven/Gradle for Java components
OPS-001 CloudWatch Dashboards Not Started Monitoring dashboards
OPS-002 Alerting Rules Not Started PagerDuty/SNS integration
OPS-003 Log Aggregation Not Started CloudWatch Logs Insights
OPS-004 Backup Verification Not Started Automated backup testing

5. Infrastructure as Code Strategy

Tool Selection: Terraform

Why Terraform over alternatives:

Factor Terraform AWS CDK CloudFormation
Multi-cloud Yes No No
State Management Built-in Via CFN Via CFN
Language HCL (declarative) TypeScript/Python YAML/JSON
Community Modules Extensive Growing Limited
Learning Curve Medium Higher Lower
Drift Detection Yes Limited Limited

Decision: Terraform with AWS provider for: - Declarative infrastructure definition - Version-controlled state (S3 + DynamoDB locking) - Modular, reusable components - Community modules for common patterns

Repository Structure

infrastructure/
├── terraform/
│   ├── environments/
│   │   ├── staging/
│   │   │   ├── main.tf                 # Environment entry point
│   │   │   ├── variables.tf            # Environment-specific vars
│   │   │   ├── terraform.tfvars        # Variable values
│   │   │   └── backend.tf              # S3 backend config
│   │   │
│   │   └── production/
│   │       ├── main.tf
│   │       ├── variables.tf
│   │       ├── terraform.tfvars
│   │       └── backend.tf
│   │
│   ├── modules/
│   │   ├── vpc/                        # VPC, subnets, NAT
│   │   │   ├── main.tf
│   │   │   ├── variables.tf
│   │   │   ├── outputs.tf
│   │   │   └── README.md
│   │   │
│   │   ├── aurora/                     # Aurora PostgreSQL
│   │   │   ├── main.tf
│   │   │   ├── variables.tf
│   │   │   ├── outputs.tf
│   │   │   └── README.md
│   │   │
│   │   ├── ecs/                        # ECS Fargate cluster
│   │   │   ├── main.tf
│   │   │   ├── variables.tf
│   │   │   ├── outputs.tf
│   │   │   └── README.md
│   │   │
│   │   ├── s3/                         # S3 buckets
│   │   │   ├── main.tf
│   │   │   ├── variables.tf
│   │   │   ├── outputs.tf
│   │   │   └── README.md
│   │   │
│   │   ├── cloudfront/                 # CDN distributions
│   │   │   ├── main.tf
│   │   │   ├── variables.tf
│   │   │   ├── outputs.tf
│   │   │   └── README.md
│   │   │
│   │   ├── alb/                        # Application Load Balancer
│   │   │   ├── main.tf
│   │   │   ├── variables.tf
│   │   │   ├── outputs.tf
│   │   │   └── README.md
│   │   │
│   │   ├── secrets/                    # Secrets Manager
│   │   │   ├── main.tf
│   │   │   ├── variables.tf
│   │   │   ├── outputs.tf
│   │   │   └── README.md
│   │   │
│   │   ├── waf/                        # WAF rules
│   │   │   ├── main.tf
│   │   │   ├── variables.tf
│   │   │   ├── outputs.tf
│   │   │   └── README.md
│   │   │
│   │   ├── airflow/                    # Airflow EC2 instance
│   │   │   ├── main.tf
│   │   │   ├── variables.tf
│   │   │   ├── outputs.tf
│   │   │   ├── user_data.sh            # Bootstrap script
│   │   │   └── README.md
│   │   │
│   │   └── monitoring/                 # CloudWatch dashboards/alarms
│   │       ├── main.tf
│   │       ├── variables.tf
│   │       ├── outputs.tf
│   │       └── README.md
│   │
│   └── global/                         # Shared resources (S3 backend, IAM)
│       ├── backend/
│       │   └── main.tf                 # S3 bucket + DynamoDB for state
│       └── iam/
│           └── main.tf                 # Service roles
├── docker/
│   ├── api/
│   │   ├── Dockerfile                  # FastAPI production image
│   │   └── .dockerignore
│   │
│   └── airflow/
│       ├── Dockerfile                  # Airflow image
│       └── requirements.txt
└── scripts/
    ├── deploy.sh                       # Deployment helper
    ├── init-backend.sh                 # Initialize Terraform backend
    └── rotate-secrets.sh               # Secret rotation helper

State Management

# terraform/environments/production/backend.tf

# Terraform state stored in S3 with DynamoDB locking
# State bucket created via terraform/global/backend/

terraform {
  backend "s3" {
    # bucket: S3 bucket for state storage
    bucket = "tax-practice-terraform-state"

    # key: Path within bucket for this environment's state
    key = "production/terraform.tfstate"

    # region: AWS region for state bucket
    region = "us-east-1"

    # encrypt: Enable server-side encryption
    encrypt = true

    # dynamodb_table: Table for state locking (prevents concurrent modifications)
    dynamodb_table = "tax-practice-terraform-locks"
  }
}

6. Phase 1: Foundation Infrastructure

6.1 VPC Module

# terraform/modules/vpc/main.tf

# VPC for Tax Practice AI
# Creates isolated network with public/private subnets across 3 AZs

resource "aws_vpc" "main" {
  # cidr_block: IP address range for the VPC
  # 10.0.0.0/16 provides 65,536 IP addresses
  cidr_block = var.vpc_cidr

  # enable_dns_hostnames: Required for RDS and other AWS services
  enable_dns_hostnames = true

  # enable_dns_support: Required for VPC DNS resolution
  enable_dns_support = true

  tags = {
    Name        = "${var.project}-${var.environment}-vpc"
    Environment = var.environment
    Project     = var.project
    ManagedBy   = "terraform"
  }
}

# Public subnets for ALB, NAT Gateway
# One per availability zone for high availability
resource "aws_subnet" "public" {
  count = length(var.availability_zones)

  vpc_id     = aws_vpc.main.id
  cidr_block = cidrsubnet(var.vpc_cidr, 4, count.index)

  # availability_zone: Distribute across AZs for fault tolerance
  availability_zone = var.availability_zones[count.index]

  # map_public_ip_on_launch: Public subnets get public IPs
  map_public_ip_on_launch = true

  tags = {
    Name        = "${var.project}-${var.environment}-public-${count.index + 1}"
    Environment = var.environment
    Type        = "public"
  }
}

# Private subnets for ECS, RDS, Lambda
# No direct internet access - uses NAT Gateway
resource "aws_subnet" "private" {
  count = length(var.availability_zones)

  vpc_id     = aws_vpc.main.id
  cidr_block = cidrsubnet(var.vpc_cidr, 4, count.index + length(var.availability_zones))

  availability_zone = var.availability_zones[count.index]

  # map_public_ip_on_launch: Private subnets do NOT get public IPs
  map_public_ip_on_launch = false

  tags = {
    Name        = "${var.project}-${var.environment}-private-${count.index + 1}"
    Environment = var.environment
    Type        = "private"
  }
}

# Internet Gateway for public subnet internet access
resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name        = "${var.project}-${var.environment}-igw"
    Environment = var.environment
  }
}

# Elastic IP for NAT Gateway (static IP for outbound traffic)
resource "aws_eip" "nat" {
  domain = "vpc"

  tags = {
    Name        = "${var.project}-${var.environment}-nat-eip"
    Environment = var.environment
  }
}

# NAT Gateway for private subnet outbound internet access
# Placed in public subnet, routes private subnet traffic to internet
resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public[0].id

  tags = {
    Name        = "${var.project}-${var.environment}-nat"
    Environment = var.environment
  }

  depends_on = [aws_internet_gateway.main]
}

# Route table for public subnets
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  # route: Direct internet access via Internet Gateway
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }

  tags = {
    Name        = "${var.project}-${var.environment}-public-rt"
    Environment = var.environment
  }
}

# Route table for private subnets
resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id

  # route: Internet access via NAT Gateway (outbound only)
  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main.id
  }

  tags = {
    Name        = "${var.project}-${var.environment}-private-rt"
    Environment = var.environment
  }
}

# Associate public subnets with public route table
resource "aws_route_table_association" "public" {
  count          = length(aws_subnet.public)
  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

# Associate private subnets with private route table
resource "aws_route_table_association" "private" {
  count          = length(aws_subnet.private)
  subnet_id      = aws_subnet.private[count.index].id
  route_table_id = aws_route_table.private.id
}

6.2 Security Groups

# terraform/modules/vpc/security_groups.tf

# ALB Security Group - accepts HTTPS from internet
resource "aws_security_group" "alb" {
  name        = "${var.project}-${var.environment}-alb-sg"
  description = "Security group for Application Load Balancer"
  vpc_id      = aws_vpc.main.id

  # ingress: Allow HTTPS from anywhere
  ingress {
    description = "HTTPS from internet"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # ingress: Allow HTTP for redirect to HTTPS
  ingress {
    description = "HTTP for redirect"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # egress: Allow all outbound (to ECS tasks)
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name        = "${var.project}-${var.environment}-alb-sg"
    Environment = var.environment
  }
}

# ECS Security Group - accepts traffic from ALB only
resource "aws_security_group" "ecs" {
  name        = "${var.project}-${var.environment}-ecs-sg"
  description = "Security group for ECS Fargate tasks"
  vpc_id      = aws_vpc.main.id

  # ingress: Only allow traffic from ALB
  ingress {
    description     = "HTTP from ALB"
    from_port       = 8000
    to_port         = 8000
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  # egress: Allow all outbound (to RDS, S3, external APIs)
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name        = "${var.project}-${var.environment}-ecs-sg"
    Environment = var.environment
  }
}

# RDS Security Group - accepts traffic from ECS and Airflow only
resource "aws_security_group" "rds" {
  name        = "${var.project}-${var.environment}-rds-sg"
  description = "Security group for Aurora PostgreSQL"
  vpc_id      = aws_vpc.main.id

  # ingress: PostgreSQL from ECS
  ingress {
    description     = "PostgreSQL from ECS"
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.ecs.id]
  }

  # ingress: PostgreSQL from Airflow
  ingress {
    description     = "PostgreSQL from Airflow"
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.airflow.id]
  }

  # egress: No outbound needed for RDS
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name        = "${var.project}-${var.environment}-rds-sg"
    Environment = var.environment
  }
}

# Airflow EC2 Security Group
resource "aws_security_group" "airflow" {
  name        = "${var.project}-${var.environment}-airflow-sg"
  description = "Security group for Airflow EC2 instance"
  vpc_id      = aws_vpc.main.id

  # ingress: Airflow UI from VPN/office IPs only
  ingress {
    description = "Airflow UI"
    from_port   = 8080
    to_port     = 8080
    protocol    = "tcp"
    cidr_blocks = var.admin_cidr_blocks
  }

  # ingress: SSH from bastion/VPN only
  ingress {
    description = "SSH"
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = var.admin_cidr_blocks
  }

  # egress: Allow all outbound
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name        = "${var.project}-${var.environment}-airflow-sg"
    Environment = var.environment
  }
}

7. Phase 2: Database and Storage

7.1 Aurora PostgreSQL Module

# terraform/modules/aurora/main.tf

# Aurora PostgreSQL Serverless v2 cluster
# Auto-scales based on load, pay-per-use

resource "aws_rds_cluster" "main" {
  # cluster_identifier: Unique name for the cluster
  cluster_identifier = "${var.project}-${var.environment}"

  # engine: Aurora PostgreSQL compatible
  engine         = "aurora-postgresql"
  engine_mode    = "provisioned"
  engine_version = "15.4"

  # database_name: Default database created on launch
  database_name = var.database_name

  # master_username: Admin user (stored in Secrets Manager)
  master_username = var.master_username

  # master_password: Retrieved from Secrets Manager
  master_password = var.master_password

  # db_subnet_group_name: Place in private subnets
  db_subnet_group_name   = aws_db_subnet_group.main.name
  vpc_security_group_ids = [var.security_group_id]

  # storage_encrypted: Encrypt data at rest with KMS
  storage_encrypted = true
  kms_key_id        = var.kms_key_arn

  # backup_retention_period: Keep 7 days of automated backups
  backup_retention_period = 7
  preferred_backup_window = "03:00-04:00"

  # deletion_protection: Prevent accidental deletion
  deletion_protection = var.environment == "production"

  # skip_final_snapshot: Create snapshot before deletion (production)
  skip_final_snapshot       = var.environment != "production"
  final_snapshot_identifier = var.environment == "production" ? "${var.project}-${var.environment}-final" : null

  # enabled_cloudwatch_logs_exports: Export PostgreSQL logs
  enabled_cloudwatch_logs_exports = ["postgresql"]

  # serverlessv2_scaling_configuration: Auto-scaling capacity
  serverlessv2_scaling_configuration {
    # min_capacity: Minimum ACUs (0.5 ACU = ~1GB RAM)
    min_capacity = var.min_capacity

    # max_capacity: Maximum ACUs (scales up during peak)
    max_capacity = var.max_capacity
  }

  tags = {
    Name        = "${var.project}-${var.environment}-aurora"
    Environment = var.environment
    Project     = var.project
    ManagedBy   = "terraform"
  }
}

# Aurora cluster instance (Serverless v2)
resource "aws_rds_cluster_instance" "main" {
  count = var.instance_count

  identifier         = "${var.project}-${var.environment}-${count.index + 1}"
  cluster_identifier = aws_rds_cluster.main.id

  # instance_class: Serverless v2 instance type
  instance_class = "db.serverless"

  engine         = aws_rds_cluster.main.engine
  engine_version = aws_rds_cluster.main.engine_version

  # publicly_accessible: Never expose RDS to internet
  publicly_accessible = false

  # performance_insights_enabled: Enable for query analysis
  performance_insights_enabled = true

  tags = {
    Name        = "${var.project}-${var.environment}-instance-${count.index + 1}"
    Environment = var.environment
  }
}

# DB subnet group for multi-AZ deployment
resource "aws_db_subnet_group" "main" {
  name        = "${var.project}-${var.environment}-db-subnet"
  description = "Subnet group for Aurora PostgreSQL"
  subnet_ids  = var.private_subnet_ids

  tags = {
    Name        = "${var.project}-${var.environment}-db-subnet"
    Environment = var.environment
  }
}

7.2 S3 Module

# terraform/modules/s3/main.tf

# Document storage bucket
# Stores tax documents, generated PDFs, signed forms

resource "aws_s3_bucket" "documents" {
  bucket = "${var.project}-${var.environment}-documents"

  tags = {
    Name        = "${var.project}-${var.environment}-documents"
    Environment = var.environment
    Purpose     = "Tax document storage"
    ManagedBy   = "terraform"
  }
}

# Enable versioning for document recovery
resource "aws_s3_bucket_versioning" "documents" {
  bucket = aws_s3_bucket.documents.id

  versioning_configuration {
    # status: Enable versioning for all objects
    status = "Enabled"
  }
}

# Server-side encryption with KMS
resource "aws_s3_bucket_server_side_encryption_configuration" "documents" {
  bucket = aws_s3_bucket.documents.id

  rule {
    apply_server_side_encryption_by_default {
      # sse_algorithm: Use AWS KMS for encryption
      sse_algorithm     = "aws:kms"
      kms_master_key_id = var.kms_key_arn
    }
    # bucket_key_enabled: Reduce KMS request costs
    bucket_key_enabled = true
  }
}

# Block all public access
resource "aws_s3_bucket_public_access_block" "documents" {
  bucket = aws_s3_bucket.documents.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

# Lifecycle rules for cost optimization
resource "aws_s3_bucket_lifecycle_configuration" "documents" {
  bucket = aws_s3_bucket.documents.id

  # Rule 1: Move old documents to Glacier after 3 years
  rule {
    id     = "archive-old-documents"
    status = "Enabled"

    filter {
      prefix = "clients/"
    }

    transition {
      days          = 1095  # 3 years
      storage_class = "GLACIER"
    }
  }

  # Rule 2: Delete old versions after 90 days
  rule {
    id     = "delete-old-versions"
    status = "Enabled"

    noncurrent_version_expiration {
      noncurrent_days = 90
    }
  }
}

# CORS configuration for presigned URL uploads
resource "aws_s3_bucket_cors_configuration" "documents" {
  bucket = aws_s3_bucket.documents.id

  cors_rule {
    allowed_headers = ["*"]
    allowed_methods = ["GET", "PUT", "POST"]
    allowed_origins = var.allowed_origins
    expose_headers  = ["ETag"]
    max_age_seconds = 3600
  }
}

# Frontend static hosting bucket (Client Portal)
resource "aws_s3_bucket" "frontend_portal" {
  bucket = "${var.project}-${var.environment}-portal"

  tags = {
    Name        = "${var.project}-${var.environment}-portal"
    Environment = var.environment
    Purpose     = "Client Portal static assets"
  }
}

# Frontend static hosting bucket (Staff App)
resource "aws_s3_bucket" "frontend_staff" {
  bucket = "${var.project}-${var.environment}-staff"

  tags = {
    Name        = "${var.project}-${var.environment}-staff"
    Environment = var.environment
    Purpose     = "Staff App static assets"
  }
}

# Block public access for frontend buckets (CloudFront only)
resource "aws_s3_bucket_public_access_block" "frontend_portal" {
  bucket = aws_s3_bucket.frontend_portal.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_s3_bucket_public_access_block" "frontend_staff" {
  bucket = aws_s3_bucket.frontend_staff.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

8. Phase 3: Compute and Networking

8.1 ECS Fargate Module

# terraform/modules/ecs/main.tf

# ECS Cluster for FastAPI backend

resource "aws_ecs_cluster" "main" {
  name = "${var.project}-${var.environment}"

  # setting: Enable Container Insights for monitoring
  setting {
    name  = "containerInsights"
    value = "enabled"
  }

  tags = {
    Name        = "${var.project}-${var.environment}-cluster"
    Environment = var.environment
  }
}

# ECS Task Definition
resource "aws_ecs_task_definition" "api" {
  family                   = "${var.project}-${var.environment}-api"

  # requires_compatibilities: Fargate (serverless containers)
  requires_compatibilities = ["FARGATE"]

  # network_mode: awsvpc required for Fargate
  network_mode = "awsvpc"

  # cpu: 512 = 0.5 vCPU (scale up for production)
  cpu = var.cpu

  # memory: 1024 = 1GB RAM (scale up for production)
  memory = var.memory

  # execution_role_arn: Role for ECS to pull images, write logs
  execution_role_arn = aws_iam_role.ecs_execution.arn

  # task_role_arn: Role for the container to access AWS services
  task_role_arn = aws_iam_role.ecs_task.arn

  container_definitions = jsonencode([
    {
      name  = "api"
      image = var.container_image

      # portMappings: Expose FastAPI port
      portMappings = [
        {
          containerPort = 8000
          hostPort      = 8000
          protocol      = "tcp"
        }
      ]

      # environment: Non-sensitive configuration
      environment = [
        { name = "ENVIRONMENT", value = var.environment },
        { name = "DB_HOST", value = var.db_host },
        { name = "DB_PORT", value = "5432" },
        { name = "DB_NAME", value = var.db_name },
        { name = "S3_BUCKET_DOCUMENTS", value = var.s3_bucket_documents },
        { name = "AWS_REGION", value = var.aws_region },
        { name = "LOG_LEVEL", value = var.environment == "production" ? "INFO" : "DEBUG" }
      ]

      # secrets: Sensitive values from Secrets Manager
      secrets = [
        {
          name      = "DB_USER"
          valueFrom = "${var.db_secret_arn}:username::"
        },
        {
          name      = "DB_PASSWORD"
          valueFrom = "${var.db_secret_arn}:password::"
        },
        {
          name      = "JWT_SECRET"
          valueFrom = var.jwt_secret_arn
        }
      ]

      # logConfiguration: Send logs to CloudWatch
      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = aws_cloudwatch_log_group.api.name
          "awslogs-region"        = var.aws_region
          "awslogs-stream-prefix" = "api"
        }
      }

      # healthCheck: Container health check
      healthCheck = {
        command     = ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
        interval    = 30
        timeout     = 5
        retries     = 3
        startPeriod = 60
      }
    }
  ])

  tags = {
    Name        = "${var.project}-${var.environment}-api-task"
    Environment = var.environment
  }
}

# ECS Service
resource "aws_ecs_service" "api" {
  name            = "${var.project}-${var.environment}-api"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.api.arn

  # desired_count: Number of running tasks
  desired_count = var.desired_count

  # launch_type: Fargate for serverless
  launch_type = "FARGATE"

  # platform_version: Use latest Fargate platform
  platform_version = "LATEST"

  # deployment_configuration: Rolling updates
  deployment_configuration {
    maximum_percent         = 200
    minimum_healthy_percent = 100
  }

  # network_configuration: VPC networking
  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [var.ecs_security_group_id]
    assign_public_ip = false
  }

  # load_balancer: Connect to ALB target group
  load_balancer {
    target_group_arn = var.target_group_arn
    container_name   = "api"
    container_port   = 8000
  }

  # enable_execute_command: Enable ECS Exec for debugging
  enable_execute_command = var.environment != "production"

  tags = {
    Name        = "${var.project}-${var.environment}-api-service"
    Environment = var.environment
  }
}

# Auto-scaling for ECS service
resource "aws_appautoscaling_target" "api" {
  max_capacity       = var.max_capacity
  min_capacity       = var.min_capacity
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.api.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

# Scale based on CPU utilization
resource "aws_appautoscaling_policy" "api_cpu" {
  name               = "${var.project}-${var.environment}-api-cpu"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.api.resource_id
  scalable_dimension = aws_appautoscaling_target.api.scalable_dimension
  service_namespace  = aws_appautoscaling_target.api.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value       = 70.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

# CloudWatch Log Group
resource "aws_cloudwatch_log_group" "api" {
  name              = "/ecs/${var.project}-${var.environment}-api"
  retention_in_days = var.environment == "production" ? 90 : 30

  tags = {
    Name        = "${var.project}-${var.environment}-api-logs"
    Environment = var.environment
  }
}

8.2 Application Load Balancer Module

# terraform/modules/alb/main.tf

# Application Load Balancer for API traffic

resource "aws_lb" "main" {
  name               = "${var.project}-${var.environment}-alb"
  internal           = false
  load_balancer_type = "application"

  security_groups = [var.alb_security_group_id]
  subnets         = var.public_subnet_ids

  # enable_deletion_protection: Prevent accidental deletion
  enable_deletion_protection = var.environment == "production"

  # access_logs: Store access logs in S3
  access_logs {
    bucket  = var.log_bucket
    prefix  = "alb"
    enabled = true
  }

  tags = {
    Name        = "${var.project}-${var.environment}-alb"
    Environment = var.environment
  }
}

# HTTPS Listener (port 443)
resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.main.arn
  port              = 443
  protocol          = "HTTPS"

  # ssl_policy: Use secure TLS policy
  ssl_policy = "ELBSecurityPolicy-TLS13-1-2-2021-06"

  # certificate_arn: ACM certificate for HTTPS
  certificate_arn = var.certificate_arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.api.arn
  }
}

# HTTP Listener (redirect to HTTPS)
resource "aws_lb_listener" "http" {
  load_balancer_arn = aws_lb.main.arn
  port              = 80
  protocol          = "HTTP"

  default_action {
    type = "redirect"

    redirect {
      port        = "443"
      protocol    = "HTTPS"
      status_code = "HTTP_301"
    }
  }
}

# Target Group for ECS tasks
resource "aws_lb_target_group" "api" {
  name        = "${var.project}-${var.environment}-api-tg"
  port        = 8000
  protocol    = "HTTP"
  vpc_id      = var.vpc_id
  target_type = "ip"

  # health_check: Verify ECS tasks are healthy
  health_check {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 30
    path                = "/health"
    matcher             = "200"
  }

  tags = {
    Name        = "${var.project}-${var.environment}-api-tg"
    Environment = var.environment
  }
}

8.3 Backend Dockerfile

# infrastructure/docker/api/Dockerfile

# =============================================================================
# Tax Practice AI - FastAPI Production Image
# =============================================================================
# Multi-stage build for minimal production image

# -----------------------------------------------------------------------------
# Stage 1: Builder
# -----------------------------------------------------------------------------
FROM python:3.12-slim as builder

# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Create virtual environment
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir --upgrade pip \
    && pip install --no-cache-dir -r requirements.txt

# -----------------------------------------------------------------------------
# Stage 2: Production Image
# -----------------------------------------------------------------------------
FROM python:3.12-slim as production

# Security: Run as non-root user
RUN groupadd -r appgroup && useradd -r -g appgroup appuser

# Install runtime dependencies only
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

# Copy virtual environment from builder
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Set working directory
WORKDIR /app

# Copy application code
COPY --chown=appuser:appgroup src/ ./src/
COPY --chown=appuser:appgroup config.yaml ./

# Environment variables
ENV PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1 \
    PYTHONPATH=/app

# Switch to non-root user
USER appuser

# Expose API port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

# Run FastAPI with uvicorn
CMD ["uvicorn", "src.api.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

9. Phase 4: Frontend Deployment

9.1 CloudFront Module

# terraform/modules/cloudfront/main.tf

# CloudFront distribution for Client Portal

resource "aws_cloudfront_distribution" "portal" {
  enabled             = true
  is_ipv6_enabled     = true
  default_root_object = "index.html"
  price_class         = "PriceClass_100"  # US, Canada, Europe

  # aliases: Custom domain names
  aliases = var.portal_domains

  # origin: S3 bucket for static assets
  origin {
    domain_name              = var.portal_bucket_regional_domain
    origin_access_control_id = aws_cloudfront_origin_access_control.portal.id
    origin_id                = "S3-Portal"
  }

  # default_cache_behavior: Serve static assets
  default_cache_behavior {
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "S3-Portal"

    # forwarded_values: Cache based on headers
    forwarded_values {
      query_string = false
      cookies {
        forward = "none"
      }
    }

    # viewer_protocol_policy: Redirect HTTP to HTTPS
    viewer_protocol_policy = "redirect-to-https"

    # TTL settings
    min_ttl     = 0
    default_ttl = 86400    # 1 day
    max_ttl     = 31536000 # 1 year

    compress = true
  }

  # custom_error_response: SPA routing (return index.html for 404s)
  custom_error_response {
    error_code         = 404
    response_code      = 200
    response_page_path = "/index.html"
  }

  custom_error_response {
    error_code         = 403
    response_code      = 200
    response_page_path = "/index.html"
  }

  # restrictions: No geo restrictions
  restrictions {
    geo_restriction {
      restriction_type = "none"
    }
  }

  # viewer_certificate: Use ACM certificate
  viewer_certificate {
    acm_certificate_arn      = var.certificate_arn
    ssl_support_method       = "sni-only"
    minimum_protocol_version = "TLSv1.2_2021"
  }

  # web_acl_id: Attach WAF
  web_acl_id = var.waf_acl_arn

  tags = {
    Name        = "${var.project}-${var.environment}-portal-cdn"
    Environment = var.environment
  }
}

# Origin Access Control for S3
resource "aws_cloudfront_origin_access_control" "portal" {
  name                              = "${var.project}-${var.environment}-portal-oac"
  description                       = "OAC for Client Portal S3 bucket"
  origin_access_control_origin_type = "s3"
  signing_behavior                  = "always"
  signing_protocol                  = "sigv4"
}

# Similar distribution for Staff App
resource "aws_cloudfront_distribution" "staff" {
  enabled             = true
  is_ipv6_enabled     = true
  default_root_object = "index.html"
  price_class         = "PriceClass_100"

  aliases = var.staff_domains

  origin {
    domain_name              = var.staff_bucket_regional_domain
    origin_access_control_id = aws_cloudfront_origin_access_control.staff.id
    origin_id                = "S3-Staff"
  }

  default_cache_behavior {
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "S3-Staff"

    forwarded_values {
      query_string = false
      cookies {
        forward = "none"
      }
    }

    viewer_protocol_policy = "redirect-to-https"
    min_ttl                = 0
    default_ttl            = 86400
    max_ttl                = 31536000
    compress               = true
  }

  custom_error_response {
    error_code         = 404
    response_code      = 200
    response_page_path = "/index.html"
  }

  custom_error_response {
    error_code         = 403
    response_code      = 200
    response_page_path = "/index.html"
  }

  restrictions {
    geo_restriction {
      restriction_type = "none"
    }
  }

  viewer_certificate {
    acm_certificate_arn      = var.certificate_arn
    ssl_support_method       = "sni-only"
    minimum_protocol_version = "TLSv1.2_2021"
  }

  web_acl_id = var.waf_acl_arn

  tags = {
    Name        = "${var.project}-${var.environment}-staff-cdn"
    Environment = var.environment
  }
}

resource "aws_cloudfront_origin_access_control" "staff" {
  name                              = "${var.project}-${var.environment}-staff-oac"
  description                       = "OAC for Staff App S3 bucket"
  origin_access_control_origin_type = "s3"
  signing_behavior                  = "always"
  signing_protocol                  = "sigv4"
}

9.2 Frontend Build and Deploy (GitHub Actions)

# .github/workflows/deploy-frontend.yml

# Frontend deployment workflow
# Builds React apps and deploys to S3/CloudFront

name: Deploy Frontend

on:
  push:
    branches:
      - main
    paths:
      - 'frontend/**'
  workflow_dispatch:
    inputs:
      environment:
        description: 'Deployment environment'
        required: true
        default: 'staging'
        type: choice
        options:
          - staging
          - production

env:
  # AWS_REGION: Region for all AWS operations
  AWS_REGION: us-east-1

  # NODE_VERSION: Node.js version for builds
  NODE_VERSION: '20'

jobs:
  # ===========================================================================
  # Build Frontend Applications
  # ===========================================================================
  build:
    name: Build Frontend
    runs-on: ubuntu-latest

    strategy:
      matrix:
        app: [client-portal, staff-app]

    steps:
      # Checkout repository
      - name: Checkout code
        uses: actions/checkout@v4

      # Setup Node.js
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'pnpm'
          cache-dependency-path: frontend/pnpm-lock.yaml

      # Install pnpm
      - name: Install pnpm
        run: npm install -g pnpm

      # Install dependencies
      - name: Install dependencies
        working-directory: frontend
        run: pnpm install --frozen-lockfile

      # Build application
      - name: Build ${{ matrix.app }}
        working-directory: frontend
        env:
          VITE_API_URL: ${{ vars.API_URL }}
          VITE_APP_NAME: ${{ matrix.app == 'client-portal' && 'Tax Practice Portal' || 'Tax Practice Staff' }}
        run: pnpm --filter ${{ matrix.app }} build

      # Upload build artifact
      - name: Upload build artifact
        uses: actions/upload-artifact@v4
        with:
          name: ${{ matrix.app }}-build
          path: frontend/apps/${{ matrix.app }}/dist
          retention-days: 1

  # ===========================================================================
  # Deploy to S3 and Invalidate CloudFront
  # ===========================================================================
  deploy:
    name: Deploy to AWS
    runs-on: ubuntu-latest
    needs: build
    environment: ${{ github.event.inputs.environment || 'staging' }}

    strategy:
      matrix:
        app: [client-portal, staff-app]

    steps:
      # Download build artifact
      - name: Download build artifact
        uses: actions/download-artifact@v4
        with:
          name: ${{ matrix.app }}-build
          path: dist

      # Configure AWS credentials
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ env.AWS_REGION }}

      # Sync to S3
      - name: Deploy to S3
        run: |
          aws s3 sync dist/ s3://${{ vars.S3_BUCKET_PREFIX }}-${{ matrix.app }}/ \
            --delete \
            --cache-control "public, max-age=31536000, immutable" \
            --exclude "index.html" \
            --exclude "*.json"

          # Upload index.html with no-cache for SPA routing
          aws s3 cp dist/index.html s3://${{ vars.S3_BUCKET_PREFIX }}-${{ matrix.app }}/index.html \
            --cache-control "no-cache, no-store, must-revalidate"

      # Invalidate CloudFront cache
      - name: Invalidate CloudFront
        run: |
          aws cloudfront create-invalidation \
            --distribution-id ${{ matrix.app == 'client-portal' && vars.CLOUDFRONT_PORTAL_ID || vars.CLOUDFRONT_STAFF_ID }} \
            --paths "/*"

10. Phase 5: Orchestration and Background Jobs

10.1 Airflow EC2 Module

# terraform/modules/airflow/main.tf

# EC2 instance for self-hosted Apache Airflow
# Handles workflow orchestration, scheduled tasks

resource "aws_instance" "airflow" {
  # ami: Amazon Linux 2023 AMI
  ami           = var.ami_id

  # instance_type: t3.medium provides 2 vCPU, 4GB RAM
  instance_type = var.instance_type

  # subnet_id: Deploy in private subnet
  subnet_id = var.private_subnet_id

  # vpc_security_group_ids: Airflow security group
  vpc_security_group_ids = [var.security_group_id]

  # iam_instance_profile: Role for AWS service access
  iam_instance_profile = aws_iam_instance_profile.airflow.name

  # key_name: SSH key for access (optional, use Session Manager instead)
  key_name = var.key_name

  # root_block_device: 50GB gp3 storage
  root_block_device {
    volume_size           = 50
    volume_type           = "gp3"
    encrypted             = true
    delete_on_termination = true
  }

  # user_data: Bootstrap script
  user_data = templatefile("${path.module}/user_data.sh", {
    environment          = var.environment
    db_host              = var.db_host
    db_name              = var.airflow_db_name
    db_secret_arn        = var.db_secret_arn
    aws_region           = var.aws_region
    s3_bucket_dags       = var.s3_bucket_dags
  })

  tags = {
    Name        = "${var.project}-${var.environment}-airflow"
    Environment = var.environment
    Service     = "airflow"
  }
}

# IAM role for Airflow EC2 instance
resource "aws_iam_role" "airflow" {
  name = "${var.project}-${var.environment}-airflow-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })

  tags = {
    Name        = "${var.project}-${var.environment}-airflow-role"
    Environment = var.environment
  }
}

# IAM policy for Airflow to access AWS services
resource "aws_iam_role_policy" "airflow" {
  name = "${var.project}-${var.environment}-airflow-policy"
  role = aws_iam_role.airflow.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        # S3 access for DAGs and document processing
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject",
          "s3:ListBucket"
        ]
        Resource = [
          "arn:aws:s3:::${var.s3_bucket_dags}",
          "arn:aws:s3:::${var.s3_bucket_dags}/*",
          "arn:aws:s3:::${var.s3_bucket_documents}",
          "arn:aws:s3:::${var.s3_bucket_documents}/*"
        ]
      },
      {
        # Secrets Manager for credentials
        Effect = "Allow"
        Action = [
          "secretsmanager:GetSecretValue"
        ]
        Resource = var.secret_arns
      },
      {
        # Lambda invocation for task execution
        Effect = "Allow"
        Action = [
          "lambda:InvokeFunction"
        ]
        Resource = "arn:aws:lambda:${var.aws_region}:${var.account_id}:function:${var.project}-${var.environment}-*"
      },
      {
        # CloudWatch Logs
        Effect = "Allow"
        Action = [
          "logs:CreateLogGroup",
          "logs:CreateLogStream",
          "logs:PutLogEvents"
        ]
        Resource = "*"
      },
      {
        # SSM Session Manager (for debugging)
        Effect = "Allow"
        Action = [
          "ssmmessages:CreateControlChannel",
          "ssmmessages:CreateDataChannel",
          "ssmmessages:OpenControlChannel",
          "ssmmessages:OpenDataChannel"
        ]
        Resource = "*"
      }
    ]
  })
}

resource "aws_iam_instance_profile" "airflow" {
  name = "${var.project}-${var.environment}-airflow-profile"
  role = aws_iam_role.airflow.name
}

10.2 Airflow Bootstrap Script

#!/bin/bash
# terraform/modules/airflow/user_data.sh
# Bootstrap script for Airflow EC2 instance

set -e

# =============================================================================
# Environment variables (injected by Terraform)
# =============================================================================
ENVIRONMENT="${environment}"
DB_HOST="${db_host}"
DB_NAME="${db_name}"
DB_SECRET_ARN="${db_secret_arn}"
AWS_REGION="${aws_region}"
S3_BUCKET_DAGS="${s3_bucket_dags}"

# =============================================================================
# System Updates
# =============================================================================
echo "Updating system packages..."
yum update -y
yum install -y python3-pip postgresql15 git jq

# =============================================================================
# Install Airflow
# =============================================================================
echo "Installing Apache Airflow..."
pip3 install apache-airflow[postgres,amazon]==2.8.0

# =============================================================================
# Get Database Credentials from Secrets Manager
# =============================================================================
echo "Retrieving database credentials..."
DB_CREDS=$(aws secretsmanager get-secret-value \
  --secret-id $DB_SECRET_ARN \
  --region $AWS_REGION \
  --query SecretString \
  --output text)

DB_USER=$(echo $DB_CREDS | jq -r '.username')
DB_PASSWORD=$(echo $DB_CREDS | jq -r '.password')

# =============================================================================
# Configure Airflow
# =============================================================================
echo "Configuring Airflow..."
export AIRFLOW_HOME=/opt/airflow
mkdir -p $AIRFLOW_HOME/dags

# Initialize Airflow database
airflow db init

# Update configuration
cat > $AIRFLOW_HOME/airflow.cfg << EOF
[core]
executor = LocalExecutor
dags_folder = $AIRFLOW_HOME/dags
parallelism = 8
load_examples = False

[database]
sql_alchemy_conn = postgresql://$DB_USER:$DB_PASSWORD@$DB_HOST:5432/$DB_NAME

[webserver]
web_server_port = 8080
authenticate = True
auth_backend = airflow.contrib.auth.backends.password_auth

[scheduler]
dag_dir_list_interval = 300

[logging]
remote_logging = True
remote_log_conn_id = aws_default
remote_base_log_folder = s3://$S3_BUCKET_DAGS/logs
encrypt_s3_logs = True
EOF

# =============================================================================
# Sync DAGs from S3
# =============================================================================
echo "Syncing DAGs from S3..."
aws s3 sync s3://$S3_BUCKET_DAGS/dags/ $AIRFLOW_HOME/dags/

# =============================================================================
# Create systemd services
# =============================================================================
echo "Creating systemd services..."

# Airflow Webserver
cat > /etc/systemd/system/airflow-webserver.service << EOF
[Unit]
Description=Airflow Webserver
After=network.target

[Service]
Environment=AIRFLOW_HOME=$AIRFLOW_HOME
User=ec2-user
Group=ec2-user
Type=simple
ExecStart=/usr/local/bin/airflow webserver
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

# Airflow Scheduler
cat > /etc/systemd/system/airflow-scheduler.service << EOF
[Unit]
Description=Airflow Scheduler
After=network.target

[Service]
Environment=AIRFLOW_HOME=$AIRFLOW_HOME
User=ec2-user
Group=ec2-user
Type=simple
ExecStart=/usr/local/bin/airflow scheduler
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

# =============================================================================
# Start Services
# =============================================================================
echo "Starting Airflow services..."
chown -R ec2-user:ec2-user $AIRFLOW_HOME
systemctl daemon-reload
systemctl enable airflow-webserver airflow-scheduler
systemctl start airflow-webserver airflow-scheduler

echo "Airflow installation complete!"

11. Phase 6: External Integrations

Secrets Manager Configuration

# terraform/modules/secrets/main.tf

# =============================================================================
# Database Credentials
# =============================================================================
resource "aws_secretsmanager_secret" "db_credentials" {
  name        = "${var.project}/${var.environment}/db-credentials"
  description = "Aurora PostgreSQL master credentials"

  tags = {
    Name        = "${var.project}-${var.environment}-db-credentials"
    Environment = var.environment
  }
}

# =============================================================================
# JWT Secret
# =============================================================================
resource "aws_secretsmanager_secret" "jwt_secret" {
  name        = "${var.project}/${var.environment}/jwt-secret"
  description = "JWT signing secret for authentication"

  tags = {
    Name        = "${var.project}-${var.environment}-jwt-secret"
    Environment = var.environment
  }
}

# =============================================================================
# External Service Credentials
# =============================================================================

# Stripe API Keys
resource "aws_secretsmanager_secret" "stripe" {
  name        = "${var.project}/${var.environment}/stripe"
  description = "Stripe API credentials"

  tags = {
    Name        = "${var.project}-${var.environment}-stripe"
    Environment = var.environment
  }
}

# Persona API Key
resource "aws_secretsmanager_secret" "persona" {
  name        = "${var.project}/${var.environment}/persona"
  description = "Persona identity verification API credentials"

  tags = {
    Name        = "${var.project}-${var.environment}-persona"
    Environment = var.environment
  }
}

# SmartVault OAuth Credentials
resource "aws_secretsmanager_secret" "smartvault" {
  name        = "${var.project}/${var.environment}/smartvault"
  description = "SmartVault OAuth credentials"

  tags = {
    Name        = "${var.project}-${var.environment}-smartvault"
    Environment = var.environment
  }
}

# SurePrep API Credentials
resource "aws_secretsmanager_secret" "sureprep" {
  name        = "${var.project}/${var.environment}/sureprep"
  description = "SurePrep API credentials"

  tags = {
    Name        = "${var.project}-${var.environment}-sureprep"
    Environment = var.environment
  }
}

# Google OAuth Credentials
resource "aws_secretsmanager_secret" "google" {
  name        = "${var.project}/${var.environment}/google"
  description = "Google Workspace OAuth credentials"

  tags = {
    Name        = "${var.project}-${var.environment}-google"
    Environment = var.environment
  }
}

# Twilio Credentials
resource "aws_secretsmanager_secret" "twilio" {
  name        = "${var.project}/${var.environment}/twilio"
  description = "Twilio SMS/Voice credentials"

  tags = {
    Name        = "${var.project}-${var.environment}-twilio"
    Environment = var.environment
  }
}

# SendGrid API Key
resource "aws_secretsmanager_secret" "sendgrid" {
  name        = "${var.project}/${var.environment}/sendgrid"
  description = "SendGrid email API key"

  tags = {
    Name        = "${var.project}-${var.environment}-sendgrid"
    Environment = var.environment
  }
}

12. Testing Requirements

12.1 New Tests for Cloud Deployment

Category Test Priority Status
Infrastructure
Terraform plan validates P0 Not Started
VPC connectivity test P0 Not Started
Security group rules verify P0 Not Started
RDS connectivity from ECS P0 Not Started
S3 access from ECS P0 Not Started
Container
Dockerfile builds successfully P0 Not Started
Container health check passes P0 Not Started
Container starts in < 60s P0 Not Started
Container handles graceful shutdown P1 Not Started
API
Health endpoint returns 200 P0 Not Started
API responds under load (100 RPS) P1 Not Started
Database migrations complete P0 Not Started
Secrets retrieval works P0 Not Started
Frontend
S3 deployment succeeds P0 Not Started
CloudFront serves index.html P0 Not Started
SPA routing works (404 → index.html) P0 Not Started
API calls work through ALB P0 Not Started
Integration
End-to-end user flow P0 Not Started
Document upload to S3 P0 Not Started
AI analysis via Bedrock P1 Not Started
Webhook delivery P1 Not Started
Security
WAF blocks SQL injection P0 Not Started
WAF blocks XSS P0 Not Started
Rate limiting works P1 Not Started
SSL certificate valid P0 Not Started

12.2 Load Testing Plan

# tests/load/config.yaml

# Load test configuration for Tax Practice AI
# Uses k6 or Artillery for load testing

scenarios:
  # Scenario 1: Normal tax season load
  - name: "Normal Load"
    description: "Typical tax season traffic pattern"
    duration: "10m"
    vus: 50  # Virtual users
    targets:
      - endpoint: "/health"
        method: "GET"
        weight: 10
      - endpoint: "/v1/clients"
        method: "GET"
        weight: 20
      - endpoint: "/v1/documents"
        method: "GET"
        weight: 30
      - endpoint: "/v1/documents/upload-url"
        method: "POST"
        weight: 20
      - endpoint: "/v1/returns/{id}/ai/ask"
        method: "POST"
        weight: 20

    thresholds:
      # p95 latency under 500ms
      http_req_duration: ["p(95)<500"]
      # Error rate under 1%
      http_req_failed: ["rate<0.01"]

  # Scenario 2: Peak load (deadline day)
  - name: "Peak Load"
    description: "April 15th deadline traffic"
    duration: "5m"
    vus: 200
    ramp_up: "1m"

    thresholds:
      # p95 latency under 1s during peak
      http_req_duration: ["p(95)<1000"]
      # Error rate under 5%
      http_req_failed: ["rate<0.05"]

  # Scenario 3: Spike test
  - name: "Spike Test"
    description: "Sudden traffic spike"
    stages:
      - duration: "1m"
        target: 50
      - duration: "30s"
        target: 500  # Sudden spike
      - duration: "1m"
        target: 500
      - duration: "30s"
        target: 50   # Return to normal

    thresholds:
      # System should recover within 30s
      http_req_duration: ["p(95)<2000"]

13. Best Practices Checklist

13.1 Infrastructure Best Practices

  • Multi-AZ deployment for high availability
  • Private subnets for compute and database
  • NAT Gateway for outbound internet (not public IPs)
  • Security groups with least-privilege rules
  • VPC Flow Logs enabled for network monitoring
  • Encryption at rest for all data stores (RDS, S3, EBS)
  • Encryption in transit (TLS 1.2+ everywhere)
  • No hardcoded credentials - use Secrets Manager
  • Resource tagging for cost allocation and management
  • Terraform state in S3 with DynamoDB locking

13.2 Security Best Practices

  • WAF enabled with OWASP rules
  • Rate limiting on API endpoints
  • CORS configured for allowed origins only
  • Content Security Policy headers
  • No public S3 buckets
  • CloudTrail enabled for API audit logging
  • GuardDuty enabled for threat detection
  • IAM roles with least-privilege (no root/admin)
  • MFA required for AWS console access
  • Secrets rotation configured

13.3 Operational Best Practices

  • CloudWatch dashboards for key metrics
  • Alarms configured for critical thresholds
  • Log retention policies set
  • Automated backups verified
  • Disaster recovery runbook documented
  • Incident response plan documented
  • Runbooks for common operations
  • On-call rotation established

13.4 Application Best Practices

  • Health check endpoints implemented
  • Graceful shutdown handling
  • Connection pooling for database
  • Retry logic with exponential backoff
  • Circuit breaker for external services
  • Structured logging with request IDs
  • Error tracking (Sentry or similar)
  • Feature flags for gradual rollout

13.5 CI/CD Best Practices

  • Automated tests run on every PR
  • Infrastructure changes require approval
  • Database migrations are reversible
  • Blue-green deployments for zero-downtime
  • Rollback capability tested
  • Environment parity (staging mirrors production)
  • Secrets never in code - injected at runtime

14. Security Considerations

14.1 Data Protection (Tax Compliance)

Requirement Implementation
SSN/EIN encryption Field-level encryption in Aurora, KMS keys
Data at rest RDS encryption, S3 SSE-KMS, EBS encryption
Data in transit TLS 1.2+ everywhere, no HTTP
Access logging CloudTrail, VPC Flow Logs, application audit logs
7-year retention S3 lifecycle policies, Aurora backups
PII masking Application-level masking in logs

14.2 Network Security

┌─────────────────────────────────────────────────────────────────┐
│                      NETWORK SECURITY LAYERS                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Layer 1: Edge Protection                                       │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  AWS WAF + Shield                                        │   │
│   │  • SQL injection protection                              │   │
│   │  • XSS protection                                        │   │
│   │  • Rate limiting (1000 req/5min per IP)                  │   │
│   │  • Geo-blocking (US-only for now)                        │   │
│   │  • Bot detection                                         │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│   Layer 2: Load Balancer                                         │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  Application Load Balancer                               │   │
│   │  • TLS termination (ACM certificate)                     │   │
│   │  • HTTP → HTTPS redirect                                 │   │
│   │  • Health checks                                         │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│   Layer 3: Application                                           │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  ECS Fargate (Private Subnet)                            │   │
│   │  • Security group: ALB only                              │   │
│   │  • JWT authentication                                    │   │
│   │  • RBAC authorization                                    │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│   Layer 4: Data                                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │  Aurora PostgreSQL (Private Subnet)                      │   │
│   │  • Security group: ECS + Airflow only                    │   │
│   │  • No public access                                      │   │
│   │  • Encrypted connections (SSL required)                  │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

14.3 WAF Rules

# terraform/modules/waf/main.tf

resource "aws_wafv2_web_acl" "main" {
  name        = "${var.project}-${var.environment}-waf"
  description = "WAF rules for Tax Practice AI"
  scope       = "REGIONAL"

  default_action {
    allow {}
  }

  # Rule 1: Rate limiting
  rule {
    name     = "RateLimitRule"
    priority = 1

    action {
      block {}
    }

    statement {
      rate_based_statement {
        limit              = 1000
        aggregate_key_type = "IP"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "RateLimitRule"
      sampled_requests_enabled   = true
    }
  }

  # Rule 2: AWS Managed Rules - Common
  rule {
    name     = "AWSManagedRulesCommonRuleSet"
    priority = 2

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name        = "AWSManagedRulesCommonRuleSet"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "AWSManagedRulesCommonRuleSet"
      sampled_requests_enabled   = true
    }
  }

  # Rule 3: AWS Managed Rules - SQL Injection
  rule {
    name     = "AWSManagedRulesSQLiRuleSet"
    priority = 3

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name        = "AWSManagedRulesSQLiRuleSet"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "AWSManagedRulesSQLiRuleSet"
      sampled_requests_enabled   = true
    }
  }

  # Rule 4: AWS Managed Rules - Known Bad Inputs
  rule {
    name     = "AWSManagedRulesKnownBadInputsRuleSet"
    priority = 4

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name        = "AWSManagedRulesKnownBadInputsRuleSet"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "AWSManagedRulesKnownBadInputsRuleSet"
      sampled_requests_enabled   = true
    }
  }

  visibility_config {
    cloudwatch_metrics_enabled = true
    metric_name                = "${var.project}-${var.environment}-waf"
    sampled_requests_enabled   = true
  }

  tags = {
    Name        = "${var.project}-${var.environment}-waf"
    Environment = var.environment
  }
}

15. Cloud Service Security Configurations

This section provides specific security configurations for each AWS service used in Tax Practice AI. These are recommended settings for a tax/financial application handling sensitive data.

15.1 VPC Security Configuration

# =============================================================================
# VPC SECURITY SETTINGS
# =============================================================================

# CIDR Block: 10.0.0.0/16
# - Provides 65,536 IP addresses
# - Private enough to not conflict with common networks
# - Large enough for future growth

vpc_cidr = "10.0.0.0/16"

# Subnet Layout:
# Public Subnets (for ALB, NAT Gateway):
#   - 10.0.0.0/20 (AZ-a) - 4,096 IPs
#   - 10.0.16.0/20 (AZ-b) - 4,096 IPs
#   - 10.0.32.0/20 (AZ-c) - 4,096 IPs
#
# Private Subnets (for ECS, RDS, Lambda):
#   - 10.0.48.0/20 (AZ-a) - 4,096 IPs
#   - 10.0.64.0/20 (AZ-b) - 4,096 IPs
#   - 10.0.80.0/20 (AZ-c) - 4,096 IPs

# VPC Flow Logs: ENABLED
# - Captures all traffic (ACCEPT and REJECT)
# - Retention: 90 days in CloudWatch Logs
# - Used for security analysis and troubleshooting
flow_logs_enabled = true
flow_logs_retention_days = 90

# DNS Settings
enable_dns_hostnames = true  # Required for RDS, ECS
enable_dns_support = true    # Required for VPC DNS resolution

15.2 Security Group Rules (Explicit)

# =============================================================================
# SECURITY GROUP: ALB (Application Load Balancer)
# =============================================================================
# Purpose: Accept traffic from internet, forward to ECS

alb_security_group_rules = {
  ingress = [
    {
      description = "HTTPS from internet"
      from_port   = 443
      to_port     = 443
      protocol    = "tcp"
      cidr_blocks = ["0.0.0.0/0"]  # Allow from anywhere (WAF filters first)
    },
    {
      description = "HTTP for redirect only"
      from_port   = 80
      to_port     = 80
      protocol    = "tcp"
      cidr_blocks = ["0.0.0.0/0"]  # Redirects to HTTPS
    }
  ]
  egress = [
    {
      description     = "To ECS tasks only"
      from_port       = 8000
      to_port         = 8000
      protocol        = "tcp"
      security_groups = ["ecs_security_group"]  # Reference, not CIDR
    }
  ]
}

# =============================================================================
# SECURITY GROUP: ECS (API Containers)
# =============================================================================
# Purpose: Run FastAPI containers, access RDS and S3

ecs_security_group_rules = {
  ingress = [
    {
      description     = "From ALB only"
      from_port       = 8000
      to_port         = 8000
      protocol        = "tcp"
      security_groups = ["alb_security_group"]  # Only ALB can reach ECS
    }
  ]
  egress = [
    {
      description = "HTTPS to AWS services (S3, Secrets Manager, etc.)"
      from_port   = 443
      to_port     = 443
      protocol    = "tcp"
      cidr_blocks = ["0.0.0.0/0"]  # AWS services via NAT Gateway
    },
    {
      description     = "PostgreSQL to RDS"
      from_port       = 5432
      to_port         = 5432
      protocol        = "tcp"
      security_groups = ["rds_security_group"]
    }
  ]
}

# =============================================================================
# SECURITY GROUP: RDS (Aurora PostgreSQL)
# =============================================================================
# Purpose: Database - most restrictive

rds_security_group_rules = {
  ingress = [
    {
      description     = "PostgreSQL from ECS"
      from_port       = 5432
      to_port         = 5432
      protocol        = "tcp"
      security_groups = ["ecs_security_group"]
    },
    {
      description     = "PostgreSQL from Airflow"
      from_port       = 5432
      to_port         = 5432
      protocol        = "tcp"
      security_groups = ["airflow_security_group"]
    },
    {
      description     = "PostgreSQL from Lambda"
      from_port       = 5432
      to_port         = 5432
      protocol        = "tcp"
      security_groups = ["lambda_security_group"]
    }
  ]
  egress = []  # RDS does not need outbound access
}

# =============================================================================
# SECURITY GROUP: Airflow (EC2)
# =============================================================================
# Purpose: Workflow orchestration - restricted access

airflow_security_group_rules = {
  ingress = [
    {
      description = "Airflow UI - Admin IPs only"
      from_port   = 8080
      to_port     = 8080
      protocol    = "tcp"
      cidr_blocks = ["YOUR_OFFICE_IP/32", "YOUR_VPN_IP/32"]  # REPLACE with actual IPs
    },
    {
      description = "SSH - Admin IPs only (or use Session Manager instead)"
      from_port   = 22
      to_port     = 22
      protocol    = "tcp"
      cidr_blocks = ["YOUR_OFFICE_IP/32"]  # REPLACE or remove if using SSM
    }
  ]
  egress = [
    {
      description = "HTTPS to AWS services"
      from_port   = 443
      to_port     = 443
      protocol    = "tcp"
      cidr_blocks = ["0.0.0.0/0"]
    },
    {
      description     = "PostgreSQL to RDS"
      from_port       = 5432
      to_port         = 5432
      protocol        = "tcp"
      security_groups = ["rds_security_group"]
    }
  ]
}

15.3 Aurora PostgreSQL Security Configuration

# =============================================================================
# AURORA POSTGRESQL - SECURITY SETTINGS
# =============================================================================

aurora_security_config = {
  # Encryption at rest: REQUIRED
  # Uses AWS KMS for encryption
  storage_encrypted = true
  kms_key_id        = "alias/tax-practice-rds"  # Customer managed key

  # Encryption in transit: REQUIRED
  # Force SSL connections
  # Set via parameter group: rds.force_ssl = 1

  # Network isolation
  publicly_accessible = false  # NEVER expose to internet
  db_subnet_group     = "private-subnets-only"

  # Authentication
  iam_database_authentication_enabled = true  # Allow IAM auth for ECS

  # Backup settings (IRS requires 7-year retention)
  backup_retention_period = 7  # Days for automated backups
  preferred_backup_window = "03:00-04:00"  # UTC, during low traffic

  # Deletion protection: ENABLED for production
  deletion_protection = true  # Prevent accidental deletion

  # Enhanced monitoring
  monitoring_interval = 60  # Seconds (0 to disable)
  monitoring_role_arn = "arn:aws:iam::ACCOUNT:role/rds-monitoring-role"

  # Performance Insights: ENABLED
  performance_insights_enabled          = true
  performance_insights_retention_period = 7  # Days

  # CloudWatch Logs export
  enabled_cloudwatch_logs_exports = ["postgresql"]

  # Auto minor version upgrade
  auto_minor_version_upgrade = true

  # Maintenance window
  preferred_maintenance_window = "sun:04:00-sun:05:00"  # UTC
}

# Parameter Group Settings
aurora_parameter_group = {
  family = "aurora-postgresql15"

  parameters = [
    {
      name  = "rds.force_ssl"
      value = "1"  # REQUIRE SSL connections
    },
    {
      name  = "log_statement"
      value = "ddl"  # Log DDL statements for audit
    },
    {
      name  = "log_connections"
      value = "1"  # Log connection attempts
    },
    {
      name  = "log_disconnections"
      value = "1"  # Log disconnections
    },
    {
      name  = "password_encryption"
      value = "scram-sha-256"  # Strong password hashing
    }
  ]
}

15.4 S3 Security Configuration

# =============================================================================
# S3 BUCKET - DOCUMENT STORAGE SECURITY
# =============================================================================

s3_security_config = {
  # Block ALL public access: REQUIRED
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true

  # Versioning: ENABLED
  # Allows recovery of accidentally deleted/modified documents
  versioning_enabled = true

  # Server-side encryption: AWS KMS
  sse_algorithm     = "aws:kms"
  kms_master_key_id = "alias/tax-practice-s3"  # Customer managed key
  bucket_key_enabled = true  # Reduce KMS costs

  # Object lock: OPTIONAL (for compliance hold)
  # Enable if you need WORM (Write Once Read Many)
  object_lock_enabled = false

  # Access logging: ENABLED
  logging_enabled = true
  logging_bucket  = "tax-practice-access-logs"
  logging_prefix  = "s3-documents/"

  # Lifecycle rules
  lifecycle_rules = [
    {
      id      = "archive-after-3-years"
      enabled = true
      prefix  = "clients/"
      transitions = [
        {
          days          = 1095  # 3 years
          storage_class = "GLACIER"
        }
      ]
    },
    {
      id      = "delete-old-versions"
      enabled = true
      noncurrent_version_expiration = {
        days = 90
      }
    },
    {
      id      = "abort-incomplete-uploads"
      enabled = true
      abort_incomplete_multipart_upload = {
        days_after_initiation = 7
      }
    }
  ]
}

# Bucket Policy: Restrict access to specific roles
s3_bucket_policy = {
  Statement = [
    {
      Sid       = "DenyUnencryptedUploads"
      Effect    = "Deny"
      Principal = "*"
      Action    = "s3:PutObject"
      Resource  = "arn:aws:s3:::tax-practice-documents/*"
      Condition = {
        StringNotEquals = {
          "s3:x-amz-server-side-encryption" = "aws:kms"
        }
      }
    },
    {
      Sid       = "DenyInsecureConnections"
      Effect    = "Deny"
      Principal = "*"
      Action    = "s3:*"
      Resource  = [
        "arn:aws:s3:::tax-practice-documents",
        "arn:aws:s3:::tax-practice-documents/*"
      ]
      Condition = {
        Bool = {
          "aws:SecureTransport" = "false"
        }
      }
    },
    {
      Sid       = "AllowECSTaskRole"
      Effect    = "Allow"
      Principal = {
        AWS = "arn:aws:iam::ACCOUNT:role/tax-practice-ecs-task-role"
      }
      Action = [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ]
      Resource = "arn:aws:s3:::tax-practice-documents/*"
    }
  ]
}

# CORS: Restrict to known origins
s3_cors_rules = [
  {
    allowed_headers = ["*"]
    allowed_methods = ["GET", "PUT", "POST"]
    allowed_origins = [
      "https://portal.taxpractice.ai",
      "https://app.taxpractice.ai",
      "https://staging.taxpractice.ai"
    ]
    expose_headers  = ["ETag"]
    max_age_seconds = 3600
  }
]

15.5 ECS Fargate Security Configuration

# =============================================================================
# ECS FARGATE - CONTAINER SECURITY
# =============================================================================

ecs_security_config = {
  # Network mode: awsvpc (required for Fargate)
  # Gives each task its own ENI and security group
  network_mode = "awsvpc"

  # Task networking
  assign_public_ip = false  # NEVER assign public IP to tasks

  # Container configuration
  container_config = {
    # Run as non-root user
    user = "1000:1000"  # appuser:appgroup

    # Read-only root filesystem
    readonly_root_filesystem = true

    # No privileged mode
    privileged = false

    # Resource limits (prevent runaway containers)
    cpu    = 512   # 0.5 vCPU
    memory = 1024  # 1 GB

    # Health check
    health_check = {
      command     = ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
      interval    = 30
      timeout     = 5
      retries     = 3
      start_period = 60
    }

    # Logging
    log_configuration = {
      log_driver = "awslogs"
      options = {
        awslogs_group         = "/ecs/tax-practice-api"
        awslogs_region        = "us-east-1"
        awslogs_stream_prefix = "api"
      }
    }
  }

  # Task execution role (for ECS to pull images, write logs)
  execution_role_permissions = [
    "ecr:GetAuthorizationToken",
    "ecr:BatchCheckLayerAvailability",
    "ecr:GetDownloadUrlForLayer",
    "ecr:BatchGetImage",
    "logs:CreateLogStream",
    "logs:PutLogEvents",
    "secretsmanager:GetSecretValue"  # For injecting secrets
  ]

  # Task role (for the container to access AWS services)
  task_role_permissions = [
    {
      effect    = "Allow"
      actions   = ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"]
      resources = ["arn:aws:s3:::tax-practice-documents/*"]
    },
    {
      effect    = "Allow"
      actions   = ["bedrock:InvokeModel"]
      resources = ["arn:aws:bedrock:us-east-1::foundation-model/anthropic.*"]
    },
    {
      effect    = "Allow"
      actions   = ["secretsmanager:GetSecretValue"]
      resources = ["arn:aws:secretsmanager:us-east-1:ACCOUNT:secret:tax-practice/*"]
    }
  ]

  # ECS Exec for debugging (STAGING ONLY)
  enable_execute_command = false  # Set to true for staging
}

15.6 ALB Security Configuration

# =============================================================================
# APPLICATION LOAD BALANCER - SECURITY
# =============================================================================

alb_security_config = {
  # Internal: NO (internet-facing)
  internal = false

  # Idle timeout
  idle_timeout = 60  # Seconds

  # Deletion protection: ENABLED for production
  enable_deletion_protection = true

  # HTTP/2: ENABLED
  enable_http2 = true

  # Access logs: ENABLED
  access_logs = {
    enabled = true
    bucket  = "tax-practice-access-logs"
    prefix  = "alb/"
  }

  # TLS Policy: Use most secure available
  ssl_policy = "ELBSecurityPolicy-TLS13-1-2-2021-06"

  # Listener rules
  listeners = {
    https = {
      port            = 443
      protocol        = "HTTPS"
      certificate_arn = "arn:aws:acm:us-east-1:ACCOUNT:certificate/xxx"

      default_action = {
        type             = "forward"
        target_group_arn = "api-target-group"
      }
    }

    http = {
      port     = 80
      protocol = "HTTP"

      # ALWAYS redirect HTTP to HTTPS
      default_action = {
        type = "redirect"
        redirect = {
          port        = "443"
          protocol    = "HTTPS"
          status_code = "HTTP_301"
        }
      }
    }
  }

  # Target group health check
  health_check = {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 30
    path                = "/health"
    matcher             = "200"
  }
}

15.7 CloudFront Security Configuration

# =============================================================================
# CLOUDFRONT - CDN SECURITY
# =============================================================================

cloudfront_security_config = {
  # Price class: US, Canada, Europe (reduce attack surface)
  price_class = "PriceClass_100"

  # HTTP versions
  http_version = "http2and3"

  # TLS: Require TLS 1.2+
  viewer_certificate = {
    acm_certificate_arn      = "arn:aws:acm:us-east-1:ACCOUNT:certificate/xxx"
    ssl_support_method       = "sni-only"
    minimum_protocol_version = "TLSv1.2_2021"
  }

  # Viewer protocol policy: HTTPS only
  viewer_protocol_policy = "redirect-to-https"

  # Origin protocol policy: HTTPS to S3 (OAC)
  origin_access_control = {
    origin_access_control_origin_type = "s3"
    signing_behavior                  = "always"
    signing_protocol                  = "sigv4"
  }

  # Response headers policy (security headers)
  response_headers_policy = {
    security_headers_config = {
      # Strict-Transport-Security
      strict_transport_security = {
        access_control_max_age_sec = 31536000  # 1 year
        include_subdomains         = true
        preload                    = true
        override                   = true
      }

      # Content-Type-Options
      content_type_options = {
        override = true
      }

      # Frame-Options
      frame_options = {
        frame_option = "DENY"
        override     = true
      }

      # XSS-Protection
      xss_protection = {
        mode_block = true
        protection = true
        override   = true
      }

      # Content-Security-Policy
      content_security_policy = {
        content_security_policy = "default-src 'self'; script-src 'self' 'unsafe-inline'; style-src 'self' 'unsafe-inline'; img-src 'self' data: https:; font-src 'self'; connect-src 'self' https://api.taxpractice.ai"
        override                = true
      }

      # Referrer-Policy
      referrer_policy = {
        referrer_policy = "strict-origin-when-cross-origin"
        override        = true
      }
    }
  }

  # Geo restriction: US only (adjust as needed)
  geo_restriction = {
    restriction_type = "whitelist"
    locations        = ["US"]  # Add more if needed
  }

  # WAF: ATTACHED
  web_acl_id = "arn:aws:wafv2:us-east-1:ACCOUNT:regional/webacl/tax-practice-waf"
}

15.8 WAF Security Configuration

# =============================================================================
# AWS WAF - WEB APPLICATION FIREWALL
# =============================================================================

waf_security_config = {
  # Scope: REGIONAL (for ALB) or CLOUDFRONT
  scope = "REGIONAL"

  # Default action: ALLOW (rules block specific threats)
  default_action = "allow"

  # Rate limiting
  rate_limit_rule = {
    name     = "RateLimitPerIP"
    priority = 1
    limit    = 1000  # Requests per 5-minute window per IP
    action   = "block"
  }

  # AWS Managed Rules (recommended set)
  managed_rules = [
    {
      name            = "AWSManagedRulesCommonRuleSet"
      vendor_name     = "AWS"
      priority        = 10
      override_action = "none"  # Use rule actions as-is
    },
    {
      name            = "AWSManagedRulesSQLiRuleSet"
      vendor_name     = "AWS"
      priority        = 20
      override_action = "none"
    },
    {
      name            = "AWSManagedRulesKnownBadInputsRuleSet"
      vendor_name     = "AWS"
      priority        = 30
      override_action = "none"
    },
    {
      name            = "AWSManagedRulesLinuxRuleSet"
      vendor_name     = "AWS"
      priority        = 40
      override_action = "none"
    },
    {
      name            = "AWSManagedRulesAmazonIpReputationList"
      vendor_name     = "AWS"
      priority        = 50
      override_action = "none"
    }
  ]

  # Custom rules
  custom_rules = [
    {
      name     = "BlockBadBots"
      priority = 60
      action   = "block"
      statement = {
        byte_match_statement = {
          field_to_match = {
            single_header = {
              name = "user-agent"
            }
          }
          positional_constraint = "CONTAINS"
          search_string         = "bad-bot"  # Example, add real bad bot signatures
          text_transformations = [
            {
              priority = 0
              type     = "LOWERCASE"
            }
          ]
        }
      }
    }
  ]

  # Logging
  logging_configuration = {
    log_destination_configs = ["arn:aws:logs:us-east-1:ACCOUNT:log-group:aws-waf-logs"]
    redacted_fields = [
      {
        single_header = {
          name = "authorization"  # Don't log auth tokens
        }
      }
    ]
  }
}

15.9 Secrets Manager Security Configuration

# =============================================================================
# AWS SECRETS MANAGER - CREDENTIALS MANAGEMENT
# =============================================================================

secrets_manager_config = {
  # KMS encryption: Customer managed key
  kms_key_id = "alias/tax-practice-secrets"

  # Secret rotation: ENABLED for database credentials
  rotation_rules = {
    automatically_after_days = 30
  }

  # Recovery window: 7 days (can recover deleted secrets)
  recovery_window_in_days = 7

  # Secrets to create
  secrets = {
    "tax-practice/production/db-credentials" = {
      description = "Aurora PostgreSQL master credentials"
      secret_string = {
        username = "app_user"
        password = "GENERATED_BY_TERRAFORM"  # Use random_password
      }
    }

    "tax-practice/production/jwt-secret" = {
      description = "JWT signing key"
      # Generate with: openssl rand -base64 64
    }

    "tax-practice/production/stripe" = {
      description = "Stripe API keys"
      secret_string = {
        secret_key      = "sk_live_xxx"
        publishable_key = "pk_live_xxx"
        webhook_secret  = "whsec_xxx"
      }
    }

    "tax-practice/production/persona" = {
      description = "Persona identity verification"
      secret_string = {
        api_key        = "xxx"
        template_id    = "xxx"
        webhook_secret = "xxx"
      }
    }

    # Add other secrets as needed...
  }

  # Resource policy: Restrict access to specific roles
  resource_policy = {
    Statement = [
      {
        Sid       = "AllowECSTaskRole"
        Effect    = "Allow"
        Principal = {
          AWS = "arn:aws:iam::ACCOUNT:role/tax-practice-ecs-task-role"
        }
        Action   = "secretsmanager:GetSecretValue"
        Resource = "*"
        Condition = {
          StringEquals = {
            "secretsmanager:ResourceTag/Environment" = "production"
          }
        }
      }
    ]
  }
}

15.10 KMS Key Configuration

# =============================================================================
# AWS KMS - ENCRYPTION KEYS
# =============================================================================

kms_keys = {
  # Key for RDS encryption
  "tax-practice-rds" = {
    description             = "KMS key for Aurora PostgreSQL encryption"
    deletion_window_in_days = 30
    enable_key_rotation     = true  # Automatic annual rotation

    policy = {
      Statement = [
        {
          Sid    = "AllowRDSAccess"
          Effect = "Allow"
          Principal = {
            Service = "rds.amazonaws.com"
          }
          Action = [
            "kms:Encrypt",
            "kms:Decrypt",
            "kms:GenerateDataKey*"
          ]
          Resource = "*"
        }
      ]
    }
  }

  # Key for S3 encryption
  "tax-practice-s3" = {
    description             = "KMS key for S3 document encryption"
    deletion_window_in_days = 30
    enable_key_rotation     = true

    policy = {
      Statement = [
        {
          Sid    = "AllowS3Access"
          Effect = "Allow"
          Principal = {
            Service = "s3.amazonaws.com"
          }
          Action = [
            "kms:Encrypt",
            "kms:Decrypt",
            "kms:GenerateDataKey*"
          ]
          Resource = "*"
        }
      ]
    }
  }

  # Key for Secrets Manager
  "tax-practice-secrets" = {
    description             = "KMS key for Secrets Manager"
    deletion_window_in_days = 30
    enable_key_rotation     = true
  }
}

15.11 IAM Roles and Policies

# =============================================================================
# IAM ROLES - LEAST PRIVILEGE
# =============================================================================

iam_roles = {
  # ECS Task Execution Role (for ECS to pull images, write logs)
  "tax-practice-ecs-execution-role" = {
    assume_role_policy = {
      Service = "ecs-tasks.amazonaws.com"
    }
    managed_policies = [
      "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
    ]
    inline_policies = [
      {
        name = "SecretsAccess"
        policy = {
          Statement = [
            {
              Effect   = "Allow"
              Action   = "secretsmanager:GetSecretValue"
              Resource = "arn:aws:secretsmanager:us-east-1:ACCOUNT:secret:tax-practice/*"
            }
          ]
        }
      }
    ]
  }

  # ECS Task Role (for container to access AWS services)
  "tax-practice-ecs-task-role" = {
    assume_role_policy = {
      Service = "ecs-tasks.amazonaws.com"
    }
    inline_policies = [
      {
        name = "S3Access"
        policy = {
          Statement = [
            {
              Effect = "Allow"
              Action = [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
              ]
              Resource = "arn:aws:s3:::tax-practice-documents/*"
            },
            {
              Effect   = "Allow"
              Action   = "s3:ListBucket"
              Resource = "arn:aws:s3:::tax-practice-documents"
            }
          ]
        }
      },
      {
        name = "BedrockAccess"
        policy = {
          Statement = [
            {
              Effect   = "Allow"
              Action   = "bedrock:InvokeModel"
              Resource = "arn:aws:bedrock:us-east-1::foundation-model/anthropic.*"
            }
          ]
        }
      },
      {
        name = "SecretsAccess"
        policy = {
          Statement = [
            {
              Effect   = "Allow"
              Action   = "secretsmanager:GetSecretValue"
              Resource = "arn:aws:secretsmanager:us-east-1:ACCOUNT:secret:tax-practice/*"
            }
          ]
        }
      }
    ]
  }

  # Airflow EC2 Role
  "tax-practice-airflow-role" = {
    assume_role_policy = {
      Service = "ec2.amazonaws.com"
    }
    managed_policies = [
      "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"  # For Session Manager
    ]
    inline_policies = [
      {
        name = "AirflowPermissions"
        policy = {
          Statement = [
            {
              Effect   = "Allow"
              Action   = ["s3:GetObject", "s3:PutObject", "s3:ListBucket"]
              Resource = [
                "arn:aws:s3:::tax-practice-dags",
                "arn:aws:s3:::tax-practice-dags/*"
              ]
            },
            {
              Effect   = "Allow"
              Action   = "secretsmanager:GetSecretValue"
              Resource = "arn:aws:secretsmanager:us-east-1:ACCOUNT:secret:tax-practice/*"
            },
            {
              Effect   = "Allow"
              Action   = "lambda:InvokeFunction"
              Resource = "arn:aws:lambda:us-east-1:ACCOUNT:function:tax-practice-*"
            }
          ]
        }
      }
    ]
  }
}

15.12 CloudTrail Configuration

# =============================================================================
# AWS CLOUDTRAIL - AUDIT LOGGING
# =============================================================================

cloudtrail_config = {
  name                          = "tax-practice-audit-trail"
  s3_bucket_name                = "tax-practice-cloudtrail-logs"
  include_global_service_events = true
  is_multi_region_trail         = true
  enable_logging                = true

  # Log file validation (detect tampering)
  enable_log_file_validation = true

  # KMS encryption for logs
  kms_key_id = "alias/tax-practice-cloudtrail"

  # CloudWatch Logs integration
  cloud_watch_logs_group_arn = "arn:aws:logs:us-east-1:ACCOUNT:log-group:cloudtrail"
  cloud_watch_logs_role_arn  = "arn:aws:iam::ACCOUNT:role/cloudtrail-to-cloudwatch"

  # Event selectors (what to log)
  event_selectors = [
    {
      read_write_type           = "All"
      include_management_events = true

      data_resources = [
        {
          type   = "AWS::S3::Object"
          values = ["arn:aws:s3:::tax-practice-documents/"]
        }
      ]
    }
  ]

  # Insights (anomaly detection)
  insight_selectors = [
    {
      insight_type = "ApiCallRateInsight"
    },
    {
      insight_type = "ApiErrorRateInsight"
    }
  ]
}

15.13 Security Summary Table

Service Encryption at Rest Encryption in Transit Public Access Logging
Aurora PostgreSQL KMS (customer key) TLS 1.2+ required NO CloudWatch Logs
S3 Documents SSE-KMS (customer key) HTTPS required NO (blocked) Access Logs
ECS Fargate N/A (stateless) TLS to ALB NO (private subnet) CloudWatch Logs
Secrets Manager KMS (customer key) TLS N/A CloudTrail
ALB N/A TLS 1.2+ (ACM cert) YES (internet-facing) Access Logs
CloudFront N/A TLS 1.2+ YES (CDN) Access Logs
Airflow EC2 EBS encryption TLS for AWS APIs NO (private subnet) CloudWatch Logs

16. Monitoring and Observability

16.1 CloudWatch Dashboard

# terraform/modules/monitoring/dashboard.tf

resource "aws_cloudwatch_dashboard" "main" {
  dashboard_name = "${var.project}-${var.environment}"

  dashboard_body = jsonencode({
    widgets = [
      # Row 1: ECS Metrics
      {
        type   = "metric"
        x      = 0
        y      = 0
        width  = 8
        height = 6
        properties = {
          title  = "ECS CPU Utilization"
          region = var.aws_region
          metrics = [
            ["AWS/ECS", "CPUUtilization", "ServiceName", "${var.project}-${var.environment}-api", "ClusterName", "${var.project}-${var.environment}"]
          ]
          period = 300
          stat   = "Average"
        }
      },
      {
        type   = "metric"
        x      = 8
        y      = 0
        width  = 8
        height = 6
        properties = {
          title  = "ECS Memory Utilization"
          region = var.aws_region
          metrics = [
            ["AWS/ECS", "MemoryUtilization", "ServiceName", "${var.project}-${var.environment}-api", "ClusterName", "${var.project}-${var.environment}"]
          ]
          period = 300
          stat   = "Average"
        }
      },
      {
        type   = "metric"
        x      = 16
        y      = 0
        width  = 8
        height = 6
        properties = {
          title  = "Running Task Count"
          region = var.aws_region
          metrics = [
            ["ECS/ContainerInsights", "RunningTaskCount", "ServiceName", "${var.project}-${var.environment}-api", "ClusterName", "${var.project}-${var.environment}"]
          ]
          period = 60
          stat   = "Average"
        }
      },

      # Row 2: ALB Metrics
      {
        type   = "metric"
        x      = 0
        y      = 6
        width  = 8
        height = 6
        properties = {
          title  = "Request Count"
          region = var.aws_region
          metrics = [
            ["AWS/ApplicationELB", "RequestCount", "LoadBalancer", var.alb_arn_suffix]
          ]
          period = 60
          stat   = "Sum"
        }
      },
      {
        type   = "metric"
        x      = 8
        y      = 6
        width  = 8
        height = 6
        properties = {
          title  = "Response Time (p95)"
          region = var.aws_region
          metrics = [
            ["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", var.alb_arn_suffix]
          ]
          period = 60
          stat   = "p95"
        }
      },
      {
        type   = "metric"
        x      = 16
        y      = 6
        width  = 8
        height = 6
        properties = {
          title  = "HTTP 5xx Errors"
          region = var.aws_region
          metrics = [
            ["AWS/ApplicationELB", "HTTPCode_Target_5XX_Count", "LoadBalancer", var.alb_arn_suffix]
          ]
          period = 60
          stat   = "Sum"
        }
      },

      # Row 3: Database Metrics
      {
        type   = "metric"
        x      = 0
        y      = 12
        width  = 8
        height = 6
        properties = {
          title  = "Aurora CPU Utilization"
          region = var.aws_region
          metrics = [
            ["AWS/RDS", "CPUUtilization", "DBClusterIdentifier", "${var.project}-${var.environment}"]
          ]
          period = 300
          stat   = "Average"
        }
      },
      {
        type   = "metric"
        x      = 8
        y      = 12
        width  = 8
        height = 6
        properties = {
          title  = "Aurora Connections"
          region = var.aws_region
          metrics = [
            ["AWS/RDS", "DatabaseConnections", "DBClusterIdentifier", "${var.project}-${var.environment}"]
          ]
          period = 60
          stat   = "Average"
        }
      },
      {
        type   = "metric"
        x      = 16
        y      = 12
        width  = 8
        height = 6
        properties = {
          title  = "Aurora Serverless ACU"
          region = var.aws_region
          metrics = [
            ["AWS/RDS", "ServerlessDatabaseCapacity", "DBClusterIdentifier", "${var.project}-${var.environment}"]
          ]
          period = 60
          stat   = "Average"
        }
      }
    ]
  })
}

16.2 Alarms

# terraform/modules/monitoring/alarms.tf

# High CPU Alarm
resource "aws_cloudwatch_metric_alarm" "ecs_cpu_high" {
  alarm_name          = "${var.project}-${var.environment}-ecs-cpu-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/ECS"
  period              = 300
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "ECS CPU utilization above 80%"

  dimensions = {
    ClusterName = "${var.project}-${var.environment}"
    ServiceName = "${var.project}-${var.environment}-api"
  }

  alarm_actions = [var.sns_topic_arn]
  ok_actions    = [var.sns_topic_arn]

  tags = {
    Name        = "${var.project}-${var.environment}-ecs-cpu-high"
    Environment = var.environment
  }
}

# High Error Rate Alarm
resource "aws_cloudwatch_metric_alarm" "alb_5xx_high" {
  alarm_name          = "${var.project}-${var.environment}-alb-5xx-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "HTTPCode_Target_5XX_Count"
  namespace           = "AWS/ApplicationELB"
  period              = 300
  statistic           = "Sum"
  threshold           = 10
  alarm_description   = "High 5xx error rate from API"

  dimensions = {
    LoadBalancer = var.alb_arn_suffix
  }

  alarm_actions = [var.sns_topic_arn]
  ok_actions    = [var.sns_topic_arn]

  tags = {
    Name        = "${var.project}-${var.environment}-alb-5xx-high"
    Environment = var.environment
  }
}

# Database Connection Alarm
resource "aws_cloudwatch_metric_alarm" "rds_connections_high" {
  alarm_name          = "${var.project}-${var.environment}-rds-connections-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "DatabaseConnections"
  namespace           = "AWS/RDS"
  period              = 300
  statistic           = "Average"
  threshold           = 200  # 80% of typical max
  alarm_description   = "High database connection count"

  dimensions = {
    DBClusterIdentifier = "${var.project}-${var.environment}"
  }

  alarm_actions = [var.sns_topic_arn]
  ok_actions    = [var.sns_topic_arn]

  tags = {
    Name        = "${var.project}-${var.environment}-rds-connections-high"
    Environment = var.environment
  }
}

17. Disaster Recovery

17.1 Backup Strategy

Component Backup Method Frequency Retention RTO RPO
Aurora Automated snapshots Daily 7 days 1 hour 5 min
Aurora Point-in-time recovery Continuous 7 days 15 min 5 min
S3 Documents Versioning + Cross-region Real-time 7 years 1 hour 0
Terraform State S3 versioning Real-time 90 days 5 min 0
Application Logs CloudWatch Logs Real-time 90 days N/A N/A

17.2 Recovery Procedures

Scenario Recovery Steps Expected Time
ECS Task Failure Auto-replaced by service < 2 min
Availability Zone Failure Traffic shifts to healthy AZ < 5 min
Database Corruption Point-in-time recovery 15-30 min
Complete Region Failure Manual failover to DR region 1-4 hours
Accidental Data Deletion S3 versioning / DB restore 15-60 min

18. Cost Estimates

18.1 Monthly Cost Breakdown

Service Configuration Est. Monthly Cost
ECS Fargate 2 tasks x 0.5vCPU x 1GB $30-50
Aurora Serverless v2 0.5-4 ACU, 50GB storage $80-150
Application Load Balancer 1 ALB + data transfer $20-30
CloudFront 2 distributions, ~100GB/mo $10-20
S3 100GB documents $5-10
NAT Gateway 1 gateway + data transfer $35-50
Route 53 2 hosted zones + queries $2-5
Secrets Manager 10 secrets $5
CloudWatch Logs, metrics, alarms $10-20
WAF Web ACL + rules $10-15
EC2 (Airflow) t3.medium reserved $23
ACM Certificates Free $0
KMS Keys + requests $5-10
Total (Staging) ~$150-200
Total (Production) Higher capacity ~$250-400

18.2 Cost Optimization Strategies

  1. Reserved Instances: 1-year commitment for EC2 (Airflow) saves 30%
  2. Aurora Serverless: Auto-scales down during off-hours
  3. S3 Intelligent-Tiering: Auto-moves cold data to cheaper storage
  4. CloudFront caching: Reduces origin requests
  5. Right-sizing: Monitor and adjust ECS task sizes

19. Rollback Strategy

19.1 Application Rollback

# Rollback procedure for ECS deployment

# Option 1: ECS Console/CLI rollback to previous task definition
aws ecs update-service \
  --cluster tax-practice-production \
  --service tax-practice-production-api \
  --task-definition tax-practice-production-api:PREVIOUS_VERSION \
  --force-new-deployment

# Option 2: Revert code and redeploy
git revert HEAD
git push origin main  # Triggers CI/CD

19.2 Infrastructure Rollback

# Rollback Terraform changes

# Option 1: Revert to previous state
terraform workspace select production
cd infrastructure/terraform/environments/production

# Show what will change
terraform plan -target=module.affected_module

# Apply previous configuration from git
git checkout HEAD~1 -- .
terraform apply

# Option 2: Import and fix manually
terraform import aws_resource.name resource_id

19.3 Database Rollback

-- Option 1: Revert migration
-- Each migration has a corresponding down migration

-- Option 2: Point-in-time recovery
-- Use AWS Console or CLI to restore to specific time
aws rds restore-db-cluster-to-point-in-time \
  --source-db-cluster-identifier tax-practice-production \
  --db-cluster-identifier tax-practice-production-restored \
  --restore-to-time 2024-12-27T10:00:00Z

Next Steps

  1. Review this plan with stakeholders
  2. Create Terraform backend (S3 bucket + DynamoDB table)
  3. Implement Phase 1 (VPC, security groups)
  4. Set up staging environment first
  5. Production deployment after staging validation

Document History

Version Date Author Changes
1.0 2024-12-27 Don McCarty Initial cloud deployment plan