Phenom Infrastructure

Terraform infrastructure as code for deploying the Phenom application stack on AWS ECS

This section contains infrastructure documentation for the Phenom application stack. Access is restricted to infrastructure team.

Overview

Phenom Infrastructure provides Terraform infrastructure as code for deploying the complete Phenom application stack on AWS ECS. This repository contains modular Terraform configurations that create a production-ready cloud environment with security, scalability, and monitoring best practices.

Current security posture (2026-05-28). The chat data plane (chat.thephenom.app) is Cloudflare-proxied with SSL Full and a mTLS-locked ALB origin (mutual_authentication=verify): the raw-origin bypass is closed, and the chat root 302-redirects non-client traffic to try.thephenom.app (the Synapse-Admin UI is no longer public). The WAFv2 WebACL phenom-prod-chat-protect (4 rules) runs in COUNT (defense-in-depth). Auth is AWS Cognito (prod pool us-east-1_knEL7cqS3); registration is disabled.

IaC debt (pending): the WAF, the waf-alb-association IAM grant, the ALB trust store + listener mTLS, the per-hostname Cloudflare AOP, the SSL config rule, and the rate-limit rule were applied via the AWS/Cloudflare APIs and are not yet codified in Terraform (tracked in phenom-infra feature/94, not merged). Full current chat state: Chat Services; NEST service map: NEST Infrastructure.

Staging refreshed as a prod carbon copy (2026-06-20). Staging (phenom-dev-postgres, Cognito pool us-east-1_n8gO6SbP6, media bucket phenom-staging-media) was rebuilt as an exact copy of production data: the RDS instance was restored from a prod snapshot (phenom-prod-postgres-clone-20260620, via snapshot_identifier in environments/development, PR phenom-infra#150), and prod S3 media was synced into the staging bucket. The staging Cognito pool was then wiped and repopulated with production’s 31 users, and the database users.id values were remapped to the new staging Cognito subs (FK cascade across all user-referencing tables).

Staging test credentials: all imported staging users have the password Password12345! (internal/testing only — staging is not production data-sensitive after this refresh).

IaC follow-ups (pending): the dev module.rds config still hardcodes database_username = "phenomhabu" while the prod-snapshot master is phenomprod, so terraform plan will want to re-replace the instance until a follow-up PR sets database_username = "phenomprod" and adds lifecycle { ignore_changes = [snapshot_identifier] }. The synapse_staging chat database was dropped by the instance replace and needs the chat-synapse provisioner re-run.

Repository

GitHub Repository: Phenom-earth/phenom-infra

Architecture

The infrastructure deploys a comprehensive AWS environment including:

Core Infrastructure

  • VPC: Virtual Private Cloud with public/private/database subnets across multiple availability zones
  • ECS Fargate: Containerized application cluster with auto-scaling capabilities
  • Application Load Balancer: Traffic routing and SSL termination
  • RDS PostgreSQL: Managed database service (PostgreSQL 17.4) with automated backups
  • AWS Secrets Manager: Secure credential and configuration storage
  • AWS Cognito: User authentication and authorization with Hasura integration
  • S3 Storage: Multiple buckets for general storage and video/image uploads
  • Lambda Functions: Serverless compute for authentication hooks and file validation
  • API Gateway: REST API for secure upload workflows

Service Stack

Note (2026-05-28): the list below reflects the original Nhost-style design. In current production, auth is AWS Cognito (not a Hasura Auth service) and file storage is AWS S3 via the nest-api Worker (not a Hasura Storage service); “Nhost Functions” is legacy. The phenom-prod-cluster also runs Synapse (Matrix chat) + Hasura + the chat MCP. The authoritative current service map is NEST Infrastructure and Chat Services; this list is pending reconciliation with 007.

The ECS cluster runs the following containerized services:

  1. GraphQL Service (Hasura GraphQL Engine)

    • Port: 8080
    • Provides GraphQL API and database migrations
    • Integrated with Cognito for JWT authentication
  2. Auth Service (Hasura Auth)

    • Port: 4000
    • Handles authentication and JWT token management
    • Enhanced with Cognito integration
  3. Storage Service (Hasura Storage)

    • Port: 5000
    • Manages file uploads and storage operations
    • Utilizes S3 backend
  4. Functions Service (Nhost Functions)

    • Port: 3000
    • Executes serverless functions

Video/Image Upload SystemServerless file upload pipeline with validation and security:

  • API Gateway: REST API for pre-signed URL generation
  • Lambda: Pre-signed URL Generator: Password-protected URL generation with 1-hour expiry
  • S3 Staging Bucket: Temporary storage with 24-hour auto-cleanup
  • Lambda: File Validator: Automatic validation using magic bytes, optional virus scanning
  • S3 Final Bucket: Permanent storage for validated media organized by type
  • Client Hosting: S3-hosted upload interface

Authentication IntegrationAWS Cognito integrated with Hasura GraphQL:

  • Cognito User Pool: Email-based authentication with MFA support
  • Lambda: Token Enhancement: Adds Hasura JWT claims to Cognito tokens
  • Lambda: User Sync: Automatically syncs authenticated users to Hasura database
  • OAuth 2.0 Flow: Implicit grant with callback support

Prerequisites

Before deploying the infrastructure, ensure you have:

  • Terraform >= 1.0
  • AWS CLI configured with appropriate credentials
  • AWS Account with sufficient permissions to create resources

Quick Start

1. Configure AWS Credentials

# Option 1: AWS CLI configuration
aws configure

# Option 2: Environment variables
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"

# Option 3: AWS Profile
export AWS_PROFILE="your-profile-name"

2. Choose Environment

cd environments/<desired-env>

# Examples:
cd environments/development
# or
cd environments/production

3. Deploy Infrastructure

# Initialize Terraform
terraform init

# Review planned changes
terraform plan

# Deploy infrastructure
terraform apply

Environment Structure

environments/
├── development/
│   ├── main.tf          # Main configuration
│   ├── locals.tf        # Environment-specific variables
│   ├── versions.tf      # Terraform and provider versions
│   ├── backend.tf       # Remote state configuration
│   └── outputs.tf       # Output values
└── production/
    └── ... (same structure)

Prod and staging are fully separated environments, each with its own Terraform state. Production lives in environments/production/ and staging lives in environments/development/ (the development directory is the staging environment — phenom-dev-postgres, Cognito pool us-east-1_n8gO6SbP6). Each environment has its own backend.tf remote state — the two state files are completely isolated, and no state is shared or cross-referenced between them.

This separation is not ideal — it means resources, modules, and changes have to be authored twice (once per environment), which invites duplication and config drift between prod and staging. But it is the model we run today, so all infrastructure must be built out with this separation of concerns in mind: add production resources to the production environment directory and staging resources to the staging environment directory, keep each environment’s state isolated, never reach across into the other environment’s state, and treat prod and staging as independent stacks that happen to share a module layout. Until/unless this is consolidated, assume nothing is shared across the prod ↔ staging boundary.

Infrastructure Modules

Networking Module (modules/networking/)

  • VPC: 10.0.0.0/16 CIDR with Internet Gateway
  • 3-Tier Subnet Architecture:
    • Public Subnets (10.0.0.0/24, 10.0.1.0/24) - For ALB
    • Private Subnets (10.0.10.0/24, 10.0.11.0/24) - For ECS tasks
    • Database Subnets (10.0.20.0/24, 10.0.21.0/24) - For RDS
  • NAT Gateways for private subnet egress (optional)
  • Security groups for ALB, ECS tasks, and RDS with least-privilege rules
  • Outputs: VPC ID, subnet IDs, security group IDs
graph TB IGW["Internet Gateway"] subgraph "Public Tier" ALB["Application Load Balancer
Port 80/443
Path-based routing
Health checks /healthz"] NAT["NAT Gateway
Private subnet egress"] end subgraph "Private Tier - ECS" TG1["Target Group
GraphQL:8080"] TG2["Target Group
Auth:4000"] TG3["Target Group
Storage:5000"] TG4["Target Group
Functions:3000"] ECS1["ECS Task
Hasura GraphQL"] ECS2["ECS Task
Hasura Auth"] ECS3["ECS Task
Hasura Storage"] ECS4["ECS Task
Nhost Functions"] end subgraph "Database Tier" RDS["RDS PostgreSQL
db.m5.large
20GB → 100GB
Private subnet"] end subgraph "Storage & Secrets" S3["S3 Buckets
General, Staging,
Final, Hosting"] Secrets["AWS Secrets Manager
DB credentials
API keys
Passwords"] end IGW -->|Port 80/443| ALB ALB -->|Route /api/graphql| TG1 ALB -->|Route /api/auth| TG2 ALB -->|Route /api/storage| TG3 ALB -->|Route /api/functions| TG4 TG1 --> ECS1 TG2 --> ECS2 TG3 --> ECS3 TG4 --> ECS4 ECS1 -->|Query/Update| RDS ECS2 -->|Query/Update| RDS ECS3 -->|Query/Update| RDS ECS4 -->|Query/Update| RDS ECS3 -->|Upload/Download| S3 ECS1 -.->|Read| Secrets ECS2 -.->|Read| Secrets ECS3 -.->|Read| Secrets NAT -->|Egress| IGW style ALB fill:#d73429,color:#fff,rx:30 style RDS fill:#1a1a1a,color:#fff,rx:30 style S3 fill:#121010,color:#a5e3e8,rx:30 style Secrets fill:#151515,color:#e0e0e0,rx:30

Reference URLs:

Application Load Balancer Module (modules/alb/)

  • ALB: Public-facing load balancer in public subnets
  • 4 Target Groups with health checks (/healthz every 30s):
    • GraphQL (port 8080)
    • Auth (port 4000)
    • Storage (port 5000)
    • Functions (port 3000)
  • HTTP listener on port 80 with path-based routing
  • Outputs: ALB DNS name, target group ARNs

ECS Module (modules/ecs/)

  • ECS Fargate Cluster with Container Insights enabled
  • 4 Task Definitions:
    • Hasura GraphQL Engine (8080)
    • Hasura Auth Service (4000)
    • Hasura Storage Service (5000)
    • Nhost Functions (3000)
  • IAM Roles: Task execution role and task role with necessary permissions
  • CloudWatch Logs: /ecs/phenom-dev log group
  • Secrets Integration: Environment variables from AWS Secrets Manager
  • Outputs: Cluster ARN, service ARNs, task definition ARNs
graph TB subgraph Cluster["ECS Fargate Cluster
(Container Insights enabled)"] GraphQL["GraphQL Service
Hasura Engine
Port 8080
2 tasks × 0.25vCPU, 0.5GB"] Auth["Auth Service
Hasura Auth
Port 4000
2 tasks × 0.25vCPU, 0.5GB"] Storage["Storage Service
Hasura Storage
Port 5000
2 tasks × 0.25vCPU, 0.5GB"] Functions["Functions Service
Nhost Functions
Port 3000
2 tasks × 0.25vCPU, 0.5GB"] end ECR["Container Images
ECR Registry"] Secrets["AWS Secrets Manager
Environment variables
Database credentials"] Logs["CloudWatch Logs
/ecs/phenom-dev"] IAM["IAM Roles
Execution & Task roles"] Alarms["CloudWatch Alarms
CPU/Memory monitoring"] ECR -->|Pull images| Cluster Secrets -->|Inject config| Cluster Cluster -->|Stream logs| Logs Cluster -.->|Assume roles| IAM Logs -->|Trigger| Alarms style GraphQL fill:#1a1a1a,color:#fff,rx:30 style Auth fill:#151515,color:#fff,rx:30 style Storage fill:#121010,color:#a5e3e8,rx:30 style Functions fill:#1a1a1a,color:#e0e0e0,rx:30 style Cluster fill:#0f0f0f,color:#e0e0e0

Note: Each service runs 2 tasks for high availability with auto-scaling capabilities.

Reference URLs:

RDS Module (modules/rds/)

  • PostgreSQL 17.4 on db.m5.large instance
  • Storage: 20GB initial with auto-scaling to 100GB
  • Backup: 7-day retention, daily 03:00-04:00 UTC
  • Maintenance: Sunday 04:00-05:00 UTC
  • Security: Private (not publicly accessible), encrypted at rest
  • Outputs: Endpoint, port, database name, username ARN

S3 Module (modules/s3/)

  • General Storage Bucket: Replaces MinIO for backend storage
  • Features:
    • Versioning support
    • AES256 encryption
    • CORS configuration for API access
    • Public access blocked
    • Lifecycle rules for cleanup (incomplete multipart uploads after 7 days)
  • IAM User: phenom-storage-user with programmatic access
  • Outputs: Bucket name, bucket ARN, access key ID

Video Upload Module (modules/video-upload/) - NEW

Complete serverless file upload system with security and validation:

Components:

  1. API Gateway: REST API /upload/generate-url endpoint

    • Usage plan: 10,000 requests/day, 10 req/sec rate limit
    • CORS enabled for browser uploads
  2. Lambda: presigned-url-generator

    • Runtime: Node.js 18.x, 512 MB, 30s timeout
    • Validates password from Secrets Manager (5-min cache)
    • Validates MIME type and file size (500MB default)
    • Generates unique pre-signed URLs (1-hour expiry)
  3. Lambda: file-validator

    • Runtime: Node.js 18.x, 3008 MB, 300s timeout
    • Triggered by S3 events on staging bucket
    • Magic byte validation (prevents extension spoofing)
    • Optional ClamAV virus scanning
    • Moves valid files to final bucket, deletes invalid
  4. S3 Staging Bucket: Temporary 24-hour storage

  5. S3 Final Bucket: Organized by type (/images/, /videos/)

  6. S3 Client Hosting Bucket: Hosts upload UI

Supported File Types:

  • Videos: MP4, MPEG, QuickTime, AVI, WMV, WebM
  • Images: JPEG, PNG, GIF, WebP, SVG, TIFF, BMP

Security:

  • Password authentication via Secrets Manager
  • Time-limited pre-signed URLs
  • File type validation using magic bytes
  • Optional virus scanning
  • All buckets encrypted (AES256)
  • Rate limiting and quotas

Outputs: API endpoint, bucket names, Lambda ARNs, client website URL

Cognito IntegrationAWS Cognito User Pool with Hasura integration:

Configuration:

  • User Pool: phenom-dev with email-based authentication
  • Password Policy: 8+ chars, lowercase, uppercase, numbers, symbols
  • MFA: Configurable (currently OFF in dev)
  • OAuth 2.0: Implicit grant flow
  • Callback URLs: localhost:3000 for development

Lambda Triggers:

  1. hasura-cognito-trigger (Pre-Token Generation)

    • Adds Hasura JWT claims to Cognito tokens
    • Claims namespace: https://hasura.io/jwt/claims
    • Includes: user ID, default role, allowed roles
  2. hasura-cognito-sync-users (Post-Authentication)

    • Syncs authenticated users to Hasura database
    • GraphQL mutation: upserts user to users table
    • Retrieves GraphQL endpoint and admin secret from Secrets Manager
    • 5-minute secret caching for performance

Post-Deployment Configuration

After successful deployment:

  1. Update Database Credentials: Modify database password in AWS Secrets Manager
  2. Configure DNS: Point your domain to the ALB DNS name (provided in Terraform outputs)
  3. Monitor Services: Verify all ECS services are running healthy in AWS Console
  4. Set Video Upload Password (if using video upload module):
    aws secretsmanager update-secret \
      --secret-id "phenom-dev-video-upload-passwords" \
      --secret-string '{"passwords":["your-secure-password"]}'
    
  5. Configure Cognito OAuth (if using Cognito):
    • Update callback URLs in Cognito console for production domains
    • Configure user pool domain for hosted UI (optional)
  6. Test Upload System: Visit the video upload client URL from Terraform outputs

Using the Video Upload System

For Users

  1. Navigate to the upload client URL (from video_client_website_url output)
  2. Enter the upload password (configured in Secrets Manager)
  3. Select file(s) to upload (videos or images)
  4. Click “Upload” - files are validated and processed automatically
  5. Check S3 final bucket for validated files (organized in /images/ or /videos/)

Upload Workflow

flowchart TD A["User Browser"] -->|Password + File metadata| B["Upload Client UI"] B -->|Request pre-signed URL| C["API Gateway
/upload/generate-url"] C -->|Validates password,
MIME type, size| D["Lambda:
presigned-url-generator"] D -->|Returns pre-signed URL| E["User Browser"] E -->|S3 Direct Upload
via pre-signed URL| F["S3 Staging Bucket"] F -->|S3 Event Notification| G["Lambda:
file-validator"] G -->|Magic byte validation| H{File Valid?} H -->|Yes| I["Move to final bucket"] H -->|No| J["Delete file"] I --> K["S3 Final Bucket
/images/ or /videos/"] J --> K K -->|Organized media| L["Ready for Use"] style A fill:#1a1a1a,color:#fff,rx:30 style K fill:#121010,color:#a5e3e8,rx:30 style L fill:#121010,color:#a5e3e8,rx:30

Security Features

  • Password Authentication: Only users with valid password can generate upload URLs
  • Pre-signed URLs: Time-limited (1 hour), one-time use, direct to S3
  • Magic Byte Validation: Prevents extension spoofing attacks
  • File Size Limits: Configurable maximum (default 500MB)
  • Virus Scanning: Optional ClamAV integration for enhanced security
  • Auto-cleanup: Staging files deleted after 24 hours
  • Rate Limiting: API Gateway quotas prevent abuse

Terraform Outputs

The infrastructure provides these key outputs:

Core Infrastructure

  • alb_dns_name: Application Load Balancer DNS name
  • service_endpoints: Direct URLs for each deployed service (GraphQL, Auth, Storage, Functions)
  • database_endpoint: RDS PostgreSQL connection endpoint

Video Upload Module- video_upload_api_endpoint: API Gateway base URL

  • video_upload_generate_url_endpoint: Full endpoint for pre-signed URL generation
  • video_staging_bucket: S3 staging bucket name
  • video_final_bucket: S3 final storage bucket name
  • video_client_hosting_bucket: S3 bucket hosting upload UI
  • video_client_website_url: Public URL for hosted upload client
  • presigned_url_lambda_arn: URL generator Lambda ARN
  • file_validator_lambda_arn: File validator Lambda ARN

Cognito Authentication- cognito_user_pool_id: User pool ID

  • cognito_user_pool_arn: User pool ARN
  • cognito_app_client_id: Application client ID for OAuth flow

S3 Storage

  • s3_bucket_name: General storage bucket name
  • s3_access_key_id: IAM user access key for S3 operations

Security Best Practices

Credential Management

  • Never commit .tfstate files or .tfvars files to version control
  • Use AWS Secrets Manager for all sensitive configuration values
  • Implement least-privilege IAM permissions

Network Security

  • Private subnets for application and database tiers
  • Security groups with minimal required access
  • VPC Flow Logs for network monitoring

Operations and Monitoring

Viewing Service Logs

# Tail ECS service logs
aws logs tail /ecs/phenom-dev --follow

# Check service health status
aws ecs describe-services --cluster phenom-dev-cluster --services phenom-dev-graphql

Common Troubleshooting

Permission Issues: Verify AWS credentials have sufficient IAM permissions
Resource Conflicts: Check for existing resources created outside Terraform
Service Health: Review CloudWatch logs and database connectivity

Destroying Infrastructure

⚠️ Warning: This permanently deletes all resources and data

terraform destroy

Ensure you have backed up any critical data before proceeding.

AWS Services Provisioned

The infrastructure creates the following AWS resources:

Service Count Purpose
VPC 1 Network isolation
Subnets 6 Public (2), Private (2), Database (2)
Internet Gateway 1 External connectivity
NAT Gateway 2 Private subnet egress (optional)
Application Load Balancer 1 Traffic routing and SSL termination
Target Groups 4 Service routing (GraphQL, Auth, Storage, Functions)
ECS Cluster 1 Container orchestration
ECS Services 4 Containerized applications
RDS PostgreSQL Instance 1 Database (db.m5.large)
S3 Buckets 5 Storage (general), Staging, Final, Client hosting
Lambda Functions 4 2 for video upload, 2 for Cognito
API Gateway 1 REST API for uploads
Secrets Manager Secrets 2 App secrets, Upload passwords
Cognito User Pool 1 Authentication
CloudWatch Log Groups 5+ Logging for all services
IAM Roles & Policies 8+ Access control

Cost Optimization

Estimated Monthly Costs (Development)

  • ECS Fargate: ~$40-60 (4 services, 0.25 vCPU, 0.5GB each)
  • RDS db.m5.large: ~$140 (20GB storage)
  • Application Load Balancer: ~$20
  • NAT Gateway: ~$30 (if enabled)
  • S3 Storage: ~$0.50-2 per GB/month (final bucket only)
  • Lambda: ~$0.20 per million invocations
  • API Gateway: ~$3.50 per million requests
  • Data Transfer: Variable (first 1GB free)

Total Estimated: $230-260/month for development environment

Cost Reduction Tips

  1. Disable NAT Gateways in development (use VPC endpoints instead)
  2. Use Fargate Spot for non-critical services (70% discount)
  3. Enable S3 Intelligent-Tiering for infrequent access storage
  4. Set CloudWatch Log Retention to 7 days for development
  5. Use RDS Reserved Instances for production (40-60% discount)
  6. Enable staging bucket lifecycle (auto-delete after 24h - already configured)

Cognito Authentication Flow

flowchart TD A["User Login Request"] --> B["Cognito User Pool
Email + Password"] B --> C["Pre-Token Generation Trigger"] C --> D["Lambda:
hasura-cognito-trigger"] D -->|Add JWT claims namespace| E["Claims Processing"] E -->|x-hasura-user-id
x-hasura-default-role
x-hasura-allowed-roles| F["Cognito Returns
JWT Token"] F -->|Token with Hasura claims| G["Post-Authentication Trigger"] G --> H["Lambda:
hasura-cognito-sync-users"] H -->|Retrieve endpoint
from Secrets Manager| I["Execute GraphQL Mutation"] I -->|Upsert user to
Hasura database| J["User Authenticated
+ Synced to Database"] style A fill:#1a1a1a,color:#e0e0e0,rx:30 style J fill:#121010,color:#a5e3e8,rx:30 style F fill:#1a1a1a,color:#fff,rx:30

Reference URLs:

Official AWS Documentation

Phenom Documentation

Module-Specific Documentation

  • Video Upload: See modules/video-upload/README.md and ARCHITECTURE.md in repository
  • Cognito Integration: Lambda function source in environments/development/lambda-functions/

For complete implementation details, configuration examples, and troubleshooting, refer to the GitHub repository.