Phenom Infrastructure
Categories:
This section contains infrastructure documentation for the Phenom application stack. Access is restricted to infrastructure team.
Overview
Phenom Infrastructure provides Terraform infrastructure as code for deploying the complete Phenom application stack on AWS ECS. This repository contains modular Terraform configurations that create a production-ready cloud environment with security, scalability, and monitoring best practices.
Current security posture (2026-05-28). The chat data plane (chat.thephenom.app) is
Cloudflare-proxied with SSL Full and a mTLS-locked ALB origin (mutual_authentication=verify):
the raw-origin bypass is closed, and the chat root 302-redirects non-client traffic to
try.thephenom.app (the Synapse-Admin UI is no longer public). The WAFv2 WebACL
phenom-prod-chat-protect (4 rules) runs in COUNT (defense-in-depth). Auth is AWS Cognito
(prod pool us-east-1_knEL7cqS3); registration is disabled.
IaC debt (pending): the WAF, the waf-alb-association IAM grant, the ALB trust store + listener
mTLS, the per-hostname Cloudflare AOP, the SSL config rule, and the rate-limit rule were applied via
the AWS/Cloudflare APIs and are not yet codified in Terraform (tracked in phenom-infra
feature/94, not merged). Full current chat state: Chat Services; NEST
service map: NEST Infrastructure.
Staging refreshed as a prod carbon copy (2026-06-20). Staging (phenom-dev-postgres,
Cognito pool us-east-1_n8gO6SbP6, media bucket phenom-staging-media) was rebuilt as an
exact copy of production data: the RDS instance was restored from a prod snapshot
(phenom-prod-postgres-clone-20260620, via snapshot_identifier in environments/development,
PR phenom-infra#150), and prod S3 media was synced into the staging bucket. The staging Cognito
pool was then wiped and repopulated with production’s 31 users, and the database users.id
values were remapped to the new staging Cognito subs (FK cascade across all user-referencing
tables).
Staging test credentials: all imported staging users have the password Password12345!
(internal/testing only — staging is not production data-sensitive after this refresh).
IaC follow-ups (pending): the dev module.rds config still hardcodes
database_username = "phenomhabu" while the prod-snapshot master is phenomprod, so
terraform plan will want to re-replace the instance until a follow-up PR sets
database_username = "phenomprod" and adds lifecycle { ignore_changes = [snapshot_identifier] }.
The synapse_staging chat database was dropped by the instance replace and needs the
chat-synapse provisioner re-run.
Repository
GitHub Repository: Phenom-earth/phenom-infra
Architecture
The infrastructure deploys a comprehensive AWS environment including:
Core Infrastructure
- VPC: Virtual Private Cloud with public/private/database subnets across multiple availability zones
- ECS Fargate: Containerized application cluster with auto-scaling capabilities
- Application Load Balancer: Traffic routing and SSL termination
- RDS PostgreSQL: Managed database service (PostgreSQL 17.4) with automated backups
- AWS Secrets Manager: Secure credential and configuration storage
- AWS Cognito: User authentication and authorization with Hasura integration
- S3 Storage: Multiple buckets for general storage and video/image uploads
- Lambda Functions: Serverless compute for authentication hooks and file validation
- API Gateway: REST API for secure upload workflows
Service Stack
Note (2026-05-28): the list below reflects the original Nhost-style design. In current production, auth is AWS Cognito (not a Hasura Auth service) and file storage is AWS S3 via the
nest-apiWorker (not a Hasura Storage service); “Nhost Functions” is legacy. Thephenom-prod-clusteralso runs Synapse (Matrix chat) + Hasura + the chat MCP. The authoritative current service map is NEST Infrastructure and Chat Services; this list is pending reconciliation with 007.
The ECS cluster runs the following containerized services:
-
GraphQL Service (Hasura GraphQL Engine)
- Port: 8080
- Provides GraphQL API and database migrations
- Integrated with Cognito for JWT authentication
-
Auth Service (Hasura Auth)
- Port: 4000
- Handles authentication and JWT token management
- Enhanced with Cognito integration
-
Storage Service (Hasura Storage)
- Port: 5000
- Manages file uploads and storage operations
- Utilizes S3 backend
-
Functions Service (Nhost Functions)
- Port: 3000
- Executes serverless functions
Video/Image Upload SystemServerless file upload pipeline with validation and security:
- API Gateway: REST API for pre-signed URL generation
- Lambda: Pre-signed URL Generator: Password-protected URL generation with 1-hour expiry
- S3 Staging Bucket: Temporary storage with 24-hour auto-cleanup
- Lambda: File Validator: Automatic validation using magic bytes, optional virus scanning
- S3 Final Bucket: Permanent storage for validated media organized by type
- Client Hosting: S3-hosted upload interface
Authentication IntegrationAWS Cognito integrated with Hasura GraphQL:
- Cognito User Pool: Email-based authentication with MFA support
- Lambda: Token Enhancement: Adds Hasura JWT claims to Cognito tokens
- Lambda: User Sync: Automatically syncs authenticated users to Hasura database
- OAuth 2.0 Flow: Implicit grant with callback support
Prerequisites
Before deploying the infrastructure, ensure you have:
- Terraform >= 1.0
- AWS CLI configured with appropriate credentials
- AWS Account with sufficient permissions to create resources
Quick Start
1. Configure AWS Credentials
# Option 1: AWS CLI configuration
aws configure
# Option 2: Environment variables
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"
# Option 3: AWS Profile
export AWS_PROFILE="your-profile-name"
2. Choose Environment
cd environments/<desired-env>
# Examples:
cd environments/development
# or
cd environments/production
3. Deploy Infrastructure
# Initialize Terraform
terraform init
# Review planned changes
terraform plan
# Deploy infrastructure
terraform apply
Environment Structure
environments/
├── development/
│ ├── main.tf # Main configuration
│ ├── locals.tf # Environment-specific variables
│ ├── versions.tf # Terraform and provider versions
│ ├── backend.tf # Remote state configuration
│ └── outputs.tf # Output values
└── production/
└── ... (same structure)
Prod and staging are fully separated environments, each with its own Terraform state.
Production lives in environments/production/ and staging lives in environments/development/
(the development directory is the staging environment — phenom-dev-postgres, Cognito pool
us-east-1_n8gO6SbP6). Each environment has its own backend.tf remote state — the two
state files are completely isolated, and no state is shared or cross-referenced between them.
This separation is not ideal — it means resources, modules, and changes have to be authored twice (once per environment), which invites duplication and config drift between prod and staging. But it is the model we run today, so all infrastructure must be built out with this separation of concerns in mind: add production resources to the production environment directory and staging resources to the staging environment directory, keep each environment’s state isolated, never reach across into the other environment’s state, and treat prod and staging as independent stacks that happen to share a module layout. Until/unless this is consolidated, assume nothing is shared across the prod ↔ staging boundary.
Infrastructure Modules
Networking Module (modules/networking/)
- VPC: 10.0.0.0/16 CIDR with Internet Gateway
- 3-Tier Subnet Architecture:
- Public Subnets (10.0.0.0/24, 10.0.1.0/24) - For ALB
- Private Subnets (10.0.10.0/24, 10.0.11.0/24) - For ECS tasks
- Database Subnets (10.0.20.0/24, 10.0.21.0/24) - For RDS
- NAT Gateways for private subnet egress (optional)
- Security groups for ALB, ECS tasks, and RDS with least-privilege rules
- Outputs: VPC ID, subnet IDs, security group IDs
Port 80/443
Path-based routing
Health checks /healthz"] NAT["NAT Gateway
Private subnet egress"] end subgraph "Private Tier - ECS" TG1["Target Group
GraphQL:8080"] TG2["Target Group
Auth:4000"] TG3["Target Group
Storage:5000"] TG4["Target Group
Functions:3000"] ECS1["ECS Task
Hasura GraphQL"] ECS2["ECS Task
Hasura Auth"] ECS3["ECS Task
Hasura Storage"] ECS4["ECS Task
Nhost Functions"] end subgraph "Database Tier" RDS["RDS PostgreSQL
db.m5.large
20GB → 100GB
Private subnet"] end subgraph "Storage & Secrets" S3["S3 Buckets
General, Staging,
Final, Hosting"] Secrets["AWS Secrets Manager
DB credentials
API keys
Passwords"] end IGW -->|Port 80/443| ALB ALB -->|Route /api/graphql| TG1 ALB -->|Route /api/auth| TG2 ALB -->|Route /api/storage| TG3 ALB -->|Route /api/functions| TG4 TG1 --> ECS1 TG2 --> ECS2 TG3 --> ECS3 TG4 --> ECS4 ECS1 -->|Query/Update| RDS ECS2 -->|Query/Update| RDS ECS3 -->|Query/Update| RDS ECS4 -->|Query/Update| RDS ECS3 -->|Upload/Download| S3 ECS1 -.->|Read| Secrets ECS2 -.->|Read| Secrets ECS3 -.->|Read| Secrets NAT -->|Egress| IGW style ALB fill:#d73429,color:#fff,rx:30 style RDS fill:#1a1a1a,color:#fff,rx:30 style S3 fill:#121010,color:#a5e3e8,rx:30 style Secrets fill:#151515,color:#e0e0e0,rx:30
Reference URLs:
Application Load Balancer Module (modules/alb/)
- ALB: Public-facing load balancer in public subnets
- 4 Target Groups with health checks (
/healthzevery 30s):- GraphQL (port 8080)
- Auth (port 4000)
- Storage (port 5000)
- Functions (port 3000)
- HTTP listener on port 80 with path-based routing
- Outputs: ALB DNS name, target group ARNs
ECS Module (modules/ecs/)
- ECS Fargate Cluster with Container Insights enabled
- 4 Task Definitions:
- Hasura GraphQL Engine (8080)
- Hasura Auth Service (4000)
- Hasura Storage Service (5000)
- Nhost Functions (3000)
- IAM Roles: Task execution role and task role with necessary permissions
- CloudWatch Logs:
/ecs/phenom-devlog group - Secrets Integration: Environment variables from AWS Secrets Manager
- Outputs: Cluster ARN, service ARNs, task definition ARNs
(Container Insights enabled)"] GraphQL["GraphQL Service
Hasura Engine
Port 8080
2 tasks × 0.25vCPU, 0.5GB"] Auth["Auth Service
Hasura Auth
Port 4000
2 tasks × 0.25vCPU, 0.5GB"] Storage["Storage Service
Hasura Storage
Port 5000
2 tasks × 0.25vCPU, 0.5GB"] Functions["Functions Service
Nhost Functions
Port 3000
2 tasks × 0.25vCPU, 0.5GB"] end ECR["Container Images
ECR Registry"] Secrets["AWS Secrets Manager
Environment variables
Database credentials"] Logs["CloudWatch Logs
/ecs/phenom-dev"] IAM["IAM Roles
Execution & Task roles"] Alarms["CloudWatch Alarms
CPU/Memory monitoring"] ECR -->|Pull images| Cluster Secrets -->|Inject config| Cluster Cluster -->|Stream logs| Logs Cluster -.->|Assume roles| IAM Logs -->|Trigger| Alarms style GraphQL fill:#1a1a1a,color:#fff,rx:30 style Auth fill:#151515,color:#fff,rx:30 style Storage fill:#121010,color:#a5e3e8,rx:30 style Functions fill:#1a1a1a,color:#e0e0e0,rx:30 style Cluster fill:#0f0f0f,color:#e0e0e0
Note: Each service runs 2 tasks for high availability with auto-scaling capabilities.
Reference URLs:
RDS Module (modules/rds/)
- PostgreSQL 17.4 on db.m5.large instance
- Storage: 20GB initial with auto-scaling to 100GB
- Backup: 7-day retention, daily 03:00-04:00 UTC
- Maintenance: Sunday 04:00-05:00 UTC
- Security: Private (not publicly accessible), encrypted at rest
- Outputs: Endpoint, port, database name, username ARN
S3 Module (modules/s3/)
- General Storage Bucket: Replaces MinIO for backend storage
- Features:
- Versioning support
- AES256 encryption
- CORS configuration for API access
- Public access blocked
- Lifecycle rules for cleanup (incomplete multipart uploads after 7 days)
- IAM User:
phenom-storage-userwith programmatic access - Outputs: Bucket name, bucket ARN, access key ID
Video Upload Module (modules/video-upload/) - NEW
Complete serverless file upload system with security and validation:
Components:
-
API Gateway: REST API
/upload/generate-urlendpoint- Usage plan: 10,000 requests/day, 10 req/sec rate limit
- CORS enabled for browser uploads
-
Lambda: presigned-url-generator
- Runtime: Node.js 18.x, 512 MB, 30s timeout
- Validates password from Secrets Manager (5-min cache)
- Validates MIME type and file size (500MB default)
- Generates unique pre-signed URLs (1-hour expiry)
-
Lambda: file-validator
- Runtime: Node.js 18.x, 3008 MB, 300s timeout
- Triggered by S3 events on staging bucket
- Magic byte validation (prevents extension spoofing)
- Optional ClamAV virus scanning
- Moves valid files to final bucket, deletes invalid
-
S3 Staging Bucket: Temporary 24-hour storage
-
S3 Final Bucket: Organized by type (
/images/,/videos/) -
S3 Client Hosting Bucket: Hosts upload UI
Supported File Types:
- Videos: MP4, MPEG, QuickTime, AVI, WMV, WebM
- Images: JPEG, PNG, GIF, WebP, SVG, TIFF, BMP
Security:
- Password authentication via Secrets Manager
- Time-limited pre-signed URLs
- File type validation using magic bytes
- Optional virus scanning
- All buckets encrypted (AES256)
- Rate limiting and quotas
Outputs: API endpoint, bucket names, Lambda ARNs, client website URL
Cognito IntegrationAWS Cognito User Pool with Hasura integration:
Configuration:
- User Pool:
phenom-devwith email-based authentication - Password Policy: 8+ chars, lowercase, uppercase, numbers, symbols
- MFA: Configurable (currently OFF in dev)
- OAuth 2.0: Implicit grant flow
- Callback URLs: localhost:3000 for development
Lambda Triggers:
-
hasura-cognito-trigger (Pre-Token Generation)
- Adds Hasura JWT claims to Cognito tokens
- Claims namespace:
https://hasura.io/jwt/claims - Includes: user ID, default role, allowed roles
-
hasura-cognito-sync-users (Post-Authentication)
- Syncs authenticated users to Hasura database
- GraphQL mutation: upserts user to
userstable - Retrieves GraphQL endpoint and admin secret from Secrets Manager
- 5-minute secret caching for performance
Post-Deployment Configuration
After successful deployment:
- Update Database Credentials: Modify database password in AWS Secrets Manager
- Configure DNS: Point your domain to the ALB DNS name (provided in Terraform outputs)
- Monitor Services: Verify all ECS services are running healthy in AWS Console
- Set Video Upload Password (if using video upload module):
aws secretsmanager update-secret \ --secret-id "phenom-dev-video-upload-passwords" \ --secret-string '{"passwords":["your-secure-password"]}' - Configure Cognito OAuth (if using Cognito):
- Update callback URLs in Cognito console for production domains
- Configure user pool domain for hosted UI (optional)
- Test Upload System: Visit the video upload client URL from Terraform outputs
Using the Video Upload System
For Users
- Navigate to the upload client URL (from
video_client_website_urloutput) - Enter the upload password (configured in Secrets Manager)
- Select file(s) to upload (videos or images)
- Click “Upload” - files are validated and processed automatically
- Check S3 final bucket for validated files (organized in
/images/or/videos/)
Upload Workflow
/upload/generate-url"] C -->|Validates password,
MIME type, size| D["Lambda:
presigned-url-generator"] D -->|Returns pre-signed URL| E["User Browser"] E -->|S3 Direct Upload
via pre-signed URL| F["S3 Staging Bucket"] F -->|S3 Event Notification| G["Lambda:
file-validator"] G -->|Magic byte validation| H{File Valid?} H -->|Yes| I["Move to final bucket"] H -->|No| J["Delete file"] I --> K["S3 Final Bucket
/images/ or /videos/"] J --> K K -->|Organized media| L["Ready for Use"] style A fill:#1a1a1a,color:#fff,rx:30 style K fill:#121010,color:#a5e3e8,rx:30 style L fill:#121010,color:#a5e3e8,rx:30
Security Features
- Password Authentication: Only users with valid password can generate upload URLs
- Pre-signed URLs: Time-limited (1 hour), one-time use, direct to S3
- Magic Byte Validation: Prevents extension spoofing attacks
- File Size Limits: Configurable maximum (default 500MB)
- Virus Scanning: Optional ClamAV integration for enhanced security
- Auto-cleanup: Staging files deleted after 24 hours
- Rate Limiting: API Gateway quotas prevent abuse
Terraform Outputs
The infrastructure provides these key outputs:
Core Infrastructure
alb_dns_name: Application Load Balancer DNS nameservice_endpoints: Direct URLs for each deployed service (GraphQL, Auth, Storage, Functions)database_endpoint: RDS PostgreSQL connection endpoint
Video Upload Module- video_upload_api_endpoint: API Gateway base URL
video_upload_generate_url_endpoint: Full endpoint for pre-signed URL generationvideo_staging_bucket: S3 staging bucket namevideo_final_bucket: S3 final storage bucket namevideo_client_hosting_bucket: S3 bucket hosting upload UIvideo_client_website_url: Public URL for hosted upload clientpresigned_url_lambda_arn: URL generator Lambda ARNfile_validator_lambda_arn: File validator Lambda ARN
Cognito Authentication- cognito_user_pool_id: User pool ID
cognito_user_pool_arn: User pool ARNcognito_app_client_id: Application client ID for OAuth flow
S3 Storage
s3_bucket_name: General storage bucket names3_access_key_id: IAM user access key for S3 operations
Security Best Practices
Credential Management
- Never commit
.tfstatefiles or.tfvarsfiles to version control - Use AWS Secrets Manager for all sensitive configuration values
- Implement least-privilege IAM permissions
Network Security
- Private subnets for application and database tiers
- Security groups with minimal required access
- VPC Flow Logs for network monitoring
Operations and Monitoring
Viewing Service Logs
# Tail ECS service logs
aws logs tail /ecs/phenom-dev --follow
# Check service health status
aws ecs describe-services --cluster phenom-dev-cluster --services phenom-dev-graphql
Common Troubleshooting
Permission Issues: Verify AWS credentials have sufficient IAM permissions
Resource Conflicts: Check for existing resources created outside Terraform
Service Health: Review CloudWatch logs and database connectivity
Destroying Infrastructure
⚠️ Warning: This permanently deletes all resources and data
terraform destroy
Ensure you have backed up any critical data before proceeding.
AWS Services Provisioned
The infrastructure creates the following AWS resources:
| Service | Count | Purpose |
|---|---|---|
| VPC | 1 | Network isolation |
| Subnets | 6 | Public (2), Private (2), Database (2) |
| Internet Gateway | 1 | External connectivity |
| NAT Gateway | 2 | Private subnet egress (optional) |
| Application Load Balancer | 1 | Traffic routing and SSL termination |
| Target Groups | 4 | Service routing (GraphQL, Auth, Storage, Functions) |
| ECS Cluster | 1 | Container orchestration |
| ECS Services | 4 | Containerized applications |
| RDS PostgreSQL Instance | 1 | Database (db.m5.large) |
| S3 Buckets | 5 | Storage (general), Staging, Final, Client hosting |
| Lambda Functions | 4 | 2 for video upload, 2 for Cognito |
| API Gateway | 1 | REST API for uploads |
| Secrets Manager Secrets | 2 | App secrets, Upload passwords |
| Cognito User Pool | 1 | Authentication |
| CloudWatch Log Groups | 5+ | Logging for all services |
| IAM Roles & Policies | 8+ | Access control |
Cost Optimization
Estimated Monthly Costs (Development)
- ECS Fargate: ~$40-60 (4 services, 0.25 vCPU, 0.5GB each)
- RDS db.m5.large: ~$140 (20GB storage)
- Application Load Balancer: ~$20
- NAT Gateway: ~$30 (if enabled)
- S3 Storage: ~$0.50-2 per GB/month (final bucket only)
- Lambda: ~$0.20 per million invocations
- API Gateway: ~$3.50 per million requests
- Data Transfer: Variable (first 1GB free)
Total Estimated: $230-260/month for development environment
Cost Reduction Tips
- Disable NAT Gateways in development (use VPC endpoints instead)
- Use Fargate Spot for non-critical services (70% discount)
- Enable S3 Intelligent-Tiering for infrequent access storage
- Set CloudWatch Log Retention to 7 days for development
- Use RDS Reserved Instances for production (40-60% discount)
- Enable staging bucket lifecycle (auto-delete after 24h - already configured)
Cognito Authentication Flow
Email + Password"] B --> C["Pre-Token Generation Trigger"] C --> D["Lambda:
hasura-cognito-trigger"] D -->|Add JWT claims namespace| E["Claims Processing"] E -->|x-hasura-user-id
x-hasura-default-role
x-hasura-allowed-roles| F["Cognito Returns
JWT Token"] F -->|Token with Hasura claims| G["Post-Authentication Trigger"] G --> H["Lambda:
hasura-cognito-sync-users"] H -->|Retrieve endpoint
from Secrets Manager| I["Execute GraphQL Mutation"] I -->|Upsert user to
Hasura database| J["User Authenticated
+ Synced to Database"] style A fill:#1a1a1a,color:#e0e0e0,rx:30 style J fill:#121010,color:#a5e3e8,rx:30 style F fill:#1a1a1a,color:#fff,rx:30
Reference URLs:
Related Documentation
Official AWS Documentation
- AWS ECS Best Practices
- Terraform AWS Provider
- AWS Lambda Best Practices
- Amazon S3 Security Best Practices
- AWS Cognito Developer Guide
Phenom Documentation
Module-Specific Documentation
- Video Upload: See
modules/video-upload/README.mdandARCHITECTURE.mdin repository - Cognito Integration: Lambda function source in
environments/development/lambda-functions/
For complete implementation details, configuration examples, and troubleshooting, refer to the GitHub repository.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.