RDS Prod Runbook

The AWS RDS PostgreSQL production database instance, the primary data store for all Phenom production services.
Audit stamp: Verified — 2026-06-19 — Phenom AI Agent
Verified · 2026-06-19 · Phenom AI Agent
Source: asset-registry.yaml; host: phenom-prod-postgres (RDS type); access via phenom-oneoff-sql ECS Fargate task
C2PA signed · SanMarcSoft AI content credential

What it is

phenom-prod-postgres is the production AWS RDS PostgreSQL instance and the single source of truth for all Phenom platform data. It backs api.thephenom.app, chat.thephenom.app (Synapse), and all production services. It must never be stopped, resized without a maintenance window, or accessed directly – all SQL access uses the phenom-oneoff-sql ECS Fargate task. This is a P0 asset; any change requires explicit approval.

Deployment chain

Layer Value
Identifier phenom-prod-postgres
Engine AWS RDS PostgreSQL
Region us-east-1
AWS profile phenom
VPC Phenom VPC (private subnets – no public endpoint)
Access method ECS Fargate task phenom-oneoff-sql only
Multi-AZ Enabled (verify in RDS console)
Automated backups Enabled, 7-day retention (verify in RDS console)
Consumers api.thephenom.app, chat.thephenom.app

Common operations

Check RDS instance status

aws rds describe-db-instances \
  --db-instance-identifier phenom-prod-postgres \
  --profile phenom \
  --region us-east-1 \
  --query 'DBInstances[0].{Status:DBInstanceStatus,Class:DBInstanceClass,Engine:EngineVersion,MultiAZ:MultiAZ,Storage:AllocatedStorage}'

Connect to the database via phenom-oneoff-sql task

# Run a one-off ECS Fargate task inside the VPC:
aws ecs run-task \
  --cluster phenom-prod-cluster \
  --task-definition phenom-oneoff-sql \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[<PRIVATE_SUBNET_ID>],securityGroups=[<DB_SG_ID>],assignPublicIp=DISABLED}" \
  --profile phenom \
  --region us-east-1

# For interactive psql via ECS Exec:
TASK_ARN=$(aws ecs run-task ... --query 'tasks[0].taskArn' --output text)
aws ecs execute-command \
  --cluster phenom-prod-cluster \
  --task "$TASK_ARN" \
  --container psql \
  --interactive \
  --command "psql \$DATABASE_URL" \
  --profile phenom \
  --region us-east-1

Take a manual snapshot before any risky change

aws rds create-db-snapshot \
  --db-instance-identifier phenom-prod-postgres \
  --db-snapshot-identifier "phenom-prod-manual-$(date +%Y%m%d-%H%M%S)" \
  --profile phenom \
  --region us-east-1

# Wait for snapshot to complete:
aws rds wait db-snapshot-completed \
  --db-snapshot-identifier "phenom-prod-manual-$(date +%Y%m%d-%H%M%S)" \
  --profile phenom \
  --region us-east-1

Initiate a failover (Multi-AZ)

# Forces a failover to the standby replica (brief downtime ~30-60s):
aws rds reboot-db-instance \
  --db-instance-identifier phenom-prod-postgres \
  --force-failover \
  --profile phenom \
  --region us-east-1

Resize instance (schedule maintenance window)

# Deferred resize (applies during next maintenance window):
aws rds modify-db-instance \
  --db-instance-identifier phenom-prod-postgres \
  --db-instance-class db.t3.large \
  --no-apply-immediately \
  --profile phenom \
  --region us-east-1

List recent automated backups

aws rds describe-db-snapshots \
  --db-instance-identifier phenom-prod-postgres \
  --snapshot-type automated \
  --profile phenom \
  --region us-east-1 \
  --query 'sort_by(DBSnapshots,&SnapshotCreateTime)[-5:].{ID:DBSnapshotIdentifier,Created:SnapshotCreateTime,Status:Status}'

Verify it is working

aws rds describe-db-instances \
  --db-instance-identifier phenom-prod-postgres \
  --profile phenom \
  --region us-east-1 \
  --query 'DBInstances[0].DBInstanceStatus' \
  --output text
# Expected: "available"

# Functional check: verify API (which depends on this RDS) responds correctly
curl -sf https://api.thephenom.app/healthz
# Expected: HTTP 200 / "OK"

Common failure modes

Symptom Likely cause Remediation
API returns 500 / DB errors RDS unhealthy or connection pool exhausted Check RDS status; check service logs; restart API services
Instance in “modifying” state Maintenance window change in progress Wait for completion; monitor RDS events
Storage near 100% Autoscaling not triggered fast enough Enable storage autoscaling; archive old data; run VACUUM
Slow response across all services RDS CPU spike or long-running query Check Performance Insights; kill long-running queries; add read replica
Connection refused Security group rule removed Restore sg inbound rule from phenom-api and phenom-synapse security groups