RDS Prod Runbook
The AWS RDS PostgreSQL production database instance, the primary data store for all Phenom production services.
Source:
asset-registry.yaml; host: phenom-prod-postgres (RDS type); access via phenom-oneoff-sql ECS Fargate task
C2PA signed · SanMarcSoft AI content credential
What it is
phenom-prod-postgres is the production AWS RDS PostgreSQL instance and the single source of truth for all Phenom platform data. It backs api.thephenom.app, chat.thephenom.app (Synapse), and all production services. It must never be stopped, resized without a maintenance window, or accessed directly – all SQL access uses the phenom-oneoff-sql ECS Fargate task. This is a P0 asset; any change requires explicit approval.
Deployment chain
| Layer | Value |
|---|---|
| Identifier | phenom-prod-postgres |
| Engine | AWS RDS PostgreSQL |
| Region | us-east-1 |
| AWS profile | phenom |
| VPC | Phenom VPC (private subnets – no public endpoint) |
| Access method | ECS Fargate task phenom-oneoff-sql only |
| Multi-AZ | Enabled (verify in RDS console) |
| Automated backups | Enabled, 7-day retention (verify in RDS console) |
| Consumers | api.thephenom.app, chat.thephenom.app |
Common operations
Check RDS instance status
aws rds describe-db-instances \
--db-instance-identifier phenom-prod-postgres \
--profile phenom \
--region us-east-1 \
--query 'DBInstances[0].{Status:DBInstanceStatus,Class:DBInstanceClass,Engine:EngineVersion,MultiAZ:MultiAZ,Storage:AllocatedStorage}'
Connect to the database via phenom-oneoff-sql task
# Run a one-off ECS Fargate task inside the VPC:
aws ecs run-task \
--cluster phenom-prod-cluster \
--task-definition phenom-oneoff-sql \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[<PRIVATE_SUBNET_ID>],securityGroups=[<DB_SG_ID>],assignPublicIp=DISABLED}" \
--profile phenom \
--region us-east-1
# For interactive psql via ECS Exec:
TASK_ARN=$(aws ecs run-task ... --query 'tasks[0].taskArn' --output text)
aws ecs execute-command \
--cluster phenom-prod-cluster \
--task "$TASK_ARN" \
--container psql \
--interactive \
--command "psql \$DATABASE_URL" \
--profile phenom \
--region us-east-1
Take a manual snapshot before any risky change
aws rds create-db-snapshot \
--db-instance-identifier phenom-prod-postgres \
--db-snapshot-identifier "phenom-prod-manual-$(date +%Y%m%d-%H%M%S)" \
--profile phenom \
--region us-east-1
# Wait for snapshot to complete:
aws rds wait db-snapshot-completed \
--db-snapshot-identifier "phenom-prod-manual-$(date +%Y%m%d-%H%M%S)" \
--profile phenom \
--region us-east-1
Initiate a failover (Multi-AZ)
# Forces a failover to the standby replica (brief downtime ~30-60s):
aws rds reboot-db-instance \
--db-instance-identifier phenom-prod-postgres \
--force-failover \
--profile phenom \
--region us-east-1
Resize instance (schedule maintenance window)
# Deferred resize (applies during next maintenance window):
aws rds modify-db-instance \
--db-instance-identifier phenom-prod-postgres \
--db-instance-class db.t3.large \
--no-apply-immediately \
--profile phenom \
--region us-east-1
List recent automated backups
aws rds describe-db-snapshots \
--db-instance-identifier phenom-prod-postgres \
--snapshot-type automated \
--profile phenom \
--region us-east-1 \
--query 'sort_by(DBSnapshots,&SnapshotCreateTime)[-5:].{ID:DBSnapshotIdentifier,Created:SnapshotCreateTime,Status:Status}'
Verify it is working
aws rds describe-db-instances \
--db-instance-identifier phenom-prod-postgres \
--profile phenom \
--region us-east-1 \
--query 'DBInstances[0].DBInstanceStatus' \
--output text
# Expected: "available"
# Functional check: verify API (which depends on this RDS) responds correctly
curl -sf https://api.thephenom.app/healthz
# Expected: HTTP 200 / "OK"
Common failure modes
| Symptom | Likely cause | Remediation |
|---|---|---|
| API returns 500 / DB errors | RDS unhealthy or connection pool exhausted | Check RDS status; check service logs; restart API services |
| Instance in “modifying” state | Maintenance window change in progress | Wait for completion; monitor RDS events |
| Storage near 100% | Autoscaling not triggered fast enough | Enable storage autoscaling; archive old data; run VACUUM |
| Slow response across all services | RDS CPU spike or long-running query | Check Performance Insights; kill long-running queries; add read replica |
| Connection refused | Security group rule removed | Restore sg inbound rule from phenom-api and phenom-synapse security groups |
Related
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.