RDS Prod Runbook

The AWS RDS PostgreSQL production database instance, the primary data store for all Phenom production services.

Audit stamp: Verified — 2026-06-19 — Phenom AI Agent — **Verified** · 2026-06-19 · Phenom AI Agent
Source: `asset-registry.yaml; host: phenom-prod-postgres (RDS type); access via phenom-oneoff-sql ECS Fargate task`
*C2PA signed · SanMarcSoft AI content credential*

What it is

phenom-prod-postgres is the production AWS RDS PostgreSQL instance and the single source of truth for all Phenom platform data. It backs api.thephenom.app, chat.thephenom.app (Synapse), and all production services. It must never be stopped, resized without a maintenance window, or accessed directly – all SQL access uses the phenom-oneoff-sql ECS Fargate task. This is a P0 asset; any change requires explicit approval.

Deployment chain

Layer	Value
Identifier	`phenom-prod-postgres`
Engine	AWS RDS PostgreSQL
Region	`us-east-1`
AWS profile	`phenom`
VPC	Phenom VPC (private subnets – no public endpoint)
Access method	ECS Fargate task `phenom-oneoff-sql` only
Multi-AZ	Enabled (verify in RDS console)
Automated backups	Enabled, 7-day retention (verify in RDS console)
Consumers	`api.thephenom.app`, `chat.thephenom.app`

Common operations

Check RDS instance status

aws rds describe-db-instances \
  --db-instance-identifier phenom-prod-postgres \
  --profile phenom \
  --region us-east-1 \
  --query 'DBInstances[0].{Status:DBInstanceStatus,Class:DBInstanceClass,Engine:EngineVersion,MultiAZ:MultiAZ,Storage:AllocatedStorage}'

Connect to the database via phenom-oneoff-sql task

# Run a one-off ECS Fargate task inside the VPC:
aws ecs run-task \
  --cluster phenom-prod-cluster \
  --task-definition phenom-oneoff-sql \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[<PRIVATE_SUBNET_ID>],securityGroups=[<DB_SG_ID>],assignPublicIp=DISABLED}" \
  --profile phenom \
  --region us-east-1

# For interactive psql via ECS Exec:
TASK_ARN=$(aws ecs run-task ... --query 'tasks[0].taskArn' --output text)
aws ecs execute-command \
  --cluster phenom-prod-cluster \
  --task "$TASK_ARN" \
  --container psql \
  --interactive \
  --command "psql \$DATABASE_URL" \
  --profile phenom \
  --region us-east-1

Take a manual snapshot before any risky change

aws rds create-db-snapshot \
  --db-instance-identifier phenom-prod-postgres \
  --db-snapshot-identifier "phenom-prod-manual-$(date +%Y%m%d-%H%M%S)" \
  --profile phenom \
  --region us-east-1

# Wait for snapshot to complete:
aws rds wait db-snapshot-completed \
  --db-snapshot-identifier "phenom-prod-manual-$(date +%Y%m%d-%H%M%S)" \
  --profile phenom \
  --region us-east-1

Initiate a failover (Multi-AZ)

# Forces a failover to the standby replica (brief downtime ~30-60s):
aws rds reboot-db-instance \
  --db-instance-identifier phenom-prod-postgres \
  --force-failover \
  --profile phenom \
  --region us-east-1

Resize instance (schedule maintenance window)

# Deferred resize (applies during next maintenance window):
aws rds modify-db-instance \
  --db-instance-identifier phenom-prod-postgres \
  --db-instance-class db.t3.large \
  --no-apply-immediately \
  --profile phenom \
  --region us-east-1

List recent automated backups

aws rds describe-db-snapshots \
  --db-instance-identifier phenom-prod-postgres \
  --snapshot-type automated \
  --profile phenom \
  --region us-east-1 \
  --query 'sort_by(DBSnapshots,&SnapshotCreateTime)[-5:].{ID:DBSnapshotIdentifier,Created:SnapshotCreateTime,Status:Status}'

Verify it is working

aws rds describe-db-instances \
  --db-instance-identifier phenom-prod-postgres \
  --profile phenom \
  --region us-east-1 \
  --query 'DBInstances[0].DBInstanceStatus' \
  --output text
# Expected: "available"

# Functional check: verify API (which depends on this RDS) responds correctly
curl -sf https://api.thephenom.app/healthz
# Expected: HTTP 200 / "OK"

Common failure modes

Symptom	Likely cause	Remediation
API returns 500 / DB errors	RDS unhealthy or connection pool exhausted	Check RDS status; check service logs; restart API services
Instance in “modifying” state	Maintenance window change in progress	Wait for completion; monitor RDS events
Storage near 100%	Autoscaling not triggered fast enough	Enable storage autoscaling; archive old data; run `VACUUM`
Slow response across all services	RDS CPU spike or long-running query	Check Performance Insights; kill long-running queries; add read replica
Connection refused	Security group rule removed	Restore sg inbound rule from phenom-api and phenom-synapse security groups

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified June 19, 2026: fix(ci): use committed c2pa-manifests.json; verify signer_cn resolved (#365) (#367) (58fe9f5)