Chat Server Security Hardening (chat.thephenom.app)

Phase 2 origin lock LANDED via mTLS (2026-05-27): chat.thephenom.app is Cloudflare-proxied with SSL Full and a locked origin (ALB mutual TLS), closing the raw-ALB bypass on 443. Live status, architecture, and runbook for the public chat server (Synapse + Hasura). Source of truth for any agent continuing the work.

Current state lives on the canonical page. For the live chat services state (stack, auth, rooms, roles, provisioning), see Chat Services. This page is the chat security runbook and the decisions / history archive.

TL;DR for the next engineer

Phase 2 origin lock LANDED on 2026-05-27 via mTLS. chat.thephenom.app is Cloudflare-proxied (orange-cloud) with SSL Full (encrypted CF to origin), login/register rate-limiting, and a locked origin: the ALB 443 listener runs mutual TLS (mutual_authentication=verify), so a direct raw-ALB request with no client cert is rejected (verified: curl to the raw ALB on 443 returns 000). The edge-auth Worker was dropped for chat (the Pro plan could not present a client cert from a Worker), so chat is now a plain Cloudflare proxy to the ALB and the origin enforces Cognito (Synapse jwt_audiences + Hasura JWT). The Phase-0 WAF stays in COUNT. Residuals: the ALB :80 listener is still open (HTTP bypass), SSL is Full not Full-Strict, and the mTLS/AOP stack is CLI-applied, not yet in Terraform.

Phase 2: origin lock LANDED via mTLS (2026-05-27)

The raw-ALB bypass is closed on 443. chat and api serve normally through Cloudflare. This followed two failed lock attempts (IP-allowlisting, then a Worker mTLS binding); see History.

Verified live

Probe Result Meaning
chat.thephenom.app/_matrix/client/versions (via CF) 200, cf-ray, server: cloudflare proxied and serving
raw ALB :443 direct (no client cert) connection rejected (curl exit, HTTP 000) mTLS lock closes the bypass
chat + api through Cloudflare normal SSL Full (encrypted origin)

Stability: chat ~19/20 (about 5% transient 525, likely AOP/mTLS edge propagation still settling; Matrix clients retry). api 5/5. Monitor; expected to settle.

Note: with the Worker removed, /v1/graphql with no bearer no longer returns a 401 at the edge (it returns 200 at the HTTP layer). Data is still protected: Hasura enforces the Cognito JWT in the GraphQL layer at the origin. The tradeoff of dropping the Worker is the loss of edge-rejection of invalid traffic, not the loss of authorization.

Architecture (post-Worker)

flowchart LR
  subgraph Clients
    MOB["Mobile app<br/>Matrix JWT login"]
    SPA["nest / dev-nest SPA<br/>Cognito bearer"]
  end
  MOB --> CF
  SPA --> CF
  CF["Cloudflare edge (orange-cloud)<br/>DDoS, WAF, rate-limit<br/>SSL Full"] -->|"AOP: presents client cert<br/>(clientAuth EKU leaf)"| ALB
  ALB[("phenom-prod-alb :443<br/>mutual_authentication = verify<br/>trust store b7481113")] -->|"mTLS verified"| ORI["Synapse + Hasura + MCP<br/>origin enforces Cognito<br/>(Synapse jwt_audiences, Hasura JWT)"]
  RAW(["raw ALB direct<br/>no client cert"]) -.->|"rejected (000)"| ALB

What made the lock work (after two failed attempts)

  1. SSL was Flexible (CF to origin over HTTP/80): the root cause of earlier 301 loops, and it meant the 443 mTLS did not even cover proxy paths. Set SSL=Full via a per-hostname Configuration Rule (http_config_settings) scoped to chat.thephenom.app + api.thephenom.app. Now CF to origin is HTTPS/443 (origin pulls encrypted).
  2. Dropped the edge-auth Worker for chat (M decision): removed the chat.thephenom.app/* worker route, so chat is a plain CF proxy to the ALB. The origin still enforces Cognito (Synapse jwt_audiences + Hasura JWT), so “only our clients get data” holds. (The Worker mTLS binding could not reliably present a client cert, so that path was abandoned.)
  3. Per-hostname Authenticated Origin Pulls (AOP) on chat + api presenting a leaf cert. Critical fix: the original leaf had no extendedKeyUsage; AWS ALB mTLS requires the clientAuth EKU, so the leaf was regenerated with extendedKeyUsage=clientAuth (same CA). Validated on a temporary ALB :8443 test listener (curl --cert) before flipping prod 443.
  4. ALB trust store phenom-prod-chat-origin-ts (b7481113) = the CA. Flipped 443 mutual_authentication=verify via aws elbv2 modify-listener (CLI, for instant rollback).

Status

Item State
Cloudflare proxy on chat.thephenom.app LIVE
SSL Full (encrypted CF to origin) on chat + api LIVE (was Flexible)
Login/register rate-limiting (15 req / 10s per IP) LIVE
Locked origin (ALB 443 mTLS; raw-ALB bypass closed) LIVE
Origin enforces Cognito (Synapse jwt_audiences + Hasura JWT) LIVE
WAF (phenom-prod-chat-protect) COUNT (defense-in-depth, unchanged)
Edge-auth Worker on chat path REMOVED (tradeoff for the lock on the Pro plan)
ALB :80 listener (HTTP bypass) OPEN (residual; close it)
SSL Full to Full-Strict PENDING
Private admin plane + CI reroute PENDING (Phase 2e)
HasuraAdminPaths COUNT to BLOCK PENDING (after CI reroute)
Synapse-native hardening PENDING (Phase 3)
Mobile real-client validation PENDING Q (PhenomApp#89)

Residual gaps (follow-ups)

  • ALB :80 listener still open: raw HTTP to the ALB bypasses the 443 mTLS edge (chat:80 to 301 redirect; api:80 reaches Hasura, but Hasura enforces the JWT so no unauthenticated data). Close or restrict the :80 listener.
  • SSL Full, not Full-Strict: Cloudflare does not yet validate the origin cert identity. Upgrade to Full-Strict.
  • Key hygiene: rotate and secure the CA + leaf keys (currently in /tmp on the ops box; move to pass / secrets management).
  • Cleanup: delete chat-origin.thephenom.ai + the old no-EKU worker cert 4ec053a9 + the now-routeless worker, or repurpose.
  • Terraform codification: the trust store, listener mTLS, and AOP (via the Cloudflare provider) are CLI/API-applied; codify them.
  • CI reroute: the prod Hasura deploy is separately broken on a missing HASURA_ENDPOINT GH var.

Live artifacts

  • AWS: ALB trust store phenom-prod-chat-origin-ts (b7481113), CA stored in s3://phenom-prod-alb-mtls-657033058608. AOP leaf cert ce5140c2 (with clientAuth EKU) on chat + api. ALB 443 listener mutual_authentication=verify.
  • Cloudflare: per-hostname Configuration Rule ssl=full (http_config_settings) on chat + api; AOP enabled on chat + api; login/register rate-limit rule on chat.thephenom.app.
  • Unused / cleanup: chat-edge-auth-canary Worker now has no routes (dropped); old no-EKU mTLS cert 4ec053a9 and the chat-origin.thephenom.ai grey record are unused.

Rollback runbook

  • mTLS lock (instant): aws elbv2 modify-listener set the ALB 443 mutual_authentication mode=off.
  • CF cutover (instant): set chat.thephenom.app DNS proxied=false (back to grey-cloud, direct to the ALB; TTL is 60s) and remove any worker route.

Cutover gotcha: Cloudflare caches GET 404s aggressively. After any route or origin change, purge the Cloudflare cache (purge by URL) or you will see stale 404s on /_matrix/client/versions and similar.


History and background

The sections below document the incident trail and the original exposure / Phase-0 work that preceded the live state above.

2026-05-28: public Synapse-Admin UI exposure closed

The ALB catch-all was serving the Synapse-Admin login UI publicly at the chat.thephenom.app root. Closed 2026-05-28: the chat-host catch-all now returns HTTP 302 to https://try.thephenom.app (random visitors are sent to the app download), and /_synapse/admin returns 404. Client paths (/_matrix, /_synapse client routes, /v1, /mcp) are unaffected. The admin plane is now internal-only; a CF-Access-gated admin path is the follow-up. Registration is disabled; only valid prod-Cognito clients (NEST web + Phenom mobile) authenticate. Verified live 2026-05-28 (root 302; /_synapse/admin 404; client paths 200).

Superseded chat-architecture decisions

  • Dual-implementation A/B (Implementation A Matrix/Synapse vs Implementation B Hasura Lite): retired. Chat is Matrix/Synapse only; Implementation B (Hasura Lite chat) is excised.
  • Three-room model (Internal / Partners / Community): superseded by four rooms (The Red Room, Staff, Analysts, Experiencers). The aliases #internal / #partners / #community are legacy and do not match the room names. See Chat Services.
  • Edge-auth Worker: dropped (the origin enforces Cognito).

Incident: origin lock, two failed attempts then mTLS landed (2026-05-27)

The origin lock took three attempts:

  1. IP-allowlisting the ALB SG to Cloudflare ranges (rolled back). Applied enable_cloudflare_origin_lock=true and it broke api.thephenom.app (~2 to 3 minute outage): Cloudflare’s proxy origin-pull egress is not reliably within the published CF IP ranges. The IPv4 and IPv6 allowlists exactly matched the official ranges and the ALB is IPv4-only, yet api still broke. Rolled back via aws elbv2 set-security-groups; orphaned alb_cf_lock SG + bridge destroyed. IP-allowlisting is ruled out (the lock must be IP-independent).
  2. Worker mTLS client-cert binding (abandoned). The Worker could not reliably present a client cert on the Pro plan.
  3. mTLS / AOP (landed). Root causes of the earlier failures were found: SSL was Flexible (so 443 mTLS did not cover proxy paths and CF to origin was plaintext) and the leaf cert was missing the clientAuth EKU that AWS ALB mTLS requires. Fixed both, validated on a temporary ALB :8443 listener with curl --cert before flipping prod 443, and dropped the chat Worker. See What made the lock work.

The exposure (verified 2026-05-26)

  • chat.thephenom.app was a Cloudflare DNS-only record (grey cloud), pointing straight at an internet-facing AWS ALB phenom-prod-alb (id c6898169cbae5dad, us-east-1) fronting Synapse 1.105 + Hasura + the chat MCP. No Cloudflare proxy, no WAF, no rate-limiting.
  • The ALB’s default 443 listener action forwarded to the Hasura target group (phenom-prod-graphql-tg). The raw ALB DNS and any unmatched Host reached Hasura, including the admin endpoints /v2/query and /v1/metadata.

The out-of-zone origin (Worker era, superseded)

While the edge-auth Worker was in use, a Cloudflare Worker on a zone hostname forwarded subrequests with the inbound Host header on the Pro plan (resolveOverride is Enterprise-only), so forwarding to an in-zone origin misrouted at the ALB. The workaround was an out-of-zone origin (chat-origin.thephenom.ai, a separate Cloudflare zone). The Worker was later dropped for chat, so this hop is unused (pending cleanup).

Council consensus (the original approach)

  1. Admin plane off the public internet is priority #1 (outranks DDoS).
  2. Cloudflare-front is theatre until the raw-ALB bypass is closed (now closed via mTLS).
  3. “Only our clients” is best-effort: a public Matrix server + a native mobile app cannot be hard-restricted; the origin enforces Cognito and Synapse enforces the Matrix token.
  4. Phased, COUNT-first, no big-bang, reversible.

Phase 0 WAF (deployed, still COUNT)

  • WAFv2 WebACL phenom-prod-chat-protect (REGIONAL, us-east-1), associated with the prod ALB. ARN ends /regional/webacl/phenom-prod-chat-protect/7b41a5d5-a2b1-4855-be1a-1aa7bc9627e4.
  • All rules in COUNT (observe-only): AWSManagedRulesCommonRuleSet, AWSManagedRulesKnownBadInputsRuleSet, RateLimitPerIP (2000 req / 5 min per IP), HasuraAdminPaths (/v2/query, /v1/metadata, /v1/query, /console).
  • The Hasura admin secret is in Secrets Manager phenom-prod-app-secrets key graphql_admin_secret.

Phase 3 (planned, Synapse-native)

Disable/lock registration (rotate the registration shared secret), tune login/media rate limits, restrict URL-preview SSRF.

  • dev-nest + nest share the prod data layer (chat.thephenom.app Hasura + prod Cognito pool us-east-1_knEL7cqS3). Chat is Matrix/Synapse only; lists/teams/sharing remain on Hasura via nest-api.
  • Role-based access uses the team_members table (staff / analysts) with team-scoped list sharing enforced in Hasura.