Chat Server Security Hardening (chat.thephenom.app)
Current state lives on the canonical page. For the live chat services state (stack, auth, rooms, roles, provisioning), see Chat Services. This page is the chat security runbook and the decisions / history archive.
TL;DR for the next engineer
Phase 2 origin lock LANDED on 2026-05-27 via mTLS. chat.thephenom.app is Cloudflare-proxied
(orange-cloud) with SSL Full (encrypted CF to origin), login/register rate-limiting, and a
locked origin: the ALB 443 listener runs mutual TLS (mutual_authentication=verify), so a
direct raw-ALB request with no client cert is rejected (verified: curl to the raw ALB on 443
returns 000). The edge-auth Worker was dropped for chat (the Pro plan could not present a
client cert from a Worker), so chat is now a plain Cloudflare proxy to the ALB and the
origin enforces Cognito (Synapse jwt_audiences + Hasura JWT). The Phase-0 WAF stays in
COUNT. Residuals: the ALB :80 listener is still open (HTTP bypass), SSL is Full not
Full-Strict, and the mTLS/AOP stack is CLI-applied, not yet in Terraform.
Phase 2: origin lock LANDED via mTLS (2026-05-27)
The raw-ALB bypass is closed on 443. chat and api serve normally through Cloudflare. This
followed two failed lock attempts (IP-allowlisting, then a Worker mTLS binding); see
History.
Verified live
| Probe | Result | Meaning |
|---|---|---|
chat.thephenom.app/_matrix/client/versions (via CF) |
200, cf-ray, server: cloudflare |
proxied and serving |
raw ALB :443 direct (no client cert) |
connection rejected (curl exit, HTTP 000) |
mTLS lock closes the bypass |
chat + api through Cloudflare |
normal | SSL Full (encrypted origin) |
Stability: chat ~19/20 (about 5% transient 525, likely AOP/mTLS edge propagation still
settling; Matrix clients retry). api 5/5. Monitor; expected to settle.
Note: with the Worker removed,
/v1/graphqlwith no bearer no longer returns a401at the edge (it returns200at the HTTP layer). Data is still protected: Hasura enforces the Cognito JWT in the GraphQL layer at the origin. The tradeoff of dropping the Worker is the loss of edge-rejection of invalid traffic, not the loss of authorization.
Architecture (post-Worker)
flowchart LR
subgraph Clients
MOB["Mobile app<br/>Matrix JWT login"]
SPA["nest / dev-nest SPA<br/>Cognito bearer"]
end
MOB --> CF
SPA --> CF
CF["Cloudflare edge (orange-cloud)<br/>DDoS, WAF, rate-limit<br/>SSL Full"] -->|"AOP: presents client cert<br/>(clientAuth EKU leaf)"| ALB
ALB[("phenom-prod-alb :443<br/>mutual_authentication = verify<br/>trust store b7481113")] -->|"mTLS verified"| ORI["Synapse + Hasura + MCP<br/>origin enforces Cognito<br/>(Synapse jwt_audiences, Hasura JWT)"]
RAW(["raw ALB direct<br/>no client cert"]) -.->|"rejected (000)"| ALB
What made the lock work (after two failed attempts)
- SSL was Flexible (CF to origin over HTTP/80): the root cause of earlier
301loops, and it meant the 443 mTLS did not even cover proxy paths. Set SSL=Full via a per-hostname Configuration Rule (http_config_settings) scoped tochat.thephenom.app+api.thephenom.app. Now CF to origin is HTTPS/443 (origin pulls encrypted). - Dropped the edge-auth Worker for chat (M decision): removed the
chat.thephenom.app/*worker route, so chat is a plain CF proxy to the ALB. The origin still enforces Cognito (Synapsejwt_audiences+ Hasura JWT), so “only our clients get data” holds. (The Worker mTLS binding could not reliably present a client cert, so that path was abandoned.) - Per-hostname Authenticated Origin Pulls (AOP) on chat + api presenting a leaf cert.
Critical fix: the original leaf had no
extendedKeyUsage; AWS ALB mTLS requires theclientAuthEKU, so the leaf was regenerated withextendedKeyUsage=clientAuth(same CA). Validated on a temporary ALB :8443 test listener (curl --cert) before flipping prod 443. - ALB trust store
phenom-prod-chat-origin-ts(b7481113) = the CA. Flipped 443mutual_authentication=verifyviaaws elbv2 modify-listener(CLI, for instant rollback).
Status
| Item | State |
|---|---|
Cloudflare proxy on chat.thephenom.app |
LIVE |
| SSL Full (encrypted CF to origin) on chat + api | LIVE (was Flexible) |
| Login/register rate-limiting (15 req / 10s per IP) | LIVE |
| Locked origin (ALB 443 mTLS; raw-ALB bypass closed) | LIVE |
Origin enforces Cognito (Synapse jwt_audiences + Hasura JWT) |
LIVE |
WAF (phenom-prod-chat-protect) |
COUNT (defense-in-depth, unchanged) |
| Edge-auth Worker on chat path | REMOVED (tradeoff for the lock on the Pro plan) |
ALB :80 listener (HTTP bypass) |
OPEN (residual; close it) |
| SSL Full to Full-Strict | PENDING |
| Private admin plane + CI reroute | PENDING (Phase 2e) |
HasuraAdminPaths COUNT to BLOCK |
PENDING (after CI reroute) |
| Synapse-native hardening | PENDING (Phase 3) |
| Mobile real-client validation | PENDING Q (PhenomApp#89) |
Residual gaps (follow-ups)
- ALB
:80listener still open: raw HTTP to the ALB bypasses the 443 mTLS edge (chat:80to301redirect;api:80reaches Hasura, but Hasura enforces the JWT so no unauthenticated data). Close or restrict the:80listener. - SSL Full, not Full-Strict: Cloudflare does not yet validate the origin cert identity. Upgrade to Full-Strict.
- Key hygiene: rotate and secure the CA + leaf keys (currently in
/tmpon the ops box; move topass/ secrets management). - Cleanup: delete
chat-origin.thephenom.ai+ the old no-EKU worker cert4ec053a9+ the now-routeless worker, or repurpose. - Terraform codification: the trust store, listener mTLS, and AOP (via the Cloudflare provider) are CLI/API-applied; codify them.
- CI reroute: the prod Hasura deploy is separately broken on a missing
HASURA_ENDPOINTGH var.
Live artifacts
- AWS: ALB trust store
phenom-prod-chat-origin-ts(b7481113), CA stored ins3://phenom-prod-alb-mtls-657033058608. AOP leaf certce5140c2(withclientAuthEKU) on chat + api. ALB 443 listenermutual_authentication=verify. - Cloudflare: per-hostname Configuration Rule
ssl=full(http_config_settings) on chat + api; AOP enabled on chat + api; login/register rate-limit rule onchat.thephenom.app. - Unused / cleanup:
chat-edge-auth-canaryWorker now has no routes (dropped); old no-EKU mTLS cert4ec053a9and thechat-origin.thephenom.aigrey record are unused.
Rollback runbook
- mTLS lock (instant):
aws elbv2 modify-listenerset the ALB 443mutual_authenticationmode=off. - CF cutover (instant): set
chat.thephenom.appDNSproxied=false(back to grey-cloud, direct to the ALB; TTL is60s) and remove any worker route.
Cutover gotcha: Cloudflare caches GET 404s aggressively. After any route or origin change,
purge the Cloudflare cache (purge by URL) or you will see stale 404s on
/_matrix/client/versions and similar.
History and background
The sections below document the incident trail and the original exposure / Phase-0 work that preceded the live state above.
2026-05-28: public Synapse-Admin UI exposure closed
The ALB catch-all was serving the Synapse-Admin login UI publicly at the chat.thephenom.app
root. Closed 2026-05-28: the chat-host catch-all now returns HTTP 302 to
https://try.thephenom.app (random visitors are sent to the app download), and
/_synapse/admin returns 404. Client paths (/_matrix, /_synapse client routes, /v1,
/mcp) are unaffected. The admin plane is now internal-only; a CF-Access-gated admin path is the
follow-up. Registration is disabled; only valid prod-Cognito clients (NEST web + Phenom mobile)
authenticate. Verified live 2026-05-28 (root 302; /_synapse/admin 404; client paths 200).
Superseded chat-architecture decisions
- Dual-implementation A/B (Implementation A Matrix/Synapse vs Implementation B Hasura Lite): retired. Chat is Matrix/Synapse only; Implementation B (Hasura Lite chat) is excised.
- Three-room model (Internal / Partners / Community): superseded by four rooms (The Red Room,
Staff, Analysts, Experiencers). The aliases
#internal/#partners/#communityare legacy and do not match the room names. See Chat Services. - Edge-auth Worker: dropped (the origin enforces Cognito).
Incident: origin lock, two failed attempts then mTLS landed (2026-05-27)
The origin lock took three attempts:
- IP-allowlisting the ALB SG to Cloudflare ranges (rolled back). Applied
enable_cloudflare_origin_lock=trueand it brokeapi.thephenom.app(~2 to 3 minute outage): Cloudflare’s proxy origin-pull egress is not reliably within the published CF IP ranges. The IPv4 and IPv6 allowlists exactly matched the official ranges and the ALB is IPv4-only, yetapistill broke. Rolled back viaaws elbv2 set-security-groups; orphanedalb_cf_lockSG + bridge destroyed. IP-allowlisting is ruled out (the lock must be IP-independent). - Worker mTLS client-cert binding (abandoned). The Worker could not reliably present a client cert on the Pro plan.
- mTLS / AOP (landed). Root causes of the earlier failures were found: SSL was Flexible
(so 443 mTLS did not cover proxy paths and CF to origin was plaintext) and the leaf cert was
missing the
clientAuthEKU that AWS ALB mTLS requires. Fixed both, validated on a temporary ALB:8443listener withcurl --certbefore flipping prod 443, and dropped the chat Worker. See What made the lock work.
The exposure (verified 2026-05-26)
chat.thephenom.appwas a Cloudflare DNS-only record (grey cloud), pointing straight at an internet-facing AWS ALBphenom-prod-alb(idc6898169cbae5dad, us-east-1) fronting Synapse 1.105 + Hasura + the chat MCP. No Cloudflare proxy, no WAF, no rate-limiting.- The ALB’s default 443 listener action forwarded to the Hasura target group
(
phenom-prod-graphql-tg). The raw ALB DNS and any unmatched Host reached Hasura, including the admin endpoints/v2/queryand/v1/metadata.
The out-of-zone origin (Worker era, superseded)
While the edge-auth Worker was in use, a Cloudflare Worker on a zone hostname forwarded subrequests
with the inbound Host header on the Pro plan (resolveOverride is Enterprise-only), so forwarding
to an in-zone origin misrouted at the ALB. The workaround was an out-of-zone origin
(chat-origin.thephenom.ai, a separate Cloudflare zone). The Worker was later dropped for chat, so
this hop is unused (pending cleanup).
Council consensus (the original approach)
- Admin plane off the public internet is priority #1 (outranks DDoS).
- Cloudflare-front is theatre until the raw-ALB bypass is closed (now closed via mTLS).
- “Only our clients” is best-effort: a public Matrix server + a native mobile app cannot be hard-restricted; the origin enforces Cognito and Synapse enforces the Matrix token.
- Phased, COUNT-first, no big-bang, reversible.
Phase 0 WAF (deployed, still COUNT)
- WAFv2 WebACL
phenom-prod-chat-protect(REGIONAL, us-east-1), associated with the prod ALB. ARN ends/regional/webacl/phenom-prod-chat-protect/7b41a5d5-a2b1-4855-be1a-1aa7bc9627e4. - All rules in COUNT (observe-only):
AWSManagedRulesCommonRuleSet,AWSManagedRulesKnownBadInputsRuleSet,RateLimitPerIP(2000 req / 5 min per IP),HasuraAdminPaths(/v2/query,/v1/metadata,/v1/query,/console). - The Hasura admin secret is in Secrets Manager
phenom-prod-app-secretskeygraphql_admin_secret.
Phase 3 (planned, Synapse-native)
Disable/lock registration (rotate the registration shared secret), tune login/media rate limits, restrict URL-preview SSRF.
Related context
- dev-nest + nest share the prod data layer (
chat.thephenom.appHasura + prod Cognito poolus-east-1_knEL7cqS3). Chat is Matrix/Synapse only; lists/teams/sharing remain on Hasura via nest-api. - Role-based access uses the
team_memberstable (staff / analysts) with team-scoped list sharing enforced in Hasura.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.