Alert Lookup Index
This is the searchable runbook for the entire alerting estate. When an alert lands, search this page for its title text. Each entry explains the trigger condition, the severity, and the response procedure. Last full audit: 2026-06-08 (KPI alert delivery + multi-tenant topic routing).
How alerts reach you
Ghost Mode alerter ─┐
Asset monitor ─┤
KPI adapter ─┴──► alerts.sanmarcsoft.com (self-hosted ntfy)
│
├─► universal-exports (operator topic: EVERY asset + EVERY KPI)
├─► ghostmode-alerts (Phenom client topic: Phenom assets + Phenom KPIs)
├─► kpi-alerts (scoped KPI feed: all KPI rules)
└─► phenom-* (per-asset team topics, mirrored to ghostmode-alerts)
│
upstream-base-url → ntfy.sh APNS → iOS (instant, audible)
Topic model (M directive, 2026-06-08; multi-tenant split):
universal-exportsis the operator topic: every alert from every source and every KPI rule lands here. One subscription covers the whole estate.ghostmode-alertsis the Phenom client topic: Phenom asset alerts and Phenom-scoped KPI alerts only. SanMarcSoft company KPIs (MRR, churn, signups, Sightengine cost, CI pass rate, bridge agents, exporters) are operator-only and never appear here. Routing is by the alert’sorglabel (phenommirrors here;sanmarcsoftdoes not).kpi-alertsis the scoped KPI feed: all Prometheus KPI rules, regardless of owner.- Per-asset
phenom-*topics (phenom-www, phenom-nest, phenom-db-prod, …) fire for the team; all are mirrored toghostmode-alerts. - iOS delivery requires
upstream-base-url: "https://ntfy.sh"in the ntfy server config (fixed 2026-06-04). Only a topic hash and message id transit ntfy.sh; content stays on our server.
Priority tiers
| Tier | ntfy priority | Tag | Meaning |
|---|---|---|---|
| P5 urgent | 5 (max, bypasses quiet hours) | 🚨 rotating_light | Active targeted attack or production asset down |
| P4 high | 4 (high) | ⚠️ warning | Hostile activity aggregated per source, worth a look |
| P3 digest | 3 (default) | 🔍 mag | Rollups, recoveries, volume caps - informational |
| P2 ops | 2 (low, badge-only) | 💓 heartbeat | Pipeline health, “all clear” heartbeat |
Master lookup table
Search this table for the title text on the notification.
| Alert title (pattern) | Source | Tier | One-line meaning |
|---|---|---|---|
<host>: CANARY CREDENTIAL from <ip> |
Ghost Mode | P5 | Someone typed credentials into a honeypot |
MULTI-DOMAIN recon: <ip> (N sites) |
Ghost Mode | P5 | One actor probing 2+ of our domains - targeted recon |
<Asset>: DOWN |
Asset monitor | P5 | Production asset failing 2+ consecutive probes |
<Asset>: RECOVERED |
Asset monitor | P3 | Asset back up - closes the matching DOWN |
<host>: N recon/blocked hits from <ip> |
Ghost Mode | P4 | Burst of high-severity recon/WAF blocks from one IP |
<host>: RECON <path> |
Ghost Mode | P4 | Single recon probe (legacy single-event path) |
<host>: BLOCKED |
Ghost Mode | P4 | Single WAF block (legacy single-event path) |
Alert volume capped |
Ghost Mode | P3 | Flood suppression engaged - more sources were flagged than shown |
Ghost Mode: all clear |
Ghost Mode | P2 | Heartbeat. Its ABSENCE is the alarm |
Q: ... test / [test] |
Manual | any | An operator (usually Q) testing the pipeline |
Prometheus rules (Grafana-visible, see caveat): EndpointDown, SlowResponse, HttpStatusError, SslCertExpiringSoon, SslCertExpiryCritical, SslCertExpired, BlackboxExporterDown, SanMarcSoftExporterDown, MrrDropped, HighChurnRate, CriticalChurnRate, NoSignups48h, SightengineQuotaWarning, SightengineQuotaCritical, AuthBruteForce, LowCIPassRate, CommsContentStale, CommsGenerationFailed, BridgeAgentDown, BridgeAgentHeartbeatFailing, BridgeHealthEndpointUnreachable.
Ghost Mode security alerts (topic: ghostmode-alerts)
Source: osint_surveillance_detector_repo/ghostmode/alerter.py. Tapping any alert opens the gated ops dashboard at https://nest-ops.thephenom.app/ops/ (the Click target is pinned there deliberately - never to an attacker-controlled host).
<host>: CANARY CREDENTIAL from <ip> - P5 urgent
Trigger: an actor submitted a username/password to a honeypot (OpenCanary) service.
Meaning: this is past scanning - someone is actively attempting access and believes the canary is real. Treat as an active, targeted attack.
Respond:
- Open the ops dashboard (tap the alert) and identify the source IP, country, ASN, and which canary service was hit.
- Block the IP (and ASN if it recurs) at Cloudflare WAF.
- Check whether the submitted credentials resemble any real ones; if an actual credential pattern appears, rotate it everywhere immediately.
- Review surrounding events from the same IP for lateral probing of real services.
- Preserve the OpenCanary logs for the incident record.
MULTI-DOMAIN recon: <ip> (N sites) - P5 urgent
Trigger: one source IP generated high-severity events on two or more owned domains (thephenom.app, sanmarcsoft.com, verifieddit.com, trusteddit.com) in the correlation window.
Meaning: not random internet noise - this actor knows the portfolio and is mapping it deliberately.
Respond:
- Open the ops dashboard; note IP, ASN, country, event count, and which domains.
- Block the IP at Cloudflare across all zones (the alert means per-domain blocking is insufficient).
- Check the canaries for interaction from the same source (recon often precedes credential attempts).
- If the ASN is a hosting provider and probing recurs from neighbours, consider an ASN-level WAF rule.
<host>: N recon/blocked hits from <ip> - P4 high
Trigger: a burst of high-severity recon or WAF-block events from a single IP, aggregated into one alert (per-IP aggregation defeats path-walk flooding).
Meaning: scanner or attacker probing one property. The WAF is already doing its job for blocked actions.
Respond:
- Usually nothing - the aggregation exists so you can glance and move on.
- If the same IP re-alerts after the 5-minute cooldown, block it at Cloudflare.
- If paths listed include anything that actually exists (real admin endpoints), check the access logs for hits that were NOT blocked.
<host>: RECON <path> / <host>: BLOCKED - P4 high
Legacy single-event variants of the above (emitted via the MCP tool path). Same response as the aggregated form.
Alert volume capped - P3 digest
Trigger: more than 20 alert-worthy sources in one scan, or the 30-alerts-per-minute global ceiling was hit.
Meaning: alert flood suppression engaged. You are NOT seeing the full picture on the phone.
Respond: open the ops dashboard for the complete event list. A capped scan during quiet hours often means a distributed scan or an attack wave - check the cross-domain correlation panel.
Ghost Mode: all clear - P2 ops heartbeat
Trigger: periodic; confirms the alerting pipeline itself is alive.
Meaning: silence is trustworthy only while these arrive. A missing heartbeat is itself the alert.
Respond (only when it STOPS arriving):
- Check the ghostmode container:
ssh matt@a1.matthewstevens.org "docker ps | grep ghostmode"and its logs. - Check ntfy server health:
curl https://alerts.sanmarcsoft.com/v1/health. - If ntfy is healthy but quiet, the detector stack is down; restart the ghostmode container.
Asset monitor alerts (topics: phenom-*, mirrored to ghostmode-alerts)
Source: ghostmode/asset_monitor.py. Probes every 60s; pages after 2 consecutive failures (debounce); re-pages at most every 30 minutes while still down. Tapping the alert opens the affected service (or the AWS console for RDS/SES).
<Asset>: DOWN - P5 urgent
Trigger: the asset failed its health definition for 2+ consecutive probes (~120s).
Monitored assets and their topics:
| Asset | Topic | Healthy when |
|---|---|---|
| Website (www.thephenom.app) | phenom-www | HTTP 200 |
| NEST | phenom-nest | HTTP 200 |
| Dev NEST | phenom-dev-nest | HTTP 200 |
| Drop | phenom-drop | HTTP 200 |
| Chat (Synapse) | phenom-chat | HTTP 200 |
| API staging (/healthz) | phenom-api-staging | HTTP 200 |
| API public | phenom-api-public | HTTP 401 (auth wall = alive) |
| Analytics | phenom-analytics | HTTP 200 |
| Webmail | phenom-webmail | HTTP 200 |
| Cloudflare edge (cdn-cgi/trace) | phenom-cf-edge | HTTP 200 |
| ADSB cache (archive API) | phenom-adsb | HTTP 401 |
| Ops dashboard | phenom-ops | reachable |
| DB dev (RDS Postgres) | phenom-db-dev | RDS status in healthy set* |
| DB prod (RDS Postgres) | phenom-db-prod | RDS status in healthy set* |
| Mail SES (us-east-1) | phenom-ses | account status SENDING |
* RDS healthy set: AVAILABLE, BACKING-UP, MAINTENANCE, MODIFYING, CONFIGURING-ENHANCED-MONITORING, STORAGE-OPTIMIZATION, UPGRADING. Anything else (stopped, failed, storage-full, …) pages as DOWN.
Respond:
- Tap the alert - it opens the affected service or console directly.
- HTTP assets: check the container on the NAS (
docker ps,docker logs <name>) and the Cloudflare tunnel for that hostname. - RDS: the console link shows the instance state; storage-full and failed states need immediate action, MODIFYING/BACKING-UP never page.
- SES: check the account-level sending status and any AWS notifications about the account.
- Expect a re-page every 30 minutes until resolved;
<Asset>: RECOVERED(P3) closes the incident.
<Asset>: RECOVERED - P3 digest
The matching DOWN condition cleared. No action; confirm the recovery was yours and not coincidence if you were mid-fix.
Prometheus rules (ops-monitoring)
Source: ops-monitoring/alerting/rules.yml, evaluated by ops-prometheus, visible in Grafana (ops-grafana).
These rules now push (2026-06-08)
Alertmanager is wired toops-kpi-ntfy-adapter, which renders each rule house-style and posts to ntfy. The earlier “console-only” caveat is resolved. Routing is by org label: rules tagged org: phenom (and Phenom asset probes) mirror to ghostmode-alerts; rules tagged org: sanmarcsoft go to universal-exports + kpi-alerts only and stay off the Phenom topic. The split below records which rules you (Phenom) actually receive.
What reaches the Phenom topic
Reaches ghostmode-alerts (you see it) |
Operator-only (universal-exports, you do not) |
|---|---|
EndpointDown, SlowResponse, HttpStatusError, SslCertExpiringSoon, SslCertExpiryCritical, SslCertExpired for Phenom assets (classified by the probe’s instance host) |
The same probe rules for SanMarcSoft assets (verifieddit, trusteddit, sanmarcsoft.com, ddit.wtf) |
CommsContentStale when client=phenom |
CommsContentStale for other clients; CommsGenerationFailed (the SanMarcSoft pipeline) |
| (none of the company KPIs) | MrrDropped, HighChurnRate, CriticalChurnRate, NoSignups48h, SightengineQuotaWarning, SightengineQuotaCritical, AuthBruteForce, LowCIPassRate, BlackboxExporterDown, SanMarcSoftExporterDown, BridgeAgentDown, BridgeAgentHeartbeatFailing, BridgeHealthEndpointUnreachable |
The company KPIs above are SanMarcSoft-internal (revenue, churn, signups, cost, CI, monitoring self-health, voice-bridge infrastructure). They are pinned org: sanmarcsoft in rules.yml, so they never reach the Phenom topic. The endpoint and SSL rules are shared across both tenants and self-classify per probe target, so a Phenom asset outage still reaches you while a SanMarcSoft asset outage does not.
Endpoint group
| Alert | Severity | Trigger | Respond |
|---|---|---|---|
EndpointDown |
critical | A probed site/API unreachable for 5m | Same playbook as Asset DOWN: container, tunnel, DNS. Check which instance label fired |
SlowResponse |
warning | Response time > 5s for 5m | Check NAS load, container CPU, upstream API latency. Often a symptom preceding EndpointDown |
HttpStatusError |
critical | Non-2xx for 5m (tsa 405 and cf-access 302 are exempt by design) | Open the URL; an error page usually names the layer (Cloudflare 52x = origin down, 4xx = app misconfig) |
SSL group
| Alert | Severity | Trigger | Respond |
|---|---|---|---|
SslCertExpiringSoon |
warning | Cert expires < 14 days | Renew now: Cloudflare-managed certs should auto-renew - investigate why this one is not |
SslCertExpiryCritical |
critical | Cert expires < 7 days | Renewal escalation: manual issue/deploy today |
SslCertExpired |
critical | Cert already expired | Service outage in progress for strict clients. Issue and deploy immediately, then post-mortem why two earlier alerts were missed |
Monitoring self-health
| Alert | Severity | Trigger | Respond |
|---|---|---|---|
BlackboxExporterDown |
critical | Probe exporter unreachable 5m | docker restart the blackbox container; all endpoint alerts are blind until it returns |
SanMarcSoftExporterDown |
critical | Custom exporter unreachable 5m | Restart sanmarcsoft-exporter; business + bridge metrics are blind |
Business group
| Alert | Severity | Trigger | Respond |
|---|---|---|---|
MrrDropped |
critical | MRR < 80% of 30 days ago for 1h | Check Stripe for cancellations/refunds; verify it is real churn and not a webhook/metric failure first |
HighChurnRate |
warning | Monthly churn > 10% | Review recent cancellations for a common cause (price change, breakage, competitor) |
CriticalChurnRate |
critical | Monthly churn > 20% | As above, but same-day: something is actively driving users out |
NoSignups48h |
warning | Zero signups two consecutive days | First suspect the funnel, not the market: test signup end-to-end (Clerk auth, emails). Then check traffic |
Cost group
| Alert | Severity | Trigger | Respond |
|---|---|---|---|
SightengineQuotaWarning |
warning | API usage > 80% of monthly quota | Decide: throttle AI detection or budget an upgrade before the month ends |
SightengineQuotaCritical |
critical | Usage > 95% | AI detection degrades imminently; throttle now, upgrade if the feature matters this month |
Security group
| Alert | Severity | Trigger | Respond |
|---|---|---|---|
AuthBruteForce |
critical | > 200 failed auth attempts in 24h | Review Clerk logs: one IP = block it; distributed = enable bot protection / rate limits. Check for any success amid failures |
Deployment group
| Alert | Severity | Trigger | Respond |
|---|---|---|---|
LowCIPassRate |
warning | Repo CI pass rate < 80% over 7 days | Open the repo’s Actions history: flaky test (quarantine + ticket) vs genuine breakage (fix forward) |
Comms content group
| Alert | Severity | Trigger | Respond |
|---|---|---|---|
CommsContentStale |
warning | No new AI news content for a client in 24h | Check the generation workflow run history for that client |
CommsGenerationFailed |
critical | Pipeline’s last run unsuccessful | Open the GitHub Actions log for sanmarcsoft-comms; apply the four-point CI verification (conclusion, steps, annotations, live probe) |
Bridge agent group (claude-peers voice infrastructure)
| Alert | Severity | Trigger | Respond |
|---|---|---|---|
BridgeAgentDown |
warning | Agent unregistered from bridge > 5m | docker logs codetalker-bridge on the NAS; check the agent’s adapter at 10.0.0.112 |
BridgeAgentHeartbeatFailing |
warning | > 3 heartbeat failures for 10m | Bridge cannot reach the remote agent endpoint; check the agent container and LAN path |
BridgeHealthEndpointUnreachable |
critical | Exporter cannot reach bridge /health for 5m | Bridge container down or crash-looping. Known boot dependency: the bridge exits if the broker tunnel (dev container → ai :7899) is down - restart bridge-supervisor in the dev container first, then the bridge container |
Quick reference commands
# ntfy server health
curl https://alerts.sanmarcsoft.com/v1/health
# Voice/bridge health (includes TTS queue + agent registry)
curl http://10.0.0.96:7900/health
# What is running on the NAS
ssh matt@a1.matthewstevens.org "docker ps"
# Ghost Mode ops dashboard (every security alert clicks through to here)
open https://nest-ops.thephenom.app/ops/
# Grafana (Prometheus alerts live here until Alertmanager is wired)
open https://ops.sanmarcsoft.com
Known gaps (as of 2026-06-08)
Prometheus rules do not pushResolved 2026-06-08. Alertmanager →ops-kpi-ntfy-adapter→ ntfy now delivers all 21 rules, routed byorglabel (see “These rules now push” above).- Heartbeat dependence - if
Ghost Mode: all clearstops, nothing else will tell you the security pipeline is dead. Check it when you have not heard it in a while. - Manual test alerts use the pattern
tk_tokens minted ad hoc (ntfy token add --expires=1h); revoke after use. - Company-KPI classification is label-driven. A SanMarcSoft KPI rule that ships without an
orglabel falls back to a heuristic and could, in principle, mis-route. All current company rules are pinnedorg: sanmarcsoft; any new one must be too.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.