Alert Lookup Index

Every alert that can reach your phone or the dashboards - what it means, how urgent it is, and exactly what to do when it arrives.

This is the searchable runbook for the entire alerting estate. When an alert lands, search this page for its title text. Each entry explains the trigger condition, the severity, and the response procedure. Last full audit: 2026-06-08 (KPI alert delivery + multi-tenant topic routing).

How alerts reach you

Ghost Mode alerter ─┐
Asset monitor      ─┤
KPI adapter        ─┴──► alerts.sanmarcsoft.com (self-hosted ntfy)
                          │
                          ├─► universal-exports  (operator topic: EVERY asset + EVERY KPI)
                          ├─► ghostmode-alerts   (Phenom client topic: Phenom assets + Phenom KPIs)
                          ├─► kpi-alerts         (scoped KPI feed: all KPI rules)
                          └─► phenom-*           (per-asset team topics, mirrored to ghostmode-alerts)
                          │
                   upstream-base-url → ntfy.sh APNS → iOS (instant, audible)

Topic model (M directive, 2026-06-08; multi-tenant split):

  • universal-exports is the operator topic: every alert from every source and every KPI rule lands here. One subscription covers the whole estate.
  • ghostmode-alerts is the Phenom client topic: Phenom asset alerts and Phenom-scoped KPI alerts only. SanMarcSoft company KPIs (MRR, churn, signups, Sightengine cost, CI pass rate, bridge agents, exporters) are operator-only and never appear here. Routing is by the alert’s org label (phenom mirrors here; sanmarcsoft does not).
  • kpi-alerts is the scoped KPI feed: all Prometheus KPI rules, regardless of owner.
  • Per-asset phenom-* topics (phenom-www, phenom-nest, phenom-db-prod, …) fire for the team; all are mirrored to ghostmode-alerts.
  • iOS delivery requires upstream-base-url: "https://ntfy.sh" in the ntfy server config (fixed 2026-06-04). Only a topic hash and message id transit ntfy.sh; content stays on our server.

Priority tiers

Tier ntfy priority Tag Meaning
P5 urgent 5 (max, bypasses quiet hours) 🚨 rotating_light Active targeted attack or production asset down
P4 high 4 (high) ⚠️ warning Hostile activity aggregated per source, worth a look
P3 digest 3 (default) 🔍 mag Rollups, recoveries, volume caps - informational
P2 ops 2 (low, badge-only) 💓 heartbeat Pipeline health, “all clear” heartbeat

Master lookup table

Search this table for the title text on the notification.

Alert title (pattern) Source Tier One-line meaning
<host>: CANARY CREDENTIAL from <ip> Ghost Mode P5 Someone typed credentials into a honeypot
MULTI-DOMAIN recon: <ip> (N sites) Ghost Mode P5 One actor probing 2+ of our domains - targeted recon
<Asset>: DOWN Asset monitor P5 Production asset failing 2+ consecutive probes
<Asset>: RECOVERED Asset monitor P3 Asset back up - closes the matching DOWN
<host>: N recon/blocked hits from <ip> Ghost Mode P4 Burst of high-severity recon/WAF blocks from one IP
<host>: RECON <path> Ghost Mode P4 Single recon probe (legacy single-event path)
<host>: BLOCKED Ghost Mode P4 Single WAF block (legacy single-event path)
Alert volume capped Ghost Mode P3 Flood suppression engaged - more sources were flagged than shown
Ghost Mode: all clear Ghost Mode P2 Heartbeat. Its ABSENCE is the alarm
Q: ... test / [test] Manual any An operator (usually Q) testing the pipeline

Prometheus rules (Grafana-visible, see caveat): EndpointDown, SlowResponse, HttpStatusError, SslCertExpiringSoon, SslCertExpiryCritical, SslCertExpired, BlackboxExporterDown, SanMarcSoftExporterDown, MrrDropped, HighChurnRate, CriticalChurnRate, NoSignups48h, SightengineQuotaWarning, SightengineQuotaCritical, AuthBruteForce, LowCIPassRate, CommsContentStale, CommsGenerationFailed, BridgeAgentDown, BridgeAgentHeartbeatFailing, BridgeHealthEndpointUnreachable.


Ghost Mode security alerts (topic: ghostmode-alerts)

Source: osint_surveillance_detector_repo/ghostmode/alerter.py. Tapping any alert opens the gated ops dashboard at https://nest-ops.thephenom.app/ops/ (the Click target is pinned there deliberately - never to an attacker-controlled host).

<host>: CANARY CREDENTIAL from <ip> - P5 urgent

Trigger: an actor submitted a username/password to a honeypot (OpenCanary) service.

Meaning: this is past scanning - someone is actively attempting access and believes the canary is real. Treat as an active, targeted attack.

Respond:

  1. Open the ops dashboard (tap the alert) and identify the source IP, country, ASN, and which canary service was hit.
  2. Block the IP (and ASN if it recurs) at Cloudflare WAF.
  3. Check whether the submitted credentials resemble any real ones; if an actual credential pattern appears, rotate it everywhere immediately.
  4. Review surrounding events from the same IP for lateral probing of real services.
  5. Preserve the OpenCanary logs for the incident record.

MULTI-DOMAIN recon: <ip> (N sites) - P5 urgent

Trigger: one source IP generated high-severity events on two or more owned domains (thephenom.app, sanmarcsoft.com, verifieddit.com, trusteddit.com) in the correlation window.

Meaning: not random internet noise - this actor knows the portfolio and is mapping it deliberately.

Respond:

  1. Open the ops dashboard; note IP, ASN, country, event count, and which domains.
  2. Block the IP at Cloudflare across all zones (the alert means per-domain blocking is insufficient).
  3. Check the canaries for interaction from the same source (recon often precedes credential attempts).
  4. If the ASN is a hosting provider and probing recurs from neighbours, consider an ASN-level WAF rule.

<host>: N recon/blocked hits from <ip> - P4 high

Trigger: a burst of high-severity recon or WAF-block events from a single IP, aggregated into one alert (per-IP aggregation defeats path-walk flooding).

Meaning: scanner or attacker probing one property. The WAF is already doing its job for blocked actions.

Respond:

  1. Usually nothing - the aggregation exists so you can glance and move on.
  2. If the same IP re-alerts after the 5-minute cooldown, block it at Cloudflare.
  3. If paths listed include anything that actually exists (real admin endpoints), check the access logs for hits that were NOT blocked.

<host>: RECON <path> / <host>: BLOCKED - P4 high

Legacy single-event variants of the above (emitted via the MCP tool path). Same response as the aggregated form.

Alert volume capped - P3 digest

Trigger: more than 20 alert-worthy sources in one scan, or the 30-alerts-per-minute global ceiling was hit.

Meaning: alert flood suppression engaged. You are NOT seeing the full picture on the phone.

Respond: open the ops dashboard for the complete event list. A capped scan during quiet hours often means a distributed scan or an attack wave - check the cross-domain correlation panel.

Ghost Mode: all clear - P2 ops heartbeat

Trigger: periodic; confirms the alerting pipeline itself is alive.

Meaning: silence is trustworthy only while these arrive. A missing heartbeat is itself the alert.

Respond (only when it STOPS arriving):

  1. Check the ghostmode container: ssh matt@a1.matthewstevens.org "docker ps | grep ghostmode" and its logs.
  2. Check ntfy server health: curl https://alerts.sanmarcsoft.com/v1/health.
  3. If ntfy is healthy but quiet, the detector stack is down; restart the ghostmode container.

Asset monitor alerts (topics: phenom-*, mirrored to ghostmode-alerts)

Source: ghostmode/asset_monitor.py. Probes every 60s; pages after 2 consecutive failures (debounce); re-pages at most every 30 minutes while still down. Tapping the alert opens the affected service (or the AWS console for RDS/SES).

<Asset>: DOWN - P5 urgent

Trigger: the asset failed its health definition for 2+ consecutive probes (~120s).

Monitored assets and their topics:

Asset Topic Healthy when
Website (www.thephenom.app) phenom-www HTTP 200
NEST phenom-nest HTTP 200
Dev NEST phenom-dev-nest HTTP 200
Drop phenom-drop HTTP 200
Chat (Synapse) phenom-chat HTTP 200
API staging (/healthz) phenom-api-staging HTTP 200
API public phenom-api-public HTTP 401 (auth wall = alive)
Analytics phenom-analytics HTTP 200
Webmail phenom-webmail HTTP 200
Cloudflare edge (cdn-cgi/trace) phenom-cf-edge HTTP 200
ADSB cache (archive API) phenom-adsb HTTP 401
Ops dashboard phenom-ops reachable
DB dev (RDS Postgres) phenom-db-dev RDS status in healthy set*
DB prod (RDS Postgres) phenom-db-prod RDS status in healthy set*
Mail SES (us-east-1) phenom-ses account status SENDING

* RDS healthy set: AVAILABLE, BACKING-UP, MAINTENANCE, MODIFYING, CONFIGURING-ENHANCED-MONITORING, STORAGE-OPTIMIZATION, UPGRADING. Anything else (stopped, failed, storage-full, …) pages as DOWN.

Respond:

  1. Tap the alert - it opens the affected service or console directly.
  2. HTTP assets: check the container on the NAS (docker ps, docker logs <name>) and the Cloudflare tunnel for that hostname.
  3. RDS: the console link shows the instance state; storage-full and failed states need immediate action, MODIFYING/BACKING-UP never page.
  4. SES: check the account-level sending status and any AWS notifications about the account.
  5. Expect a re-page every 30 minutes until resolved; <Asset>: RECOVERED (P3) closes the incident.

<Asset>: RECOVERED - P3 digest

The matching DOWN condition cleared. No action; confirm the recovery was yours and not coincidence if you were mid-fix.


Prometheus rules (ops-monitoring)

Source: ops-monitoring/alerting/rules.yml, evaluated by ops-prometheus, visible in Grafana (ops-grafana).

What reaches the Phenom topic

Reaches ghostmode-alerts (you see it) Operator-only (universal-exports, you do not)
EndpointDown, SlowResponse, HttpStatusError, SslCertExpiringSoon, SslCertExpiryCritical, SslCertExpired for Phenom assets (classified by the probe’s instance host) The same probe rules for SanMarcSoft assets (verifieddit, trusteddit, sanmarcsoft.com, ddit.wtf)
CommsContentStale when client=phenom CommsContentStale for other clients; CommsGenerationFailed (the SanMarcSoft pipeline)
(none of the company KPIs) MrrDropped, HighChurnRate, CriticalChurnRate, NoSignups48h, SightengineQuotaWarning, SightengineQuotaCritical, AuthBruteForce, LowCIPassRate, BlackboxExporterDown, SanMarcSoftExporterDown, BridgeAgentDown, BridgeAgentHeartbeatFailing, BridgeHealthEndpointUnreachable

The company KPIs above are SanMarcSoft-internal (revenue, churn, signups, cost, CI, monitoring self-health, voice-bridge infrastructure). They are pinned org: sanmarcsoft in rules.yml, so they never reach the Phenom topic. The endpoint and SSL rules are shared across both tenants and self-classify per probe target, so a Phenom asset outage still reaches you while a SanMarcSoft asset outage does not.

Endpoint group

Alert Severity Trigger Respond
EndpointDown critical A probed site/API unreachable for 5m Same playbook as Asset DOWN: container, tunnel, DNS. Check which instance label fired
SlowResponse warning Response time > 5s for 5m Check NAS load, container CPU, upstream API latency. Often a symptom preceding EndpointDown
HttpStatusError critical Non-2xx for 5m (tsa 405 and cf-access 302 are exempt by design) Open the URL; an error page usually names the layer (Cloudflare 52x = origin down, 4xx = app misconfig)

SSL group

Alert Severity Trigger Respond
SslCertExpiringSoon warning Cert expires < 14 days Renew now: Cloudflare-managed certs should auto-renew - investigate why this one is not
SslCertExpiryCritical critical Cert expires < 7 days Renewal escalation: manual issue/deploy today
SslCertExpired critical Cert already expired Service outage in progress for strict clients. Issue and deploy immediately, then post-mortem why two earlier alerts were missed

Monitoring self-health

Alert Severity Trigger Respond
BlackboxExporterDown critical Probe exporter unreachable 5m docker restart the blackbox container; all endpoint alerts are blind until it returns
SanMarcSoftExporterDown critical Custom exporter unreachable 5m Restart sanmarcsoft-exporter; business + bridge metrics are blind

Business group

Alert Severity Trigger Respond
MrrDropped critical MRR < 80% of 30 days ago for 1h Check Stripe for cancellations/refunds; verify it is real churn and not a webhook/metric failure first
HighChurnRate warning Monthly churn > 10% Review recent cancellations for a common cause (price change, breakage, competitor)
CriticalChurnRate critical Monthly churn > 20% As above, but same-day: something is actively driving users out
NoSignups48h warning Zero signups two consecutive days First suspect the funnel, not the market: test signup end-to-end (Clerk auth, emails). Then check traffic

Cost group

Alert Severity Trigger Respond
SightengineQuotaWarning warning API usage > 80% of monthly quota Decide: throttle AI detection or budget an upgrade before the month ends
SightengineQuotaCritical critical Usage > 95% AI detection degrades imminently; throttle now, upgrade if the feature matters this month

Security group

Alert Severity Trigger Respond
AuthBruteForce critical > 200 failed auth attempts in 24h Review Clerk logs: one IP = block it; distributed = enable bot protection / rate limits. Check for any success amid failures

Deployment group

Alert Severity Trigger Respond
LowCIPassRate warning Repo CI pass rate < 80% over 7 days Open the repo’s Actions history: flaky test (quarantine + ticket) vs genuine breakage (fix forward)

Comms content group

Alert Severity Trigger Respond
CommsContentStale warning No new AI news content for a client in 24h Check the generation workflow run history for that client
CommsGenerationFailed critical Pipeline’s last run unsuccessful Open the GitHub Actions log for sanmarcsoft-comms; apply the four-point CI verification (conclusion, steps, annotations, live probe)

Bridge agent group (claude-peers voice infrastructure)

Alert Severity Trigger Respond
BridgeAgentDown warning Agent unregistered from bridge > 5m docker logs codetalker-bridge on the NAS; check the agent’s adapter at 10.0.0.112
BridgeAgentHeartbeatFailing warning > 3 heartbeat failures for 10m Bridge cannot reach the remote agent endpoint; check the agent container and LAN path
BridgeHealthEndpointUnreachable critical Exporter cannot reach bridge /health for 5m Bridge container down or crash-looping. Known boot dependency: the bridge exits if the broker tunnel (dev container → ai :7899) is down - restart bridge-supervisor in the dev container first, then the bridge container

Quick reference commands

# ntfy server health
curl https://alerts.sanmarcsoft.com/v1/health

# Voice/bridge health (includes TTS queue + agent registry)
curl http://10.0.0.96:7900/health

# What is running on the NAS
ssh matt@a1.matthewstevens.org "docker ps"

# Ghost Mode ops dashboard (every security alert clicks through to here)
open https://nest-ops.thephenom.app/ops/

# Grafana (Prometheus alerts live here until Alertmanager is wired)
open https://ops.sanmarcsoft.com

Known gaps (as of 2026-06-08)

  1. Prometheus rules do not push Resolved 2026-06-08. Alertmanager → ops-kpi-ntfy-adapter → ntfy now delivers all 21 rules, routed by org label (see “These rules now push” above).
  2. Heartbeat dependence - if Ghost Mode: all clear stops, nothing else will tell you the security pipeline is dead. Check it when you have not heard it in a while.
  3. Manual test alerts use the pattern tk_ tokens minted ad hoc (ntfy token add --expires=1h); revoke after use.
  4. Company-KPI classification is label-driven. A SanMarcSoft KPI rule that ships without an org label falls back to a heuristic and could, in principle, mis-route. All current company rules are pinned org: sanmarcsoft; any new one must be too.