Alert Lookup Index

Every alert that can reach your phone or the dashboards - what it means, how urgent it is, and exactly what to do when it arrives.

This is the searchable runbook for the entire alerting estate. When an alert lands, search this page for its title text. Each entry explains the trigger condition, the severity, and the response procedure. Last full audit: 2026-06-08 (KPI alert delivery + multi-tenant topic routing).

How alerts reach you

Ghost Mode alerter ─┐
Asset monitor      ─┤
KPI adapter        ─┴──► alerts.sanmarcsoft.com (self-hosted ntfy)
                          │
                          ├─► universal-exports  (operator topic: EVERY asset + EVERY KPI)
                          ├─► ghostmode-alerts   (Phenom client topic: Phenom assets + Phenom KPIs)
                          ├─► kpi-alerts         (scoped KPI feed: all KPI rules)
                          └─► phenom-*           (per-asset team topics, mirrored to ghostmode-alerts)
                          │
                   upstream-base-url → ntfy.sh APNS → iOS (instant, audible)

Topic model (M directive, 2026-06-08; multi-tenant split):

universal-exports is the operator topic: every alert from every source and every KPI rule lands here. One subscription covers the whole estate.
ghostmode-alerts is the Phenom client topic: Phenom asset alerts and Phenom-scoped KPI alerts only. SanMarcSoft company KPIs (MRR, churn, signups, Sightengine cost, CI pass rate, bridge agents, exporters) are operator-only and never appear here. Routing is by the alert’s org label (phenom mirrors here; sanmarcsoft does not).
kpi-alerts is the scoped KPI feed: all Prometheus KPI rules, regardless of owner.
Per-asset phenom-* topics (phenom-www, phenom-nest, phenom-db-prod, …) fire for the team; all are mirrored to ghostmode-alerts.
iOS delivery requires upstream-base-url: "https://ntfy.sh" in the ntfy server config (fixed 2026-06-04). Only a topic hash and message id transit ntfy.sh; content stays on our server.

Priority tiers

Tier	ntfy priority	Tag	Meaning
P5 urgent	5 (max, bypasses quiet hours)	🚨 rotating_light	Active targeted attack or production asset down
P4 high	4 (high)	⚠️ warning	Hostile activity aggregated per source, worth a look
P3 digest	3 (default)	🔍 mag	Rollups, recoveries, volume caps - informational
P2 ops	2 (low, badge-only)	💓 heartbeat	Pipeline health, “all clear” heartbeat

Master lookup table

Search this table for the title text on the notification.

Alert title (pattern)	Source	Tier	One-line meaning
`<host>: CANARY CREDENTIAL from <ip>`	Ghost Mode	P5	Someone typed credentials into a honeypot
`MULTI-DOMAIN recon: <ip> (N sites)`	Ghost Mode	P5	One actor probing 2+ of our domains - targeted recon
`<Asset>: DOWN`	Asset monitor	P5	Production asset failing 2+ consecutive probes
`<Asset>: RECOVERED`	Asset monitor	P3	Asset back up - closes the matching DOWN
`<host>: N recon/blocked hits from <ip>`	Ghost Mode	P4	Burst of high-severity recon/WAF blocks from one IP
`<host>: RECON <path>`	Ghost Mode	P4	Single recon probe (legacy single-event path)
`<host>: BLOCKED`	Ghost Mode	P4	Single WAF block (legacy single-event path)
`Alert volume capped`	Ghost Mode	P3	Flood suppression engaged - more sources were flagged than shown
`Ghost Mode: all clear`	Ghost Mode	P2	Heartbeat. Its ABSENCE is the alarm
`Q: ... test` / `[test]`	Manual	any	An operator (usually Q) testing the pipeline

Prometheus rules (Grafana-visible, see caveat): EndpointDown, SlowResponse, HttpStatusError, SslCertExpiringSoon, SslCertExpiryCritical, SslCertExpired, BlackboxExporterDown, SanMarcSoftExporterDown, MrrDropped, HighChurnRate, CriticalChurnRate, NoSignups48h, SightengineQuotaWarning, SightengineQuotaCritical, AuthBruteForce, LowCIPassRate, CommsContentStale, CommsGenerationFailed, BridgeAgentDown, BridgeAgentHeartbeatFailing, BridgeHealthEndpointUnreachable.

Ghost Mode security alerts (topic: ghostmode-alerts)

Source: osint_surveillance_detector_repo/ghostmode/alerter.py. Tapping any alert opens the gated ops dashboard at https://nest-ops.thephenom.app/ops/ (the Click target is pinned there deliberately - never to an attacker-controlled host).

`<host>: CANARY CREDENTIAL from <ip>` - P5 urgent

Trigger: an actor submitted a username/password to a honeypot (OpenCanary) service.

Meaning: this is past scanning - someone is actively attempting access and believes the canary is real. Treat as an active, targeted attack.

Respond:

Open the ops dashboard (tap the alert) and identify the source IP, country, ASN, and which canary service was hit.
Block the IP (and ASN if it recurs) at Cloudflare WAF.
Check whether the submitted credentials resemble any real ones; if an actual credential pattern appears, rotate it everywhere immediately.
Review surrounding events from the same IP for lateral probing of real services.
Preserve the OpenCanary logs for the incident record.

`MULTI-DOMAIN recon: <ip> (N sites)` - P5 urgent

Trigger: one source IP generated high-severity events on two or more owned domains (thephenom.app, sanmarcsoft.com, verifieddit.com, trusteddit.com) in the correlation window.

Meaning: not random internet noise - this actor knows the portfolio and is mapping it deliberately.

Respond:

Open the ops dashboard; note IP, ASN, country, event count, and which domains.
Block the IP at Cloudflare across all zones (the alert means per-domain blocking is insufficient).
Check the canaries for interaction from the same source (recon often precedes credential attempts).
If the ASN is a hosting provider and probing recurs from neighbours, consider an ASN-level WAF rule.

`<host>: N recon/blocked hits from <ip>` - P4 high

Trigger: a burst of high-severity recon or WAF-block events from a single IP, aggregated into one alert (per-IP aggregation defeats path-walk flooding).

Meaning: scanner or attacker probing one property. The WAF is already doing its job for blocked actions.

Respond:

Usually nothing - the aggregation exists so you can glance and move on.
If the same IP re-alerts after the 5-minute cooldown, block it at Cloudflare.
If paths listed include anything that actually exists (real admin endpoints), check the access logs for hits that were NOT blocked.

`<host>: RECON <path>` / `<host>: BLOCKED` - P4 high

Legacy single-event variants of the above (emitted via the MCP tool path). Same response as the aggregated form.

`Alert volume capped` - P3 digest

Trigger: more than 20 alert-worthy sources in one scan, or the 30-alerts-per-minute global ceiling was hit.

Meaning: alert flood suppression engaged. You are NOT seeing the full picture on the phone.

Respond: open the ops dashboard for the complete event list. A capped scan during quiet hours often means a distributed scan or an attack wave - check the cross-domain correlation panel.

`Ghost Mode: all clear` - P2 ops heartbeat

Trigger: periodic; confirms the alerting pipeline itself is alive.

Meaning: silence is trustworthy only while these arrive. A missing heartbeat is itself the alert.

Respond (only when it STOPS arriving):

Check the ghostmode container: ssh matt@a1.matthewstevens.org "docker ps | grep ghostmode" and its logs.
Check ntfy server health: curl https://alerts.sanmarcsoft.com/v1/health.
If ntfy is healthy but quiet, the detector stack is down; restart the ghostmode container.

Asset monitor alerts (topics: phenom-*, mirrored to ghostmode-alerts)

Source: ghostmode/asset_monitor.py. Probes every 60s; pages after 2 consecutive failures (debounce); re-pages at most every 30 minutes while still down. Tapping the alert opens the affected service (or the AWS console for RDS/SES).

`<Asset>: DOWN` - P5 urgent

Trigger: the asset failed its health definition for 2+ consecutive probes (~120s).

Monitored assets and their topics:

Asset	Topic	Healthy when
Website (www.thephenom.app)	phenom-www	HTTP 200
NEST	phenom-nest	HTTP 200
Dev NEST	phenom-dev-nest	HTTP 200
Drop	phenom-drop	HTTP 200
Chat (Synapse)	phenom-chat	HTTP 200
API staging (/healthz)	phenom-api-staging	HTTP 200
API public	phenom-api-public	HTTP 401 (auth wall = alive)
Analytics	phenom-analytics	HTTP 200
Webmail	phenom-webmail	HTTP 200
Cloudflare edge (cdn-cgi/trace)	phenom-cf-edge	HTTP 200
ADSB cache (archive API)	phenom-adsb	HTTP 401
Ops dashboard	phenom-ops	reachable
DB dev (RDS Postgres)	phenom-db-dev	RDS status in healthy set*
DB prod (RDS Postgres)	phenom-db-prod	RDS status in healthy set*
Mail SES (us-east-1)	phenom-ses	account status SENDING

* RDS healthy set: AVAILABLE, BACKING-UP, MAINTENANCE, MODIFYING, CONFIGURING-ENHANCED-MONITORING, STORAGE-OPTIMIZATION, UPGRADING. Anything else (stopped, failed, storage-full, …) pages as DOWN.

Respond:

Tap the alert - it opens the affected service or console directly.
HTTP assets: check the container on the NAS (docker ps, docker logs <name>) and the Cloudflare tunnel for that hostname.
RDS: the console link shows the instance state; storage-full and failed states need immediate action, MODIFYING/BACKING-UP never page.
SES: check the account-level sending status and any AWS notifications about the account.
Expect a re-page every 30 minutes until resolved; <Asset>: RECOVERED (P3) closes the incident.

`<Asset>: RECOVERED` - P3 digest

The matching DOWN condition cleared. No action; confirm the recovery was yours and not coincidence if you were mid-fix.

Prometheus rules (ops-monitoring)

Source: ops-monitoring/alerting/rules.yml, evaluated by ops-prometheus, visible in Grafana (ops-grafana).

These rules now push (2026-06-08)

Alertmanager is wired to ops-kpi-ntfy-adapter, which renders each rule house-style and posts to ntfy. The earlier “console-only” caveat is resolved. Routing is by org label: rules tagged org: phenom (and Phenom asset probes) mirror to ghostmode-alerts; rules tagged org: sanmarcsoft go to universal-exports + kpi-alerts only and stay off the Phenom topic. The split below records which rules you (Phenom) actually receive.

What reaches the Phenom topic

Reaches `ghostmode-alerts` (you see it)	Operator-only (`universal-exports`, you do not)
`EndpointDown`, `SlowResponse`, `HttpStatusError`, `SslCertExpiringSoon`, `SslCertExpiryCritical`, `SslCertExpired` for Phenom assets (classified by the probe’s `instance` host)	The same probe rules for SanMarcSoft assets (verifieddit, trusteddit, sanmarcsoft.com, ddit.wtf)
`CommsContentStale` when `client=phenom`	`CommsContentStale` for other clients; `CommsGenerationFailed` (the SanMarcSoft pipeline)
(none of the company KPIs)	`MrrDropped`, `HighChurnRate`, `CriticalChurnRate`, `NoSignups48h`, `SightengineQuotaWarning`, `SightengineQuotaCritical`, `AuthBruteForce`, `LowCIPassRate`, `BlackboxExporterDown`, `SanMarcSoftExporterDown`, `BridgeAgentDown`, `BridgeAgentHeartbeatFailing`, `BridgeHealthEndpointUnreachable`

The company KPIs above are SanMarcSoft-internal (revenue, churn, signups, cost, CI, monitoring self-health, voice-bridge infrastructure). They are pinned org: sanmarcsoft in rules.yml, so they never reach the Phenom topic. The endpoint and SSL rules are shared across both tenants and self-classify per probe target, so a Phenom asset outage still reaches you while a SanMarcSoft asset outage does not.

Endpoint group

Alert	Severity	Trigger	Respond
`EndpointDown`	critical	A probed site/API unreachable for 5m	Same playbook as Asset DOWN: container, tunnel, DNS. Check which `instance` label fired
`SlowResponse`	warning	Response time > 5s for 5m	Check NAS load, container CPU, upstream API latency. Often a symptom preceding EndpointDown
`HttpStatusError`	critical	Non-2xx for 5m (tsa 405 and cf-access 302 are exempt by design)	Open the URL; an error page usually names the layer (Cloudflare 52x = origin down, 4xx = app misconfig)

SSL group

Alert	Severity	Trigger	Respond
`SslCertExpiringSoon`	warning	Cert expires < 14 days	Renew now: Cloudflare-managed certs should auto-renew - investigate why this one is not
`SslCertExpiryCritical`	critical	Cert expires < 7 days	Renewal escalation: manual issue/deploy today
`SslCertExpired`	critical	Cert already expired	Service outage in progress for strict clients. Issue and deploy immediately, then post-mortem why two earlier alerts were missed

Monitoring self-health

Alert	Severity	Trigger	Respond
`BlackboxExporterDown`	critical	Probe exporter unreachable 5m	`docker restart` the blackbox container; all endpoint alerts are blind until it returns
`SanMarcSoftExporterDown`	critical	Custom exporter unreachable 5m	Restart `sanmarcsoft-exporter`; business + bridge metrics are blind

Business group

Alert	Severity	Trigger	Respond
`MrrDropped`	critical	MRR < 80% of 30 days ago for 1h	Check Stripe for cancellations/refunds; verify it is real churn and not a webhook/metric failure first
`HighChurnRate`	warning	Monthly churn > 10%	Review recent cancellations for a common cause (price change, breakage, competitor)
`CriticalChurnRate`	critical	Monthly churn > 20%	As above, but same-day: something is actively driving users out
`NoSignups48h`	warning	Zero signups two consecutive days	First suspect the funnel, not the market: test signup end-to-end (Clerk auth, emails). Then check traffic

Cost group

Alert	Severity	Trigger	Respond
`SightengineQuotaWarning`	warning	API usage > 80% of monthly quota	Decide: throttle AI detection or budget an upgrade before the month ends
`SightengineQuotaCritical`	critical	Usage > 95%	AI detection degrades imminently; throttle now, upgrade if the feature matters this month

Security group

Alert	Severity	Trigger	Respond
`AuthBruteForce`	critical	> 200 failed auth attempts in 24h	Review Clerk logs: one IP = block it; distributed = enable bot protection / rate limits. Check for any success amid failures

Deployment group

Alert	Severity	Trigger	Respond
`LowCIPassRate`	warning	Repo CI pass rate < 80% over 7 days	Open the repo’s Actions history: flaky test (quarantine + ticket) vs genuine breakage (fix forward)

Comms content group

Alert	Severity	Trigger	Respond
`CommsContentStale`	warning	No new AI news content for a client in 24h	Check the generation workflow run history for that client
`CommsGenerationFailed`	critical	Pipeline’s last run unsuccessful	Open the GitHub Actions log for `sanmarcsoft-comms`; apply the four-point CI verification (conclusion, steps, annotations, live probe)

Bridge agent group (claude-peers voice infrastructure)

Alert	Severity	Trigger	Respond
`BridgeAgentDown`	warning	Agent unregistered from bridge > 5m	`docker logs codetalker-bridge` on the NAS; check the agent’s adapter at 10.0.0.112
`BridgeAgentHeartbeatFailing`	warning	> 3 heartbeat failures for 10m	Bridge cannot reach the remote agent endpoint; check the agent container and LAN path
`BridgeHealthEndpointUnreachable`	critical	Exporter cannot reach bridge /health for 5m	Bridge container down or crash-looping. Known boot dependency: the bridge exits if the broker tunnel (dev container → ai :7899) is down - restart `bridge-supervisor` in the dev container first, then the bridge container

Quick reference commands

# ntfy server health
curl https://alerts.sanmarcsoft.com/v1/health

# Voice/bridge health (includes TTS queue + agent registry)
curl http://10.0.0.96:7900/health

# What is running on the NAS
ssh matt@a1.matthewstevens.org "docker ps"

# Ghost Mode ops dashboard (every security alert clicks through to here)
open https://nest-ops.thephenom.app/ops/

# Grafana (Prometheus alerts live here until Alertmanager is wired)
open https://ops.sanmarcsoft.com

Known gaps (as of 2026-06-08)

~~Prometheus rules do not push~~ Resolved 2026-06-08. Alertmanager → ops-kpi-ntfy-adapter → ntfy now delivers all 21 rules, routed by org label (see “These rules now push” above).
Heartbeat dependence - if Ghost Mode: all clear stops, nothing else will tell you the security pipeline is dead. Check it when you have not heard it in a while.
Manual test alerts use the pattern tk_ tokens minted ad hoc (ntfy token add --expires=1h); revoke after use.
Company-KPI classification is label-driven. A SanMarcSoft KPI rule that ships without an org label falls back to a heuristic and could, in principle, mis-route. All current company rules are pinned org: sanmarcsoft; any new one must be too.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified June 22, 2026: docs(#383): GlitchTip runbook + register in nest-ops asset index (#384) (fbe78ef)

Alert Lookup Index

How alerts reach you

Priority tiers

Master lookup table

Ghost Mode security alerts (topic: ghostmode-alerts)

<host>: CANARY CREDENTIAL from <ip> - P5 urgent

MULTI-DOMAIN recon: <ip> (N sites) - P5 urgent

<host>: N recon/blocked hits from <ip> - P4 high

<host>: RECON <path> / <host>: BLOCKED - P4 high

Alert volume capped - P3 digest

Ghost Mode: all clear - P2 ops heartbeat

Asset monitor alerts (topics: phenom-*, mirrored to ghostmode-alerts)

<Asset>: DOWN - P5 urgent

<Asset>: RECOVERED - P3 digest

Prometheus rules (ops-monitoring)

These rules now push (2026-06-08)

What reaches the Phenom topic

Endpoint group

SSL group

Monitoring self-health

Business group

Cost group

Security group

Deployment group

Comms content group

Bridge agent group (claude-peers voice infrastructure)

Quick reference commands

Known gaps (as of 2026-06-08)

Feedback

`<host>: CANARY CREDENTIAL from <ip>` - P5 urgent

`MULTI-DOMAIN recon: <ip> (N sites)` - P5 urgent

`<host>: N recon/blocked hits from <ip>` - P4 high

`<host>: RECON <path>` / `<host>: BLOCKED` - P4 high

`Alert volume capped` - P3 digest

`Ghost Mode: all clear` - P2 ops heartbeat

`<Asset>: DOWN` - P5 urgent

`<Asset>: RECOVERED` - P3 digest