Observability & alerts
An opt-in metrics sampler feeds the admin panel’s charts, plus a one-shot crash-loop alert — no inbound surface.
Two small, optional pieces let you see the phone’s health over time and get told when something breaks: a metrics sampler that feeds the admin panel’s charts, and a one-shot crash-loop alert. Both are off by default and add no inbound surface.
Metrics sampler → admin charts
Set ENABLE_METRICS=true (in .env or via ./setup.sh) and re-run the installer.
scripts/ops/metrics-sampler.py is then
supervised like any other service. Once a minute it reads a few cheap, host-side
numbers and appends one compact JSON line to a capped ring file:
| Metric | Source |
|---|---|
| CPU busy % | /proc/stat deltas |
| memory / swap used % | /proc/meminfo |
| load (1m) | /proc/loadavg |
| disk used % | statvfs(DATA_DIR) |
| temperature | hottest /sys/class/thermal/thermal_zone* |
| battery % | termux-battery-status (best-effort; needs Termux:API) |
| DEGRADED count | number of *.degraded markers (the 24h health strip) |
The web admin panel reads it on the /metrics page: inline-SVG
sparklines per metric (current + 24h min/avg/max) and a 24h health strip (green
when every service was healthy, red when one was crash-looping, grey for gaps).
Tuning (all optional, in .env):
POCKET_METRICS_POLL_S=60 # sample interval (min 5)
POCKET_METRICS_RING=5760 # samples kept (5760 ≈ 4 days @ 60s)
POCKET_METRICS_BATTERY=true # poll the battery (self-disables if Termux:API absent)
The “Problems” view + doctor button
Independent of the sampler, the panel surfaces trouble loudly:
- a Problems banner on the dashboard and a
problems (N)nav tab appear the moment any service is crash-looping (DEGRADED) or a core service is down; - the
/problemspage lists each unhealthy service with its crash-loop marker, a restart button, and a link to its log; - a run doctor button runs
ops/doctor.sh(the read-only preflight) right from the dashboard / problems page; - the log viewer (
/logs/<name>) takes a case-insensitivefilterand a line count, so you can grep a service’s log without a shell.
Storage tier & resource notes
- The ring file lives on ext4 (
~/.pocket/metrics/metrics.jsonl), not the exFAT SD card — the sampler trims it with an atomicrename, which exFAT/FUSE refuses, and this keeps churn off the removable card. This mirrors how the honeypot DB is pinned to ext4. Override withPOCKET_METRICS_LOGif needed (point the panel at the same path). - Cost is negligible: a handful of file reads once a minute and a < 1 MB ring. It is recommended to leave on. It makes no network call and opens no port.
Crash-loop alerts (POCKET_ALERT_CMD)
Every service is kept alive by the supervisor; after POCKET_CRASHLOOP_FAILS rapid
failures it flags the service DEGRADED and fires one optional alert command
(see RESILIENCE.md). ./setup.sh wires it for you — pick a channel:
| Channel | What setup writes (POCKET_ALERT_CMD) |
|---|---|
| none | empty (no alert) |
| ntfy | curl … -d "service $POCKET_ALERT_SERVICE crash-looping (rc=$POCKET_ALERT_RC …)" <topic> |
| healthchecks | curl … "<your-ping-url>/fail" |
| Matrix | bash "…/scripts/ops/alert-matrix.sh" (a tiny helper, below) |
The command runs once via sh -c with $POCKET_ALERT_SERVICE, $POCKET_ALERT_RC,
$POCKET_ALERT_FAILS in the environment (never on argv). It must be short and
silent. You can also just set POCKET_ALERT_CMD in .env by hand to any command.
Matrix alerts (ops/alert-matrix.sh)
So the Matrix access token never lands in .env, the Matrix channel reads a
0600 file (the same pattern the honeypot uses). Create it after the stack is up:
# ${DATA_DIR}/secrets/alert-matrix.env (chmod 600)
ALERT_MATRIX_HS=http://127.0.0.1:8448 # loopback is fine
ALERT_MATRIX_TOKEN=<access token of a bot/admin account already in the room>
ALERT_MATRIX_ROOM=!internalRoomId:your.domain # internal id, NOT an alias
It posts a one-line m.text message and is best-effort (short timeout, never blocks
the supervisor). The token flows only in the Authorization header.
Resource & Risk
- Alerts are fire-and-forget: stdout/stderr go to
/dev/nulland a stuck command is backgrounded, so a broken alert can never wedge the supervisor — but it also means you won’t see why an alert failed. Test your command once by hand. - A ping URL or ntfy topic is itself a capability — it lives in
.env(0600), the same trust level asBACKUP_DAEMON_HC_URL. The Matrix token lives in a separate 0600 file, never.env.