WIP: Fix silent 65-hour outage: alerting + open-webui hardening + activation-script footgun #5

Closed
bryan wants to merge 4 commits from fix/outage-followups into main
Owner

Summary

Fallout from a 65-hour Open-WebUI outage (Apr 21 19:05 → Apr 24 12:11) visible only via Cloudflare's "server not responding" — because the monitoring stack was structurally broken in four different ways that each hid the next.

What went wrong (root-cause chain)

  1. Silent alerting — Prometheus had ServiceDown rules firing for 65h, but no Alertmanager was wired. Alerts evaluated into the void.
  2. Pip-user install rot — Open-WebUI's lifespan hook installs tool/function deps at startup via pip --user, which stomped the shared user site-packages and dropped greenlet. The service crash-looped 40+ times against the DB session.
  3. Activation scripts silently no-op'dnix-darwin only composes a fixed set of named phases (preActivation, extraActivation, postActivation, launchd, homebrew, etc.) into the final activate script. Every custom-named system.activationScripts.{foo}.text in this repo (ollama firewall/restart, monitoring-setup, syncthing-firewall, etc.) was defined, built, and never executed.
  4. Ollama stale binary — auto-restart hook from 4c9b35b was one of those silently-skipped scripts, so ollama stayed on 0.20.4 for weeks while brew symlinked through 0.21.0 → 0.21.2.

Fixes (in this PR)

1. Alerting pipeline: Prometheus → Alertmanager → smtp2go → email

Initial wiring tried to use Grafana's embedded Alertmanager (fewer moving parts, reuses existing smtp2go config). That worked for Grafana's own alerting flow but hit a Grafana 12.4 bug: its /api/v2/alerts endpoint expects a non-spec wrapped payload ({alerts: [...]}) while Prometheus sends the canonical bare-array AM v2 format, returning 400 on every real alert.

Pivoted to real prometheus-alertmanager from nixpkgs (AM isn't in Homebrew). Runs as a dedicated launchd agent on :9093 with SMTP config reusing ~/.secrets/grafana-smtp-password. Grafana now has Alertmanager added as a datasource so alert state is still viewable in its UI.

Smoke-tested end-to-end: test alert POSTed to AM → alertmanager_notifications_total{email}=1, 0 failures, email delivered.

2. Open-WebUI: uv tool install + import probe

  • Migrated from pip install --user to uv tool install --python 3.11 open-webui — isolated venv with a uv-managed lockfile.
  • Updater runs an import probe (python -c "import open_webui.main, greenlet, sqlalchemy, ...") against the tool venv before kickstart -k. A broken dep resolution now aborts the restart instead of crash-looping the service behind the tunnel. The existing process keeps running.
  • Runner handles first-install idempotently; updater yields to runner on fresh installs (avoids a uv tool install race between the two agents at load).

3. Activation-scripts footgun — fix + document

Folded every custom-named activation script into system.activationScripts.extraActivation.text = lib.mkAfter … across ollama, open-webui, syncthing, monitoring, smb-mount, icloud-backup.

Updated modules/services/AGENTS.md with the full fixed list of nix-darwin phases and a loud warning. Verified via grep on /run/current-system/activate — 23 real socketfilterfw / kickstart calls embedded in activate, up from zero before.

Side-effect win: ollama was auto-restarted on this rebuild for the first time since the hook was added; API now reports 0.21.2 (was 0.20.4 for 16 days).

4. Observability cleanup

  • Alloy: added loki.process regex stage that extracts service_name from the log filename basename. {service_name="open-webui"} now works in LogQL instead of filtering on filename.
  • Dashboard: replaced 3 hardcoded per-target probe queries with a single probe_success{job="blackbox"} + {{instance}} legend — auto-scales to whatever's in blackbox.targets.

Files changed

flake.lock                                         |  30 +--
modules/hosts/studio.nix                           |   9 +-
modules/services/AGENTS.md                         |  95 +++++-
modules/services/grafana/dashboards/service-health.json | 48 +---
modules/services/icloud-backup.nix                 |  18 +-
modules/services/monitoring.nix                    | 227 +++++++++++++++---
modules/services/ollama.nix                        |  25 +--
modules/services/open-webui.nix                    | 188 ++++++++++----
modules/services/smb-mount.nix                     |  29 +--
modules/services/syncthing.nix                     |  17 +-
10 files changed, 485 insertions(+), 201 deletions(-)

Testing / verification

Activated on studio and smoke-tested:

  • nix flake check / darwin-rebuild build .#studio clean
  • All 23 folded activation blocks present in /run/current-system/activate (was 0)
  • uv tool list shows open-webui v0.9.2; /health returns {"status":true} behind Cloudflare
  • Ollama auto-restart fired; running pid matches rebuild time; /api/version now 0.21.2
  • Loki streams carry service_name labels for all 8 running services
  • Prometheus: 1 active AM discovered, 0 errors, 0 dropped
  • Alertmanager: smoke alert accepted, routed, alertmanager_notifications_total{email}=1 with 0 failures
  • Real test email delivered to inbox via smtp2go

Follow-ups

  • ~/.secrets/grafana-admin-password and ~/.secrets/grafana-prometheus-token exist from the Grafana-AM detour and are now unreferenced. Harmless to leave; deletable at leisure (already done).
  • Grafana service account prometheus-alerter still in DB; can be deleted via UI or API.
  • Every other repo's services/*.nix with custom activation-script names would benefit from the same audit if they exist elsewhere.

Commits

  • 81bd870 add monitoring alert emails
  • 0df1fb1 use alertmanager
## Summary Fallout from a 65-hour Open-WebUI outage (Apr 21 19:05 → Apr 24 12:11) visible only via Cloudflare's "server not responding" — because the monitoring stack was structurally broken in four different ways that each hid the next. ## What went wrong (root-cause chain) 1. **Silent alerting** — Prometheus had `ServiceDown` rules firing for 65h, but no Alertmanager was wired. Alerts evaluated into the void. 2. **Pip-user install rot** — Open-WebUI's lifespan hook installs tool/function deps at startup via `pip --user`, which stomped the shared user site-packages and dropped `greenlet`. The service crash-looped 40+ times against the DB session. 3. **Activation scripts silently no-op'd** — `nix-darwin` only composes a fixed set of named phases (`preActivation`, `extraActivation`, `postActivation`, `launchd`, `homebrew`, etc.) into the final activate script. Every custom-named `system.activationScripts.{foo}.text` in this repo (ollama firewall/restart, monitoring-setup, syncthing-firewall, etc.) was defined, built, and **never executed**. 4. **Ollama stale binary** — auto-restart hook from `4c9b35b` was one of those silently-skipped scripts, so ollama stayed on 0.20.4 for weeks while brew symlinked through 0.21.0 → 0.21.2. ## Fixes (in this PR) ### 1. Alerting pipeline: Prometheus → Alertmanager → smtp2go → email Initial wiring tried to use **Grafana's embedded Alertmanager** (fewer moving parts, reuses existing smtp2go config). That worked for Grafana's own alerting flow but hit a Grafana 12.4 bug: its `/api/v2/alerts` endpoint expects a non-spec wrapped payload (`{alerts: [...]}`) while Prometheus sends the canonical bare-array AM v2 format, returning 400 on every real alert. **Pivoted to real `prometheus-alertmanager`** from nixpkgs (AM isn't in Homebrew). Runs as a dedicated launchd agent on :9093 with SMTP config reusing `~/.secrets/grafana-smtp-password`. Grafana now has Alertmanager added as a datasource so alert state is still viewable in its UI. **Smoke-tested end-to-end**: test alert POSTed to AM → `alertmanager_notifications_total{email}=1`, 0 failures, email delivered. ### 2. Open-WebUI: `uv tool install` + import probe - Migrated from `pip install --user` to `uv tool install --python 3.11 open-webui` — isolated venv with a uv-managed lockfile. - Updater runs an **import probe** (`python -c "import open_webui.main, greenlet, sqlalchemy, ..."`) against the tool venv before `kickstart -k`. A broken dep resolution now *aborts* the restart instead of crash-looping the service behind the tunnel. The existing process keeps running. - Runner handles first-install idempotently; updater yields to runner on fresh installs (avoids a `uv tool install` race between the two agents at load). ### 3. Activation-scripts footgun — fix + document Folded every custom-named activation script into `system.activationScripts.extraActivation.text = lib.mkAfter …` across ollama, open-webui, syncthing, monitoring, smb-mount, icloud-backup. Updated `modules/services/AGENTS.md` with the full fixed list of nix-darwin phases and a loud warning. Verified via `grep` on `/run/current-system/activate` — 23 real `socketfilterfw` / `kickstart` calls embedded in activate, up from **zero before**. **Side-effect win**: ollama was auto-restarted on this rebuild for the first time since the hook was added; API now reports 0.21.2 (was 0.20.4 for 16 days). ### 4. Observability cleanup - **Alloy**: added `loki.process` regex stage that extracts `service_name` from the log filename basename. `{service_name="open-webui"}` now works in LogQL instead of filtering on filename. - **Dashboard**: replaced 3 hardcoded per-target probe queries with a single `probe_success{job="blackbox"}` + `{{instance}}` legend — auto-scales to whatever's in `blackbox.targets`. ## Files changed ``` flake.lock | 30 +-- modules/hosts/studio.nix | 9 +- modules/services/AGENTS.md | 95 +++++- modules/services/grafana/dashboards/service-health.json | 48 +--- modules/services/icloud-backup.nix | 18 +- modules/services/monitoring.nix | 227 +++++++++++++++--- modules/services/ollama.nix | 25 +-- modules/services/open-webui.nix | 188 ++++++++++---- modules/services/smb-mount.nix | 29 +-- modules/services/syncthing.nix | 17 +- 10 files changed, 485 insertions(+), 201 deletions(-) ``` ## Testing / verification Activated on `studio` and smoke-tested: - [x] `nix flake check` / `darwin-rebuild build .#studio` clean - [x] All 23 folded activation blocks present in `/run/current-system/activate` (was 0) - [x] `uv tool list` shows `open-webui v0.9.2`; `/health` returns `{"status":true}` behind Cloudflare - [x] Ollama auto-restart fired; running pid matches rebuild time; `/api/version` now `0.21.2` - [x] Loki streams carry `service_name` labels for all 8 running services - [x] Prometheus: 1 active AM discovered, 0 errors, 0 dropped - [x] Alertmanager: smoke alert accepted, routed, `alertmanager_notifications_total{email}=1` with 0 failures - [x] Real test email delivered to inbox via smtp2go ## Follow-ups - `~/.secrets/grafana-admin-password` and `~/.secrets/grafana-prometheus-token` exist from the Grafana-AM detour and are now unreferenced. Harmless to leave; deletable at leisure (already done). - Grafana service account `prometheus-alerter` still in DB; can be deleted via UI or API. - Every other repo's `services/*.nix` with custom activation-script names would benefit from the same audit if they exist elsewhere. ## Commits - `81bd870` add monitoring alert emails - `0df1fb1` use alertmanager
bryan changed title from Fix silent 65-hour outage: alerting + open-webui hardening + activation-script footgun to WIP: Fix silent 65-hour outage: alerting + open-webui hardening + activation-script footgun 2026-04-24 21:08:10 +00:00
- Bind Alertmanager to 127.0.0.1 (was 0.0.0.0). Its web UI and
  /api/v2/silences are unauthenticated, so a 0.0.0.0 bind let anyone on
  the LAN suppress every alert. All consumers are local (Prometheus, the
  activation preflight, Grafana's datasource — all localhost:9093), so
  loopback-only closes the hole with no functional loss. Drop the now-dead
  app-firewall --add/--unblock for AM (loopback isn't filtered).
- Correct comments that still described the superseded "Grafana embedded
  Alertmanager" design and a nonexistent grafana-admin-password secret:
  studio.nix host note, prometheus.yml note, grafana secrets recipe,
  grafana agent note, and the AGENTS.md services table.
The studio's own node_exporter is the only node_exporter scrape target
(unraid is HTTP-probe + syslog only), so the node_* alert rules must use
darwin/Mach metric names, not Linux /proc names. Verified the available
metrics against a running macOS node_exporter (1.11.x) and validated the
rewritten rules with `promtool check rules` (5/5 OK).

- HighMemoryUsage: node_memory_MemAvailable_bytes/MemTotal_bytes don't
  exist on darwin, so the rule never fired. Replaced with HighMemoryPressure
  alerting on node_memory_swap_used_bytes > 1 GiB — the unambiguous "RAM
  overcommitted" signal on macOS (naive used% is meaningless; the kernel
  keeps free near zero by design).
- HighDiskUsage: darwin fstypes are apfs/autofs/nullfs, so the Linux
  tmpfs|devtmpfs|overlay exclusion matched nothing. Pinned to the real data
  volume (fstype=apfs, mountpoint=/System/Volumes/Data); the sealed "/"
  snapshot shares the APFS container and would double-fire.
- HighCpuUsage: already darwin-valid (node_cpu_seconds_total{mode=idle}
  exists) — added a note so it isn't "fixed" back.
- Documented the platform constraint inline to prevent regression.
Author
Owner

Squash-merged to main as 1dadac2 and pushed directly (signed-commit workflow — server-side merge would re-sign/invalidate signatures).

Landed the original branch plus two review follow-ups:

  • Alertmanager bound to 127.0.0.1 (unauthenticated silence API) + corrected stale 'Grafana embedded AM / grafana-admin-password' docs.
  • Alert rules rewritten for macOS/Mach node_exporter metrics — swap-based memory pressure and apfs /System/Volumes/Data disk (the old Linux MemAvailable/fstype rules never fired). promtool-validated 5/5.
  • syncthing.nix conflict resolved keeping config.homebrew.prefix (main) + the extraActivation footgun fix (branch).

Verified: nix build .#darwinConfigurations.studio.system exit 0.

Squash-merged to `main` as `1dadac2` and pushed directly (signed-commit workflow — server-side merge would re-sign/invalidate signatures). Landed the original branch plus two review follow-ups: - Alertmanager bound to 127.0.0.1 (unauthenticated silence API) + corrected stale 'Grafana embedded AM / grafana-admin-password' docs. - Alert rules rewritten for macOS/Mach node_exporter metrics — swap-based memory pressure and apfs /System/Volumes/Data disk (the old Linux MemAvailable/fstype rules never fired). promtool-validated 5/5. - syncthing.nix conflict resolved keeping config.homebrew.prefix (main) + the extraActivation footgun fix (branch). Verified: `nix build .#darwinConfigurations.studio.system` exit 0.
bryan closed this pull request 2026-06-19 00:04:27 +00:00

Pull request closed

Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
bryan/nix-configs!5
No description provided.