WIP: Fix silent 65-hour outage: alerting + open-webui hardening + activation-script footgun #5

Draft
bryan wants to merge 2 commits from fix/outage-followups into main
Owner

Summary

Fallout from a 65-hour Open-WebUI outage (Apr 21 19:05 → Apr 24 12:11) visible only via Cloudflare's "server not responding" — because the monitoring stack was structurally broken in four different ways that each hid the next.

What went wrong (root-cause chain)

  1. Silent alerting — Prometheus had ServiceDown rules firing for 65h, but no Alertmanager was wired. Alerts evaluated into the void.
  2. Pip-user install rot — Open-WebUI's lifespan hook installs tool/function deps at startup via pip --user, which stomped the shared user site-packages and dropped greenlet. The service crash-looped 40+ times against the DB session.
  3. Activation scripts silently no-op'dnix-darwin only composes a fixed set of named phases (preActivation, extraActivation, postActivation, launchd, homebrew, etc.) into the final activate script. Every custom-named system.activationScripts.{foo}.text in this repo (ollama firewall/restart, monitoring-setup, syncthing-firewall, etc.) was defined, built, and never executed.
  4. Ollama stale binary — auto-restart hook from 4c9b35b was one of those silently-skipped scripts, so ollama stayed on 0.20.4 for weeks while brew symlinked through 0.21.0 → 0.21.2.

Fixes (in this PR)

1. Alerting pipeline: Prometheus → Alertmanager → smtp2go → email

Initial wiring tried to use Grafana's embedded Alertmanager (fewer moving parts, reuses existing smtp2go config). That worked for Grafana's own alerting flow but hit a Grafana 12.4 bug: its /api/v2/alerts endpoint expects a non-spec wrapped payload ({alerts: [...]}) while Prometheus sends the canonical bare-array AM v2 format, returning 400 on every real alert.

Pivoted to real prometheus-alertmanager from nixpkgs (AM isn't in Homebrew). Runs as a dedicated launchd agent on :9093 with SMTP config reusing ~/.secrets/grafana-smtp-password. Grafana now has Alertmanager added as a datasource so alert state is still viewable in its UI.

Smoke-tested end-to-end: test alert POSTed to AM → alertmanager_notifications_total{email}=1, 0 failures, email delivered.

2. Open-WebUI: uv tool install + import probe

  • Migrated from pip install --user to uv tool install --python 3.11 open-webui — isolated venv with a uv-managed lockfile.
  • Updater runs an import probe (python -c "import open_webui.main, greenlet, sqlalchemy, ...") against the tool venv before kickstart -k. A broken dep resolution now aborts the restart instead of crash-looping the service behind the tunnel. The existing process keeps running.
  • Runner handles first-install idempotently; updater yields to runner on fresh installs (avoids a uv tool install race between the two agents at load).

3. Activation-scripts footgun — fix + document

Folded every custom-named activation script into system.activationScripts.extraActivation.text = lib.mkAfter … across ollama, open-webui, syncthing, monitoring, smb-mount, icloud-backup.

Updated modules/services/AGENTS.md with the full fixed list of nix-darwin phases and a loud warning. Verified via grep on /run/current-system/activate — 23 real socketfilterfw / kickstart calls embedded in activate, up from zero before.

Side-effect win: ollama was auto-restarted on this rebuild for the first time since the hook was added; API now reports 0.21.2 (was 0.20.4 for 16 days).

4. Observability cleanup

  • Alloy: added loki.process regex stage that extracts service_name from the log filename basename. {service_name="open-webui"} now works in LogQL instead of filtering on filename.
  • Dashboard: replaced 3 hardcoded per-target probe queries with a single probe_success{job="blackbox"} + {{instance}} legend — auto-scales to whatever's in blackbox.targets.

Files changed

flake.lock                                         |  30 +--
modules/hosts/studio.nix                           |   9 +-
modules/services/AGENTS.md                         |  95 +++++-
modules/services/grafana/dashboards/service-health.json | 48 +---
modules/services/icloud-backup.nix                 |  18 +-
modules/services/monitoring.nix                    | 227 +++++++++++++++---
modules/services/ollama.nix                        |  25 +--
modules/services/open-webui.nix                    | 188 ++++++++++----
modules/services/smb-mount.nix                     |  29 +--
modules/services/syncthing.nix                     |  17 +-
10 files changed, 485 insertions(+), 201 deletions(-)

Testing / verification

Activated on studio and smoke-tested:

  • nix flake check / darwin-rebuild build .#studio clean
  • All 23 folded activation blocks present in /run/current-system/activate (was 0)
  • uv tool list shows open-webui v0.9.2; /health returns {"status":true} behind Cloudflare
  • Ollama auto-restart fired; running pid matches rebuild time; /api/version now 0.21.2
  • Loki streams carry service_name labels for all 8 running services
  • Prometheus: 1 active AM discovered, 0 errors, 0 dropped
  • Alertmanager: smoke alert accepted, routed, alertmanager_notifications_total{email}=1 with 0 failures
  • Real test email delivered to inbox via smtp2go

Follow-ups

  • ~/.secrets/grafana-admin-password and ~/.secrets/grafana-prometheus-token exist from the Grafana-AM detour and are now unreferenced. Harmless to leave; deletable at leisure (already done).
  • Grafana service account prometheus-alerter still in DB; can be deleted via UI or API.
  • Every other repo's services/*.nix with custom activation-script names would benefit from the same audit if they exist elsewhere.

Commits

  • 81bd870 add monitoring alert emails
  • 0df1fb1 use alertmanager
## Summary Fallout from a 65-hour Open-WebUI outage (Apr 21 19:05 → Apr 24 12:11) visible only via Cloudflare's "server not responding" — because the monitoring stack was structurally broken in four different ways that each hid the next. ## What went wrong (root-cause chain) 1. **Silent alerting** — Prometheus had `ServiceDown` rules firing for 65h, but no Alertmanager was wired. Alerts evaluated into the void. 2. **Pip-user install rot** — Open-WebUI's lifespan hook installs tool/function deps at startup via `pip --user`, which stomped the shared user site-packages and dropped `greenlet`. The service crash-looped 40+ times against the DB session. 3. **Activation scripts silently no-op'd** — `nix-darwin` only composes a fixed set of named phases (`preActivation`, `extraActivation`, `postActivation`, `launchd`, `homebrew`, etc.) into the final activate script. Every custom-named `system.activationScripts.{foo}.text` in this repo (ollama firewall/restart, monitoring-setup, syncthing-firewall, etc.) was defined, built, and **never executed**. 4. **Ollama stale binary** — auto-restart hook from `4c9b35b` was one of those silently-skipped scripts, so ollama stayed on 0.20.4 for weeks while brew symlinked through 0.21.0 → 0.21.2. ## Fixes (in this PR) ### 1. Alerting pipeline: Prometheus → Alertmanager → smtp2go → email Initial wiring tried to use **Grafana's embedded Alertmanager** (fewer moving parts, reuses existing smtp2go config). That worked for Grafana's own alerting flow but hit a Grafana 12.4 bug: its `/api/v2/alerts` endpoint expects a non-spec wrapped payload (`{alerts: [...]}`) while Prometheus sends the canonical bare-array AM v2 format, returning 400 on every real alert. **Pivoted to real `prometheus-alertmanager`** from nixpkgs (AM isn't in Homebrew). Runs as a dedicated launchd agent on :9093 with SMTP config reusing `~/.secrets/grafana-smtp-password`. Grafana now has Alertmanager added as a datasource so alert state is still viewable in its UI. **Smoke-tested end-to-end**: test alert POSTed to AM → `alertmanager_notifications_total{email}=1`, 0 failures, email delivered. ### 2. Open-WebUI: `uv tool install` + import probe - Migrated from `pip install --user` to `uv tool install --python 3.11 open-webui` — isolated venv with a uv-managed lockfile. - Updater runs an **import probe** (`python -c "import open_webui.main, greenlet, sqlalchemy, ..."`) against the tool venv before `kickstart -k`. A broken dep resolution now *aborts* the restart instead of crash-looping the service behind the tunnel. The existing process keeps running. - Runner handles first-install idempotently; updater yields to runner on fresh installs (avoids a `uv tool install` race between the two agents at load). ### 3. Activation-scripts footgun — fix + document Folded every custom-named activation script into `system.activationScripts.extraActivation.text = lib.mkAfter …` across ollama, open-webui, syncthing, monitoring, smb-mount, icloud-backup. Updated `modules/services/AGENTS.md` with the full fixed list of nix-darwin phases and a loud warning. Verified via `grep` on `/run/current-system/activate` — 23 real `socketfilterfw` / `kickstart` calls embedded in activate, up from **zero before**. **Side-effect win**: ollama was auto-restarted on this rebuild for the first time since the hook was added; API now reports 0.21.2 (was 0.20.4 for 16 days). ### 4. Observability cleanup - **Alloy**: added `loki.process` regex stage that extracts `service_name` from the log filename basename. `{service_name="open-webui"}` now works in LogQL instead of filtering on filename. - **Dashboard**: replaced 3 hardcoded per-target probe queries with a single `probe_success{job="blackbox"}` + `{{instance}}` legend — auto-scales to whatever's in `blackbox.targets`. ## Files changed ``` flake.lock | 30 +-- modules/hosts/studio.nix | 9 +- modules/services/AGENTS.md | 95 +++++- modules/services/grafana/dashboards/service-health.json | 48 +--- modules/services/icloud-backup.nix | 18 +- modules/services/monitoring.nix | 227 +++++++++++++++--- modules/services/ollama.nix | 25 +-- modules/services/open-webui.nix | 188 ++++++++++---- modules/services/smb-mount.nix | 29 +-- modules/services/syncthing.nix | 17 +- 10 files changed, 485 insertions(+), 201 deletions(-) ``` ## Testing / verification Activated on `studio` and smoke-tested: - [x] `nix flake check` / `darwin-rebuild build .#studio` clean - [x] All 23 folded activation blocks present in `/run/current-system/activate` (was 0) - [x] `uv tool list` shows `open-webui v0.9.2`; `/health` returns `{"status":true}` behind Cloudflare - [x] Ollama auto-restart fired; running pid matches rebuild time; `/api/version` now `0.21.2` - [x] Loki streams carry `service_name` labels for all 8 running services - [x] Prometheus: 1 active AM discovered, 0 errors, 0 dropped - [x] Alertmanager: smoke alert accepted, routed, `alertmanager_notifications_total{email}=1` with 0 failures - [x] Real test email delivered to inbox via smtp2go ## Follow-ups - `~/.secrets/grafana-admin-password` and `~/.secrets/grafana-prometheus-token` exist from the Grafana-AM detour and are now unreferenced. Harmless to leave; deletable at leisure (already done). - Grafana service account `prometheus-alerter` still in DB; can be deleted via UI or API. - Every other repo's `services/*.nix` with custom activation-script names would benefit from the same audit if they exist elsewhere. ## Commits - `81bd870` add monitoring alert emails - `0df1fb1` use alertmanager
bryan changed title from Fix silent 65-hour outage: alerting + open-webui hardening + activation-script footgun to WIP: Fix silent 65-hour outage: alerting + open-webui hardening + activation-script footgun 2026-04-24 21:08:10 +00:00
This pull request has changes conflicting with the target branch.
  • flake.lock
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin fix/outage-followups:fix/outage-followups
git switch fix/outage-followups

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch main
git merge --no-ff fix/outage-followups
git switch fix/outage-followups
git rebase main
git switch main
git merge --ff-only fix/outage-followups
git switch fix/outage-followups
git rebase main
git switch main
git merge --no-ff fix/outage-followups
git switch main
git merge --squash fix/outage-followups
git switch main
git merge --ff-only fix/outage-followups
git switch main
git merge fix/outage-followups
git push origin main
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
bryan/nix-configs!5
No description provided.