WIP: Fix silent 65-hour outage: alerting + open-webui hardening + activation-script footgun #5
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "fix/outage-followups"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Fallout from a 65-hour Open-WebUI outage (Apr 21 19:05 → Apr 24 12:11) visible only via Cloudflare's "server not responding" — because the monitoring stack was structurally broken in four different ways that each hid the next.
What went wrong (root-cause chain)
ServiceDownrules firing for 65h, but no Alertmanager was wired. Alerts evaluated into the void.pip --user, which stomped the shared user site-packages and droppedgreenlet. The service crash-looped 40+ times against the DB session.nix-darwinonly composes a fixed set of named phases (preActivation,extraActivation,postActivation,launchd,homebrew, etc.) into the final activate script. Every custom-namedsystem.activationScripts.{foo}.textin this repo (ollama firewall/restart, monitoring-setup, syncthing-firewall, etc.) was defined, built, and never executed.4c9b35bwas one of those silently-skipped scripts, so ollama stayed on 0.20.4 for weeks while brew symlinked through 0.21.0 → 0.21.2.Fixes (in this PR)
1. Alerting pipeline: Prometheus → Alertmanager → smtp2go → email
Initial wiring tried to use Grafana's embedded Alertmanager (fewer moving parts, reuses existing smtp2go config). That worked for Grafana's own alerting flow but hit a Grafana 12.4 bug: its
/api/v2/alertsendpoint expects a non-spec wrapped payload ({alerts: [...]}) while Prometheus sends the canonical bare-array AM v2 format, returning 400 on every real alert.Pivoted to real
prometheus-alertmanagerfrom nixpkgs (AM isn't in Homebrew). Runs as a dedicated launchd agent on :9093 with SMTP config reusing~/.secrets/grafana-smtp-password. Grafana now has Alertmanager added as a datasource so alert state is still viewable in its UI.Smoke-tested end-to-end: test alert POSTed to AM →
alertmanager_notifications_total{email}=1, 0 failures, email delivered.2. Open-WebUI:
uv tool install+ import probepip install --usertouv tool install --python 3.11 open-webui— isolated venv with a uv-managed lockfile.python -c "import open_webui.main, greenlet, sqlalchemy, ...") against the tool venv beforekickstart -k. A broken dep resolution now aborts the restart instead of crash-looping the service behind the tunnel. The existing process keeps running.uv tool installrace between the two agents at load).3. Activation-scripts footgun — fix + document
Folded every custom-named activation script into
system.activationScripts.extraActivation.text = lib.mkAfter …across ollama, open-webui, syncthing, monitoring, smb-mount, icloud-backup.Updated
modules/services/AGENTS.mdwith the full fixed list of nix-darwin phases and a loud warning. Verified viagrepon/run/current-system/activate— 23 realsocketfilterfw/kickstartcalls embedded in activate, up from zero before.Side-effect win: ollama was auto-restarted on this rebuild for the first time since the hook was added; API now reports 0.21.2 (was 0.20.4 for 16 days).
4. Observability cleanup
loki.processregex stage that extractsservice_namefrom the log filename basename.{service_name="open-webui"}now works in LogQL instead of filtering on filename.probe_success{job="blackbox"}+{{instance}}legend — auto-scales to whatever's inblackbox.targets.Files changed
Testing / verification
Activated on
studioand smoke-tested:nix flake check/darwin-rebuild build .#studioclean/run/current-system/activate(was 0)uv tool listshowsopen-webui v0.9.2;/healthreturns{"status":true}behind Cloudflare/api/versionnow0.21.2service_namelabels for all 8 running servicesalertmanager_notifications_total{email}=1with 0 failuresFollow-ups
~/.secrets/grafana-admin-passwordand~/.secrets/grafana-prometheus-tokenexist from the Grafana-AM detour and are now unreferenced. Harmless to leave; deletable at leisure (already done).prometheus-alerterstill in DB; can be deleted via UI or API.services/*.nixwith custom activation-script names would benefit from the same audit if they exist elsewhere.Commits
81bd870add monitoring alert emails0df1fb1use alertmanagerFix silent 65-hour outage: alerting + open-webui hardening + activation-script footgunto WIP: Fix silent 65-hour outage: alerting + open-webui hardening + activation-script footgunView command line instructions
Checkout
From your project repository, check out a new branch and test the changes.Merge
Merge the changes and update on Forgejo.Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.