Watchstander — the cross-vendor reliability layer for AI data centersTop 5Deep-reviewed
by April Zheng
What they're building
Watchstander — the cross-vendor reliability layer for AI data centers. It watches a GPU fleet, predicts a hardware failure ~18 hours before it happens, opens a pre-emptive incident, and — only after a human approves — drains the node and files the vendor RMA. sense → predict → act. At frontier scale a single unplanned GPU interruption reportedly costs ~$2.5M, so an 18-hour head start is the whole game. Three tracks: 🔭 1. Sense — normalizes NVIDIA (DCGM) + AMD (ROCm/Redfish) telemetry onto one schema. 🔮 2. Predict — a model trained on real GWDG Xid-79 failure data flags a degrading node 18.17 h early (honest metrics: 0.18/0.56). 🤝 3. Act + Trust — triage → cited diagnosis → fleet-pattern detection → cross-vendor RMA, every destructive action human-gated. How we integrated the sponsor stack Each does real work in the loop (not one token call); offline-deterministic by default, live with keys. ✅ = exercised live (evidence in docs/evidence/). 🧠 Nebius Token Factory ✅ — model-routed inference: Llama-3.3-70B triage, Qwen3-235B diagnosis + RMA summary, plus Qwen3-Embedding-8B re-ranking evidence. Timeout/retry/graceful-JSON hardened. 🔎 Tavily ✅ — multi-query fan-out per signature, scored + de-duped, include_answer synthesis threaded into the Nebius diagnosis. 🛠️ Composio ✅ — the real actions: GitHub ticket + Slack on-call page + cross-vendor RMA. make verify-live posted a real issue (#21) + Slack message. 🧩 mem0 ✅ — causal long-term memory: recall feeds the diagnosis and a confirmed recurrence overrides the drain confidence floor. ⚡ Nebius AI Cloud — Serverless AI Builders Challenge: predictor trained/backtested on AI Cloud GPU; inference repointable via NEBIUS_BASE_URL. 🤖 OpenClaw — the agent runtime: handle_alert + human-approval gate, make openclaw / POST /alert / openclaw.toml. Teammates: April Zheng; Dave Petrovikj; Alice Xu
AI code reviewrepo: real
Exceptional and fully verifiable. The live demo (watchstander.onrender.com/demo) is an interactive Mission Control with REACTIVE/PREDICTION/PREEMPTIVE tracks and a working human-approval drain gate. agent.py implements a real sense->triage->enrich->diagnose->act->observe loop with a confidence floor on destructive actions, fleet-pattern clustering, and mem0 recurrence override. All three sponsors are load-bearing and chained: Nebius (nebius_reasoner.py) does deliberate model routing on api.studio.nebius.ai (Llama-3.3-70B triage, Qwen3-235B diagnosis, Qwen3-Embedding-8B rerank); Tavily (tavily_searcher.py) does multi-query fan-out with include_answer='advanced' threaded into the Nebius diagnosis; Composio (composio_actions.py) executes GITHUB_CREATE_AN_ISSUE + SLACK_CHAT_POST_MESSAGE. Live side-effects are proven: GitHub issue #21 on gigaapril/watchstander was confirmed created by 'make verify-live', and evidence logs show a live Nebius/Tavily/mem0 run. The predictor is a real logistic model trained on the GWDG Zenodo Xid-79 dataset with an honest model card and modest, non-inflated metrics. ⚑ Forecast metrics are deliberately modest (0.18 precision / 0.56 recall) — honestly disclosed, small dataset (10 events/5 nodes). Docs are verbose. None of these undermine the build.