1 comments

  • jamiemallers 1 hour ago
    The agent-based pull model is the right architecture here. We ran something similar internally and the key insight is exactly what you landed on: keep the agent dumb, evaluate server-side. The moment you put alerting logic on the agent, you need to redeploy agents every time you tweak a threshold, and coordinating that across 20 clusters is a nightmare.

    K8s auto-discovery via Ingresses/Services/HTTPRoutes is clever. One edge case to watch: teams using custom CRDs for routing (Istio VirtualServices, Traefik IngressRoutes). You'll get requests for those pretty fast once people adopt this in real clusters. A plugin/annotation system where users can teach the agent about custom resource types would scale better than hard-coding each one.

    The "what's actually down vs a blip" problem is where most monitoring tools quietly fail. Two things that help: (1) requiring N consecutive failures before marking down, with N configurable per-monitor (a database might need N=1, a CDN edge might need N=3), and (2) correlating failures across monitors. If 5 services behind the same ingress controller all fail simultaneously, that's one incident, not five.

    Curious about your status page auto-generation. Do you group services by namespace, by cluster, or something else? In our experience the auto-generated grouping is never quite what customers want to show publicly, so having an easy way to override the hierarchy matters a lot.

    • canto 1 hour ago
      "A plugin/annotation system where users can teach the agent about custom resource types would scale better than hard-coding each one." - this is a fantastic observation and feedback! Many thanks!

      "requiring N consecutive failures before marking down" - I do have the code for it, it's just hidden currently. StatusDude supports 2 types of worker/agents - cloud agents - that will re-verify from multiregion the service status and private agents - the ones we're talking about here - that I might just bring this option back as it makes more sense.

      Correlating failures is a bit tricky as usually it requires some sort of manual dependency creation but, I guess for k8s ingress and similar I should be able to figure this out and at least send alerts with appropriate priorities and order.

      As for the status page auto generation - currently it's based on namespace - I didn't wanted to bloat the user dashboard too much. Each monitor is tagged with cluster id, namespace and labels. Status Pages pickup monitors based on labels. Users are free to modify these and show exactly what they want :)