-
Notifications
You must be signed in to change notification settings - Fork 44
Description
Problem
Peekaping can become a single point of failure in a single node deployment. If the instance running the monitoring service goes down — for example due to hardware issues, restarts, updates, or network outages — all monitoring functionality is immediately lost. In production environments this is critical, because the monitoring system must remain reliable even if parts of the infrastructure fail.
Proposed solution
Enable a high availability setup with ≥3 replicas where exactly one replica performs monitoring checks. Use a leader-election mechanism for scheduler ownership and automatic failover so a standby replica takes over when the active one fails. Prevent duplicate checks during transitions.
Scope & Non-Goals
- In scope: multi-replica operation; single active scheduler; leader election; clean failover behavior
- Out of scope: provisioning/operation of external reverse proxies or load balancers — users handle these
Suggestions for documentation
- Provide reference deployments for Docker Swarm (simple HA entry point) and Kubernetes (for advanced setups)
- Include guidance on running multiple replicas, health/readiness checks, and safe configuration of any shared state
- Suggest Traefik as an example reverse proxy in the docs
Technical notes (informative)
- Centralize scheduler state (e.g., datastore/coordination backend) to avoid split-brain and ensure idempotent job execution
- Define reasonable leader heartbeat and takeover intervals; document expected failover timing
- Clarify rolling-update behavior to avoid unnecessary leadership churn during upgrades
Alternatives
To the best of my knowledge, there is no comparable open-source monitoring tool with built-in HA. Practical alternatives are only cloud-hosted services (e.g. UptimeRobot) for enterprise use cases (or a different kind solution like a observability platform like New Relic).