prometheus-alert-rules

A curated collection of Prometheus alert rules and recording rules covering a wide range of systems and services. Rules are organized by component and validated automatically with promtool on every commit.

Usage

Prometheus configuration

# prometheus.yml

global:
  scrape_interval: 15s
  ...

rule_files:
  - 'rules/*.yml'

scrape_configs:
  ...

Alert Rules

Each file in rules/ targets a specific system or service. All alert names use CamelCase and include severity labels (page, warning, info).

File	Description
`backup.yml`	Backup job failures and missing daily backup status
`blackbox.yml`	Blackbox exporter probe failures, SSL certificate expiry, and slow probes
`clock.yml`	Host clock skew and NTP synchronization issues
`consul.yml`	Consul service healthchecks, missing master nodes, and unhealthy agents
`cpu.yml`	High CPU load, underutilization, steal, I/O wait, and context switching
`disk.yml`	Disk space, inodes, read/write rates, latency, and filesystem errors
`docker.yml`	Container CPU, memory, volume usage, and throttle rate
`elasticsearch.yml`	Heap usage, disk space, cluster health, shards, and indexing/query performance
`galera.yml`	MySQL Galera cluster size, replication state, and flow control
`haproxy.yml`	HTTP error rates, connection errors, pending requests, and security blocks
`hwmon.yml`	Physical component temperature and overtemperature alarms
`jvm.yml`	JVM heap/non-heap memory, GC time, thread counts, and file descriptors
`kafka.yml`	Topic replicas, consumer group lag, and offset anomalies
`kernel.yml`	Kernel version deviations across the fleet
`lynis.yml`	Lynis hardening index, warnings, suggestions, and audit freshness
`md.yml`	Software RAID array state and disk failures
`memory.yml`	Host memory, OOM kills, swap usage, and ECC errors
`mimir-alerts.yml`	Grafana Mimir component health, ingestion, compaction, and ruler alerts
`mimir-rules.yml`	Grafana Mimir recording rules for API and component metrics
`mysql.yml`	MySQL availability, replication lag, connections, and slow queries
`network.yml`	Network throughput, interface saturation, conntrack, and errors
`nomad.yml`	Nomad job failures, lost jobs, queued jobs, and blocked evaluations
`nvme.yml`	NVMe drive temperature, wear, media errors, and power-on hours
`php-fpm.yml`	PHP-FPM max children, listen queue, busy processes, and slow requests
`pressure.yml`	Linux PSI (Pressure Stall Information) for CPU, memory, and I/O
`prometheus-alerts.yml`	Prometheus and Alertmanager self-monitoring, target scraping, and TSDB health
`rabbitmq.yml`	RabbitMQ node health, memory, file descriptors, queues, and connections
`reboot.yml`	Node reboots and frequent reboot detection
`redis.yml`	Redis availability, replication, memory, connections, and backups
`smartd.yml`	SMART disk health, temperature, reallocated/pending sectors, and wear
`systemd.yml`	Systemd service crashes
`thumbor.yml`	Thumbor availability, error rate, response time, and storage hit ratio
`timezones.yml`	Sleep-peacefully timezone-based alert suppression windows
`up.yml`	Core component availability (API server, Cadvisor, Kubelet, node exporter, etc.)
`vault.yml`	HashiCorp Vault seal state, availability, and token management

Textfile Collector Scripts

The textfile-collector/ directory contains shell scripts for use with node_exporter's textfile collector. They expose metrics not natively available from exporters.

Script	Description
`lynis.sh`	Exports Lynis audit report metrics (hardening index, warnings, suggestions)
`nvme_metrics.sh`	Exports NVMe drive health metrics via `nvme-cli`
`smartmon.sh`	Exports SMART disk health metrics via `smartctl`

Copy the scripts to your textfile collector directory (e.g. /var/lib/node_exporter/textfile_collector/) and schedule them via cron or a systemd timer.

Development

Prerequisites

Docker — used to run promtool for rule validation
shellcheck — used to lint shell scripts

Validate alert rules

make test

This runs promtool check rules on all files in rules/ inside a Docker container.

Lint textfile-collector scripts

make shellcheck

Adding a new alert rule

Add or edit the appropriate YAML file in rules/.
Run make test to validate the rule syntax.
Follow the existing structure and conventions:

---
groups:
  - name: <group-name>
    rules:
      - alert: <AlertName>
        expr: >
          (<PromQL expression>)
          * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
        for: <duration>
        labels:
          severity: <page|warning|info>
        annotations:
          summary: <Human readable description> (instance {{ $labels.instance }})
          description: "<detailed description>\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Name		Name	Last commit message	Last commit date
Latest commit History 237 Commits
.github		.github
rules		rules
textfile-collector		textfile-collector
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

prometheus-alert-rules

Table of Contents

Usage

Prometheus configuration

Alert Rules

Textfile Collector Scripts

Development

Prerequisites

Validate alert rules

Lint textfile-collector scripts

Adding a new alert rule

About

Uh oh!

Releases 87

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

prometheus-alert-rules

Table of Contents

Usage

Prometheus configuration

Alert Rules

Textfile Collector Scripts

Development

Prerequisites

Validate alert rules

Lint textfile-collector scripts

Adding a new alert rule

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 87

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages