-
Notifications
You must be signed in to change notification settings - Fork 126
Single faulty host with postgres-continuous-discovery breaks metrics collection for all hosts #890
Description
Description:
We've encountered a critical issue where a single problematic host configured with postgres-continuous-discovery can completely stop metrics collection from all monitored hosts. Here's our detailed findings:
Environment:
- pgwatch3 versions tested: 3.6.0, 3.7.0 (install https://github.com/cybertec-postgresql/pgwatch/releases/download/v3.7.0/pgwatch_Linux_x86_64.deb)
- PostgreSQL versions: Various (issue appeared after upgrading one host)
- PostgreSQL version for pgwatch: 16.9
Steps to Reproduce:
- Configure a host with kind: postgres-continuous-discovery that either:
- Has insufficient permissions (blocks on btrim function)
- Has incorrect pg_hba.conf settings (blocks connections)
- Wait for pgwatch3 to attempt database discovery
- Observe that metrics stop flowing from ALL hosts, not just the problematic one
Observed Behavior:
When a continuous-discovery host fails with either:
[ERROR] [sql:select /* pgwatch_generated */ datname from pg_database where not datistemplate and datallowconn and has_database_privilege (datname, 'CONNECT') and case when length(trim($1)) > 0 then datname ~ $1 else true end and case when length(trim($2)) > 0 then not datname ~ $2 else true end] [args:[.* (template|postgres)]] [err:ERROR: permission denied for function btrim (SQLSTATE 42501)] [pid:968232] [time:3.164024ms]
or connection errors:
FATAL: no pg_hba.conf entry for host [...] (SQLSTATE 28000)
The entire metrics collection system becomes blocked. Metrics stop being collected from all hosts until:
- The problematic host is disabled (is_enabled: false)
- Or changed to kind: postgres
Expected Behavior:
- Metrics should continue to be collected from all other hosts
- Only the problematic host should be marked as failed
- Errors should be logged but not block the entire collection process
Workaround:
Changing the problematic host to kind: postgres (explicit DB listing) allows metrics to flow again, though with connection errors for that specific host.
Configuration Example:
/etc/pgwatch/sources.yaml
- name: problem-host
group: test
conn_str: postgresql://pgwatch:<password>@<problem-host>:<port>/postgres
custom_metrics: {}
custom_metrics_standby: {}
kind: postgres-continuous-discovery
include_pattern: .*
exclude_pattern: (template|postgres)
preset_metrics: full
preset_metrics_standby: ""
is_enabled: true
custom_tags:
env: test
host: problem-host
host_config:
dcs_type: ""
dcs_endpoints: []
scope: ""
namespace: ""
username: ""
password: ""
ca_file: ""
cert_file: ""
key_file: ""
logs_glob_path: ""
logs_match_regex: ""
per_metric_disabled_intervals: []
only_if_master: falseconfig pgwatch.service
[Unit]
Description=pgwatch metrics collector
After=network.target
[Service]
# Sinks
Environment="PW_BATCHING_DELAY=950ms"
Environment="PW_RETENTION=30"
Environment="PW_REAL_DBNAME_FIELD=real_dbname"
Environment="PW_SYSTEM_IDENTIFIER_FIELD=sys_id"
Environment="PW_SINK=postgres://pgwatch@127.0.0.1:5433/pgwatch_metrics"
# Sources
Environment="PW_REFRESH=120"
Environment="PW_MIN_DB_SIZE_MB=10"
Environment="PW_MAX_PARALLEL_CONNECTIONS_PER_DB=4"
Environment="PW_SOURCES=/etc/pgwatch/sources.yaml"
# Metrics
Environment="PW_CREATE_HELPERS=false"
Environment="PW_DIRECT_OS_STATS=true"
Environment="PW_INSTANCE_LEVEL_CACHE_MAX_SECONDS=30"
Environment="PW_METRICS=postgresql://pgwatch@127.0.0.1:5433/pgwatch_metrics"
# WebUI
# Environment="PW_WEBDISABLE=all"
Environment="PW_WEBUSER=web_user"
Environment="PW_WEBPASSWORD=web_user_password"
Environment="PW_WEBADDR=mywebhost:8080"
Type=simple
User=pgwatch
ExecStart=/usr/bin/pgwatch \
--log-file=/var/log/pgwatch.log \
--log-file-format=text \
--log-level=error
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
Questions:
- Is this the intended behavior for continuous-discovery hosts?
- Could pgwatch3 implement better error isolation?
- Are there recommended permissions/configuration for continuous-discovery?
Additional Context:
This became particularly problematic during PostgreSQL upgrades where temporary permission issues or connection problems would take down monitoring for our entire infrastructure.