Skip to content

Single faulty host with postgres-continuous-discovery breaks metrics collection for all hosts #890

@Timz2009

Description

@Timz2009

Description:
We've encountered a critical issue where a single problematic host configured with postgres-continuous-discovery can completely stop metrics collection from all monitored hosts. Here's our detailed findings:

Environment:

Steps to Reproduce:

  1. Configure a host with kind: postgres-continuous-discovery that either:
  • Has insufficient permissions (blocks on btrim function)
  • Has incorrect pg_hba.conf settings (blocks connections)
  1. Wait for pgwatch3 to attempt database discovery
  2. Observe that metrics stop flowing from ALL hosts, not just the problematic one

Observed Behavior:
When a continuous-discovery host fails with either:
[ERROR] [sql:select /* pgwatch_generated */ datname from pg_database where not datistemplate and datallowconn and has_database_privilege (datname, 'CONNECT') and case when length(trim($1)) > 0 then datname ~ $1 else true end and case when length(trim($2)) > 0 then not datname ~ $2 else true end] [args:[.* (template|postgres)]] [err:ERROR: permission denied for function btrim (SQLSTATE 42501)] [pid:968232] [time:3.164024ms]
or connection errors:
FATAL: no pg_hba.conf entry for host [...] (SQLSTATE 28000)
The entire metrics collection system becomes blocked. Metrics stop being collected from all hosts until:

  • The problematic host is disabled (is_enabled: false)
  • Or changed to kind: postgres

Expected Behavior:

  • Metrics should continue to be collected from all other hosts
  • Only the problematic host should be marked as failed
  • Errors should be logged but not block the entire collection process

Workaround:
Changing the problematic host to kind: postgres (explicit DB listing) allows metrics to flow again, though with connection errors for that specific host.

Configuration Example:
/etc/pgwatch/sources.yaml

- name: problem-host
  group: test
  conn_str: postgresql://pgwatch:<password>@<problem-host>:<port>/postgres
  custom_metrics: {}
  custom_metrics_standby: {}
  kind: postgres-continuous-discovery
  include_pattern: .*
  exclude_pattern: (template|postgres)
  preset_metrics: full
  preset_metrics_standby: ""
  is_enabled: true
  custom_tags:
    env: test
    host: problem-host
  host_config:
    dcs_type: ""
    dcs_endpoints: []
    scope: ""
    namespace: ""
    username: ""
    password: ""
    ca_file: ""
    cert_file: ""
    key_file: ""
    logs_glob_path: ""
    logs_match_regex: ""
    per_metric_disabled_intervals: []
  only_if_master: false

config pgwatch.service

[Unit]
Description=pgwatch metrics collector
After=network.target

[Service]
# Sinks
Environment="PW_BATCHING_DELAY=950ms"
Environment="PW_RETENTION=30"
Environment="PW_REAL_DBNAME_FIELD=real_dbname"
Environment="PW_SYSTEM_IDENTIFIER_FIELD=sys_id"
Environment="PW_SINK=postgres://pgwatch@127.0.0.1:5433/pgwatch_metrics"

# Sources
Environment="PW_REFRESH=120"
Environment="PW_MIN_DB_SIZE_MB=10"
Environment="PW_MAX_PARALLEL_CONNECTIONS_PER_DB=4"
Environment="PW_SOURCES=/etc/pgwatch/sources.yaml"

# Metrics
Environment="PW_CREATE_HELPERS=false"
Environment="PW_DIRECT_OS_STATS=true"
Environment="PW_INSTANCE_LEVEL_CACHE_MAX_SECONDS=30"
Environment="PW_METRICS=postgresql://pgwatch@127.0.0.1:5433/pgwatch_metrics"

# WebUI
# Environment="PW_WEBDISABLE=all"
Environment="PW_WEBUSER=web_user"
Environment="PW_WEBPASSWORD=web_user_password"
Environment="PW_WEBADDR=mywebhost:8080"

Type=simple
User=pgwatch
ExecStart=/usr/bin/pgwatch \
  --log-file=/var/log/pgwatch.log \
  --log-file-format=text \
  --log-level=error
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Questions:

  • Is this the intended behavior for continuous-discovery hosts?
  • Could pgwatch3 implement better error isolation?
  • Are there recommended permissions/configuration for continuous-discovery?

Additional Context:
This became particularly problematic during PostgreSQL upgrades where temporary permission issues or connection problems would take down monitoring for our entire infrastructure.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions