Single faulty host with postgres-continuous-discovery breaks metrics collection for all hosts

**Description:**
We've encountered a critical issue where a single problematic host configured with postgres-continuous-discovery can completely stop metrics collection from all monitored hosts. Here's our detailed findings:

**Environment:**
- pgwatch3 versions tested: 3.6.0, 3.7.0 (install https://github.com/cybertec-postgresql/pgwatch/releases/download/v3.7.0/pgwatch_Linux_x86_64.deb)
- PostgreSQL versions: Various (issue appeared after upgrading one host)
- PostgreSQL version for pgwatch: 16.9

**Steps to Reproduce:**
1. Configure a host with kind: postgres-continuous-discovery that either:
- Has insufficient permissions (blocks on btrim function)
- Has incorrect pg_hba.conf settings (blocks connections)
2. Wait for pgwatch3 to attempt database discovery
3. Observe that metrics stop flowing from ALL hosts, not just the problematic one

**Observed Behavior:**
When a continuous-discovery host fails with either:
`[ERROR] [sql:select /* pgwatch_generated */
        datname
        from pg_database
        where not datistemplate
        and datallowconn
        and has_database_privilege (datname, 'CONNECT')
        and case when length(trim($1)) > 0 then datname ~ $1 else true end
        and case when length(trim($2)) > 0 then not datname ~ $2 else true end] [args:[.* (template|postgres)]] [err:ERROR: permission denied for function btrim (SQLSTATE 42501)] [pid:968232] [time:3.164024ms]`
or connection errors:
`FATAL: no pg_hba.conf entry for host [...] (SQLSTATE 28000)`
The entire metrics collection system becomes blocked. Metrics stop being collected from all hosts until:
-  The problematic host is disabled (is_enabled: false)
-  Or changed to kind: postgres

**Expected Behavior:**
- Metrics should continue to be collected from all other hosts
- Only the problematic host should be marked as failed
- Errors should be logged but not block the entire collection process

**Workaround:**
Changing the problematic host to kind: postgres (explicit DB listing) allows metrics to flow again, though with connection errors for that specific host.

**Configuration Example:**
/etc/pgwatch/sources.yaml
```yaml
- name: problem-host
  group: test
  conn_str: postgresql://pgwatch:<password>@<problem-host>:<port>/postgres
  custom_metrics: {}
  custom_metrics_standby: {}
  kind: postgres-continuous-discovery
  include_pattern: .*
  exclude_pattern: (template|postgres)
  preset_metrics: full
  preset_metrics_standby: ""
  is_enabled: true
  custom_tags:
    env: test
    host: problem-host
  host_config:
    dcs_type: ""
    dcs_endpoints: []
    scope: ""
    namespace: ""
    username: ""
    password: ""
    ca_file: ""
    cert_file: ""
    key_file: ""
    logs_glob_path: ""
    logs_match_regex: ""
    per_metric_disabled_intervals: []
  only_if_master: false
```
config pgwatch.service
```
[Unit]
Description=pgwatch metrics collector
After=network.target

[Service]
# Sinks
Environment="PW_BATCHING_DELAY=950ms"
Environment="PW_RETENTION=30"
Environment="PW_REAL_DBNAME_FIELD=real_dbname"
Environment="PW_SYSTEM_IDENTIFIER_FIELD=sys_id"
Environment="PW_SINK=postgres://pgwatch@127.0.0.1:5433/pgwatch_metrics"

# Sources
Environment="PW_REFRESH=120"
Environment="PW_MIN_DB_SIZE_MB=10"
Environment="PW_MAX_PARALLEL_CONNECTIONS_PER_DB=4"
Environment="PW_SOURCES=/etc/pgwatch/sources.yaml"

# Metrics
Environment="PW_CREATE_HELPERS=false"
Environment="PW_DIRECT_OS_STATS=true"
Environment="PW_INSTANCE_LEVEL_CACHE_MAX_SECONDS=30"
Environment="PW_METRICS=postgresql://pgwatch@127.0.0.1:5433/pgwatch_metrics"

# WebUI
# Environment="PW_WEBDISABLE=all"
Environment="PW_WEBUSER=web_user"
Environment="PW_WEBPASSWORD=web_user_password"
Environment="PW_WEBADDR=mywebhost:8080"

Type=simple
User=pgwatch
ExecStart=/usr/bin/pgwatch \
  --log-file=/var/log/pgwatch.log \
  --log-file-format=text \
  --log-level=error
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
```

**Questions:**
- Is this the intended behavior for continuous-discovery hosts?
- Could pgwatch3 implement better error isolation?
- Are there recommended permissions/configuration for continuous-discovery?

**Additional Context:**
This became particularly problematic during PostgreSQL upgrades where temporary permission issues or connection problems would take down monitoring for our entire infrastructure.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single faulty host with postgres-continuous-discovery breaks metrics collection for all hosts #890

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Single faulty host with postgres-continuous-discovery breaks metrics collection for all hosts #890

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions