-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Description
Bug report
We upgraded some servers to latest deb release in repos.influxdata.com only because that repo does not keep old versions (filing a separate ticket about that).
The latest version seems to fail in a way I can't diagnose, log output is same as before yet the response form the prometheus output metrics endpoint intermittently has a limited set of the actual data.
It seems to be only reporting the last batch of metrics written not all since last scrape.
In the config below the exec input pulls cgroups memory stats for each container running on the host. There are ~3000-4000 distinct metrics returned by it every time and this number is constant with number of containers running.
Relevant telegraf.conf:
/etc/telegraf/telegraf.conf
[agent]
collection_jitter = "3s"
flush_interval = "10s"
flush_jitter = "5s"
interval = "15s"
round_interval = true
/etc/telegraf/telegraf.d/default_inputs.conf
[inputs.cpu]
fieldpass = ["usage_*"]
percpu = false
totalcpu = true
[inputs.disk]
fieldpass = ["free", "total", "used", "inodes_used", "inodes_total"]
ignore_fs = ["tmpfs", "devtmpfs"]
mount_points = ["/", "/data_0", "/data"]
[inputs.diskio]
[inputs.exec]
commands = ["/usr/local/bin/capture_memory_metrics"]
data_format = "influx"
timeout = "15s"
[inputs.mem]
fielddrop = ["*_percent"]
[inputs.net]
interfaces = ["eth0", "bond0", "bond1", "br0"]
[inputs.system]
fieldpass = ["uptime", "load*"]
/etc/telegraf/telegraf.d/default_outputs.conf
[[outputs.prometheus_client]]
listen = ":9126"
System info:
telegraf 1.0.0 (old version 0.13.2)
Steps to reproduce:
(This is assumption, based on the cases that do exhibit it)
- create an
execinput that returns > 1000 metrics - run with that input and prometheus output
- obeserve number of metrics scraped each second on prometheus end point.
Expected behavior:
Shoudl constistenyl see the correct number of metrics. For our case This is what happend when we scrape every second and output number of metrics scraped using 0.13.2:
$ for i in {1..999}; do echo -n "Number of capsules metrics scraped:"; curl 2>/dev/null localhost:9126/metrics | grep -o -P 'capsule_name="[a-z0-9_]+"' | wc -l; sleep 1; done
Number of capsules metrics scraped:3980
Number of capsules metrics scraped:3980
Number of capsules metrics scraped:3980
Number of capsules metrics scraped:3980
Number of capsules metrics scraped:3980
Number of capsules metrics scraped:3980
Number of capsules metrics scraped:3980
Number of capsules metrics scraped:3980
Actual behavior:
After upgrade with 1.0.0 we see:
$ for i in {1..999}; do echo -n "Number of capsules metrics scraped:"; curl 2>/dev/null localhost:9126/metrics | grep -o -P 'capsule_name="[a-z0-9_]+"' | wc -l; sleep 1; done
Number of capsules metrics scraped:1000
Number of capsules metrics scraped:1000
Number of capsules metrics scraped:3
Number of capsules metrics scraped:3
Number of capsules metrics scraped:3
Number of capsules metrics scraped:3
Number of capsules metrics scraped:3
Number of capsules metrics scraped:3
Number of capsules metrics scraped:3
Number of capsules metrics scraped:3
Number of capsules metrics scraped:0
Number of capsules metrics scraped:1000
Number of capsules metrics scraped:1000
Number of capsules metrics scraped:1000
Number of capsules metrics scraped:1000
Number of capsules metrics scraped:1000
This numbers match the log messages about number of metrics in each batch so the bug appears to be that prometheus output is only reporting the last batch of metrics recorded not all since last scrape.