Skip to content

[inputs.system] new uptime metric with wrong type, renamed cpu metrics and other #18919

@Daryes

Description

@Daryes

Use Case

Telegraf version : unreleased yet (would be 1.39.0)
System : multiples
module : inputs.system

Pull discussion : #18533

Hello,

First, thanks for the work done to include some OS informations, this will allow to drop the usage of scripts retrieving said data.

This said, I will chime on some of the latest changes applied to the inputs.system module.

  1. New "uptime" metric wrongly changed to the gauge type.

Unless I'm mistaken, but given the code, the new "uptime" metric is registered as a gauge type, while it should have stayed a counter.
Influxdb and others does not care about this, but it is another matter for prometheus and alike :
the type is send back as a comment with the metric, and used by prometheus.

Given the prometheus documentation :

  • A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.
    (...). Do not use a counter to expose a value that can decrease
  • A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.

The opposite can be said for a gauge, do not use a gauge for a metric that only increases.
Which is the case for the "uptime" metric, its value never decreases, aside from a restart.

Its type should be corrected back to "counter".

  1. Deprecation of the metric "uptime_format"

I've seen the part of the discussion relating the deprecation since 7 years.

but honestly, I would like this to be kept, as the wording format is different from what can be achieved with grafana and others.
They usually allow only a static format which cannot be changed or easily altered, while uptime_format has a much more concise and/or understandable result for anybody.

  • grafana (duration s): 1 week, 6 days, 8 hours
  • grafana (duration d): 13 d 07:47:31
  • uptime_format : 13 days, 7:47

After 7 years, it should be grandfathered instead of removed.

Also, its true this is a string metric, which should be not encouraged.
Its proper place would be in fact as a tag/label on the "uptime" metric, but it will cause some problems when used as a tag/label, as its value will change each time when gathered.
This would create new series (tag+metric) each time, each minute (or /10s), which would have other consequences.
So there is also not much point to think about moving it as a tag/label on the "uptime" metric.

  1. Renaming of the "cpu" to "cpu_logical"

I think this is a mistake to rename a metric existing since the first version of Telegraf without a real reason.
Normalizing against the recently introduced "cpu_physical" metric is not enough to warrant this, given the impact : cpu, ram, drive, network and system are the first series of metrics used when monitoring anything.
I will not go in the debate where the term logical is not always applicable depending of the cpu, nor with physical vs socket.

One of the selling argument of telegraf is its stability with its metrics, implicitly, the metric names staying unaltered.
Changing this name will require to update all existing dashboard using it, either directly or in formulaes.
There is no need to open such a can of worm.

Thanks.

ps : would it be possible to implement a warning in the CI against the change of existing metrics name, to remind about the impact ?
At first glance, I'm not seeing a simple way for that, but maybe someone has an idea.

Expected behavior

N/A

Actual behavior

N/A

Additional info

N/A

Metadata

Metadata

Assignees

Labels

bugunexpected problem or unintended behaviorregressionsomething that used to work, but is now broken

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions