-
Notifications
You must be signed in to change notification settings - Fork 2k
consul: log unhealthy script check output #10858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Logs unhealthy script check output to the agent's log. Prior to this commit script check output was only available by querying the Consul API and therefore often unavailable when diagnosing issues after they are fixed. This change logs failures along with their (quoted) output: ``` >>> Repro: consul agent -dev nomad agent -dev nomad run https://gist.githubusercontent.com/schmichael/126dcb98df2bb3eb010f74ab254c1b7b/raw/071e5f95fc38d52b1c2f5a99b27bbb9ee6fcfeb2/script.hcl >>> Agent log output: ... 2021-07-06T10:06:01.590-0700 [WARN] client.alloc_runner.task_runner.task_hook.script_checks: unhealthy script check: alloc_id=38f31314-6782-55af-053a-50b52d5e11c6 check_id=_nomad-check-8f188b2b53b772955b0c6f1fe912cb249808b7f9 task=redis health=critical output=""This is the output for state 2\n"" 2021-07-06T10:06:01.637-0700 [WARN] client.alloc_runner.task_runner.task_hook.script_checks: unhealthy script check: alloc_id=38f31314-6782-55af-053a-50b52d5e11c6 check_id=_nomad-check-51947a330e9b43483fd5eb6839042df08d6580a5 task=redis health=warning output=""This is the output for state 1\n"" ... ```
I don't think we should be logging unbounded output from scripts we don't own into agent logs. Can we at least make this opt-in through agent config? |
Good point. The default max size in Consul is 4kb which is a lot to dump in logs ... and the output could be larger since Nomad doesn't do any truncation! What if we truncated to 200 bytes? If that's insufficient we could make the size configurable with 0 disabling it altogether, but I'd rather wait for the request instead of making another difficult-to-discover configuration parameter. |
@@ -366,6 +370,11 @@ func newScriptCheckCallback(s *scriptCheck) taskletCallback { | |||
outputMsg = string(output) | |||
} | |||
|
|||
// If the check is unhealthy, log the output | |||
if state == api.HealthCritical || state == api.HealthWarning { | |||
s.logger.Warn("unhealthy script check", "health", state, "output", strconv.Quote(outputMsg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the output message lands in Consul anyways, maybe we'd get most of the benefit of this if we logged the script check's error code rather than textual output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about logging just the exist code by default, then opt-into full text output by setting DEBUG or TRACE mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is that the output in Consul is ephemeral. Consul does not log the output, so if you're not observing Consul at the time the failure happens, the output is gone forever.
We could make this Consul's problem I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make this Consul's problem I suppose.
Checked with Consul and @mkeeler felt it would be best to do this in Nomad since we're the one running the script. Nomad can also annotate the log line with more metadata than Consul has available (eg Alloc ID). Lacking the alloc id in the log line would really complicate using this to create a timeline of events after an outage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I don't love that the allocation can write arbitrary strings to the client log but I'm sure if we looked hard enough we could find other cases of that.
This would be much more involved, but it also seems like the Event Stream would be a better way to propagate and consume these messages. There isn't any plumbing to make that work with non-raft events though, so it would be a challenge. |
Fixes double quoting of escaped values. Before fix: ```` logger.Warn(..., "output", strconv.Quote(out)) output=""This is the output for state 1\n"" ```` After fix: ``` logger.Warn(..., "output", hclog.Quote(out)) output="This is the output for state 1\n" ```
Should we be concerned about treating script check output as secret? E.g. a script might execute something like |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Logs unhealthy script check output to the agent's log. Prior to this
commit script check output was only available by querying the Consul API
and therefore often unavailable when diagnosing issues after they are
fixed.
This change logs failures along with their (quoted) output: