consul: log unhealthy script check output #10858

schmichael · 2021-07-06T17:15:40Z

Logs unhealthy script check output to the agent's log. Prior to this
commit script check output was only available by querying the Consul API
and therefore often unavailable when diagnosing issues after they are
fixed.

This change logs failures along with their (quoted) output:

>>> Repro:
consul agent -dev
nomad agent -dev
nomad run https://gist.githubusercontent.com/schmichael/126dcb98df2bb3eb010f74ab254c1b7b/raw/071e5f95fc38d52b1c2f5a99b27bbb9ee6fcfeb2/script.hcl

>>> Agent log output:
...
2021-07-06T10:06:01.590-0700 [WARN] client.alloc_runner.task_runner.task_hook.script_checks: unhealthy script check: alloc_id=38f31314-6782-55af-053a-50b52d5e11c6 check_id=_nomad-check-8f188b2b53b772955b0c6f1fe912cb249808b7f9 task=redis health=critical output=""This is the output for state 2\n""
2021-07-06T10:06:01.637-0700 [WARN] client.alloc_runner.task_runner.task_hook.script_checks: unhealthy script check: alloc_id=38f31314-6782-55af-053a-50b52d5e11c6 check_id=_nomad-check-51947a330e9b43483fd5eb6839042df08d6580a5 task=redis health=warning output=""This is the output for state 1\n""
...

Logs unhealthy script check output to the agent's log. Prior to this commit script check output was only available by querying the Consul API and therefore often unavailable when diagnosing issues after they are fixed. This change logs failures along with their (quoted) output: ``` >>> Repro: consul agent -dev nomad agent -dev nomad run https://gist.githubusercontent.com/schmichael/126dcb98df2bb3eb010f74ab254c1b7b/raw/071e5f95fc38d52b1c2f5a99b27bbb9ee6fcfeb2/script.hcl >>> Agent log output: ... 2021-07-06T10:06:01.590-0700 [WARN] client.alloc_runner.task_runner.task_hook.script_checks: unhealthy script check: alloc_id=38f31314-6782-55af-053a-50b52d5e11c6 check_id=_nomad-check-8f188b2b53b772955b0c6f1fe912cb249808b7f9 task=redis health=critical output=""This is the output for state 2\n"" 2021-07-06T10:06:01.637-0700 [WARN] client.alloc_runner.task_runner.task_hook.script_checks: unhealthy script check: alloc_id=38f31314-6782-55af-053a-50b52d5e11c6 check_id=_nomad-check-51947a330e9b43483fd5eb6839042df08d6580a5 task=redis health=warning output=""This is the output for state 1\n"" ... ```

shoenig · 2021-07-06T17:20:10Z

I don't think we should be logging unbounded output from scripts we don't own into agent logs. Can we at least make this opt-in through agent config?

schmichael · 2021-07-06T17:26:25Z

I don't think we should be logging unbounded output from scripts we don't own into agent logs. Can we at least make this opt-in through agent config?

Good point. The default max size in Consul is 4kb which is a lot to dump in logs ... and the output could be larger since Nomad doesn't do any truncation!

What if we truncated to 200 bytes?

If that's insufficient we could make the size configurable with 0 disabling it altogether, but I'd rather wait for the request instead of making another difficult-to-discover configuration parameter.

tgross · 2021-07-06T17:28:58Z

client/allocrunner/taskrunner/script_check_hook.go

@@ -366,6 +370,11 @@ func newScriptCheckCallback(s *scriptCheck) taskletCallback {
 			outputMsg = string(output)
 		}

+		// If the check is unhealthy, log the output
+		if state == api.HealthCritical || state == api.HealthWarning {
+			s.logger.Warn("unhealthy script check", "health", state, "output", strconv.Quote(outputMsg))


If the output message lands in Consul anyways, maybe we'd get most of the benefit of this if we logged the script check's error code rather than textual output?

How about logging just the exist code by default, then opt-into full text output by setting DEBUG or TRACE mode?

I think the problem is that the output in Consul is ephemeral. Consul does not log the output, so if you're not observing Consul at the time the failure happens, the output is gone forever.

We could make this Consul's problem I suppose.

We could make this Consul's problem I suppose.

Checked with Consul and @mkeeler felt it would be best to do this in Nomad since we're the one running the script. Nomad can also annotate the log line with more metadata than Consul has available (eg Alloc ID). Lacking the alloc id in the log line would really complicate using this to create a timeline of events after an outage.

That makes sense. I don't love that the allocation can write arbitrary strings to the client log but I'm sure if we looked hard enough we could find other cases of that.

shoenig · 2021-07-06T17:32:51Z

This would be much more involved, but it also seems like the Event Stream would be a better way to propagate and consume these messages. There isn't any plumbing to make that work with non-raft events though, so it would be a challenge.

Fixes double quoting of escaped values. Before fix: ```` logger.Warn(..., "output", strconv.Quote(out)) output=""This is the output for state 1\n"" ```` After fix: ``` logger.Warn(..., "output", hclog.Quote(out)) output="This is the output for state 1\n" ```

shoenig · 2021-07-07T18:48:56Z

Should we be concerned about treating script check output as secret? E.g. a script might execute something like curl -v, dumping headers into output. IIUC that output has been only accessible by reading the check status from Consul which can be protected with an ACL.

hashicorp-cla · 2022-03-12T17:07:41Z

All committers have signed the CLA.

github-actions · 2025-03-30T02:23:49Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

schmichael requested a review from tgross July 6, 2021 17:15

schmichael marked this pull request as draft July 6, 2021 17:18

tgross reviewed Jul 6, 2021

View reviewed changes

schmichael mentioned this pull request Jul 6, 2021

add Quote type to enable safe concise output of untrusted strings hashicorp/go-hclog#96

Merged

vercel bot deployed to Preview – nomad-storybook-and-ui July 7, 2021 18:40 View deployment

vercel bot temporarily deployed to Preview – nomad July 7, 2021 18:40 Inactive

consul: truncate output strings to 200 characters

cffb4c3

vercel bot temporarily deployed to Preview – nomad July 7, 2021 19:11 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui July 7, 2021 19:11 View deployment

schmichael closed this Jan 25, 2023

schmichael deleted the f-script-logging branch January 25, 2023 01:13

github-actions bot locked as resolved and limited conversation to collaborators Mar 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

consul: log unhealthy script check output #10858

consul: log unhealthy script check output #10858

Uh oh!

schmichael commented Jul 6, 2021

Uh oh!

shoenig commented Jul 6, 2021

Uh oh!

schmichael commented Jul 6, 2021

Uh oh!

tgross Jul 6, 2021

Uh oh!

shoenig Jul 6, 2021

Uh oh!

schmichael Jul 6, 2021

Uh oh!

schmichael Jul 6, 2021

Uh oh!

tgross Jul 6, 2021

Uh oh!

shoenig commented Jul 6, 2021

Uh oh!

shoenig commented Jul 7, 2021

Uh oh!

hashicorp-cla commented Mar 12, 2022 •

edited

Loading

Uh oh!

github-actions bot commented Mar 30, 2025

Uh oh!

Uh oh!

consul: log unhealthy script check output #10858

consul: log unhealthy script check output #10858

Uh oh!

Conversation

schmichael commented Jul 6, 2021

Uh oh!

shoenig commented Jul 6, 2021

Uh oh!

schmichael commented Jul 6, 2021

Uh oh!

tgross Jul 6, 2021

Choose a reason for hiding this comment

Uh oh!

shoenig Jul 6, 2021

Choose a reason for hiding this comment

Uh oh!

schmichael Jul 6, 2021

Choose a reason for hiding this comment

Uh oh!

schmichael Jul 6, 2021

Choose a reason for hiding this comment

Uh oh!

tgross Jul 6, 2021

Choose a reason for hiding this comment

Uh oh!

shoenig commented Jul 6, 2021

Uh oh!

shoenig commented Jul 7, 2021

Uh oh!

hashicorp-cla commented Mar 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 30, 2025

Uh oh!

Uh oh!

hashicorp-cla commented Mar 12, 2022 •

edited

Loading