Skip to content

Conversation

schmichael
Copy link
Member

Logs unhealthy script check output to the agent's log. Prior to this
commit script check output was only available by querying the Consul API
and therefore often unavailable when diagnosing issues after they are
fixed.

This change logs failures along with their (quoted) output:

>>> Repro:
consul agent -dev
nomad agent -dev
nomad run https://gist.githubusercontent.com/schmichael/126dcb98df2bb3eb010f74ab254c1b7b/raw/071e5f95fc38d52b1c2f5a99b27bbb9ee6fcfeb2/script.hcl

>>> Agent log output:
...
2021-07-06T10:06:01.590-0700 [WARN] client.alloc_runner.task_runner.task_hook.script_checks: unhealthy script check: alloc_id=38f31314-6782-55af-053a-50b52d5e11c6 check_id=_nomad-check-8f188b2b53b772955b0c6f1fe912cb249808b7f9 task=redis health=critical output=""This is the output for state 2\n""
2021-07-06T10:06:01.637-0700 [WARN] client.alloc_runner.task_runner.task_hook.script_checks: unhealthy script check: alloc_id=38f31314-6782-55af-053a-50b52d5e11c6 check_id=_nomad-check-51947a330e9b43483fd5eb6839042df08d6580a5 task=redis health=warning output=""This is the output for state 1\n""
...

Logs unhealthy script check output to the agent's log. Prior to this
commit script check output was only available by querying the Consul API
and therefore often unavailable when diagnosing issues after they are
fixed.

This change logs failures along with their (quoted) output:

```
>>> Repro:
consul agent -dev
nomad agent -dev
nomad run https://gist.githubusercontent.com/schmichael/126dcb98df2bb3eb010f74ab254c1b7b/raw/071e5f95fc38d52b1c2f5a99b27bbb9ee6fcfeb2/script.hcl

>>> Agent log output:
...
2021-07-06T10:06:01.590-0700 [WARN] client.alloc_runner.task_runner.task_hook.script_checks: unhealthy script check: alloc_id=38f31314-6782-55af-053a-50b52d5e11c6 check_id=_nomad-check-8f188b2b53b772955b0c6f1fe912cb249808b7f9 task=redis health=critical output=""This is the output for state 2\n""
2021-07-06T10:06:01.637-0700 [WARN] client.alloc_runner.task_runner.task_hook.script_checks: unhealthy script check: alloc_id=38f31314-6782-55af-053a-50b52d5e11c6 check_id=_nomad-check-51947a330e9b43483fd5eb6839042df08d6580a5 task=redis health=warning output=""This is the output for state 1\n""
...
```
@schmichael schmichael requested a review from tgross July 6, 2021 17:15
@schmichael schmichael marked this pull request as draft July 6, 2021 17:18
@shoenig
Copy link
Contributor

shoenig commented Jul 6, 2021

I don't think we should be logging unbounded output from scripts we don't own into agent logs. Can we at least make this opt-in through agent config?

@schmichael
Copy link
Member Author

I don't think we should be logging unbounded output from scripts we don't own into agent logs. Can we at least make this opt-in through agent config?

Good point. The default max size in Consul is 4kb which is a lot to dump in logs ... and the output could be larger since Nomad doesn't do any truncation!

What if we truncated to 200 bytes?

If that's insufficient we could make the size configurable with 0 disabling it altogether, but I'd rather wait for the request instead of making another difficult-to-discover configuration parameter.

@@ -366,6 +370,11 @@ func newScriptCheckCallback(s *scriptCheck) taskletCallback {
outputMsg = string(output)
}

// If the check is unhealthy, log the output
if state == api.HealthCritical || state == api.HealthWarning {
s.logger.Warn("unhealthy script check", "health", state, "output", strconv.Quote(outputMsg))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the output message lands in Consul anyways, maybe we'd get most of the benefit of this if we logged the script check's error code rather than textual output?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about logging just the exist code by default, then opt-into full text output by setting DEBUG or TRACE mode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is that the output in Consul is ephemeral. Consul does not log the output, so if you're not observing Consul at the time the failure happens, the output is gone forever.

We could make this Consul's problem I suppose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make this Consul's problem I suppose.

Checked with Consul and @mkeeler felt it would be best to do this in Nomad since we're the one running the script. Nomad can also annotate the log line with more metadata than Consul has available (eg Alloc ID). Lacking the alloc id in the log line would really complicate using this to create a timeline of events after an outage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I don't love that the allocation can write arbitrary strings to the client log but I'm sure if we looked hard enough we could find other cases of that.

@shoenig
Copy link
Contributor

shoenig commented Jul 6, 2021

This would be much more involved, but it also seems like the Event Stream would be a better way to propagate and consume these messages. There isn't any plumbing to make that work with non-raft events though, so it would be a challenge.

Fixes double quoting of escaped values.

Before fix:

````
logger.Warn(..., "output", strconv.Quote(out))

output=""This is the output for state 1\n""
````

After fix:

```
logger.Warn(..., "output", hclog.Quote(out))

output="This is the output for state 1\n"
```
@shoenig
Copy link
Contributor

shoenig commented Jul 7, 2021

Should we be concerned about treating script check output as secret? E.g. a script might execute something like curl -v, dumping headers into output. IIUC that output has been only accessible by reading the check status from Consul which can be protected with an ACL.

@hashicorp-cla
Copy link

hashicorp-cla commented Mar 12, 2022

CLA assistant check
All committers have signed the CLA.

@schmichael schmichael closed this Jan 25, 2023
@schmichael schmichael deleted the f-script-logging branch January 25, 2023 01:13
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 30, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants