Surface resource constraint problems in TaskRun Status#876
Surface resource constraint problems in TaskRun Status#876tekton-robot merged 1 commit intomasterfrom unknown repository
Conversation
|
|
||
| func isPodExceedingNodeResources(pod *corev1.Pod) bool { | ||
| for _, podStatus := range pod.Status.Conditions { | ||
| if podStatus.Reason == corev1.PodReasonUnschedulable && strings.Contains(podStatus.Message, "Insufficient") { |
There was a problem hiding this comment.
The strings.Contains() here feels brittle to me but I wasn't sure how else to narrow down an Unschedulable status to one of insufficient resources.
There was a problem hiding this comment.
i agree! seems a bit unfortunate but i guess that happens sometimes with errors :D
When a TaskRun's Pod is unschedulable due to resource constraints, either on the node or in a namespace with a ResourceQuota, the Status of that TaskRun is left somewhat ambiguous. Prior to this commit, when resources are limited on a Node the TaskRun will be held in a Succeeded/Unknown state with Reason of "Pending". When resources are limited due to a ResourceQuota the TaskRun will fail with a "CouldntGetTask" reason. This commit addresses the issue of ambiguous or incorrect TaskRun Status in resource constrained environments by: 1. Marking a TaskRun as Succeeded/Unknown with an ExceededNodeResources reason when a node doesn't have enough resources. Kubernetes will, in this case, attempt to reschedule the pod when space becomes available for it. 2. Emitting an event to indicate that a TaskRun's pod hit the resource ceiling on a Node. This shows up in the TaskRun's `kubectl describe` output. 3. Marking a TaskRun as Succeeded/False with an ExceededResourceQuota reason when a namespace with ResourceQuota rejects the TR's Pod outright.
bobcatfish
left a comment
There was a problem hiding this comment.
Thanks for this @sbwsg ! Just some rambling thoughts about code organization but I think we should go ahead anyway.
p.s. excellent commit message :D
/lgtm
/approve
/meow space
| if tr.Spec.TaskRef != nil { | ||
| msg = fmt.Sprintf("References a Task %s/%s that doesn't exist", tr.Namespace, tr.Spec.TaskRef.Name) | ||
| } else { | ||
| msg = fmt.Sprintf("References a TaskSpec with missing information") |
There was a problem hiding this comment.
this is outside the scope of your change, but i think that in spite of what the previous logic indicated (which I may have added... 😇 ) I think there are a variety of things that could have gone wrong here (if i remember right, even templating problems) - so one option would be to make this super generic, e.g. something like "invalid task"?
|
|
||
| func isPodExceedingNodeResources(pod *corev1.Pod) bool { | ||
| for _, podStatus := range pod.Status.Conditions { | ||
| if podStatus.Reason == corev1.PodReasonUnschedulable && strings.Contains(podStatus.Message, "Insufficient") { |
There was a problem hiding this comment.
i agree! seems a bit unfortunate but i guess that happens sometimes with errors :D
| } else { | ||
| reason = "Pending" | ||
| msg = getWaitingMessage(pod) | ||
| } |
There was a problem hiding this comment.
it feels like we're starting to build up a lot of logic in this file/package around looking at the pod and from that determining a reason and a msg - I wonder if we could move some of this out into its own package, with tests?
anyway i dont feel strongly enough about this to block merging tho, maybe it's something we can revisit
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bobcatfish, sbwsg The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |

Changes
When a TaskRun's Pod is unschedulable due to resource constraints, either on the node or in a namespace with a ResourceQuota, the Status of that TaskRun is left somewhat ambiguous.
Prior to this commit, when resources are limited on a Node the TaskRun will be held in a Succeeded/Unknown state with Reason of "Pending". When resources are limited due to a ResourceQuota the TaskRun will fail with a "CouldntGetTask" reason.
This commit addresses the issue of ambiguous or incorrect TaskRun Status in resource constrained environments by:
kubectl describeoutput.This PR is intended to build towards #734 by first clearly indicating when Pods are running into resource constraints. A future PR will attempt to actually tackle the problem of ResourceQuotas flatly failing TaskRuns without any kind of rescheduling.
Screenshots
Before, TaskRun Status doesn't reflect resource constraint issues for the pods:After, TaskRun Status reflects problems scheduling pods due to resource constraints:
Here's the ExceededNodeResources event appearing in the TaskRun's
kubectl describeoutput:Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
Question for reviewers: Does this change warrant documentation? It doesn't look like we document Status types in the taskrun.md doc.
See the contribution guide
for more details.
Release Notes