Skip to content

Log thread dump when health check exceeds 10s#11266

Merged
timja merged 1 commit intojenkinsci:masterfrom
gbhat618:log-thread-dump-when-health-exceeds-10s
Nov 10, 2025
Merged

Log thread dump when health check exceeds 10s#11266
timja merged 1 commit intojenkinsci:masterfrom
gbhat618:log-thread-dump-when-health-exceeds-10s

Conversation

@gbhat618
Copy link
Contributor

@gbhat618 gbhat618 commented Nov 4, 2025

A stuck /health requests causes Kubernetes liveness probes to fail, leading to pod deletion before the root cause can be diagnosed. PR proposes to log a thread dump if the health check exceeds a 10s default timeout, providing crucial diagnostic information.

The timeout can be configured via the system property: -Djenkins.health.HealthCheckAction.thresholdTimeout=PT30S (example for 30 seconds)

Testing done

Tested by patching HealthCheckAction with TimeUnit.SECONDS.sleep(xx); example,

example patching 11s
diff --git a/core/src/main/java/jenkins/health/HealthCheckAction.java b/core/src/main/java/jenkins/health/HealthCheckAction.java
index 0b79dd65dd..8a02a923fd 100644
--- a/core/src/main/java/jenkins/health/HealthCheckAction.java
+++ b/core/src/main/java/jenkins/health/HealthCheckAction.java
@@ -34,6 +34,7 @@ import java.io.IOException;
 import java.time.Duration;
 import java.util.Timer;
 import java.util.TimerTask;
+import java.util.concurrent.TimeUnit;
 import java.util.logging.Level;
 import java.util.logging.Logger;
 import java.util.stream.Collectors;
@@ -84,6 +85,14 @@ public final class HealthCheckAction extends InvisibleAction implements Unprotec
             }
         }, THRESHOLD_TIMEOUT.toMillis());

+        LOGGER.info("sleeping to simulate health check delay");
+        try {
+            TimeUnit.SECONDS.sleep(11);
+        } catch (InterruptedException e) {
+            Thread.currentThread().interrupt();
+        }
+        LOGGER.info("finished sleeping");
+
         try {
             for (var healthCheck : ExtensionList.lookup(HealthCheck.class)) {
                 var check = healthCheck.check();

Testing by java -jar war/target/jenkins.war then curl -vvv -L http://localhost:8080/health

  • Default timeout works, with patching 11s - thread dump printed
    2025-11-04 16:42:43.844+0000 [id=17]	INFO	jenkins.health.HealthCheckAction#doIndex: sleeping to simulate health check delay
    2025-11-04 16:42:53.941+0000 [id=95]	SEVERE	j.health.HealthCheckAction$1#run: health check did not complete in timely fashion:
    [LF]>
    [LF]> "Common-Cleaner" Id=13 Group=InnocuousThreadGroup TIMED_WAITING on java.lang.ref.ReferenceQueue$Lock@9f7e518
    ...
    [LF]> "Handling GET /health/ from [0:0:0:0:0:0:0:1] : Jetty (winstone)-17" Id=17 Group=main TIMED_WAITING
    [LF]> 	at [email protected]/java.lang.Thread.sleep(Native Method)
    [LF]> 	at [email protected]/java.lang.Thread.sleep(Thread.java:344)
    [LF]> 	at [email protected]/java.util.concurrent.TimeUnit.sleep(TimeUnit.java:446)
    [LF]> 	at jenkins.health.HealthCheckAction.doIndex(HealthCheckAction.java:90)
    ...
    [LF]> "Signal Dispatcher" Id=4 Group=system RUNNABLE
    2025-11-04 16:42:54.851+0000 [id=17]	INFO	jenkins.health.HealthCheckAction#doIndex: finished sleeping
    
  • No thread dump printed when sleep 11s (same as above), but passing -Djenkins.health.HealthCheckAction.thresholdTimeout=PT30S
    2025-11-04 16:45:56.522+0000 [id=18]	INFO	jenkins.health.HealthCheckAction#doIndex: sleeping to simulate health check delay
    2025-11-04 16:46:07.528+0000 [id=18]	INFO	jenkins.health.HealthCheckAction#doIndex: finished sleeping
    
  • Patched to 31s, and passing -Djenkins.health.HealthCheckAction.thresholdTimeout=PT30S - thread dump printed,
    2025-11-04 16:48:58.284+0000 [id=16]	INFO	jenkins.health.HealthCheckAction#doIndex: sleeping to simulate health check delay
    2025-11-04 16:49:28.390+0000 [id=78]	SEVERE	j.health.HealthCheckAction$1#run: health check did not complete in timely fashion:
    [LF]>
    [LF]> "Common-Cleaner" Id=13 Group=InnocuousThreadGroup TIMED_WAITING on java.lang.ref.ReferenceQueue$Lock@3f4d5cb5
    ...
    [LF]> "Handling GET /health/ from [0:0:0:0:0:0:0:1] : Jetty (winstone)-16" Id=16 Group=main TIMED_WAITING
    [LF]> 	at [email protected]/java.lang.Thread.sleep(Native Method)
    [LF]> 	at [email protected]/java.lang.Thread.sleep(Thread.java:344)
    [LF]> 	at [email protected]/java.util.concurrent.TimeUnit.sleep(TimeUnit.java:446)
    [LF]> 	at jenkins.health.HealthCheckAction.doIndex(HealthCheckAction.java:90)
    ...
    [LF]> "Signal Dispatcher" Id=4 Group=system RUNNABLE
    2025-11-04 16:49:29.290+0000 [id=16]	INFO	jenkins.health.HealthCheckAction#doIndex: finished sleeping
    

Proposed changelog entries

  • Log a thread dump when a /health check exceeds the 10 second default timeout to help diagnose stuck requests. This timeout is configurable via the jenkins.health.HealthCheckAction.thresholdTimeout system property.

Proposed changelog category

/label rfe

Proposed upgrade guidelines

N/A

Submitter checklist

  • The Jira issue, if it exists, is well-described.
  • The changelog entries and upgrade guidelines are appropriate for the audience affected by the change (users or developers, depending on the change) and are in the imperative mood (see examples). Fill in the Proposed upgrade guidelines section only if there are breaking changes or changes that may require extra steps from users during upgrade.
  • There is automated testing or an explanation as to why this change has no tests.
  • New public classes, fields, and methods are annotated with @Restricted or have @since TODO Javadocs, as appropriate.
  • New deprecations are annotated with @Deprecated(since = "TODO") or @Deprecated(forRemoval = true, since = "TODO"), if applicable.
  • New or substantially changed JavaScript is not defined inline and does not call eval to ease future introduction of Content Security Policy (CSP) directives (see documentation).
  • For dependency updates, there are links to external changelogs and, if possible, full differentials.
  • For new APIs and extension points, there is a link to at least one consumer.

Desired reviewers

@jglick , @Vlatombe

Before the changes are marked as ready-for-merge:

Maintainer checklist

  • There are at least two (2) approvals for the pull request and no outstanding requests for change.
  • Conversations in the pull request are over, or it is explicit that a reviewer is not blocking the change.
  • Changelog entries in the pull request title and/or Proposed changelog entries are accurate, human-readable, and in the imperative mood.
  • Proper changelog labels are set so that the changelog can be generated automatically.
  • If the change needs additional upgrade steps from users, the upgrade-guide-needed label is set and there is a Proposed upgrade guidelines section in the pull request title (see example).
  • If it would make sense to backport the change to LTS, a Jira issue must exist, be a Bug or Improvement, and be labeled as lts-candidate to be considered (see query).

@comment-ops-bot comment-ops-bot bot added the rfe For changelog: Minor enhancement. use `major-rfe` for changes to be highlighted label Nov 4, 2025
Copy link
Contributor

@MarkEWaite MarkEWaite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is now ready for merge. We will merge it after approximately 24 hours if there is no negative feedback.

/label ready-for-merge

@comment-ops-bot comment-ops-bot bot added the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label Nov 8, 2025
@timja timja merged commit b251733 into jenkinsci:master Nov 10, 2025
19 checks passed
@gbhat618 gbhat618 deleted the log-thread-dump-when-health-exceeds-10s branch November 10, 2025 12:45
@MarkEWaite
Copy link
Contributor

  • This timeout is configurable via the jenkins.health.HealthCheckAction.thresholdTimeout system property.

@gbhat618 could you add documentation to the Jenkins system properties documentation page for this new property?

@gbhat618
Copy link
Contributor Author

could you add documentation ..

ok, good to know about it. Trying in jenkins-infra/jenkins.io#8537

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback rfe For changelog: Minor enhancement. use `major-rfe` for changes to be highlighted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants