Skip to content

Conversation

@jglick
Copy link
Member

@jglick jglick commented Sep 24, 2025

SlaveComputer.kill was being called during shutdown, but only the setNumExecutors part seems desirable and not redundant. History:

  • 19860fa added Computer.kill during cleanup. Among other things, this called closeChannel (synchronously), something we want to do to make sure the Remoting channel is closed gracefully and not just via socket close when the process exits.
  • ab5d148 added disconnect which also closes the channel (asynchronously).
  • 8d771bc introduced the pending list and made sure disconnections completed within 10s before shutting down.

But Computer.kill was doing a lot of other things too: setNumExecutors, presumably to block any late queue scheduling, fine; but also trying to close log file handles (they would be closed anyway when we exit) and then deleting the whole log directory (why?!). kill was originally designed to clean up everything about a Computer when a Node was removed, but this seems overkill (excuse the pun) for shutdown.

I also noticed that the 10s timeout was applied to disconnection but to each agent, rather than cumulatively, which could have allowed shutdown to drag on for a long time. Compare ForkJoinPool.invokeAll with timeout.

Testing done

In CloudBees CI scalability testing, when hundreds of agents are connected, just this overhead part (not, mind you, the actual disconnections) could take as much as 20s; typical thread dump excerpt:

"Thread-18" #56 […] prio=5 os_prio=0 cpu=1579.97ms elapsed=37.05s tid=0x… nid=… runnable  [0x…]
   java.lang.Thread.State: RUNNABLE
	at java.io.FileDescriptor.close0([email protected]/Native Method)
	at java.io.FileDescriptor.close([email protected]/FileDescriptor.java:304)
	- locked <0x…> (a java.io.FileDescriptor)
	at java.io.FileDescriptor$1.close([email protected]/FileDescriptor.java:89)
	at sun.nio.ch.FileChannelImpl$Closer.run([email protected]/FileChannelImpl.java:116)
	at jdk.internal.ref.CleanerImpl$PhantomCleanableRef.performCleanup([email protected]/CleanerImpl.java:178)
	at jdk.internal.ref.PhantomCleanable.clean([email protected]/PhantomCleanable.java:133)
	at sun.nio.ch.FileChannelImpl.implCloseChannel([email protected]/FileChannelImpl.java:210)
	at java.nio.channels.spi.AbstractInterruptibleChannel.close([email protected]/AbstractInterruptibleChannel.java:113)
	- locked <0x…> (a java.lang.Object)
	at sun.nio.ch.ChannelOutputStream.close([email protected]/ChannelOutputStream.java:111)
	at hudson.util.io.RewindableFileOutputStream.closeCurrent(RewindableFileOutputStream.java:105)
	at hudson.util.io.RewindableFileOutputStream.close(RewindableFileOutputStream.java:92)
	- locked <0x…> (a hudson.util.io.RewindableRotatingFileOutputStream)
	at hudson.slaves.SlaveComputer.kill(SlaveComputer.java:920)
	at hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:98)
	at jenkins.model.Jenkins.lambda$_cleanUpDisconnectComputers$17(Jenkins.java:3778)
	at jenkins.model.Jenkins$$Lambda/0x0000782955c9d858.run(Unknown Source)
	at hudson.model.Queue._withLock(Queue.java:1412)
	at hudson.model.Queue.withLock(Queue.java:1286)
	at jenkins.model.Jenkins._cleanUpDisconnectComputers(Jenkins.java:3774)
	at jenkins.model.Jenkins.cleanUp(Jenkins.java:3655)
	at hudson.WebAppMain._contextDestroyed(WebAppMain.java:430)
	at hudson.WebAppMain.contextDestroyed(WebAppMain.java:422)

With this patch, that overhead is reduced 50–100× to a fraction of a second. This is important because we typically accept the default Kubernetes terminationGracePeriod of 30s, so if Jenkins is doing a ton of processing during shutdown, it risks being SIGKILLed and not handling final stages of termination. (jenkinsci/workflow-cps-plugin#1088 helps speed up termination as well.)

Proposed changelog entries

  • Streamlined logic to disconnect agents during controller shutdown.

Proposed changelog category

/label bug

Proposed upgrade guidelines

N/A

Maintainer checklist

  • There are at least two (2) approvals for the pull request and no outstanding requests for change.
  • Conversations in the pull request are over, or it is explicit that a reviewer is not blocking the change.
  • Changelog entries in the pull request title and/or Proposed changelog entries are accurate, human-readable, and in the imperative mood.
  • Proper changelog labels are set so that the changelog can be generated automatically.
  • If the change needs additional upgrade steps from users, the upgrade-guide-needed label is set and there is a Proposed upgrade guidelines section in the pull request title (see example).
  • If it would make sense to backport the change to LTS, a Jira issue must exist, be a Bug or Improvement, and be labeled as lts-candidate to be considered (see query).

@comment-ops-bot comment-ops-bot bot added the bug For changelog: Minor bug. Will be listed after features label Sep 24, 2025
ViewJob.interruptReloadThread();
}

protected void killComputer(Computer c) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was only needed for access from other packages.

*/
@Restricted(NoExternalUse.class)
@GuardedBy("hudson.model.Queue.lock")
/*package*/ void inflictMortalWound() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cute name, but unnecessary given NoExternalUse.

(Also why was it marked @Restricted when it was not public? This ought to be a compiler error.)

@jglick
Copy link
Member Author

jglick commented Sep 24, 2025

Windows test failures perhaps mean that we do want to close log files during JenkinsRule on Windows at least.

I did not manage to reproduce them in either a Windows 10 or a Windows Server 2025 VM.

@NotMyFault
Copy link
Member

/label ready-for-merge


This PR is now ready for merge. We will merge it after ~24 hours if there is no negative feedback.
Please see the merge process documentation for more information about the merge process.
Thanks!

@comment-ops-bot comment-ops-bot bot added the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label Sep 28, 2025
Copy link

@A1exKH A1exKH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@krisstern krisstern merged commit a68faf9 into jenkinsci:master Sep 30, 2025
18 checks passed
@jglick jglick deleted the _cleanUpDisconnectComputers branch September 30, 2025 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug For changelog: Minor bug. Will be listed after features ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants