-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
Speed up Jenkins._cleanUpDisconnectComputers
#11102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up Jenkins._cleanUpDisconnectComputers
#11102
Conversation
| ViewJob.interruptReloadThread(); | ||
| } | ||
|
|
||
| protected void killComputer(Computer c) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was only needed for access from other packages.
| */ | ||
| @Restricted(NoExternalUse.class) | ||
| @GuardedBy("hudson.model.Queue.lock") | ||
| /*package*/ void inflictMortalWound() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cute name, but unnecessary given NoExternalUse.
(Also why was it marked @Restricted when it was not public? This ought to be a compiler error.)
|
Windows test failures perhaps mean that we do want to close log files during I did not manage to reproduce them in either a Windows 10 or a Windows Server 2025 VM. |
|
/label ready-for-merge This PR is now ready for merge. We will merge it after ~24 hours if there is no negative feedback. |
A1exKH
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
SlaveComputer.killwas being called during shutdown, but only thesetNumExecutorspart seems desirable and not redundant. History:Computer.killduring cleanup. Among other things, this calledcloseChannel(synchronously), something we want to do to make sure the Remoting channel is closed gracefully and not just via socket close when the process exits.disconnectwhich also closes the channel (asynchronously).pendinglist and made suredisconnections completed within 10s before shutting down.But
Computer.killwas doing a lot of other things too:setNumExecutors, presumably to block any late queue scheduling, fine; but also trying to close log file handles (they would be closed anyway when we exit) and then deleting the whole log directory (why?!).killwas originally designed to clean up everything about aComputerwhen aNodewas removed, but this seems overkill (excuse the pun) for shutdown.I also noticed that the 10s timeout was applied to disconnection but to each agent, rather than cumulatively, which could have allowed shutdown to drag on for a long time. Compare
ForkJoinPool.invokeAllwith timeout.Testing done
In CloudBees CI scalability testing, when hundreds of agents are connected, just this overhead part (not, mind you, the actual disconnections) could take as much as 20s; typical thread dump excerpt:
With this patch, that overhead is reduced 50–100× to a fraction of a second. This is important because we typically accept the default Kubernetes
terminationGracePeriodof 30s, so if Jenkins is doing a ton of processing during shutdown, it risks beingSIGKILLed and not handling final stages of termination. (jenkinsci/workflow-cps-plugin#1088 helps speed up termination as well.)Proposed changelog entries
Proposed changelog category
/label bug
Proposed upgrade guidelines
N/A
Maintainer checklist
upgrade-guide-neededlabel is set and there is a Proposed upgrade guidelines section in the pull request title (see example).lts-candidateto be considered (see query).