-
Notifications
You must be signed in to change notification settings - Fork 67
Add fast-track to skip uninstall/install if NVIDIA driver modules present #454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add fast-track to skip uninstall/install if NVIDIA driver modules present #454
Conversation
…sent Signed-off-by: Karthik Vetrivel <[email protected]>
…s and fix scenario handling Signed-off-by: Karthik Vetrivel <[email protected]>
0bb2e4c to
b107ac5
Compare
…nge detection Signed-off-by: Karthik Vetrivel <[email protected]>
b107ac5 to
ba7e6de
Compare
|
Thanks @karthikvetrivel, this looks very promising! I was curious as to why |
ubuntu22.04/nvidia-driver
Outdated
| _mount_rootfs | ||
|
|
||
| # Ensure persistence daemon is running | ||
| _ensure_persistence_running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this method can be shortened to this
| _ensure_persistence_running | |
| _ensure_persistenced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Thanks! Yes, I picked one Ubuntu nvidia-driver version to test on and get feedback before porting to others. |
4650182 to
bf0bb88
Compare
Signed-off-by: Karthik Vetrivel <[email protected]>
bf0bb88 to
0a036ed
Compare
cdesiniotis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reviewed the rhel9 directory.
Signed-off-by: Karthik Vetrivel <[email protected]>
0fb8195 to
d4a6dff
Compare
Signed-off-by: Karthik Vetrivel <[email protected]>
f0d0d41 to
b660caa
Compare
rhel9/nvidia-driver
Outdated
|
|
||
| trap "make -s -j ${MAX_THREADS} SYSSRC=/lib/modules/${KERNEL_VERSION}/build clean > /dev/null" EXIT | ||
| # Skip cleanup trap for DTK builds - modules are copied after this function returns | ||
| if [ "${PACKAGE_TAG:-}" != "builtin" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question - could you elaborate on why this change is needed? What exactly does the clean make target do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understanding, make clean removes the compiled .ko module files. For DTK builds, the modules need to persist after _create_driver_package() returns because ocp_dtk_entrypoint copies them to a shared volume afterward. If the cleanup trap runs, the modules are deleted before the copy happens.
Feel free to push back on my understanding here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting that this change is needed in general and is not related to the fast-track install code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this isn't related to a fast track optimization. This path isn't hit on a fast track reinstall. I can move these changes to a follow-up PR if preferred!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would prefer if this change (and any other changes not related to the fast-track optimization) were in a separate PR with a clear description of the problem addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
rhel9/nvidia-driver
Outdated
| nvidia-topologyd | ||
| } | ||
|
|
||
| _ensure_persistence() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this being called? I don't see this method being invoked in the rhel9 scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, replaced a prior _start_daemons method.
ubuntu22.04/nvidia-driver
Outdated
| _unmount_rootfs | ||
| _update_package_cache | ||
| _resolve_kernel_version || exit 1 | ||
| _install_prerequisites |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question -- is installing prerequisite packages actually required in this case? If yes, which packages are required for the userspace-only install?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, the prerequisites (kernel headers, kernel image, kernel modules) are specifically for kernel module compilation, which doesn't happen here—we run nvidia-installer --no-kernel-module. Will remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also remove the call to _update_package_cache while we are at it. If we aren't installing any new packages then we don't need to update the package cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Also removed in RHEL9.
rhel9/nvidia-driver
Outdated
| fi | ||
|
|
||
| _mount_rootfs | ||
| _ensure_persistence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we only starting the nvidia-persistenced daemon for this code path? Shouldn't we be restarting all of the daemons that we bring up in _start_daemons()?
My intuition tells me that none of the daemons should still be running at this point -- that is, none of the daemons should still be running after a previously running driver container terminates / restarts. If my intuition is correct, then couldn't we unconditionally call _start_daemons() here? Let me know if you have observed different behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you're right. Whenever we hit _ensure_persistence in this code path, we log persistence not found. I'll start the daemons here (and mirror the changes to Ubuntu 22.04).
Speaking of daemons, but I noticed that we don't ever stop nvidia-topologyd in _unload_driver() but we stop every other daemon. Is there a reason why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of daemons, but I noticed that we don't ever stop nvidia-topologyd in _unload_driver() but we stop every other daemon. Is there a reason why?
No reason AFAIK. We should also stop it in _unload_driver().
ubuntu22.04/nvidia-driver
Outdated
| _unmount_rootfs | ||
| _update_package_cache | ||
| _resolve_kernel_version || exit 1 | ||
| _install_prerequisites |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also remove the call to _update_package_cache while we are at it. If we aren't installing any new packages then we don't need to update the package cache.
4d6bb0f to
aae8035
Compare
aae8035 to
883b5a9
Compare
Signed-off-by: Karthik Vetrivel <[email protected]>
883b5a9 to
34e89fc
Compare
This PR is a part of this endeavor:
Relevant PRs: