Skip to content

Conversation

@karthikvetrivel
Copy link
Member

@karthikvetrivel karthikvetrivel commented Oct 2, 2025

This PR is a part of this endeavor:

GPU Driver container should avoid re-installing drivers on spurious container restarts

Relevant PRs:

…s and fix scenario handling

Signed-off-by: Karthik Vetrivel <[email protected]>
@karthikvetrivel karthikvetrivel marked this pull request as ready for review October 17, 2025 15:13
@karthikvetrivel karthikvetrivel marked this pull request as draft November 6, 2025 14:18
@karthikvetrivel karthikvetrivel force-pushed the eat/avoid-reinstall-gpu-container branch from 0bb2e4c to b107ac5 Compare November 7, 2025 14:00
@karthikvetrivel karthikvetrivel force-pushed the eat/avoid-reinstall-gpu-container branch from b107ac5 to ba7e6de Compare November 7, 2025 20:35
@tariq1890
Copy link
Contributor

Thanks @karthikvetrivel, this looks very promising! I was curious as to why ubuntu 24.04 wasn't included in the changeset

_mount_rootfs

# Ensure persistence daemon is running
_ensure_persistence_running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this method can be shortened to this

Suggested change
_ensure_persistence_running
_ensure_persistenced

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@karthikvetrivel
Copy link
Member Author

Thanks @karthikvetrivel, this looks very promising! I was curious as to why ubuntu 24.04 wasn't included in the changeset

Thanks! Yes, I picked one Ubuntu nvidia-driver version to test on and get feedback before porting to others.

@karthikvetrivel karthikvetrivel force-pushed the eat/avoid-reinstall-gpu-container branch from 4650182 to bf0bb88 Compare November 20, 2025 16:48
Signed-off-by: Karthik Vetrivel <[email protected]>
@karthikvetrivel karthikvetrivel force-pushed the eat/avoid-reinstall-gpu-container branch from bf0bb88 to 0a036ed Compare November 20, 2025 17:46
Copy link
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reviewed the rhel9 directory.

@karthikvetrivel karthikvetrivel force-pushed the eat/avoid-reinstall-gpu-container branch from 0fb8195 to d4a6dff Compare December 1, 2025 15:20
@karthikvetrivel karthikvetrivel force-pushed the eat/avoid-reinstall-gpu-container branch from f0d0d41 to b660caa Compare December 8, 2025 19:55
@karthikvetrivel karthikvetrivel marked this pull request as ready for review December 8, 2025 20:58

trap "make -s -j ${MAX_THREADS} SYSSRC=/lib/modules/${KERNEL_VERSION}/build clean > /dev/null" EXIT
# Skip cleanup trap for DTK builds - modules are copied after this function returns
if [ "${PACKAGE_TAG:-}" != "builtin" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - could you elaborate on why this change is needed? What exactly does the clean make target do?

Copy link
Member Author

@karthikvetrivel karthikvetrivel Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, make clean removes the compiled .ko module files. For DTK builds, the modules need to persist after _create_driver_package() returns because ocp_dtk_entrypoint copies them to a shared volume afterward. If the cleanup trap runs, the modules are deleted before the copy happens.

Feel free to push back on my understanding here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting that this change is needed in general and is not related to the fast-track install code path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this isn't related to a fast track optimization. This path isn't hit on a fast track reinstall. I can move these changes to a follow-up PR if preferred!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would prefer if this change (and any other changes not related to the fast-track optimization) were in a separate PR with a clear description of the problem addressed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

nvidia-topologyd
}

_ensure_persistence() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this being called? I don't see this method being invoked in the rhel9 scripts.

Copy link
Member Author

@karthikvetrivel karthikvetrivel Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, replaced a prior _start_daemons method.

_unmount_rootfs
_update_package_cache
_resolve_kernel_version || exit 1
_install_prerequisites
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- is installing prerequisite packages actually required in this case? If yes, which packages are required for the userspace-only install?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the prerequisites (kernel headers, kernel image, kernel modules) are specifically for kernel module compilation, which doesn't happen here—we run nvidia-installer --no-kernel-module. Will remove.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also remove the call to _update_package_cache while we are at it. If we aren't installing any new packages then we don't need to update the package cache.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Also removed in RHEL9.

fi

_mount_rootfs
_ensure_persistence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we only starting the nvidia-persistenced daemon for this code path? Shouldn't we be restarting all of the daemons that we bring up in _start_daemons()?

My intuition tells me that none of the daemons should still be running at this point -- that is, none of the daemons should still be running after a previously running driver container terminates / restarts. If my intuition is correct, then couldn't we unconditionally call _start_daemons() here? Let me know if you have observed different behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you're right. Whenever we hit _ensure_persistence in this code path, we log persistence not found. I'll start the daemons here (and mirror the changes to Ubuntu 22.04).

Speaking of daemons, but I noticed that we don't ever stop nvidia-topologyd in _unload_driver() but we stop every other daemon. Is there a reason why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of daemons, but I noticed that we don't ever stop nvidia-topologyd in _unload_driver() but we stop every other daemon. Is there a reason why?

No reason AFAIK. We should also stop it in _unload_driver().

_unmount_rootfs
_update_package_cache
_resolve_kernel_version || exit 1
_install_prerequisites
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also remove the call to _update_package_cache while we are at it. If we aren't installing any new packages then we don't need to update the package cache.

@karthikvetrivel karthikvetrivel force-pushed the eat/avoid-reinstall-gpu-container branch from 4d6bb0f to aae8035 Compare December 12, 2025 14:51
@karthikvetrivel karthikvetrivel force-pushed the eat/avoid-reinstall-gpu-container branch from aae8035 to 883b5a9 Compare December 12, 2025 18:44
@karthikvetrivel karthikvetrivel force-pushed the eat/avoid-reinstall-gpu-container branch from 883b5a9 to 34e89fc Compare December 16, 2025 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants