Skip to content

Conversation

@roclark
Copy link
Member

@roclark roclark commented Feb 23, 2021

The TensorFlow image on NGC is quite large (10+ GB) and contains a lot of unnecessary data. Using one of the CUDA images as a base cuts down on several gigabytes of data with no change in performance or functionality with the existing Bobber image. To further reduce the image size, using multi-stage builds allows us to compile many of the testing tools, such as NCCL and mdtest, inside a beefier build image while copying only the necessary binaries to the final, slimmer runtime image.

Closes #2
Fixes #82

Signed-Off-By: Robert Clark [email protected]

@roclark roclark added enhancement New feature or request docker Any items related to the Dockerfile or running and building the image labels Feb 23, 2021
@roclark roclark requested review from fredvx and joehandzik February 23, 2021 22:59
@roclark roclark self-assigned this Feb 23, 2021
@roclark
Copy link
Member Author

roclark commented Feb 23, 2021

I've run this on a single node currently and the results are indistinguishable between the 6.1.1 image and this new, lighter image, but am marking this as a draft until I can do some multi-node testing just to verify functionality though I expect that to yield the same results given single-node uses mpirun.

@joehandzik
Copy link
Contributor

Definitely my only concern will be if we lost some network functionality somehow, but I think DeepOps uses the base CUDA image for some multinode testing, so I'm hopeful. As you say, let's hold this until we can do a multinode test.

@roclark roclark force-pushed the update-base-image branch from c447d28 to 6fc44f0 Compare April 5, 2021 19:53
Bobber currently uses a TensorFlow image from NGC as the base to use some
of the TensorFlow functionality in the tests. While this is efficient,
the Bobber image is quite large (12GB+ at present). By moving to a CUDA
base image and installing TensorFlow inside the Bobber image, it might be
possible to reduce the overall image size by several gigabytes with no
change in functionality or performance. This will require a thorough
investigation of the potential impacts of such a change.

Signed-Off-By: Robert Clark <[email protected]>
@roclark roclark force-pushed the update-base-image branch 3 times, most recently from 28882a4 to 6b490a8 Compare March 15, 2022 17:59
@roclark
Copy link
Member Author

roclark commented Mar 16, 2022

I was able to verify functionality on multi-node just now and it appears to work well (though I don't have the fastest storage for this cluster so it isn't scaling, but that's expected). Given this will resolve issue #82, there doesn't appear to be a performance regression, and it works well for single- and multi-node, I will go ahead and move this out of draft and merge.

@roclark roclark marked this pull request as ready for review March 16, 2022 20:08
roclark added 2 commits March 16, 2022 15:10
The Bobber image requires compiling multiple binaries across many
repositories, creating a lot of unnecessary files which bloats the
image. By using multi-stage builds, much of the compilation and
dependency installation can be done in a beefier base image and only the
necessary components can be copied to the final runtime image. This
results in a reduction of several gigabytes in the final image for no
loss of performance or functionality.

Signed-Off-By: Robert Clark <[email protected]>
Newer versions of FIO changed the way results are displayed, causing
the parser to complain that the results are invalid. The newer versions
contain an extra value in the results line which can safely be ignored.

Signed-Off-By: Robert Clark <[email protected]>
@roclark roclark force-pushed the update-base-image branch from 6b490a8 to e9d7d1a Compare March 16, 2022 20:10
Copy link

@fredvx fredvx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@roclark roclark merged commit 0d2808e into main Mar 21, 2022
@roclark roclark deleted the update-base-image branch March 21, 2022 21:15
@roclark roclark added this to the Release 6.3.1 milestone Mar 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docker Any items related to the Dockerfile or running and building the image enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error while building container Investigate replacing base image with CUDA image

4 participants