chore(test): Add E2E tests for Kubeflow Trainer#2470
Conversation
7c96d36 to
b20e562
Compare
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
38ca1e6 to
957d8af
Compare
da2d9d8 to
5af466d
Compare
|
|
||
|
|
||
| # TODO (andreyvelich): Change return type to IntOrString. | ||
| def get_num_proc_per_node(resources_per_node: dict) -> object: |
There was a problem hiding this comment.
@astefanutti I added this util func to get numProcPerNode based on CPU and GPU configuration, as we discussed before.
WDYT ?
There was a problem hiding this comment.
@andreyvelich ah that's interesting, intuitively I would have defined that logic backend side in the torch plugin.
Not all the runtimes may have that "auto" logic, and some may be cgroups friendly.
Also the capping logic should take into account fractional CPU and millicore request / limit, which may be easier to achieved controller-side.
Adding #2407 for reference.
There was a problem hiding this comment.
Oh, that is good point, yeah maybe we could move this part to the plugin itself.
Let's add the TODO statement to refactor it in the followup PRs, so we can integrate initial E2E tests working?
There was a problem hiding this comment.
Sounds good, I agree with you, better integrate E2E asap, and we can work on #2407 separately.
There was a problem hiding this comment.
SGTM, we can construct Torch environment variables based on input numProcPerNode as we do in the current torch plugin.
There was a problem hiding this comment.
I think, what we try to say here is if user doesn't explicitly set numProcPerNode in TrainJob, we should construct this value based on Trainer container resources (if they are configured).
And we should do it inside the torch plugin on the server side, not in the get_num_proc_per_node() function on the client side.
Does it make sense @tenzen-y ?
There was a problem hiding this comment.
Sorry for your confusion. That is what I wanted to say.
|
This should be ready for review. |
|
FYI, I was able to use large GitHub runners in our E2Es, after enabling it for I will create issue to inform Kubeflow community that we can use these runners. |
0f60a1f to
a036b42
Compare
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
a036b42 to
7d2b51f
Compare
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
b577e24 to
761a9fa
Compare
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
62f0611 to
657c36e
Compare
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
6835a6e to
188aeee
Compare
|
@tenzen-y Hopefully, I fixed all of the test flakiness. Please take a look when you can. |
| echo "Install Kind" | ||
| go install sigs.k8s.io/kind@v0.27.0 | ||
|
|
There was a problem hiding this comment.
| echo "Install Kind" | |
| go install sigs.k8s.io/kind@v0.27.0 |
You might forget to remove this.
| trainer "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1" | ||
| "github.com/onsi/ginkgo/v2" | ||
| "github.com/onsi/gomega" | ||
| "k8s.io/client-go/kubernetes/scheme" | ||
| "sigs.k8s.io/controller-runtime/pkg/client" | ||
| "sigs.k8s.io/controller-runtime/pkg/client/config" |
There was a problem hiding this comment.
| trainer "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1" | |
| "github.com/onsi/ginkgo/v2" | |
| "github.com/onsi/gomega" | |
| "k8s.io/client-go/kubernetes/scheme" | |
| "sigs.k8s.io/controller-runtime/pkg/client" | |
| "sigs.k8s.io/controller-runtime/pkg/client/config" | |
| "github.com/onsi/ginkgo/v2" | |
| "github.com/onsi/gomega" | |
| "k8s.io/client-go/kubernetes/scheme" | |
| "sigs.k8s.io/controller-runtime/pkg/client" | |
| "sigs.k8s.io/controller-runtime/pkg/client/config" | |
| trainer "github.com/kubeflow/trainer/pkg/apis/trainer/v1alpha1" |
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Electronic-Waste
left a comment
There was a problem hiding this comment.
@andreyvelich Thanks for this! Just a few comments.
| # Kubernetes versions for e2e tests on Kind cluster. | ||
| kubernetes-version: ["1.29.14", "1.30.0", "1.31.0"] |
| # TODO (andreyvelich): Update JobSet to v0.8.0 once this PR is merged: https://github.com/kubeflow/trainer/pull/2466 | ||
| # "pydantic>=2.10.0", | ||
| "jobset @ git+https://github.com/kubernetes-sigs/jobset.git@v0.7.2#subdirectory=sdk/python", |
There was a problem hiding this comment.
It might be better to use v0.7.3 since we used this version before: #2445
There was a problem hiding this comment.
Actually, let me ping it directly to updated commit, once we merge this PR:
kubernetes-sigs/jobset#810
| # Input and output location for Notebooks executed with Papermill. | ||
| NOTEBOOK_INPUT=$(PROJECT_DIR)/examples/pytorch/image-classification/mnist.ipynb | ||
| NOTEBOOK_OUTPUT=$(PROJECT_DIR)/trainer_output.ipynb | ||
| PAPERMILL_TIMEOUT=900 | ||
| .PHONY: test-e2e-notebook | ||
| test-e2e-notebook: ## Run Jupyter Notebook with Papermill. | ||
| NOTEBOOK_INPUT=$(NOTEBOOK_INPUT) NOTEBOOK_OUTPUT=$(NOTEBOOK_OUTPUT) PAPERMILL_TIMEOUT=$(PAPERMILL_TIMEOUT) ./hack/e2e-run-notebook.sh |
There was a problem hiding this comment.
I would suggest that we use NOTEBOOK_INPUT_LIST and NOTEBOOK_OUTPUT_LIST for e2e test with Notebook. We may add more testcases in the future:)
There was a problem hiding this comment.
It looks like you mentioned future work. So, we can keep current approach for now.
There are no guarantees and plans for additional tests for now.
There was a problem hiding this comment.
Yeah, I agree @Electronic-Waste, but I want to discuss in the future how we can parallelize these steps, so we can run multiple Notebooks at the same time.
As @tenzen-y suggested, let's discuss it as a followup.
| # The default output for Notebook after Papermill execution. | ||
| trainer_output.ipynb |
There was a problem hiding this comment.
How about changing it to *_output.ipynb?
There was a problem hiding this comment.
As you mentioned in #2470 (comment), let's add artifacts to gitignore.
| # TODO (andreyvelich): Discuss how we can upload artifacts for multiple Notebooks. | ||
| - name: Upload notebook | ||
| uses: actions/upload-artifact@v4 | ||
| if: always() | ||
| with: | ||
| name: mnist_output_${{ matrix.kubernetes-version }}.ipynb | ||
| path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/mnist_output_${{ matrix.kubernetes-version }}.ipynb | ||
| retention-days: 1 |
There was a problem hiding this comment.
Option1: Output files to a seperate directory
We can output these *_output.ipynb files to a certain directory like ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/output, and upload the whole dir like:
- name: Upload all artifacts
uses: actions/upload-artifact@v4
if: always()
with:
name: all-artifacts
path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/output/*
retention-days: 1
Option2: regex match
- name: Upload all artifacts
uses: actions/upload-artifact@v4
if: always()
with:
name: all-artifacts
path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/*_output_${{ matrix.kubernetes-version }}.ipynb
retention-days: 1
There was a problem hiding this comment.
Instead of output, it might be better to use artifacts.
There was a problem hiding this comment.
Great suggestion! I didn't know that upload artifact supports regex.
Let me create PR.
There was a problem hiding this comment.
|
@Electronic-Waste Could you address in the following in your side? If yes, could you say
/approve |
|
@tenzen-y For sure. I'm glad to help. It's better to merge this PR asap:) /hold cancel |
Otherwise, Andrey will address those in this PR. |
Thanks! |
Electronic-Waste
left a comment
There was a problem hiding this comment.
Thanks for this!
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Electronic-Waste, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Add e2e tests for Kubeflow Trainer Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add timeout for papermill Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add output as part of make command Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add k8s version to setup cluster Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix Kind k8s version Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix 1.29 version Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Create script to run Notebook Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Download dataset when local_rank=0 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update test/e2e/e2e_test.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Refactor Go e2e tests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Bump k8s to 1.29.14 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Install Kind from go mod Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix path for Kind package Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix Go e2e Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Reduce number of CPUs Export Notebook as artifact Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Print logs due to flaky test Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix artifact path Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * docker pull image Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix path Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add k8s version to output name Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove install Kind cmd Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Add e2e tests for Kubeflow Trainer Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add timeout for papermill Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add output as part of make command Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add k8s version to setup cluster Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix Kind k8s version Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix 1.29 version Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Create script to run Notebook Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Download dataset when local_rank=0 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update test/e2e/e2e_test.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Refactor Go e2e tests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Bump k8s to 1.29.14 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Install Kind from go mod Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix path for Kind package Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix Go e2e Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Reduce number of CPUs Export Notebook as artifact Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Print logs due to flaky test Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix artifact path Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * docker pull image Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix path Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add k8s version to output name Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove install Kind cmd Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Fixes: #2213
This PR adds E2E tests for: