CI Make nightly cuml.accel integration test stricter#7631
CI Make nightly cuml.accel integration test stricter#7631rapids-bot[bot] merged 2 commits intorapidsai:mainfrom
Conversation
| @@ -1,6 +1,6 @@ | |||
| - reason: AUC standard deviation differs slightly with cuml.accel in sklearn 1.8 | |||
| - reason: AUC standard deviation differs slightly with cuml.accel in sklearn >= 1.7.2 | |||
There was a problem hiding this comment.
I used 1.8.0 and 1.7.2 locally and noticed that this xfail should also include 1.7.2. Eventually 1.7.2 might become the "intermediate" scikit-learn version we test against, so updating this now when the memory is fresh.
| - "sklearn.metrics._plot.tests.test_roc_curve_display::test_roc_curve_from_cv_results_legend_label[single-None]" | ||
| - "sklearn.metrics._plot.tests.test_roc_curve_display::test_roc_curve_from_cv_results_legend_label[single-curve_kwargs1]" | ||
| - reason: Search CV sample weight equivalence differs with cuml.accel in sklearn 1.8 | ||
| - reason: Search CV sample weight equivalence differs with cuml.accel in sklearn 1.7.2 |
| rapids-logger "Analyzing test results" | ||
| ./python/cuml/cuml_accel_tests/upstream/summarize-results.py \ | ||
| --config ./python/cuml/cuml_accel_tests/upstream/scikit-learn/test_config.yaml \ | ||
| "${RAPIDS_TESTS_DIR}/junit-cuml-accel-scikit-learn.xml" |
There was a problem hiding this comment.
I think we can also drop this call (and the set +e above) and just run the tests like normal. The summarize script doesn't get us much of anything IMO.
There was a problem hiding this comment.
I thought about that and decided to keep it so that those who want to know can track the pass rate manually. I'd prefer that people use a number from a CI run if they want to quote the pass rate than execute something locally (which will make it virtually impossible to ever understand how they came up with the number)
There was a problem hiding this comment.
I mean, they can always take the numbers output in the pytest summary to calculate it if they want to. I suspect this case will never come up and the whole thing is unnecessary. Still, now that CI has passed not sure ripping it out is worth another ci cycle.
jameslamb
left a comment
There was a problem hiding this comment.
Giving this a cic-codeowners approval, sounds great to me 😁
|
/merge |
This changes the nightly cuml.accel integration test with scikit-learn to use a strict "fail on anything" setup. The same as we are using on Pull Requests. This solves the problem that we have to choose an arbitrary threshold to declare "CI passes" (there is no great way to justify 80% over 85% or 87.325%) and that different versions of scikit-learn have a different number of tests. For example 1.8.0 has about 44000 test cases, v1.7.2 has 41472, about 41000 are shared between those two versions. About 1000 only exist in 1.7.2 and 4000 are new in 1.8.0. This means the pass rate can change quite a bit, without cuml.accel having gotten any worse. We could also reconsider how we calculate the pass rate. For example, the denominator of the pass rate includes skipped tests. Virtually all of the ~2500 skipped tests that are only in 1.8.0 are related to the array API. The reason they are skipped has more to do with what is installed in the test environment (pytorch, cupy, etc) and which environment variables are set than with the quality of cuml.accel. The important thing is that we do not start failing tests we used to pass or start passing tests we used to fail. And of course if new versions bring new tests that we fail that needs fixing. Authors: - Tim Head (https://github.com/betatim) Approvers: - Jim Crist-Harif (https://github.com/jcrist) - James Lamb (https://github.com/jameslamb) URL: rapidsai#7631
This changes the nightly cuml.accel integration test with scikit-learn to use a strict "fail on anything" setup. The same as we are using on Pull Requests.
This solves the problem that we have to choose an arbitrary threshold to declare "CI passes" (there is no great way to justify 80% over 85% or 87.325%) and that different versions of scikit-learn have a different number of tests. For example 1.8.0 has about 44000 test cases, v1.7.2 has 41472, about 41000 are shared between those two versions. About 1000 only exist in 1.7.2 and 4000 are new in 1.8.0. This means the pass rate can change quite a bit, without cuml.accel having gotten any worse.
We could also reconsider how we calculate the pass rate. For example, the denominator of the pass rate includes skipped tests. Virtually all of the ~2500 skipped tests that are only in 1.8.0 are related to the array API. The reason they are skipped has more to do with what is installed in the test environment (pytorch, cupy, etc) and which environment variables are set than with the quality of cuml.accel.
The important thing is that we do not start failing tests we used to pass or start passing tests we used to fail. And of course if new versions bring new tests that we fail that needs fixing.