Skip to content

[SPARK-53076][CORE][TESTS] Disable SparkBloomFilterSuite #51788

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Aug 2, 2025

What changes were proposed in this pull request?

This PR aims to disable SparkBloomFilterSuite due to the excessive running time.

  • SPARK-53077 is filed to re-enable this with the reasonable running time.

Why are the changes needed?

Previously, common/sketch module took less than 10s.

$ mvn package --pl common/sketch
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  7.177 s
[INFO] Finished at: 2025-08-02T08:25:43-07:00
[INFO] ------------------------------------------------------------------------

After SparkBloomFilterSuite was added newly, SparkBloomFilterSuite took over 12 minutes. It's too long as a unit test.

[info] Test org.apache.spark.util.sketch.SparkBloomFilterSuite#testAccuracyRandomDistribution(long, double, int, org.junit.jupiter.api.TestInfo):#1 started
[info] Test org.apache.spark.util.sketch.SparkBloomFilterSuite#testAccuracyEvenOdd(long, double, int, org.junit.jupiter.api.TestInfo):#1 started
[info] Test run finished: 0 failed, 0 ignored, 2 total, 721.939s

Does this PR introduce any user-facing change?

No, this is a test change.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Aug 2, 2025
@dongjoon-hyun
Copy link
Member Author

cc @ishnagy , @peter-toth , @LuciferYang

@dongjoon-hyun
Copy link
Member Author

Could you review this PR if you have some time, @beliefer ?

@LuciferYang
Copy link
Contributor

hope this test can be re-enabled within a short period of time.

@LuciferYang
Copy link
Contributor

Merged into master. Thanks @dongjoon-hyun

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @LuciferYang . Yes, I hope so too.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-53076 branch August 3, 2025 06:30
@peter-toth
Copy link
Contributor

Thanks @dongjoon-hyun.

@ishnagy do you think we could validate V2 filter implementation with less than 1G items added?

@beliefer
Copy link
Contributor

beliefer commented Aug 3, 2025

Could you review this PR if you have some time, @beliefer ?

Thank you for the ping. Disable the test case looks good to me in a short.

@ishnagy
Copy link
Contributor

ishnagy commented Aug 3, 2025

@ishnagy do you think we could validate V2 filter implementation with less than 1G items added?

the short answer is a (qualified) yes.

the longer answer is a bit more nuanced.

this test was never meant to be run regularly (e.g. as part of a generic pre-merge test suite). I submitted the PR originally having an @Disabled annotations on these tests, for the exact purpose of not to hog any general purpose test runs with these longrunning testcases. Since the sketch module has virtually no external dependencies, it would be perfectly enough to run any heavy duty bloomfilter tests when the actual BF logic is changed (which is expected to happen quite rarely, I assume). There was an attempt to start working out some such test classifications in d9d69801, but that was quickly cut short.

on the necessity of these tests. I'm a bit worried that forcing this test code into the "generic bucket", getting constrained by the time/resource limitations we have there, significantly cutting down the testcases just to meet those "artificial" constraints, we will introduce some similar blind spots as the original scala suite when it only tested with a 100k insertions at most. This particular int/long truncation problem may be detectable with a 100M elements only, but I think the implementations should not only be proven to be safe in the range we can afford to test regularly (kind of pointlessly), but in 1-2orders of magnitudes higher, which we may not be able run on every change, but we can still afford to run once in a while, when the implementation changes.

  • Short term action item: I'll check if the 100M insertion is enough to demonstrate the problem with the V1 implementation, and prep a PR with a lowered item count (ETA: tomorrow)
  • Mid term: I would be perfectly fine with leaving these tests on disabled (without removing them), even putting back all the testcases we cut out in the chase of decreasing the running time. I'd be happy to work on some tagging facilities, with which we can prevent hogging down the everyday testing without completely disabling the slower tests.
  • I'm already working on some performance improvement patches (introducing a non compatible V3 version), which should make the insert/query operations a lot quicker, and thus the high item count tests a lot less painful.

@LuciferYang
Copy link
Contributor

Since the sketch module has virtually no external dependencies, it would be perfectly enough to run any heavy duty bloomfilter tests when the actual BF logic is changed (which is expected to happen quite rarely, I assume).

I've always been curious about this point you mentioned. If we routinely mark it as @Disabled, do we have a way to enforce that the submitter or code reviewer manually verifies it when changes occur? If not, when potentially erroneous changes are introduced into a release version, it would result in a poor user experience.

@LuciferYang
Copy link
Contributor

@dongjoon-hyun @peter-toth @ishnagy If it indeed requires a large amount of data to reproduce this issue, I have an idea and I'm not sure if everyone agrees:

  1. Add an environment variable, such as VERIFY_SPARK_BF_SUITE.

  2. Replace Disabled with EnabledIfEnvironmentVariable:

     -@Disabled("TODO(SPARK-53077): Re-enable with a resonable test time.")
     +@EnabledIfEnvironmentVariable(named = "VERIFY_SPARK_BF_SUITE", matches = "true")

    This way, we can ensure the test is executed by running VERIFY_SPARK_BF_SUITE=true build/mvn package --pl common/sketch, while it will be skipped when running build/mvn package --pl common/sketch or VERIFY_SPARK_BF_SUITE=false build/mvn package --pl common/sketch.

  3. Modify the YAML file for daily tests to ensure that at least one daily test will verify this test.

By doing so, the change pipeline will no longer execute this test, but it will still be verified during the daily test process. This way, even if there are change-related errors, we can detect and correct them within at most one day.

@dongjoon-hyun
Copy link
Member Author

Thank you all.

I support all approaches to improve this situation including

  • the fundamental performance improvements,
  • reducing test data size,
  • introducing the environment variable because it's better than hard-coded one although we need to rename it to have SPARK_ prefix like SPARK_GENERATE_BENCHMARK_FILES

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 3, 2025

For the record, the following is ASF Infra Policy and why this is important.

ASF INFRA POLICY

MONITORING

Screenshot 2025-08-03 at 08 29 51

SPARK EFFORT

  • SPARK-48094 Reduce GitHub Action usage according to ASF project allowance

@ishnagy
Copy link
Contributor

ishnagy commented Aug 4, 2025

I made some measurements with the V1 implementation:

100M
testAccuracyRandomDistribution: acceptableFpp(3.000000 %) < actualFpp (3.050257 %) [00m18s] T:  ~9.6%
testAccuracyEvenOdd:            acceptableFpp(3.000000 %) < actualFpp (3.053887 %) [00m09s] T:  ~9.3%

150M
testAccuracyRandomDistribution: acceptableFpp(3.000000 %) < actualFpp (3.080157 %) [00m28s] T: ~15.0%
testAccuracyEvenOdd:            acceptableFpp(3.000000 %) < actualFpp (3.079987 %) [00m15s] T: ~15.4%

200M
testAccuracyRandomDistribution: acceptableFpp(3.000000 %) < actualFpp (3.861257 %) [00m37s] T: ~19.8%
testAccuracyEvenOdd:            acceptableFpp(3.000000 %) < actualFpp (3.860424 %) [00m20s] T: ~20.6%

250M
testAccuracyRandomDistribution: acceptableFpp(3.000000 %) < actualFpp (3.676172 %) [00m47s] T: ~25.1%
testAccuracyEvenOdd:            acceptableFpp(3.000000 %) < actualFpp (3.675387 %) [00m25s] T: ~25.8%

300M
testAccuracyRandomDistribution: acceptableFpp(3.000000 %) < actualFpp (3.210548 %) [00m57s] T: ~30.5%
testAccuracyEvenOdd:            acceptableFpp(3.000000 %) < actualFpp (3.209847 %) [00m30s] T: ~30.1%

350M
testAccuracyRandomDistribution: acceptableFpp(3.000000 %) < actualFpp (5.377388 %) [01m07s] T: ~35.8%
testAccuracyEvenOdd:            acceptableFpp(3.000000 %) < actualFpp (5.377483 %) [00m36s] T: ~37.1%

400M
testAccuracyRandomDistribution: acceptableFpp(3.000000 %) < actualFpp (8.170380 %) [01m17s] T: ~41.2%
testAccuracyEvenOdd:            acceptableFpp(3.000000 %) < actualFpp (8.170716 %) [00m40s] T: ~41.2%

500M
testAccuracyRandomDistribution: acceptableFpp(3.000000 %) < actualFpp (15.392861 %) [01m36s] T: ~51.3%
testAccuracyEvenOdd:            acceptableFpp(3.000000 %) < actualFpp (15.391692 %) [00m50s] T: ~51.5%

1G
testAccuracyRandomDistribution: acceptableFpp(3.000000 %) < actualFpp (59.890330 %) [03m07s] T: 100.0%
testAccuracyEvenOdd:            acceptableFpp(3.000000 %) < actualFpp (59.888499 %) [01m37s] T: 100.0%

I'd say that the truncation bug (with this FPP level) does not cause reliably detectable discrepancies under 350M insertions.

The running time scales approx linearly with the insertion count, so it would mean that with the bare minimum of insertions we could reduce the running time by about a factor of three. It's not nothing (12min -> 4min on github), but 4minutes are still 1-2 orders of magnitudes longer than the old running time.

I saw #51806 went in recently, I can create a patch on top of it. Let me know, if you think it still makes sense.

@dongjoon-hyun
@LuciferYang
@peter-toth

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 4, 2025

Thank you so much, @ishnagy .

It sounds good to me. 4min will be covered in non-ansi-build test pipeline.

@ishnagy
Copy link
Contributor

ishnagy commented Aug 5, 2025

just for recordkeeping, linking the related PRs together,
I have submitted #51845 to reduce the insertion count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants