Skip to content

SPARK-1181. 'mvn test' fails out of the box since sbt assembly does not necessarily exist #77

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Mar 4, 2014

The test suite requires that "sbt assembly" has been run in order for some tests (like DriverSuite) to pass. The tests themselves say as much.

This means that a "mvn test" from a fresh clone fails.

There's a pretty simple fix, to have Maven's test-compile phase invoke "sbt assembly". I suppose the only downside is re-invoking "sbt assembly" each time tests are run.

I'm open to ideas about how to set this up more intelligently but it would be a generally good thing if the Maven build's tests passed out of the box.

@markhamstra
Copy link
Contributor

The standard maven build procedure should be to run mvn -DskipTests package first (which builds the assembly) and then mvn test. The "Building Spark with Maven" page should be updated to clearly explain that procedure.

Requiring maven users to invoke "sbt assembly" not only forces downloading SBT itself, but also ends up duplicating artifacts in .ivy2 and .m2.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@JoshRosen
Copy link
Contributor

In Maven, you can run tests that depend on packages/assemblies during Maven's integration-test phase, which automatically runs after the Maven package phase. I'm not sure what the SBT equivalent of integration-test is.

This approach would require us to move the integration tests into a separate test directory or package.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12997/

@srowen
Copy link
Member Author

srowen commented Mar 4, 2014

OK that works, to package and then test. In the canonical Maven lifecycle, packaging comes after test, so test would not depend on packaging. In practice this is at worst a fine workaround.

I agree that making these integration tests is likely the right-est answer. I will close this however as I think that's for another day; other bigger build changes might make this point irrelevant.

@srowen srowen closed this Mar 4, 2014
@srowen srowen deleted the SPARK-1181 branch March 23, 2014 13:11
Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018
ashangit pushed a commit to ashangit/spark that referenced this pull request Dec 11, 2018
[SPARK-24917] make chunk size configurable
clems4ever pushed a commit to clems4ever/spark that referenced this pull request Feb 11, 2019
[SPARK-24917] make chunk size configurable
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
hn5092 added a commit to hn5092/spark that referenced this pull request Nov 10, 2019
hn5092 added a commit to hn5092/spark that referenced this pull request Nov 19, 2019
cloud-fan pushed a commit that referenced this pull request Aug 11, 2020
### What changes were proposed in this pull request?
This PR added a physical rule to remove redundant project nodes. A `ProjectExec` is redundant when
1. It has the same output attributes and order as its child's output when ordering of these attributes is required.
2. It has the same output attributes as its child's output when attribute output ordering is not required.

For example:
After Filter:
```
== Physical Plan ==
*(1) Project [a#14L, b#15L, c#16, key#17]
+- *(1) Filter (isnotnull(a#14L) AND (a#14L > 5))
   +- *(1) ColumnarToRow
      +- FileScan parquet [a#14L,b#15L,c#16,key#17]
```
The `Project a#14L, b#15L, c#16, key#17` is redundant because its output is exactly the same as filter's output.

Before Aggregate:
```
== Physical Plan ==
*(2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], output=[sum_a#39L, key#17, last_b#41L])
+- Exchange hashpartitioning(key#17, 5), true, [id=#77]
   +- *(1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51])
      +- *(1) Project [key#17, a#14L, b#15L]
         +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 100))
            +- *(1) ColumnarToRow
               +- FileScan parquet [a#14L,b#15L,key#17]
```
The `Project key#17, a#14L, b#15L` is redundant because hash aggregate doesn't require child plan's output to be in a specific order.

### Why are the changes needed?

It removes unnecessary query nodes and makes query plan cleaner.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

Closes #29031 from allisonwang-db/remove-project.

Lead-authored-by: allisonwang-db <[email protected]>
Co-authored-by: allisonwang-db <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants