forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 0
[SPARK-10400] [SQL] Renames SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ring scripts Migrate Apache download closer.cgi refs to new closer.lua This is the bit of the change that affects the project docs; I'm implementing the changes to the Apache site separately. Author: Sean Owen <[email protected]> Closes apache#8557 from srowen/SPARK-10398. (cherry picked from commit 3f63bd6) Signed-off-by: Sean Owen <[email protected]>
This PR addresses issue [SPARK-10392](https://issues.apache.org/jira/browse/SPARK-10392) The problem is that for "start of epoch" date (01 Jan 1970) PySpark class DateType returns 0 instead of the `datetime.date` due to implementation of its return statement Issue reproduction on master: ``` >>> from pyspark.sql.types import * >>> a = DateType() >>> a.fromInternal(0) 0 >>> a.fromInternal(1) datetime.date(1970, 1, 2) ``` Author: 0x0FFF <[email protected]> Closes apache#8556 from 0x0FFF/SPARK-10392.
…verride clone method https://issues.apache.org/jira/browse/SPARK-10422 Author: Yin Huai <[email protected]> Closes apache#8578 from yhuai/SPARK-10422. (cherry picked from commit 03f3e91) Signed-off-by: Davies Liu <[email protected]>
Author: Davies Liu <[email protected]> Closes apache#8543 from davies/preserve_page. (cherry picked from commit 62b4690) Signed-off-by: Andrew Or <[email protected]>
…explain by default New screenshots after this fix: <img width="627" alt="s1" src="https://cloud.githubusercontent.com/assets/1000778/9625782/4b2dba36-518b-11e5-9104-c713ff026e3d.png"> Default: <img width="462" alt="s2" src="https://cloud.githubusercontent.com/assets/1000778/9625817/92366e50-518b-11e5-9981-cdfb774d66b8.png"> After clicking `+details`: <img width="377" alt="s3" src="https://cloud.githubusercontent.com/assets/1000778/9625784/4ba24342-518b-11e5-8522-846a16a95d44.png"> Author: zsxwing <[email protected]> Closes apache#8570 from zsxwing/SPARK-10411. (cherry picked from commit 0349b5b) Signed-off-by: Andrew Or <[email protected]>
From Jira: Running spark-submit with yarn with number-executors equal to 0 when not using dynamic allocation should error out. In spark 1.5.0 it continues and ends up hanging. yarn.ClientArguments still has the check so something else must have changed. spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --num-executors 0 .... spark 1.4.1 errors with: java.lang.IllegalArgumentException: Number of executors was 0, but must be at least 1 (or 0 if dynamic executor allocation is enabled). Author: Holden Karau <[email protected]> Closes apache#8580 from holdenk/SPARK-10332-spark-submit-to-yarn-executors-0-message. (cherry picked from commit 67580f1) Signed-off-by: Sean Owen <[email protected]>
…rting results Author: robbins <[email protected]> Closes apache#8589 from robbinspg/InputStreamSuite-fix. (cherry picked from commit 754f853) Signed-off-by: Andrew Or <[email protected]>
…eue to be clear Author: robbins <[email protected]> Closes apache#8582 from robbinspg/InputOutputMetricsSuite.
Author: robbins <[email protected]> Closes apache#8605 from robbinspg/DAGSchedulerSuite-fix. (cherry picked from commit 2e1c175) Signed-off-by: Andrew Or <[email protected]>
…with checkpoint file in cluster mode Author: xutingjun <[email protected]> Closes apache#8477 from XuTingjun/streaming-attempt.
We should make sure the scaladoc for params includes their default values through the models in ml/ Author: Holden Karau <[email protected]> Closes apache#8591 from holdenk/SPARK-10402-add-scaladoc-for-default-values-of-params-in-ml. (cherry picked from commit 22eab70) Signed-off-by: Joseph K. Bradley <[email protected]>
…amming guides and python docs - Fixed information around Python API tags in streaming programming guides - Added missing stuff in python docs Author: Tathagata Das <[email protected]> Closes apache#8595 from tdas/SPARK-10440. (cherry picked from commit 7a4f326) Signed-off-by: Reynold Xin <[email protected]>
To keep full compatibility of Parquet write path with Spark 1.4, we should rename the innermost field name of arrays that may contain null from "array_element" to "array". Please refer to [SPARK-10434] [1] for more details. [1]: https://issues.apache.org/jira/browse/SPARK-10434 Author: Cheng Lian <[email protected]> Closes apache#8586 from liancheng/spark-10434/fix-parquet-array-type. (cherry picked from commit bca8c07) Signed-off-by: Cheng Lian <[email protected]>
…in the… … main README. Author: Stephen Hopper <[email protected]> Closes apache#8646 from enragedginger/master. (cherry picked from commit 9d8e838) Signed-off-by: Sean Owen <[email protected]>
Author: Jacek Laskowski <[email protected]> Closes apache#8629 from jaceklaskowski/docs-fixes. (cherry picked from commit 6ceed85) Signed-off-by: Sean Owen <[email protected]>
Copied model must have the same parent, but ml.IsotonicRegressionModel.copy did not set parent. Here fix it and add test case. Author: Yanbo Liang <[email protected]> Closes apache#8637 from yanboliang/spark-10470. (cherry picked from commit f7b55db) Signed-off-by: Xiangrui Meng <[email protected]>
https://issues.apache.org/jira/browse/SPARK-10441 This is the backport of apache#8597 for 1.5 branch. Author: Yin Huai <[email protected]> Closes apache#8655 from yhuai/timestampJson-1.5.
…ion about rate limiting and backpressure Author: Tathagata Das <[email protected]> Closes apache#8656 from tdas/SPARK-10492 and squashes the following commits: 986cdd6 [Tathagata Das] Added information on backpressure (cherry picked from commit 52b24a6) Signed-off-by: Tathagata Das <[email protected]>
…or nested structs We used to workaround SPARK-10301 with a quick fix in branch-1.5 (PR apache#8515), but it doesn't cover the case described in SPARK-10428. So this PR backports PR apache#8509, which had once been considered too big a change to be merged into branch-1.5 in the last minute, to fix both SPARK-10301 and SPARK-10428 for Spark 1.5. Also added more test cases for SPARK-10428. This PR looks big, but the essential change is only ~200 loc. All other changes are for testing. Especially, PR apache#8454 is also backported here because the `ParquetInteroperabilitySuite` introduced in PR apache#8515 depends on it. This should be safe since apache#8454 only touches testing code. Author: Cheng Lian <[email protected]> Closes apache#8583 from liancheng/spark-10301/for-1.5.
…ream and throw a better exception when reading QueueInputDStream Output a warning when serializing QueueInputDStream rather than throwing an exception to allow unit tests use it. Moreover, this PR also throws an better exception when deserializing QueueInputDStream to make the user find out the problem easily. The previous exception is hard to understand: https://issues.apache.org/jira/browse/SPARK-8553 Author: zsxwing <[email protected]> Closes apache#8624 from zsxwing/SPARK-10071 and squashes the following commits: 847cfa8 [zsxwing] Output a warning when writing QueueInputDStream and throw a better exception when reading QueueInputDStream (cherry picked from commit 820913f) Signed-off-by: Tathagata Das <[email protected]>
The YARN backend doesn't like when user code calls `System.exit`, since it cannot know the exit status and thus cannot set an appropriate final status for the application. So, for pyspark, avoid that call and instead throw an exception with the exit code. SparkSubmit handles that exception and exits with the given exit code, while YARN uses the exit code as the failure code for the Spark app. Author: Marcelo Vanzin <[email protected]> Closes apache#7751 from vanzin/SPARK-9416. (cherry picked from commit f68d024)
The fix for SPARK-7736 introduced a race where a port value of "-1" could be passed down to the pyspark process, causing it to fail to connect back to the JVM. This change adds code to fix that race. Author: Marcelo Vanzin <[email protected]> Closes apache#8258 from vanzin/SPARK-7736. (cherry picked from commit c1840a8)
…ld be 0.0 (original: 1.0) Small typo in the example for `LabelledPoint` in the MLLib docs. Author: Sean Paradiso <[email protected]> Closes apache#8680 from sparadiso/docs_mllib_smalltypo. (cherry picked from commit 1dc7548) Signed-off-by: Xiangrui Meng <[email protected]>
Data Spill with UnsafeRow causes assert failure. ``` java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) ``` To reproduce that with code (thanks andrewor14): ```scala bin/spark-shell --master local --conf spark.shuffle.memoryFraction=0.005 --conf spark.shuffle.sort.bypassMergeThreshold=0 sc.parallelize(1 to 2 * 1000 * 1000, 10) .map { i => (i, i) }.toDF("a", "b").groupBy("b").avg().count() ``` Author: Cheng Hao <[email protected]> Closes apache#8635 from chenghao-intel/unsafe_spill. (cherry picked from commit e048111) Signed-off-by: Andrew Or <[email protected]>
From JIRA: Add documentation for tungsten-sort. From the mailing list "I saw a new "spark.shuffle.manager=tungsten-sort" implemented in https://issues.apache.org/jira/browse/SPARK-7081, but it can't be found its corresponding description in http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/configuration.html(Currenlty there are only 'sort' and 'hash' two options)." Author: Holden Karau <[email protected]> Closes apache#8638 from holdenk/SPARK-10469-document-tungsten-sort. (cherry picked from commit a76bde9) Signed-off-by: Andrew Or <[email protected]>
…or.cores This is a regression introduced in apache#4960, this commit fixes it and adds a test. tnachen andrewor14 please review, this should be an easy one. Author: Iulian Dragos <[email protected]> Closes apache#8653 from dragos/issue/mesos/fine-grained-maxExecutorCores. (cherry picked from commit f0562e8) Signed-off-by: Andrew Or <[email protected]>
…osExecutor.cores" This reverts commit 8cf1619.
Previously, project/plugins.sbt explicitly set scalaVersion to 2.10.4. This can cause issues when using a version of sbt that is compiled against a different version of Scala (for example sbt 0.13.9 uses 2.10.5). Removing this explicit setting will cause build files to be compiled and run against the same version of Scala that sbt is compiled against. Note that this only applies to the project build files (items in project/), it is distinct from the version of Scala we target for the actual spark compilation. Author: Ahir Reddy <[email protected]> Closes apache#8709 from ahirreddy/sbt-scala-version-fix. (cherry picked from commit 9bbe33f) Signed-off-by: Sean Owen <[email protected]>
…s" if it is too flaky If hadoopFsRelationSuites's "test all data types" is too flaky we can disable it for now. https://issues.apache.org/jira/browse/SPARK-10540 Author: Yin Huai <[email protected]> Closes apache#8705 from yhuai/SPARK-10540-ignore. (cherry picked from commit 6ce0886) Signed-off-by: Yin Huai <[email protected]>
Cherry-pick this to branch 1.5. Author: Rohit Agarwal <[email protected]> Closes apache#8701 from tgravescs/SPARK-9924-1.5 and squashes the following commits: 16e1c5f [Rohit Agarwal] [SPARK-9924] [WEB UI] Don't schedule checkForLogs while some of them are already running.
…l the test This commit ensures if an assertion fails within a thread, it will ultimately fail the test. Otherwise we end up potentially masking real bugs by not propagating assertion failures properly. Author: Andrew Or <[email protected]> Closes apache#8723 from andrewor14/fix-threading-suite. (cherry picked from commit d74c6a1) Signed-off-by: Andrew Or <[email protected]>
…asks important error information When throwing an IllegalArgumentException in SnappyCompressionCodec.init, chain the existing exception. This allows potentially important debugging info to be passed to the user. Manual testing shows the exception chained properly, and the test suite still looks fine as well. This contribution is my original work and I license the work to the project under the project's open source license. Author: Daniel Imfeld <[email protected]> Closes apache#8725 from dimfeld/dimfeld-patch-1. (cherry picked from commit 6d83678) Signed-off-by: Sean Owen <[email protected]>
https://issues.apache.org/jira/browse/SPARK-10554 Fixes NPE when ShutdownHook tries to cleanup temporary folders Author: Nithin Asokan <[email protected]> Closes apache#8720 from nasokan/SPARK-10554. (cherry picked from commit 8285e3b) Signed-off-by: Sean Owen <[email protected]>
spark.mesos.mesosExecutor.cores when launching Mesos executors (regression) (cherry picked from commit 03e8d0a) backported to branch-1.5 /cc andrewor14 Author: Iulian Dragos <[email protected]> Closes apache#8732 from dragos/issue/mesos/fine-grained-maxExecutorCores-1.5.
e982c06
to
3789ac8
Compare
liancheng
pushed a commit
that referenced
this pull request
Mar 10, 2016
…nerate ## What changes were proposed in this pull request? Analysis exception occurs while running the following query. ``` SELECT ints FROM nestedArray LATERAL VIEW explode(a.b) `a` AS `ints` ``` ``` Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`ints`' given input columns: [a, `ints`]; line 1 pos 7 'Project ['ints] +- Generate explode(a#0.b), true, false, Some(a), [`ints`#8] +- SubqueryAlias nestedarray +- LocalRelation [a#0], [[[[1,2,3]]]] ``` ## How was this patch tested? Added new unit tests in SQLQuerySuite and HiveQlSuite Author: Dilip Biswal <[email protected]> Closes apache#11538 from dilipbiswal/SPARK-13698.
liancheng
pushed a commit
that referenced
this pull request
Apr 26, 2016
…onfig option. ## What changes were proposed in this pull request? Currently, `OptimizeIn` optimizer replaces `In` expression into `InSet` expression if the size of set is greater than a constant, 10. This issue aims to make a configuration `spark.sql.optimizer.inSetConversionThreshold` for that. After this PR, `OptimizerIn` is configurable. ```scala scala> sql("select a in (1,2,3) from (select explode(array(1,2)) a) T").explain() == Physical Plan == WholeStageCodegen : +- Project [a#7 IN (1,2,3) AS (a IN (1, 2, 3))#8] : +- INPUT +- Generate explode([1,2]), false, false, [a#7] +- Scan OneRowRelation[] scala> sqlContext.setConf("spark.sql.optimizer.inSetConversionThreshold", "2") scala> sql("select a in (1,2,3) from (select explode(array(1,2)) a) T").explain() == Physical Plan == WholeStageCodegen : +- Project [a#16 INSET (1,2,3) AS (a IN (1, 2, 3))apache#17] : +- INPUT +- Generate explode([1,2]), false, false, [a#16] +- Scan OneRowRelation[] ``` ## How was this patch tested? Pass the Jenkins tests (with a new testcase) Author: Dongjoon Hyun <[email protected]> Closes apache#12562 from dongjoon-hyun/SPARK-14796.
liancheng
pushed a commit
that referenced
this pull request
May 3, 2016
## What changes were proposed in this pull request? This PR aims to optimize GroupExpressions by removing repeating expressions. `RemoveRepetitionFromGroupExpressions` is added. **Before** ```scala scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain() == Physical Plan == WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9], functions=[], output=[(a + 1)#5]) : +- INPUT +- Exchange hashpartitioning((a#0 + 1)#6, (1 + a#0)#7, (A#0 + 1)#8, (1 + A#0)#9, 200), None +- WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6,(1 + a#0) AS (1 + a#0)#7,(A#0 + 1) AS (A#0 + 1)#8,(1 + A#0) AS (1 + A#0)#9], functions=[], output=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9]) : +- INPUT +- LocalTableScan [a#0], [[1],[2]] ``` **After** ```scala scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain() == Physical Plan == WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1)#6], functions=[], output=[(a + 1)#5]) : +- INPUT +- Exchange hashpartitioning((a#0 + 1)#6, 200), None +- WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6], functions=[], output=[(a#0 + 1)#6]) : +- INPUT +- LocalTableScan [a#0], [[1],[2]] ``` ## How was this patch tested? Pass the Jenkins tests (with a new testcase) Author: Dongjoon Hyun <[email protected]> Closes apache#12590 from dongjoon-hyun/SPARK-14830.
liancheng
pushed a commit
that referenced
this pull request
Jun 10, 2016
## What changes were proposed in this pull request? This issue add a new optimizer `ReorderAssociativeOperator` by taking advantage of integral associative property. Currently, Spark works like the following. 1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`. 2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`. This PR can handle Case 2 for **Add/Multiply** expression whose data types are `ByteType`, `ShortType`, `IntegerType`, and `LongType`. The followings are the plan comparison between `before` and `after` this issue. **Before** ```scala scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain == Physical Plan == WholeStageCodegen : +- Project [(((((((((a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS (((((((((a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8] : +- INPUT +- Generate explode([1]), false, false, [a#7] +- Scan OneRowRelation[] scala> sql("select a*1*2*3*4*5*6*7*8*9 from (select explode(array(1)) a)").explain == Physical Plan == *Project [(((((((((a#18 * 1) * 2) * 3) * 4) * 5) * 6) * 7) * 8) * 9) AS (((((((((a * 1) * 2) * 3) * 4) * 5) * 6) * 7) * 8) * 9)apache#19] +- Generate explode([1]), false, false, [a#18] +- Scan OneRowRelation[] ``` **After** ```scala scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain == Physical Plan == WholeStageCodegen : +- Project [(a#7 + 45) AS (((((((((a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8] : +- INPUT +- Generate explode([1]), false, false, [a#7] +- Scan OneRowRelation[] scala> sql("select a*1*2*3*4*5*6*7*8*9 from (select explode(array(1)) a)").explain == Physical Plan == *Project [(a#18 * 362880) AS (((((((((a * 1) * 2) * 3) * 4) * 5) * 6) * 7) * 8) * 9)apache#19] +- Generate explode([1]), false, false, [a#18] +- Scan OneRowRelation[] ``` This PR is greatly generalized by cloud-fan 's key ideas; he should be credited for the work he did. ## How was this patch tested? Pass the Jenkins tests including new testsuite. Author: Dongjoon Hyun <[email protected]> Closes apache#12850 from dongjoon-hyun/SPARK-15076.
liancheng
pushed a commit
that referenced
this pull request
Oct 28, 2016
## What changes were proposed in this pull request? This PR aims to optimize GroupExpressions by removing repeating expressions. `RemoveRepetitionFromGroupExpressions` is added. **Before** ```scala scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain() == Physical Plan == WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9], functions=[], output=[(a + 1)#5]) : +- INPUT +- Exchange hashpartitioning((a#0 + 1)#6, (1 + a#0)#7, (A#0 + 1)#8, (1 + A#0)#9, 200), None +- WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6,(1 + a#0) AS (1 + a#0)#7,(A#0 + 1) AS (A#0 + 1)#8,(1 + A#0) AS (1 + A#0)#9], functions=[], output=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9]) : +- INPUT +- LocalTableScan [a#0], [[1],[2]] ``` **After** ```scala scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain() == Physical Plan == WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1)#6], functions=[], output=[(a + 1)#5]) : +- INPUT +- Exchange hashpartitioning((a#0 + 1)#6, 200), None +- WholeStageCodegen : +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6], functions=[], output=[(a#0 + 1)#6]) : +- INPUT +- LocalTableScan [a#0], [[1],[2]] ``` ## How was this patch tested? Pass the Jenkins tests (with a new testcase) Author: Dongjoon Hyun <[email protected]> Closes apache#12590 from dongjoon-hyun/SPARK-14830. (cherry picked from commit 6e63201) Signed-off-by: Michael Armbrust <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Please refer to SPARK-10400 for more details.