[SPARK-2890][SQL] Allow reading of data when case insensitive resolution could cause possible ambiguity. #2209

marmbrus · 2014-08-29T22:14:25Z

Throwing an error in the constructor makes it possible to run queries, even when there is no actual ambiguity. Remove this check in favor of throwing an error in analysis when they query is actually is ambiguous.

Also took the opportunity to add test cases that would have caught a subtle bug in my first attempt at fixing this and refactor some other test code.

SparkQA · 2014-08-29T22:19:05Z

QA tests have started for PR 2209 at commit a703ff4.

This patch merges cleanly.

SparkQA · 2014-08-29T23:32:44Z

QA tests have finished for PR 2209 at commit a703ff4.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class LowerCaseSchema(child: LogicalPlan) extends UnaryNode with Logging

yhuai · 2014-08-30T01:55:45Z

Reading parquet files in HiveContext triggers the problem? If we have two columns c1, C1, we will not be able to read C1 when we are using case insensitive resolution, right?

yhuai · 2014-08-30T01:57:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala

+      val deduplicatedFields = convertedFields.groupBy(_.name).map {
+        case (fieldName, versions) if versions.size == 1  => versions.head
+        case (fieldName, versions) if versions.size > 1  =>
+          logWarning(s"Resolving attributes case insensitively is ambiguous for $fieldName")


Provide more information on which column (with the original column name) we will keep in the lowerCaseSchema?

marmbrus · 2014-08-31T01:02:25Z

I actually encountered the error with a jsonRDD, but yeah it could happen with parquet files as well. Your comment about joins though makes me think that we should just get rid of this check entirely. We can throw an error when your query is invalid, but throwing an exception just because at some point in a query something could be ambiguous seems overly restrictive.

yhuai · 2014-08-31T01:23:49Z

Sounds good. I was not sure how to correctly query those results with ambiguous schemas when I added that check. Seems an more informative logging entry is better than an exception.

SparkQA · 2014-09-05T23:53:31Z

QA tests have started for PR 2209 at commit a703ff4.

This patch merges cleanly.

SparkQA · 2014-09-06T01:53:31Z

Tests timed out after a configured wait of 120m.

marmbrus · 2014-09-10T06:42:19Z

Jenkins, test this please.

marmbrus · 2014-09-10T18:52:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala

-        StructField(f.name.toLowerCase(), lowerCaseSchema(f.dataType), f.nullable)))
+      val convertedFields = fields.map(f =>
+        StructField(f.name.toLowerCase, lowerCaseSchema(f.dataType), f.nullable))
+      val deduplicatedFields = convertedFields.groupBy(_.name).map {


Ahh, this reorders the schema and breaks things. Props to @andyk.

SparkQA · 2014-09-10T23:50:21Z

QA tests have started for PR 2209 at commit a703ff4.

This patch does not merge cleanly!

SparkQA · 2014-09-11T01:01:32Z

QA tests have finished for PR 2209 at commit a703ff4.

This patch fails unit tests.
This patch does not merge cleanly!

SparkQA · 2014-09-11T04:30:35Z

QA tests have started for PR 2209 at commit a703ff4.

This patch merges cleanly.

SparkQA · 2014-09-11T06:08:48Z

QA tests have finished for PR 2209 at commit a703ff4.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class LowerCaseSchema(child: LogicalPlan) extends UnaryNode with Logging

SparkQA · 2014-09-13T20:39:02Z

QA tests have started for PR 2209 at commit 729cca4.

This patch merges cleanly.

SparkQA · 2014-09-13T22:39:02Z

Tests timed out after a configured wait of 120m.

JoshRosen · 2014-09-13T22:41:27Z

Jenkins will actually show you how long the tests took, which can be helpful in narrowing down why we're seeing these timeouts. In this case, it looks like the majority of the time is spent in certain Hive compatibility tests:

https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/90/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/

marmbrus · 2014-09-13T22:52:58Z

@JoshRosen I am hoping that #2164 will fix the test time outs.

SparkQA · 2014-09-13T23:31:50Z

QA tests have started for PR 2209 at commit 729cca4.

This patch merges cleanly.

SparkQA · 2014-09-14T01:20:23Z

QA tests have finished for PR 2209 at commit 729cca4.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder extends compression.Encoder[IntegerType.type]
- class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[IntegerType.type])
- class Encoder extends compression.Encoder[LongType.type]
- class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[LongType.type])

marmbrus · 2014-09-16T18:42:42Z

Merged to master. Thanks for looking this over!

yhuai reviewed Aug 30, 2014
View reviewed changes

marmbrus reviewed Sep 10, 2014
View reviewed changes

marmbrus added 2 commits September 13, 2014 13:31

Remove error (it'll be caught in analysis).

a003aeb

Better tests.

729cca4

marmbrus force-pushed the sameNameStruct branch from a703ff4 to 729cca4 Compare September 13, 2014 20:32

asfgit closed this in 30f288a Sep 16, 2014

marmbrus mentioned this pull request Sep 16, 2014

[SPARK-3414][SQL] Replace LowerCaseSchema with Resolver #2382

Closed

marmbrus deleted the sameNameStruct branch September 22, 2014 19:54

[SPARK-2890][SQL] Allow reading of data when case insensitive resolution could cause possible ambiguity. #2209

[SPARK-2890][SQL] Allow reading of data when case insensitive resolution could cause possible ambiguity. #2209

Uh oh!

Conversation

marmbrus commented Aug 29, 2014

Uh oh!

SparkQA commented Aug 29, 2014

Uh oh!

SparkQA commented Aug 29, 2014

Uh oh!

yhuai commented Aug 30, 2014

Uh oh!

yhuai Aug 30, 2014

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Aug 31, 2014

Uh oh!

yhuai commented Aug 31, 2014

Uh oh!

SparkQA commented Sep 5, 2014

Uh oh!

SparkQA commented Sep 6, 2014

Uh oh!

marmbrus commented Sep 10, 2014

Uh oh!

marmbrus Sep 10, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 10, 2014

Uh oh!

SparkQA commented Sep 11, 2014

Uh oh!

SparkQA commented Sep 11, 2014

Uh oh!

SparkQA commented Sep 11, 2014

Uh oh!

SparkQA commented Sep 13, 2014

Uh oh!

SparkQA commented Sep 13, 2014

Uh oh!

JoshRosen commented Sep 13, 2014

Uh oh!

marmbrus commented Sep 13, 2014

Uh oh!

SparkQA commented Sep 13, 2014

Uh oh!

SparkQA commented Sep 14, 2014

Uh oh!

marmbrus commented Sep 16, 2014

Uh oh!

Uh oh!