Skip to content

[SPARK-2890][SQL] Allow reading of data when case insensitive resolution could cause possible ambiguity. #2209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

marmbrus
Copy link
Contributor

Throwing an error in the constructor makes it possible to run queries, even when there is no actual ambiguity. Remove this check in favor of throwing an error in analysis when they query is actually is ambiguous.

Also took the opportunity to add test cases that would have caught a subtle bug in my first attempt at fixing this and refactor some other test code.

@SparkQA
Copy link

SparkQA commented Aug 29, 2014

QA tests have started for PR 2209 at commit a703ff4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 29, 2014

QA tests have finished for PR 2209 at commit a703ff4.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class LowerCaseSchema(child: LogicalPlan) extends UnaryNode with Logging

@yhuai
Copy link
Contributor

yhuai commented Aug 30, 2014

Reading parquet files in HiveContext triggers the problem? If we have two columns c1, C1, we will not be able to read C1 when we are using case insensitive resolution, right?

val deduplicatedFields = convertedFields.groupBy(_.name).map {
case (fieldName, versions) if versions.size == 1 => versions.head
case (fieldName, versions) if versions.size > 1 =>
logWarning(s"Resolving attributes case insensitively is ambiguous for $fieldName")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provide more information on which column (with the original column name) we will keep in the lowerCaseSchema?

@marmbrus
Copy link
Contributor Author

I actually encountered the error with a jsonRDD, but yeah it could happen with parquet files as well. Your comment about joins though makes me think that we should just get rid of this check entirely. We can throw an error when your query is invalid, but throwing an exception just because at some point in a query something could be ambiguous seems overly restrictive.

@yhuai
Copy link
Contributor

yhuai commented Aug 31, 2014

Sounds good. I was not sure how to correctly query those results with ambiguous schemas when I added that check. Seems an more informative logging entry is better than an exception.

@SparkQA
Copy link

SparkQA commented Sep 5, 2014

QA tests have started for PR 2209 at commit a703ff4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 6, 2014

Tests timed out after a configured wait of 120m.

@marmbrus
Copy link
Contributor Author

Jenkins, test this please.

StructField(f.name.toLowerCase(), lowerCaseSchema(f.dataType), f.nullable)))
val convertedFields = fields.map(f =>
StructField(f.name.toLowerCase, lowerCaseSchema(f.dataType), f.nullable))
val deduplicatedFields = convertedFields.groupBy(_.name).map {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, this reorders the schema and breaks things. Props to @andyk.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have started for PR 2209 at commit a703ff4.

  • This patch does not merge cleanly!

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have finished for PR 2209 at commit a703ff4.

  • This patch fails unit tests.
  • This patch does not merge cleanly!

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have started for PR 2209 at commit a703ff4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have finished for PR 2209 at commit a703ff4.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class LowerCaseSchema(child: LogicalPlan) extends UnaryNode with Logging

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have started for PR 2209 at commit 729cca4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

Tests timed out after a configured wait of 120m.

@JoshRosen
Copy link
Contributor

Jenkins will actually show you how long the tests took, which can be helpful in narrowing down why we're seeing these timeouts. In this case, it looks like the majority of the time is spent in certain Hive compatibility tests:

https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/90/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/

@marmbrus
Copy link
Contributor Author

@JoshRosen I am hoping that #2164 will fix the test time outs.

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have started for PR 2209 at commit 729cca4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 14, 2014

QA tests have finished for PR 2209 at commit 729cca4.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
    • class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
    • class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
    • class Encoder extends compression.Encoder[IntegerType.type]
    • class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[IntegerType.type])
    • class Encoder extends compression.Encoder[LongType.type]
    • class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[LongType.type])

@marmbrus
Copy link
Contributor Author

Merged to master. Thanks for looking this over!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants