[SPARK-42655][SQL] Incorrect ambiguous column reference error #40258

shrprasa · 2023-03-02T20:18:31Z

What changes were proposed in this pull request?
The result of attribute resolution should consider only unique values for the reference. If it has duplicate values, it will incorrectly result into ambiguous reference error.

Why are the changes needed?
The below query fails incorrectly due to ambiguous reference error.
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID")
val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*)
df3.select("id").show()
org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id.

df3.explain()
== Physical Plan ==
*(1) Project [_1#6 AS id#17, _2#7 AS col2#18, _3#8 AS col3#19, _4#9 AS col4#20, _5#10 AS col5#21, _1#6 AS ID#17]

Before the fix, attributes matched were:
attributes: Vector(id#17, id#17)
Thus, it throws ambiguous reference error. But if we consider only unique matches, it will return correct result.
unique attributes: Vector(id#17)

Does this PR introduce any user-facing change?
Yes, Users migrating from Spark 2.3 to 3.x will face this error as the scenario used to work fine in Spark 2.3 but fails in Spark 3.2. After the fix, iit will work correctly as it was in Spark 2.3.

How was this patch tested?
Added unit test.

srowen

I think this is too drastic and the wrong fix - you're actually changing the col names, and only on select. It's just the error that would ideally show the original col names right?

shrprasa · 2023-03-03T18:14:01Z

@srowen Please ignore that change. It was work in progress to check few things.
The reason why we get ambiguous error in below scenario and why it's not correct is the result of attribute resolution returns
two values but both values are same. Thus, it should not throw ambiguous error.

val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID")
val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*)
df3.select("id").show()
org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id.

df3.explain()
== Physical Plan ==
*(1) Project [_1#6 AS id#17, _2#7 AS col2#18, _3#8 AS col3#19, _4#9 AS col4#20, _5#10 AS col5#21, _1#6 AS ID#17]

Before the fix, attributes matched were:
attributes: Vector(id#17, id#17)
Thus, it throws ambiguous reference error. But if we consider only unique matches, it will return correct result.
unique attributes: Vector(id#17)

shrprasa · 2023-03-04T02:53:44Z

@srowen @dongjoon-hyun Can you please review this PR?

shrprasa · 2023-03-07T05:56:12Z

Gentle Ping @srowen @dongjoon-hyun @mridulm @HyukjinKwon

srowen · 2023-03-07T13:06:13Z

I'm not sure about the change, not sure I'm qualified to review it. I think at best the error message should change; I am not clear that the result is 'wrong'

shrprasa · 2023-03-07T13:19:07Z

I'm not sure about the change, not sure I'm qualified to review it. I think at best the error message should change; I am not clear that the result is 'wrong'

Thanks for replying. Can you please tag someone who should be right person to review this change?

shrprasa · 2023-03-08T05:25:38Z

Gentle ping @dongjoon-hyun @mridulm @HyukjinKwon @yaooqinn Can you please review this PR?

yaooqinn · 2023-03-08T07:16:10Z

Can you try set spark.sql.caseSensitive=true?

shrprasa · 2023-03-08T07:21:34Z

Can you try set spark.sql.caseSensitive=true?

Yes, I have tried it. With caseSensitive set to true, it will work as then id and ID will be treated as separate columns.
Issue is when columns names are supposed to considered as case insensitive.

yaooqinn · 2023-03-08T07:23:50Z

You first defined a case-sensitive data set, then queried in a case-insensitive way, I guess the error is expected.

shrprasa · 2023-03-08T07:39:53Z

You first defined a case-sensitive data set, then queried in a case-insensitive way, I guess the error is expected.

In the physical plan, both id and ID columns are projected to the same column in the dataframe: _1#6
_1#6 AS id#17, _1#6 AS ID#17
So, there is no ambiguity,

Also, in the matched attributes, results are same: attributes: Vector(id#17, id#17)
Just because, we have duplicates in the matched result, it's being considered as ambiguous.

If the matched attribute result was Vector(id#17, ID#17) , then it would have been valid error.

And even if the dataset has columns in different cases, Spark being case insensitive by default, should consider both columns as same.

srowen · 2023-03-08T13:20:02Z

I don't get it, it is due to case sensitivity; that's why it becomes ambiguous and that's what you see. The issue is that the error isn't super helpful because it shows the lower-cased column right? that's what I was saying. Or: does your change still result in an error without case sensitivity? it should

shrprasa · 2023-03-09T04:30:27Z

I don't get it, it is due to case sensitivity; that's why it becomes ambiguous and that's what you see. The issue is that the error isn't super helpful because it shows the lower-cased column right? that's what I was saying. Or: does your change still result in an error without case sensitivity? it should

The issue is not with the error message. Problem is that in this case error should not be thrown. Select query should return result. After this change, ambiguous error will not be thrown as we are fixing the duplicate attribute match.

srowen · 2023-03-09T04:58:23Z

Hm, how is it not ambiguous? When case insensitive, 'id' could mean one of two different columns

shrprasa · 2023-03-09T05:10:09Z

Hm, how is it not ambiguous? When case insensitive, 'id' could mean one of two different columns

It's not ambiguous because the when we are selecting using list of column names, both id and ID are getting value from same column 'id' in the source dataframe.
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID")
val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*)
df3.select("id").show()

df3.explain()
== Physical Plan ==
*(1) Project [_1#6 AS id#17, _2#7 AS col2#18, _3#8 AS col3#19, _4#9 AS col4#20, _5#10 AS col5#21, _1#6 AS ID#17]

srowen · 2023-03-09T05:15:16Z

That isn't relevant. You are selecting from a DataFrame with cols id and ID. Imagine for instance they do not come from the same source, it's clearly ambiguous. It wouldn't make sense if it were different in this case.

shrprasa · 2023-03-09T05:21:14Z

It's very much relevant as this is the only case which requires the fix. If they do not come from same source, the plan will reflect that and it will throw the ambiguous error even after this fix.

srowen · 2023-03-09T05:34:15Z

Hm, I just don't see the logic in that. It isn't how SQL works either, as far as I understand. Here's maybe another example, imagine a DataFrame defined by SELECT 3 as id, 3 as ID. Would you also say selecting "id" is unambiguous? and it makes sense to you if I change a 3 to a 4 that this query is no longer semantically valid?

shrprasa · 2023-03-10T04:15:35Z

Hm, I just don't see the logic in that. It isn't how SQL works either, as far as I understand. Here's maybe another example, imagine a DataFrame defined by SELECT 3 as id, 3 as ID. Would you also say selecting "id" is unambiguous? and it makes sense to you if I change a 3 to a 4 that this query is no longer semantically valid?

If it's valid as per the plan then yes.

shrprasa · 2023-03-13T05:04:40Z

Gentle ping @dongjoon-hyun @mridulm @HyukjinKwon @yaooqinn Can you please review this PR or direct it to someone who can review this PR.

shrprasa · 2023-03-23T04:08:32Z

df3.select("id").show()

@cloud-fan The example you have shared will behave the same even after this fix. It will give ambiguous error.
The use case which the fix is trying to solve is different. Can you please try these two cases:
Case 1: which works fine
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df3 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*)
df3.select("id").show()

Case 2: which doesn't work fine and the fix is to solve this issue
val df2 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID")
val df4 = df2.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*)
df4.select("id").show()

yaooqinn · 2023-03-23T04:08:35Z

@shrprasa
At the dataset definition phase, especially for intermediate datasets, Spark is lenient/lazy with case sensitivity. This is because the checks happen in SQL Analyzing, which is not required for defining a Dataset. This gives the user more freedom but also cognitive disorders. On the other hand, in the read phase, SQL Analyzing is a mandatory step, and checks will be performed, so the configuration provided by Spark at this stage is sufficient to resolve all ambiguities.

cloud-fan · 2023-03-23T04:33:23Z

@shrprasa do you know how the case 1 works?

shrprasa · 2023-03-23T04:45:49Z

@shrprasa do you know how the case 1 works?

yes. It works because the resolved column has just one match
attributes: Vector(id#17)

but for second case, the match result is
attributes: Vector(id#17, id#17)
Since, there are more than one value although both are exactly same, it fails. This fix proposes to fix this by taking distinct values of match result.

cloud-fan · 2023-03-24T06:38:13Z

It works because the resolved column has just one match

But there are two id columns. Does Spark already do deduplication somewhere?

shrprasa · 2023-03-24T07:25:30Z

It works because the resolved column has just one match

But there are two id columns. Does Spark already do deduplication somewhere?

Not sure about the deduplication before, but even if it was doing it at some stage, in the second use case it might not have converted the column name to lowercase by that time, that's why that would still treat the two id and ID columns as different.
Only at end result of column match, we see that both column matches are same id#17.

cloud-fan · 2023-03-24T08:26:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala

@@ -258,7 +258,7 @@ package object expressions  {
        case (Seq(), _) =>
          val name = nameParts.head
          val attributes = collectMatches(name, direct.get(name.toLowerCase(Locale.ROOT)))
-          (attributes.filterNot(_.qualifiedAccessOnly), nameParts.tail)
+          (attributes.distinct.filterNot(_.qualifiedAccessOnly), nameParts.tail)


shall we fix def unique in this class? It should look at expr Id.

The unique method is not used in this flow. It's used at many places while returning the result. Making any changes to unique will increase the scope.

cloud-fan · 2023-03-24T12:57:48Z

I think case 1 works by accident. It's not an intentional design. I don't think it's a bug that case 2 doesn't work.

shrprasa · 2023-03-24T13:23:44Z

I think case 1 works by accident. It's not an intentional design. I don't think it's a bug that case 2 doesn't work.

@cloud-fan As I had said in previous comment :
Not sure about the deduplication before, but even if it was doing it at some stage, in the second use case it might not have converted the column name to lowercase by that time, that's why that would still treat the two id and ID columns as different.
Only at end result of column match, we see that both column matches are same id#17.
The speculation was right. Dedup is happening in unique method.

For case 1:
unique before:: Map(col3 -> Vector(col3#18571), col2 -> Vector(col2#18570), id -> Vector(id#18569, id#18569), col5 -> Vector(col5#18573), col4 -> Vector(col4#18572))
unique after:: Map(col3 -> Vector(col3#18571), col2 -> Vector(col2#18570), id -> Vector(id#18569), col5 -> Vector(col5#18573), col4 -> Vector(col4#18572))

For Case 2:
unique before:: Map(col3 -> Vector(col3#18610), col2 -> Vector(col2#18609), id -> Vector(id#18608, ID#18608), col5 -> Vector(col5#18612), col4 -> Vector(col4#18611))
unique after:: Map(col3 -> Vector(col3#18610), col2 -> Vector(col2#18609), id -> Vector(id#18608, ID#18608), col5 -> Vector(col5#18612), col4 -> Vector(col4#18611))

Most of the places we are calling unique before returning the result. So what' the negative impact you think it will have if we return unique results for the column match also?

One positive use case is it will fix this wrong ambiguous error being thrown just because the result of match has two duplicate values.
attributes:: Vector(id#18608, id#18608)

shrprasa · 2023-03-24T13:41:38Z

FWIW Both the use cases were working fine in Spark 2.3

shrprasa · 2023-03-28T05:19:56Z

@cloud-fan Can you please check my last comments.

cloud-fan · 2023-03-31T04:45:22Z

FWIW Both the use cases were working fine in Spark 2.3

Sorry I missed this point. Do you know how it worked in 2.3? Did 2.3 also call distinct before returning the result?

cloud-fan · 2023-03-31T04:49:11Z

according to the code in 2.3, I think we should call distinct in line 345

shrprasa · 2023-03-31T11:38:10Z

according to the code in 2.3, I think we should call distinct in line 345

@cloud-fan
Yes, that should also work, but making it there will increase the impact of change to lot more other scenarios.
Whereas the place where I have made distinct keeps the scope very limited.

cloud-fan · 2023-03-31T12:57:06Z

If you really worry about regression, we can add a legacy config to fall back to the old code. I don't agree to make code changes that only fix the problem in one particular code path, while we know other code paths have the same problem as well.

shrprasa · 2023-03-31T13:48:56Z

If you really worry about regression, we can add a legacy config to fall back to the old code. I don't agree to make code changes that only fix the problem in one particular code path, while we know other code paths have the same problem as well.

Ok, I will update the PR with suggested change.

shrprasa · 2023-04-01T03:44:21Z

@cloud-fan I have made the change. All Tests have passed. Can you please review?

shrprasa · 2023-04-04T05:11:38Z

Gentle ping @cloud-fan

cloud-fan · 2023-04-04T13:15:37Z

thanks, merging to master/3.4!

**What changes were proposed in this pull request?** The result of attribute resolution should consider only unique values for the reference. If it has duplicate values, it will incorrectly result into ambiguous reference error. **Why are the changes needed?** The below query fails incorrectly due to ambiguous reference error. val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID") val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*) df3.select("id").show() org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id. df3.explain() == Physical Plan == *(1) Project [_1#6 AS id#17, _2#7 AS col2#18, _3#8 AS col3#19, _4#9 AS col4#20, _5#10 AS col5#21, _1#6 AS ID#17] Before the fix, attributes matched were: attributes: Vector(id#17, id#17) Thus, it throws ambiguous reference error. But if we consider only unique matches, it will return correct result. unique attributes: Vector(id#17) **Does this PR introduce any user-facing change?** Yes, Users migrating from Spark 2.3 to 3.x will face this error as the scenario used to work fine in Spark 2.3 but fails in Spark 3.2. After the fix, iit will work correctly as it was in Spark 2.3. **How was this patch tested?** Added unit test. Closes #40258 from shrprasa/col_ambiguous_issue. Authored-by: Shrikant Prasad <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit b283c6a) Signed-off-by: Wenchen Fan <[email protected]>

shrprasa · 2023-04-04T17:02:37Z

Thanks a lot @cloud-fan for the guidance and support in getting this issue fixed.

**What changes were proposed in this pull request?** The result of attribute resolution should consider only unique values for the reference. If it has duplicate values, it will incorrectly result into ambiguous reference error. **Why are the changes needed?** The below query fails incorrectly due to ambiguous reference error. val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID") val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*) df3.select("id").show() org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id. df3.explain() == Physical Plan == *(1) Project [_1#6 AS id#17, _2#7 AS col2#18, _3#8 AS col3#19, _4#9 AS col4#20, _5#10 AS col5#21, _1#6 AS ID#17] Before the fix, attributes matched were: attributes: Vector(id#17, id#17) Thus, it throws ambiguous reference error. But if we consider only unique matches, it will return correct result. unique attributes: Vector(id#17) **Does this PR introduce any user-facing change?** Yes, Users migrating from Spark 2.3 to 3.x will face this error as the scenario used to work fine in Spark 2.3 but fails in Spark 3.2. After the fix, iit will work correctly as it was in Spark 2.3. **How was this patch tested?** Added unit test. Closes apache#40258 from shrprasa/col_ambiguous_issue. Authored-by: Shrikant Prasad <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit b283c6a) Signed-off-by: Wenchen Fan <[email protected]>

bsikander · 2023-11-26T19:55:30Z

@shrprasa do you think this issue is similar to the issue that i just posted: https://stackoverflow.com/questions/77553257/select-behavior-different-between-pyspark-2-4-8-and-3-3-2

Trying to understand the behavior.

github-actions bot added the SQL label Mar 2, 2023

shrprasa force-pushed the col_ambiguous_issue branch 3 times, most recently from a637f83 to d40293e Compare March 3, 2023 10:12

srowen requested changes Mar 3, 2023

View reviewed changes

shrprasa force-pushed the col_ambiguous_issue branch from d40293e to 5d91223 Compare March 3, 2023 18:14

shrprasa changed the title ~~[WIP][SPARK-42655]:Incorrect ambiguous column reference error~~ [SPARK-42655][SQL]:Incorrect ambiguous column reference error Mar 3, 2023

shrprasa requested a review from srowen March 3, 2023 18:21

shrprasa force-pushed the col_ambiguous_issue branch 2 times, most recently from e7114e7 to ea5fe9b Compare March 3, 2023 18:32

HyukjinKwon changed the title ~~[SPARK-42655][SQL]:Incorrect ambiguous column reference error~~ [SPARK-42655][SQL] Incorrect ambiguous column reference error Mar 7, 2023

cloud-fan reviewed Mar 24, 2023

View reviewed changes

shrprasa force-pushed the col_ambiguous_issue branch from ea5fe9b to b2da643 Compare March 31, 2023 15:34

shrprasa requested a review from cloud-fan March 31, 2023 15:41

[SPARK-42655][SQL]: Fix incorrect ambiguous column reference error

e4f003a

shrprasa force-pushed the col_ambiguous_issue branch from b2da643 to e4f003a Compare March 31, 2023 16:46

cloud-fan approved these changes Apr 4, 2023

View reviewed changes

cloud-fan closed this in b283c6a Apr 4, 2023

[SPARK-42655][SQL] Incorrect ambiguous column reference error #40258

[SPARK-42655][SQL] Incorrect ambiguous column reference error #40258

Uh oh!

Conversation

shrprasa commented Mar 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

shrprasa commented Mar 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shrprasa commented Mar 4, 2023

Uh oh!

shrprasa commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Mar 7, 2023

Uh oh!

shrprasa commented Mar 7, 2023

Uh oh!

shrprasa commented Mar 8, 2023

Uh oh!

yaooqinn commented Mar 8, 2023

Uh oh!

shrprasa commented Mar 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn commented Mar 8, 2023

Uh oh!

shrprasa commented Mar 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Mar 8, 2023

Uh oh!

shrprasa commented Mar 9, 2023

Uh oh!

srowen commented Mar 9, 2023

Uh oh!

shrprasa commented Mar 9, 2023

Uh oh!

srowen commented Mar 9, 2023

Uh oh!

shrprasa commented Mar 9, 2023

Uh oh!

srowen commented Mar 9, 2023

Uh oh!

shrprasa commented Mar 10, 2023

Uh oh!

shrprasa commented Mar 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shrprasa commented Mar 23, 2023

Uh oh!

yaooqinn commented Mar 23, 2023

Uh oh!

cloud-fan commented Mar 23, 2023

Uh oh!

shrprasa commented Mar 23, 2023

Uh oh!

cloud-fan commented Mar 24, 2023

Uh oh!

shrprasa commented Mar 24, 2023

Uh oh!

cloud-fan Mar 24, 2023

Choose a reason for hiding this comment

Uh oh!

shrprasa Mar 24, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 24, 2023

Uh oh!

shrprasa commented Mar 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shrprasa commented Mar 24, 2023

Uh oh!

shrprasa commented Mar 28, 2023

Uh oh!

shrprasa commented Mar 2, 2023 •

edited

Loading

shrprasa commented Mar 3, 2023 •

edited

Loading

shrprasa commented Mar 7, 2023 •

edited

Loading

shrprasa commented Mar 8, 2023 •

edited

Loading

shrprasa commented Mar 8, 2023 •

edited

Loading

shrprasa commented Mar 13, 2023 •

edited

Loading

shrprasa commented Mar 24, 2023 •

edited

Loading

cloud-fan commented Apr 4, 2023 •

edited

Loading