Skip to content

[SPARK-42655][SQL] Incorrect ambiguous column reference error #40258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

shrprasa
Copy link
Contributor

@shrprasa shrprasa commented Mar 2, 2023

What changes were proposed in this pull request?
The result of attribute resolution should consider only unique values for the reference. If it has duplicate values, it will incorrectly result into ambiguous reference error.

Why are the changes needed?
The below query fails incorrectly due to ambiguous reference error.
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID")
val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*)
df3.select("id").show()
org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id.

df3.explain()
== Physical Plan ==
*(1) Project [_1#6 AS id#17, _2#7 AS col2#18, _3#8 AS col3#19, _4#9 AS col4#20, _5#10 AS col5#21, _1#6 AS ID#17]

Before the fix, attributes matched were:
attributes: Vector(id#17, id#17)
Thus, it throws ambiguous reference error. But if we consider only unique matches, it will return correct result.
unique attributes: Vector(id#17)

Does this PR introduce any user-facing change?
Yes, Users migrating from Spark 2.3 to 3.x will face this error as the scenario used to work fine in Spark 2.3 but fails in Spark 3.2. After the fix, iit will work correctly as it was in Spark 2.3.

How was this patch tested?
Added unit test.

@github-actions github-actions bot added the SQL label Mar 2, 2023
@shrprasa shrprasa force-pushed the col_ambiguous_issue branch 3 times, most recently from a637f83 to d40293e Compare March 3, 2023 10:12
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too drastic and the wrong fix - you're actually changing the col names, and only on select. It's just the error that would ideally show the original col names right?

@shrprasa
Copy link
Contributor Author

shrprasa commented Mar 3, 2023

@srowen Please ignore that change. It was work in progress to check few things.
The reason why we get ambiguous error in below scenario and why it's not correct is the result of attribute resolution returns
two values but both values are same. Thus, it should not throw ambiguous error.

val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID")
val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*)
df3.select("id").show()
org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id.

df3.explain()
== Physical Plan ==
*(1) Project [_1#6 AS id#17, _2#7 AS col2#18, _3#8 AS col3#19, _4#9 AS col4#20, _5#10 AS col5#21, _1#6 AS ID#17]

Before the fix, attributes matched were:
attributes: Vector(id#17, id#17)
Thus, it throws ambiguous reference error. But if we consider only unique matches, it will return correct result.
unique attributes: Vector(id#17)

@shrprasa shrprasa force-pushed the col_ambiguous_issue branch from d40293e to 5d91223 Compare March 3, 2023 18:14
@shrprasa shrprasa changed the title [WIP][SPARK-42655]:Incorrect ambiguous column reference error [SPARK-42655][SQL]:Incorrect ambiguous column reference error Mar 3, 2023
@shrprasa shrprasa requested a review from srowen March 3, 2023 18:21
@shrprasa shrprasa force-pushed the col_ambiguous_issue branch 2 times, most recently from e7114e7 to ea5fe9b Compare March 3, 2023 18:32
@shrprasa
Copy link
Contributor Author

shrprasa commented Mar 4, 2023

@srowen @dongjoon-hyun Can you please review this PR?

@shrprasa
Copy link
Contributor Author

shrprasa commented Mar 7, 2023

Gentle Ping @srowen @dongjoon-hyun @mridulm @HyukjinKwon

@HyukjinKwon HyukjinKwon changed the title [SPARK-42655][SQL]:Incorrect ambiguous column reference error [SPARK-42655][SQL] Incorrect ambiguous column reference error Mar 7, 2023
@srowen
Copy link
Member

srowen commented Mar 7, 2023

I'm not sure about the change, not sure I'm qualified to review it. I think at best the error message should change; I am not clear that the result is 'wrong'

@shrprasa
Copy link
Contributor Author

shrprasa commented Mar 7, 2023

I'm not sure about the change, not sure I'm qualified to review it. I think at best the error message should change; I am not clear that the result is 'wrong'

Thanks for replying. Can you please tag someone who should be right person to review this change?

@shrprasa
Copy link
Contributor Author

shrprasa commented Mar 8, 2023

Gentle ping @dongjoon-hyun @mridulm @HyukjinKwon @yaooqinn Can you please review this PR?

@yaooqinn
Copy link
Member

yaooqinn commented Mar 8, 2023

Can you try set spark.sql.caseSensitive=true?

@shrprasa
Copy link
Contributor Author

shrprasa commented Mar 8, 2023

Can you try set spark.sql.caseSensitive=true?

Yes, I have tried it. With caseSensitive set to true, it will work as then id and ID will be treated as separate columns.
Issue is when columns names are supposed to considered as case insensitive.

@yaooqinn
Copy link
Member

yaooqinn commented Mar 8, 2023

You first defined a case-sensitive data set, then queried in a case-insensitive way, I guess the error is expected.

@shrprasa
Copy link
Contributor Author

shrprasa commented Mar 8, 2023

You first defined a case-sensitive data set, then queried in a case-insensitive way, I guess the error is expected.

In the physical plan, both id and ID columns are projected to the same column in the dataframe: _1#6
_1#6 AS id#17, _1#6 AS ID#17
So, there is no ambiguity,

Also, in the matched attributes, results are same: attributes: Vector(id#17, id#17)
Just because, we have duplicates in the matched result, it's being considered as ambiguous.

If the matched attribute result was Vector(id#17, ID#17) , then it would have been valid error.

And even if the dataset has columns in different cases, Spark being case insensitive by default, should consider both columns as same.

@srowen
Copy link
Member

srowen commented Mar 8, 2023

I don't get it, it is due to case sensitivity; that's why it becomes ambiguous and that's what you see. The issue is that the error isn't super helpful because it shows the lower-cased column right? that's what I was saying. Or: does your change still result in an error without case sensitivity? it should

@shrprasa
Copy link
Contributor Author

shrprasa commented Mar 9, 2023

I don't get it, it is due to case sensitivity; that's why it becomes ambiguous and that's what you see. The issue is that the error isn't super helpful because it shows the lower-cased column right? that's what I was saying. Or: does your change still result in an error without case sensitivity? it should

The issue is not with the error message. Problem is that in this case error should not be thrown. Select query should return result. After this change, ambiguous error will not be thrown as we are fixing the duplicate attribute match.

@srowen
Copy link
Member

srowen commented Mar 9, 2023

Hm, how is it not ambiguous? When case insensitive, 'id' could mean one of two different columns

@shrprasa
Copy link
Contributor Author

shrprasa commented Mar 9, 2023

Hm, how is it not ambiguous? When case insensitive, 'id' could mean one of two different columns

It's not ambiguous because the when we are selecting using list of column names, both id and ID are getting value from same column 'id' in the source dataframe.
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID")
val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*)
df3.select("id").show()

df3.explain()
== Physical Plan ==
*(1) Project [_1#6 AS id#17, _2#7 AS col2#18, _3#8 AS col3#19, _4#9 AS col4#20, _5#10 AS col5#21, _1#6 AS ID#17]

@srowen
Copy link
Member

srowen commented Mar 9, 2023

That isn't relevant. You are selecting from a DataFrame with cols id and ID. Imagine for instance they do not come from the same source, it's clearly ambiguous. It wouldn't make sense if it were different in this case.

@shrprasa
Copy link
Contributor Author

shrprasa commented Mar 9, 2023

It's very much relevant as this is the only case which requires the fix. If they do not come from same source, the plan will reflect that and it will throw the ambiguous error even after this fix.

@srowen
Copy link
Member

srowen commented Mar 9, 2023

Hm, I just don't see the logic in that. It isn't how SQL works either, as far as I understand. Here's maybe another example, imagine a DataFrame defined by SELECT 3 as id, 3 as ID. Would you also say selecting "id" is unambiguous? and it makes sense to you if I change a 3 to a 4 that this query is no longer semantically valid?

@shrprasa
Copy link
Contributor Author

Hm, I just don't see the logic in that. It isn't how SQL works either, as far as I understand. Here's maybe another example, imagine a DataFrame defined by SELECT 3 as id, 3 as ID. Would you also say selecting "id" is unambiguous? and it makes sense to you if I change a 3 to a 4 that this query is no longer semantically valid?

If it's valid as per the plan then yes.

@shrprasa
Copy link
Contributor Author

shrprasa commented Mar 13, 2023

Gentle ping @dongjoon-hyun @mridulm @HyukjinKwon @yaooqinn Can you please review this PR or direct it to someone who can review this PR.

@shrprasa
Copy link
Contributor Author

df3.select("id").show()

@cloud-fan The example you have shared will behave the same even after this fix. It will give ambiguous error.
The use case which the fix is trying to solve is different. Can you please try these two cases:
Case 1: which works fine
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df3 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*)
df3.select("id").show()

Case 2: which doesn't work fine and the fix is to solve this issue
val df2 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID")
val df4 = df2.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*)
df4.select("id").show()

@yaooqinn
Copy link
Member

@shrprasa
At the dataset definition phase, especially for intermediate datasets, Spark is lenient/lazy with case sensitivity. This is because the checks happen in SQL Analyzing, which is not required for defining a Dataset. This gives the user more freedom but also cognitive disorders. On the other hand, in the read phase, SQL Analyzing is a mandatory step, and checks will be performed, so the configuration provided by Spark at this stage is sufficient to resolve all ambiguities.

@cloud-fan
Copy link
Contributor

@shrprasa do you know how the case 1 works?

@shrprasa
Copy link
Contributor Author

@shrprasa do you know how the case 1 works?

yes. It works because the resolved column has just one match
attributes: Vector(id#17)

but for second case, the match result is
attributes: Vector(id#17, id#17)
Since, there are more than one value although both are exactly same, it fails. This fix proposes to fix this by taking distinct values of match result.

@cloud-fan
Copy link
Contributor

It works because the resolved column has just one match

But there are two id columns. Does Spark already do deduplication somewhere?

@shrprasa
Copy link
Contributor Author

It works because the resolved column has just one match

But there are two id columns. Does Spark already do deduplication somewhere?

Not sure about the deduplication before, but even if it was doing it at some stage, in the second use case it might not have converted the column name to lowercase by that time, that's why that would still treat the two id and ID columns as different.
Only at end result of column match, we see that both column matches are same id#17.

@@ -258,7 +258,7 @@ package object expressions {
case (Seq(), _) =>
val name = nameParts.head
val attributes = collectMatches(name, direct.get(name.toLowerCase(Locale.ROOT)))
(attributes.filterNot(_.qualifiedAccessOnly), nameParts.tail)
(attributes.distinct.filterNot(_.qualifiedAccessOnly), nameParts.tail)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we fix def unique in this class? It should look at expr Id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unique method is not used in this flow. It's used at many places while returning the result. Making any changes to unique will increase the scope.

@cloud-fan
Copy link
Contributor

I think case 1 works by accident. It's not an intentional design. I don't think it's a bug that case 2 doesn't work.

@shrprasa
Copy link
Contributor Author

shrprasa commented Mar 24, 2023

I think case 1 works by accident. It's not an intentional design. I don't think it's a bug that case 2 doesn't work.

@cloud-fan As I had said in previous comment :
Not sure about the deduplication before, but even if it was doing it at some stage, in the second use case it might not have converted the column name to lowercase by that time, that's why that would still treat the two id and ID columns as different.
Only at end result of column match, we see that both column matches are same id#17.
The speculation was right. Dedup is happening in unique method.

For case 1:
unique before:: Map(col3 -> Vector(col3#18571), col2 -> Vector(col2#18570), id -> Vector(id#18569, id#18569), col5 -> Vector(col5#18573), col4 -> Vector(col4#18572))
unique after:: Map(col3 -> Vector(col3#18571), col2 -> Vector(col2#18570), id -> Vector(id#18569), col5 -> Vector(col5#18573), col4 -> Vector(col4#18572))

For Case 2:
unique before:: Map(col3 -> Vector(col3#18610), col2 -> Vector(col2#18609), id -> Vector(id#18608, ID#18608), col5 -> Vector(col5#18612), col4 -> Vector(col4#18611))
unique after:: Map(col3 -> Vector(col3#18610), col2 -> Vector(col2#18609), id -> Vector(id#18608, ID#18608), col5 -> Vector(col5#18612), col4 -> Vector(col4#18611))

Most of the places we are calling unique before returning the result. So what' the negative impact you think it will have if we return unique results for the column match also?

One positive use case is it will fix this wrong ambiguous error being thrown just because the result of match has two duplicate values.
attributes:: Vector(id#18608, id#18608)

@shrprasa
Copy link
Contributor Author

FWIW Both the use cases were working fine in Spark 2.3

@shrprasa
Copy link
Contributor Author

@cloud-fan Can you please check my last comments.

@cloud-fan
Copy link
Contributor

FWIW Both the use cases were working fine in Spark 2.3

Sorry I missed this point. Do you know how it worked in 2.3? Did 2.3 also call distinct before returning the result?

@cloud-fan
Copy link
Contributor

according to the code in 2.3, I think we should call distinct in line 345

@shrprasa
Copy link
Contributor Author

according to the code in 2.3, I think we should call distinct in line 345

@cloud-fan
Yes, that should also work, but making it there will increase the impact of change to lot more other scenarios.
Whereas the place where I have made distinct keeps the scope very limited.

@cloud-fan
Copy link
Contributor

If you really worry about regression, we can add a legacy config to fall back to the old code. I don't agree to make code changes that only fix the problem in one particular code path, while we know other code paths have the same problem as well.

@shrprasa
Copy link
Contributor Author

If you really worry about regression, we can add a legacy config to fall back to the old code. I don't agree to make code changes that only fix the problem in one particular code path, while we know other code paths have the same problem as well.

Ok, I will update the PR with suggested change.

@shrprasa shrprasa force-pushed the col_ambiguous_issue branch from ea5fe9b to b2da643 Compare March 31, 2023 15:34
@shrprasa shrprasa requested a review from cloud-fan March 31, 2023 15:41
@shrprasa shrprasa force-pushed the col_ambiguous_issue branch from b2da643 to e4f003a Compare March 31, 2023 16:46
@shrprasa
Copy link
Contributor Author

shrprasa commented Apr 1, 2023

@cloud-fan I have made the change. All Tests have passed. Can you please review?

@shrprasa
Copy link
Contributor Author

shrprasa commented Apr 4, 2023

Gentle ping @cloud-fan

@cloud-fan
Copy link
Contributor

cloud-fan commented Apr 4, 2023

thanks, merging to master/3.4!

@cloud-fan cloud-fan closed this in b283c6a Apr 4, 2023
cloud-fan pushed a commit that referenced this pull request Apr 4, 2023
**What changes were proposed in this pull request?**
The result of attribute resolution should consider only unique values for the reference. If it has duplicate values, it will incorrectly result into ambiguous reference error.

**Why are the changes needed?**
The below query fails incorrectly due to ambiguous reference error.
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID")
val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*)
df3.select("id").show()
org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id.

df3.explain()
== Physical Plan ==
*(1) Project [_1#6 AS id#17, _2#7 AS col2#18, _3#8 AS col3#19, _4#9 AS col4#20, _5#10 AS col5#21, _1#6 AS ID#17]

Before the fix, attributes matched were:
attributes: Vector(id#17, id#17)
Thus, it throws ambiguous reference error. But if we consider only unique matches, it will return correct result.
unique attributes: Vector(id#17)

**Does this PR introduce any user-facing change?**
Yes, Users migrating from Spark 2.3 to 3.x will face this error as the scenario used to work fine in Spark 2.3 but fails in Spark 3.2. After the fix, iit will work correctly as it was in Spark 2.3.

**How was this patch tested?**
Added unit test.

Closes #40258 from shrprasa/col_ambiguous_issue.

Authored-by: Shrikant Prasad <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit b283c6a)
Signed-off-by: Wenchen Fan <[email protected]>
@shrprasa
Copy link
Contributor Author

shrprasa commented Apr 4, 2023

Thanks a lot @cloud-fan for the guidance and support in getting this issue fixed.

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
**What changes were proposed in this pull request?**
The result of attribute resolution should consider only unique values for the reference. If it has duplicate values, it will incorrectly result into ambiguous reference error.

**Why are the changes needed?**
The below query fails incorrectly due to ambiguous reference error.
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID")
val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*)
df3.select("id").show()
org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id.

df3.explain()
== Physical Plan ==
*(1) Project [_1#6 AS id#17, _2#7 AS col2#18, _3#8 AS col3#19, _4#9 AS col4#20, _5#10 AS col5#21, _1#6 AS ID#17]

Before the fix, attributes matched were:
attributes: Vector(id#17, id#17)
Thus, it throws ambiguous reference error. But if we consider only unique matches, it will return correct result.
unique attributes: Vector(id#17)

**Does this PR introduce any user-facing change?**
Yes, Users migrating from Spark 2.3 to 3.x will face this error as the scenario used to work fine in Spark 2.3 but fails in Spark 3.2. After the fix, iit will work correctly as it was in Spark 2.3.

**How was this patch tested?**
Added unit test.

Closes apache#40258 from shrprasa/col_ambiguous_issue.

Authored-by: Shrikant Prasad <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit b283c6a)
Signed-off-by: Wenchen Fan <[email protected]>
@bsikander
Copy link
Contributor

@shrprasa do you think this issue is similar to the issue that i just posted: https://stackoverflow.com/questions/77553257/select-behavior-different-between-pyspark-2-4-8-and-3-3-2

Trying to understand the behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants