-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-42655][SQL] Incorrect ambiguous column reference error #40258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a637f83
to
d40293e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is too drastic and the wrong fix - you're actually changing the col names, and only on select. It's just the error that would ideally show the original col names right?
@srowen Please ignore that change. It was work in progress to check few things. val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") df3.explain() Before the fix, attributes matched were: |
d40293e
to
5d91223
Compare
e7114e7
to
ea5fe9b
Compare
@srowen @dongjoon-hyun Can you please review this PR? |
Gentle Ping @srowen @dongjoon-hyun @mridulm @HyukjinKwon |
I'm not sure about the change, not sure I'm qualified to review it. I think at best the error message should change; I am not clear that the result is 'wrong' |
Thanks for replying. Can you please tag someone who should be right person to review this change? |
Gentle ping @dongjoon-hyun @mridulm @HyukjinKwon @yaooqinn Can you please review this PR? |
Can you try |
Yes, I have tried it. With caseSensitive set to true, it will work as then id and ID will be treated as separate columns. |
You first defined a case-sensitive data set, then queried in a case-insensitive way, I guess the error is expected. |
In the physical plan, both id and ID columns are projected to the same column in the dataframe: _1#6 Also, in the matched attributes, results are same: attributes: Vector(id#17, id#17) If the matched attribute result was Vector(id#17, ID#17) , then it would have been valid error. And even if the dataset has columns in different cases, Spark being case insensitive by default, should consider both columns as same. |
I don't get it, it is due to case sensitivity; that's why it becomes ambiguous and that's what you see. The issue is that the error isn't super helpful because it shows the lower-cased column right? that's what I was saying. Or: does your change still result in an error without case sensitivity? it should |
The issue is not with the error message. Problem is that in this case error should not be thrown. Select query should return result. After this change, ambiguous error will not be thrown as we are fixing the duplicate attribute match. |
Hm, how is it not ambiguous? When case insensitive, 'id' could mean one of two different columns |
It's not ambiguous because the when we are selecting using list of column names, both id and ID are getting value from same column 'id' in the source dataframe. df3.explain() |
That isn't relevant. You are selecting from a DataFrame with cols id and ID. Imagine for instance they do not come from the same source, it's clearly ambiguous. It wouldn't make sense if it were different in this case. |
It's very much relevant as this is the only case which requires the fix. If they do not come from same source, the plan will reflect that and it will throw the ambiguous error even after this fix. |
Hm, I just don't see the logic in that. It isn't how SQL works either, as far as I understand. Here's maybe another example, imagine a DataFrame defined by |
If it's valid as per the plan then yes. |
Gentle ping @dongjoon-hyun @mridulm @HyukjinKwon @yaooqinn Can you please review this PR or direct it to someone who can review this PR. |
@cloud-fan The example you have shared will behave the same even after this fix. It will give ambiguous error. Case 2: which doesn't work fine and the fix is to solve this issue |
@shrprasa |
@shrprasa do you know how the case 1 works? |
yes. It works because the resolved column has just one match but for second case, the match result is |
But there are two id columns. Does Spark already do deduplication somewhere? |
Not sure about the deduplication before, but even if it was doing it at some stage, in the second use case it might not have converted the column name to lowercase by that time, that's why that would still treat the two id and ID columns as different. |
@@ -258,7 +258,7 @@ package object expressions { | |||
case (Seq(), _) => | |||
val name = nameParts.head | |||
val attributes = collectMatches(name, direct.get(name.toLowerCase(Locale.ROOT))) | |||
(attributes.filterNot(_.qualifiedAccessOnly), nameParts.tail) | |||
(attributes.distinct.filterNot(_.qualifiedAccessOnly), nameParts.tail) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we fix def unique
in this class? It should look at expr Id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unique method is not used in this flow. It's used at many places while returning the result. Making any changes to unique will increase the scope.
I think case 1 works by accident. It's not an intentional design. I don't think it's a bug that case 2 doesn't work. |
@cloud-fan As I had said in previous comment : For case 1: For Case 2: Most of the places we are calling unique before returning the result. So what' the negative impact you think it will have if we return unique results for the column match also? One positive use case is it will fix this wrong ambiguous error being thrown just because the result of match has two duplicate values. |
FWIW Both the use cases were working fine in Spark 2.3 |
@cloud-fan Can you please check my last comments. |
Sorry I missed this point. Do you know how it worked in 2.3? Did 2.3 also call |
according to the code in 2.3, I think we should call |
@cloud-fan |
If you really worry about regression, we can add a legacy config to fall back to the old code. I don't agree to make code changes that only fix the problem in one particular code path, while we know other code paths have the same problem as well. |
Ok, I will update the PR with suggested change. |
ea5fe9b
to
b2da643
Compare
b2da643
to
e4f003a
Compare
@cloud-fan I have made the change. All Tests have passed. Can you please review? |
Gentle ping @cloud-fan |
thanks, merging to master/3.4! |
**What changes were proposed in this pull request?** The result of attribute resolution should consider only unique values for the reference. If it has duplicate values, it will incorrectly result into ambiguous reference error. **Why are the changes needed?** The below query fails incorrectly due to ambiguous reference error. val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID") val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*) df3.select("id").show() org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id. df3.explain() == Physical Plan == *(1) Project [_1#6 AS id#17, _2#7 AS col2#18, _3#8 AS col3#19, _4#9 AS col4#20, _5#10 AS col5#21, _1#6 AS ID#17] Before the fix, attributes matched were: attributes: Vector(id#17, id#17) Thus, it throws ambiguous reference error. But if we consider only unique matches, it will return correct result. unique attributes: Vector(id#17) **Does this PR introduce any user-facing change?** Yes, Users migrating from Spark 2.3 to 3.x will face this error as the scenario used to work fine in Spark 2.3 but fails in Spark 3.2. After the fix, iit will work correctly as it was in Spark 2.3. **How was this patch tested?** Added unit test. Closes #40258 from shrprasa/col_ambiguous_issue. Authored-by: Shrikant Prasad <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit b283c6a) Signed-off-by: Wenchen Fan <[email protected]>
Thanks a lot @cloud-fan for the guidance and support in getting this issue fixed. |
**What changes were proposed in this pull request?** The result of attribute resolution should consider only unique values for the reference. If it has duplicate values, it will incorrectly result into ambiguous reference error. **Why are the changes needed?** The below query fails incorrectly due to ambiguous reference error. val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID") val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*) df3.select("id").show() org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id. df3.explain() == Physical Plan == *(1) Project [_1#6 AS id#17, _2#7 AS col2#18, _3#8 AS col3#19, _4#9 AS col4#20, _5#10 AS col5#21, _1#6 AS ID#17] Before the fix, attributes matched were: attributes: Vector(id#17, id#17) Thus, it throws ambiguous reference error. But if we consider only unique matches, it will return correct result. unique attributes: Vector(id#17) **Does this PR introduce any user-facing change?** Yes, Users migrating from Spark 2.3 to 3.x will face this error as the scenario used to work fine in Spark 2.3 but fails in Spark 3.2. After the fix, iit will work correctly as it was in Spark 2.3. **How was this patch tested?** Added unit test. Closes apache#40258 from shrprasa/col_ambiguous_issue. Authored-by: Shrikant Prasad <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit b283c6a) Signed-off-by: Wenchen Fan <[email protected]>
@shrprasa do you think this issue is similar to the issue that i just posted: https://stackoverflow.com/questions/77553257/select-behavior-different-between-pyspark-2-4-8-and-3-3-2 Trying to understand the behavior. |
What changes were proposed in this pull request?
The result of attribute resolution should consider only unique values for the reference. If it has duplicate values, it will incorrectly result into ambiguous reference error.
Why are the changes needed?
The below query fails incorrectly due to ambiguous reference error.
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID")
val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*)
df3.select("id").show()
org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id.
df3.explain()
== Physical Plan ==
*(1) Project [_1#6 AS id#17, _2#7 AS col2#18, _3#8 AS col3#19, _4#9 AS col4#20, _5#10 AS col5#21, _1#6 AS ID#17]
Before the fix, attributes matched were:
attributes: Vector(id#17, id#17)
Thus, it throws ambiguous reference error. But if we consider only unique matches, it will return correct result.
unique attributes: Vector(id#17)
Does this PR introduce any user-facing change?
Yes, Users migrating from Spark 2.3 to 3.x will face this error as the scenario used to work fine in Spark 2.3 but fails in Spark 3.2. After the fix, iit will work correctly as it was in Spark 2.3.
How was this patch tested?
Added unit test.