-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-10442][SQL] fix string to boolean cast #8698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #42267 has finished for PR 8698 at commit
|
cc @yhuai, looks no hive compatibility tests is broken :) |
Would be nice to have a test for a persisted table partitioned by a boolean column. sqlContext.range(2).selectExpr("(id % 2 = 0) as b", "id").write.partitionBy("b").saveAsTable("t")
sqlContext.table("t").show() Currently this snippet produces wrong answer (all boolean values are |
Although this change doesn't break any existing Hive compatibility tests, it's still a breaking change. We might want to have a separate SQL option to let users be able to fallback to the old behavior. The partitioned table case should be fixed in a separate PR (don't use |
A compatibility option would be reasonable. My vote would be for the |
if (!ctx.mutableStates.exists(_._2 == "trueStrings")) { | ||
ctx.addMutableState("java.util.Set", "trueStrings", | ||
""" | ||
trueStrings = new java.util.HashSet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't this be a static variable somewhere?
I think having a config flag is reasonable. If we want that, it should be a single flag that dictates whether we should follow Hive, or our own standards. However, in this case it seems it is too much work to bring a flag, and the benefit isn't huge yet. So I would just follow Vertica's approach. |
+1 to @rxin 's suggestions |
e7b50f6
to
8706165
Compare
Test build #42311 has finished for PR 8698 at commit
|
LGTM |
LGTM. I am merging it to master. |
When we cast string to boolean in hive, it returns
true
if the length of string is > 0, and spark SQL follows this behavior.However, this behavior is very different from other SQL systems:
true
for 't' 'true' '1',false
for 'f' 'false' '0', throw exception for others.true
for 't' 'true' 'y' 'yes' '1',false
for 'f' 'false' 'n' 'no' '0', null for others.true
for 't' 'true' 'y' 'yes' 'on' '1',false
for 'f' 'false' 'n' 'no' 'off' '0', throw exception for others.true
for 't' 'true' 'y' 'yes' '1',false
for 'f' 'false' 'n' 'no' '0', null for others.Whether we should change the cast behavior according to other SQL system or not is not decided yet, this PR is a test to see if we changed, how many compatibility tests will fail.