Skip to content

[SPARK-10442][SQL] fix string to boolean cast #8698

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

cloud-fan
Copy link
Contributor

When we cast string to boolean in hive, it returns true if the length of string is > 0, and spark SQL follows this behavior.

However, this behavior is very different from other SQL systems:

  1. presto will return true for 't' 'true' '1', false for 'f' 'false' '0', throw exception for others.
  2. redshift will return true for 't' 'true' 'y' 'yes' '1', false for 'f' 'false' 'n' 'no' '0', null for others.
  3. postgresql will return true for 't' 'true' 'y' 'yes' 'on' '1', false for 'f' 'false' 'n' 'no' 'off' '0', throw exception for others.
  4. vertica will return true for 't' 'true' 'y' 'yes' '1', false for 'f' 'false' 'n' 'no' '0', null for others.
  5. impala throw exception when try to cast string to boolean.
  6. mysql, oracle, sqlserver don't have boolean type

Whether we should change the cast behavior according to other SQL system or not is not decided yet, this PR is a test to see if we changed, how many compatibility tests will fail.

@SparkQA
Copy link

SparkQA commented Sep 10, 2015

Test build #42267 has finished for PR 8698 at commit e7b50f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

cc @yhuai, looks no hive compatibility tests is broken :)

@liancheng
Copy link
Contributor

Would be nice to have a test for a persisted table partitioned by a boolean column.

sqlContext.range(2).selectExpr("(id % 2 = 0) as b", "id").write.partitionBy("b").saveAsTable("t")
sqlContext.table("t").show()

Currently this snippet produces wrong answer (all boolean values are true).

@liancheng
Copy link
Contributor

Although this change doesn't break any existing Hive compatibility tests, it's still a breaking change. We might want to have a separate SQL option to let users be able to fallback to the old behavior. The partitioned table case should be fixed in a separate PR (don't use Cast for boolean columns).

@marmbrus
Copy link
Contributor

A compatibility option would be reasonable. My vote would be for the vertica option. @rxin ?

if (!ctx.mutableStates.exists(_._2 == "trueStrings")) {
ctx.addMutableState("java.util.Set", "trueStrings",
"""
trueStrings = new java.util.HashSet();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't this be a static variable somewhere?

@rxin
Copy link
Contributor

rxin commented Sep 10, 2015

I think having a config flag is reasonable. If we want that, it should be a single flag that dictates whether we should follow Hive, or our own standards.

However, in this case it seems it is too much work to bring a flag, and the benefit isn't huge yet. So I would just follow Vertica's approach.

@marmbrus
Copy link
Contributor

+1 to @rxin 's suggestions

@cloud-fan cloud-fan changed the title [SPARK-10442][SQL][WIP] fix string to boolean cast [SPARK-10442][SQL] fix string to boolean cast Sep 11, 2015
@SparkQA
Copy link

SparkQA commented Sep 11, 2015

Test build #42311 has finished for PR 8698 at commit 8706165.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

LGTM

@yhuai
Copy link
Contributor

yhuai commented Sep 11, 2015

LGTM. I am merging it to master.

@asfgit asfgit closed this in d5d6473 Sep 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants