[SPARK-26424][SQL] Use java.time API in date/timestamp expressions #23358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

MaxGekk wants to merge 25 commits into apache:master from MaxGekk:new-time-cast

Member

MaxGekk commented Dec 20, 2018

What changes were proposed in this pull request?

In the PR, I propose to switch the DateFormatClass, ToUnixTimestamp, FromUnixTime, UnixTime on java.time API for parsing/formatting dates and timestamps. The API has been already implemented by the Timestamp/DateFormatter classes. One of benefit is those classes support parsing timestamps with microsecond precision. Old behaviour can be switched on via SQL config: spark.sql.legacy.timeParser.enabled (false by default).

How was this patch tested?

It was tested by existing test suites - DateFunctionsSuite, DateExpressionsSuite, JsonSuite, CsvSuite, SQLQueryTestSuite as well as PySpark tests.

MaxGekk added 7 commits

December 20, 2018 19:08


          Porting DateFormatClass on new formatter

5ea78cd


          Porting UnixTime on new formatter

162252f


          Porting FromUnixTime on new formatter

fd6f7cc


          Fix task not serializable


          Catch illigalargexception


          Remove unused method newDateFormat

29e94b4


          Set invalid date for Feb

964780f

SparkQA commented Dec 21, 2018

Test build #100346 has finished for PR 23358 at commit 964780f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon reviewed

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala Outdated Show resolved Hide resolved

MaxGekk added 6 commits

December 21, 2018 18:28


          Enable legacy parser for Hive Compatibility suite

9cd8f33


          Test for parsing AM/PM

5d52c0e


          Handling DateTimeException

f444f57


          Set default time if it is not parsed

6d90b30


          Revert "Enable legacy parser for Hive Compatibility suite"

30d9226

This reverts commit 9cd8f33.


          Enable STRICT mode

e4baad6

SparkQA commented Dec 21, 2018

Test build #100363 has finished for PR 23358 at commit 9cd8f33.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA commented Dec 21, 2018

Test build #100369 has finished for PR 23358 at commit 30d9226.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.


          Set default values for month, day, minute and second.

c3fe2a7

MaxGekk commented

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala

                     .appendPattern(pattern)
-                    .parseDefaulting(ChronoField.YEAR_OF_ERA, 1970)
+                    .parseDefaulting(ChronoField.ERA, 1)

Member Author

MaxGekk Dec 21, 2018

Era is required in STRICT mode

Contributor

cloud-fan Dec 26, 2018

is 1 a reasonable default value for ERA?

Member Author

MaxGekk Dec 26, 2018

I think so. This is our current era: https://docs.oracle.com/javase/8/docs/api/java/time/temporal/ChronoField.html#ERA : "The value of the era that was active on 1970-01-01 (ISO) must be assigned the value 1."

MaxGekk commented

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala

		.appendPattern(pattern)
		.parseDefaulting(ChronoField.YEAR_OF_ERA, 1970)

Member Author

MaxGekk Dec 21, 2018 •

edited

Loading

Year must always present in timestamps/dates. Probability of an user is satisfied to default value 1970 is pretty low. Don't think if the user wants to parse let's say 14 Nov, he/she means 14 Nov 1970. I would guess current year but this approach is error prone.

MaxGekk commented

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala

                     .parseDefaulting(ChronoField.MONTH_OF_YEAR, 1)
                     .parseDefaulting(ChronoField.DAY_OF_MONTH, 1)
-                    .parseDefaulting(ChronoField.HOUR_OF_DAY, 0)

Member Author

MaxGekk Dec 21, 2018 •

edited

Loading

Hours must always present in the time part. The default value causes conflict if the timestamp pattern has a (AM or PM). If there are no hours, we set the time part to zero later.

MaxGekk commented

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala Show resolved Hide resolved

MaxGekk commented

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala

                 }
                 protected def toInstantWithZoneId(temporalAccessor: TemporalAccessor, zoneId: ZoneId): Instant = {
-                  val localDateTime = LocalDateTime.from(temporalAccessor)
+                  val localTime = if (temporalAccessor.query(TemporalQueries.localTime) == null) {

Member Author

MaxGekk Dec 21, 2018

If parsed timestamp does not have the time part at all, set all (hours, minutes, seconds and etc.) to zeros.

MaxGekk commented

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala

+                 */
+                @throws(classOf[ParseException])
+                @throws(classOf[DateTimeParseException])
+                @throws(classOf[DateTimeException])

Member Author

MaxGekk Dec 21, 2018 •

edited

Loading

This annotations are required in whole stage codegen otherwise I got an error about some exception catches are not reachable/used.

MaxGekk commented

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala

@@ @@ -36,7 +50,8 @@ class Iso8601TimestampFormatter( @@
                   pattern: String,
                   timeZone: TimeZone,
                   locale: Locale) extends TimestampFormatter with DateTimeFormatterHelper {
-                private val formatter = buildFormatter(pattern, locale)
+                @transient
+                private lazy val formatter = buildFormatter(pattern, locale)

Member Author

MaxGekk Dec 21, 2018

The Iso8601TimestampFormatter class became serializable but the build is still not.

MaxGekk commented

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala Show resolved Hide resolved

MaxGekk commented

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala

@@ @@ -27,7 +29,19 @@ import org.apache.commons.lang3.time.FastDateFormat @@
               import org.apache.spark.sql.internal.SQLConf
-              sealed trait TimestampFormatter {
+              sealed trait TimestampFormatter extends Serializable {

Member Author

MaxGekk Dec 21, 2018

Making it serializable otherwise I got task not serializable exception from generated code

MaxGekk commented

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala Show resolved Hide resolved

SparkQA commented Dec 21, 2018

Test build #100371 has finished for PR 23358 at commit e4baad6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Member Author

MaxGekk commented Dec 21, 2018

jenkins, retest this, please

SparkQA commented Dec 22, 2018

Test build #100393 has finished for PR 23358 at commit 2851ede.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA commented Dec 22, 2018

Test build #100398 has finished for PR 23358 at commit 2ac29d5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan reviewed

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/util/DateFormatterSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/util/TimestampFormatterSuite.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes

View reviewed changes

Member Author

MaxGekk commented Dec 24, 2018

As @hvanhovell mentioned offline, the implementation based on java.time classes changes behavior since it uses IsoChronology which is actually Proleptic Gregorian calendar. It could cause some problems in manipulating old dates. I am going to mention that in the migration guide.

I think we can support other calendars like Julian calendar the future by using libraries like ThreeTen-Extra.

MaxGekk added 3 commits

December 24, 2018 11:32


          Addressing Wenchen's comments

6bc9f54


          Updating the migration guide regarding calendars

fb21d93


          Merge branch 'new-time-cast' of github.com:MaxGekk/spark-1 into new-t…

d9d3616

…ime-cast

Member Author

MaxGekk commented Dec 24, 2018

R build failed:

> devtools::install_version('testthat', version = '1.0.2', repos='http://cran.us.r-project.org')
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : 
  there is no package called 'pkgbuild'

Member Author

MaxGekk commented Dec 24, 2018

jenkins, retest this, please

SparkQA commented Dec 24, 2018

Test build #100419 has finished for PR 23358 at commit d9d3616.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA commented Dec 24, 2018

Test build #100424 has finished for PR 23358 at commit d9d3616.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk added 2 commits

December 24, 2018 16:29


          Merge remote-tracking branch 'origin/master' into new-time-cast

d425570


          Merge branch 'new-time-cast' of github.com:MaxGekk/spark-1 into new-t…

0e1afc3

…ime-cast

SparkQA commented Dec 24, 2018

Test build #100426 has finished for PR 23358 at commit 0e1afc3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Contributor

cloud-fan commented Dec 25, 2018

The migration guide update LGTM.

Do we have an example to show the calendar problem? AFAIK SQL standard follows Gregorian calendar, so we shouldn't support Julian calendar.

Member Author

MaxGekk commented Dec 25, 2018

Do we have an example to show the calendar problem?

There is a shift in dates:

and number of days in year is different - 365.2425 days in Gregorian calendar vs 365.25 days in Julian calendar. So, if you have a day arithmetic, you get different results (in years) in the calendars.

Contributor

cloud-fan commented Dec 26, 2018

LGTM

HyukjinKwon approved these changes

View reviewed changes

Contributor

cloud-fan commented Dec 27, 2018

thanks, merging to master!

asfgit closed this in

7c7fccf

Member

dongjoon-hyun commented Dec 28, 2018

Hi, All.

This PR seems to break code generation in Scala-2.11 profile.

I'll make a followup PR to fix this.

dongjoon-hyun mentioned this pull request

[SPARK-26424][SQL][FOLLOWUP] Fix DateFormatClass/UnixTime codegen #23394

Closed

MaxGekk mentioned this pull request

[SPARK-26002][SQL] Fix day of year calculation for Julian calendar days #23000

Closed

holdenk pushed a commit to holdenk/spark that referenced this pull request


          [SPARK-26424][SQL] Use java.time API in date/timestamp expressions

210550c

## What changes were proposed in this pull request?

In the PR, I propose to switch the `DateFormatClass`, `ToUnixTimestamp`, `FromUnixTime`, `UnixTime` on java.time API for parsing/formatting dates and timestamps. The API has been already implemented by the `Timestamp`/`DateFormatter` classes. One of benefit is those classes support parsing timestamps with microsecond precision. Old behaviour can be switched on via SQL config: `spark.sql.legacy.timeParser.enabled` (`false` by default).

## How was this patch tested?

It was tested by existing test suites - `DateFunctionsSuite`, `DateExpressionsSuite`, `JsonSuite`, `CsvSuite`, `SQLQueryTestSuite` as well as PySpark tests.

Closes apache#23358 from MaxGekk/new-time-cast.

Lead-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

holdenk pushed a commit to holdenk/spark that referenced this pull request


          [SPARK-26424][SQL][FOLLOWUP] Fix DateFormatClass/UnixTime codegen

1bb70d9

## What changes were proposed in this pull request?

This PR fixes the codegen bug introduced by apache#23358 .

- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/158/

```
Line 44, Column 93: A method named "apply" is not declared in any enclosing class
nor any supertype, nor through a static import
```

## How was this patch tested?

Manual. `DateExpressionsSuite` should be passed with Scala-2.11.

Closes apache#23394 from dongjoon-hyun/SPARK-26424.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request


          [SPARK-26424][SQL] Use java.time API in date/timestamp expressions

1d967ea

## What changes were proposed in this pull request?

In the PR, I propose to switch the `DateFormatClass`, `ToUnixTimestamp`, `FromUnixTime`, `UnixTime` on java.time API for parsing/formatting dates and timestamps. The API has been already implemented by the `Timestamp`/`DateFormatter` classes. One of benefit is those classes support parsing timestamps with microsecond precision. Old behaviour can be switched on via SQL config: `spark.sql.legacy.timeParser.enabled` (`false` by default).

## How was this patch tested?

It was tested by existing test suites - `DateFunctionsSuite`, `DateExpressionsSuite`, `JsonSuite`, `CsvSuite`, `SQLQueryTestSuite` as well as PySpark tests.

Closes apache#23358 from MaxGekk/new-time-cast.

Lead-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request


          [SPARK-26424][SQL][FOLLOWUP] Fix DateFormatClass/UnixTime codegen

b8c98b8

## What changes were proposed in this pull request?

This PR fixes the codegen bug introduced by apache#23358 .

- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/158/

```
Line 44, Column 93: A method named "apply" is not declared in any enclosing class
nor any supertype, nor through a static import
```

## How was this patch tested?

Manual. `DateExpressionsSuite` should be passed with Scala-2.11.

Closes apache#23394 from dongjoon-hyun/SPARK-26424.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

MaxGekk deleted the new-time-cast branch

August 17, 2019 13:35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet