Skip to content

[SPARK-26424][SQL] Use java.time API in date/timestamp expressions #23358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 25 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Dec 20, 2018

What changes were proposed in this pull request?

In the PR, I propose to switch the DateFormatClass, ToUnixTimestamp, FromUnixTime, UnixTime on java.time API for parsing/formatting dates and timestamps. The API has been already implemented by the Timestamp/DateFormatter classes. One of benefit is those classes support parsing timestamps with microsecond precision. Old behaviour can be switched on via SQL config: spark.sql.legacy.timeParser.enabled (false by default).

How was this patch tested?

It was tested by existing test suites - DateFunctionsSuite, DateExpressionsSuite, JsonSuite, CsvSuite, SQLQueryTestSuite as well as PySpark tests.

@SparkQA
Copy link

SparkQA commented Dec 21, 2018

Test build #100346 has finished for PR 23358 at commit 964780f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2018

Test build #100363 has finished for PR 23358 at commit 9cd8f33.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2018

Test build #100369 has finished for PR 23358 at commit 30d9226.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.appendPattern(pattern)
.parseDefaulting(ChronoField.YEAR_OF_ERA, 1970)
.parseDefaulting(ChronoField.ERA, 1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Era is required in STRICT mode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 1 a reasonable default value for ERA?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. This is our current era: https://docs.oracle.com/javase/8/docs/api/java/time/temporal/ChronoField.html#ERA : "The value of the era that was active on 1970-01-01 (ISO) must be assigned the value 1."

.appendPattern(pattern)
.parseDefaulting(ChronoField.YEAR_OF_ERA, 1970)
Copy link
Member Author

@MaxGekk MaxGekk Dec 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Year must always present in timestamps/dates. Probability of an user is satisfied to default value 1970 is pretty low. Don't think if the user wants to parse let's say 14 Nov, he/she means 14 Nov 1970. I would guess current year but this approach is error prone.

.parseDefaulting(ChronoField.MONTH_OF_YEAR, 1)
.parseDefaulting(ChronoField.DAY_OF_MONTH, 1)
.parseDefaulting(ChronoField.HOUR_OF_DAY, 0)
Copy link
Member Author

@MaxGekk MaxGekk Dec 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hours must always present in the time part. The default value causes conflict if the timestamp pattern has a (AM or PM). If there are no hours, we set the time part to zero later.

}

protected def toInstantWithZoneId(temporalAccessor: TemporalAccessor, zoneId: ZoneId): Instant = {
val localDateTime = LocalDateTime.from(temporalAccessor)
val localTime = if (temporalAccessor.query(TemporalQueries.localTime) == null) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If parsed timestamp does not have the time part at all, set all (hours, minutes, seconds and etc.) to zeros.

*/
@throws(classOf[ParseException])
@throws(classOf[DateTimeParseException])
@throws(classOf[DateTimeException])
Copy link
Member Author

@MaxGekk MaxGekk Dec 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This annotations are required in whole stage codegen otherwise I got an error about some exception catches are not reachable/used.

@@ -36,7 +50,8 @@ class Iso8601TimestampFormatter(
pattern: String,
timeZone: TimeZone,
locale: Locale) extends TimestampFormatter with DateTimeFormatterHelper {
private val formatter = buildFormatter(pattern, locale)
@transient
private lazy val formatter = buildFormatter(pattern, locale)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Iso8601TimestampFormatter class became serializable but the build is still not.

@@ -27,7 +29,19 @@ import org.apache.commons.lang3.time.FastDateFormat

import org.apache.spark.sql.internal.SQLConf

sealed trait TimestampFormatter {
sealed trait TimestampFormatter extends Serializable {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it serializable otherwise I got task not serializable exception from generated code

@SparkQA
Copy link

SparkQA commented Dec 21, 2018

Test build #100371 has finished for PR 23358 at commit e4baad6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 21, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Dec 22, 2018

Test build #100393 has finished for PR 23358 at commit 2851ede.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 22, 2018

Test build #100398 has finished for PR 23358 at commit 2ac29d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 24, 2018

As @hvanhovell mentioned offline, the implementation based on java.time classes changes behavior since it uses IsoChronology which is actually Proleptic Gregorian calendar. It could cause some problems in manipulating old dates. I am going to mention that in the migration guide.

I think we can support other calendars like Julian calendar the future by using libraries like ThreeTen-Extra.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 24, 2018

R build failed:

> devtools::install_version('testthat', version = '1.0.2', repos='http://cran.us.r-project.org')
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : 
  there is no package called 'pkgbuild'

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 24, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Dec 24, 2018

Test build #100419 has finished for PR 23358 at commit d9d3616.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 24, 2018

Test build #100424 has finished for PR 23358 at commit d9d3616.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 24, 2018

Test build #100426 has finished for PR 23358 at commit 0e1afc3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

The migration guide update LGTM.

Do we have an example to show the calendar problem? AFAIK SQL standard follows Gregorian calendar, so we shouldn't support Julian calendar.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 25, 2018

Do we have an example to show the calendar problem?

There is a shift in dates:
calendar68
and number of days in year is different - 365.2425 days in Gregorian calendar vs 365.25 days in Julian calendar. So, if you have a day arithmetic, you get different results (in years) in the calendars.

@cloud-fan
Copy link
Contributor

LGTM

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 7c7fccf Dec 27, 2018
holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
## What changes were proposed in this pull request?

In the PR, I propose to switch the `DateFormatClass`, `ToUnixTimestamp`, `FromUnixTime`, `UnixTime` on java.time API for parsing/formatting dates and timestamps. The API has been already implemented by the `Timestamp`/`DateFormatter` classes. One of benefit is those classes support parsing timestamps with microsecond precision. Old behaviour can be switched on via SQL config: `spark.sql.legacy.timeParser.enabled` (`false` by default).

## How was this patch tested?

It was tested by existing test suites - `DateFunctionsSuite`, `DateExpressionsSuite`, `JsonSuite`, `CsvSuite`, `SQLQueryTestSuite` as well as PySpark tests.

Closes apache#23358 from MaxGekk/new-time-cast.

Lead-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
## What changes were proposed in this pull request?

This PR fixes the codegen bug introduced by apache#23358 .

- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/158/

```
Line 44, Column 93: A method named "apply" is not declared in any enclosing class
nor any supertype, nor through a static import
```

## How was this patch tested?

Manual. `DateExpressionsSuite` should be passed with Scala-2.11.

Closes apache#23394 from dongjoon-hyun/SPARK-26424.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

In the PR, I propose to switch the `DateFormatClass`, `ToUnixTimestamp`, `FromUnixTime`, `UnixTime` on java.time API for parsing/formatting dates and timestamps. The API has been already implemented by the `Timestamp`/`DateFormatter` classes. One of benefit is those classes support parsing timestamps with microsecond precision. Old behaviour can be switched on via SQL config: `spark.sql.legacy.timeParser.enabled` (`false` by default).

## How was this patch tested?

It was tested by existing test suites - `DateFunctionsSuite`, `DateExpressionsSuite`, `JsonSuite`, `CsvSuite`, `SQLQueryTestSuite` as well as PySpark tests.

Closes apache#23358 from MaxGekk/new-time-cast.

Lead-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

This PR fixes the codegen bug introduced by apache#23358 .

- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/158/

```
Line 44, Column 93: A method named "apply" is not declared in any enclosing class
nor any supertype, nor through a static import
```

## How was this patch tested?

Manual. `DateExpressionsSuite` should be passed with Scala-2.11.

Closes apache#23394 from dongjoon-hyun/SPARK-26424.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@MaxGekk MaxGekk deleted the new-time-cast branch August 17, 2019 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants