Skip to content

[SPARK-5852] [SQL] Passdown the schema for Parquet File in HiveContext #4562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -287,7 +287,11 @@ case class ParquetRelation2(
}
}

parquetSchema = maybeSchema.getOrElse(readSchema())
try {
parquetSchema = readSchema().getOrElse(maybeSchema.getOrElse(maybeMetastoreSchema.get))
} catch {
case e => throw new SparkException(s"Failed to find schema for ${paths.mkString(",")}", e)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this

parquetSchema = {
  if (maybeSchema.isDefined) {
    maybeSchema.get
  } else {
    (readSchema(), maybeMetastoreSchema) match {
      case (Some(dataSchema), _) => dataSchema
      case (None, Some(metastoreSchema)) => metastoreSchema
      case (None, None) =>
        throw new SparkException("Failed to get the schema.")
     }
  }
}

We first check if maybeSchema is defined. If not, we read the schema from existing data. If existing data does not exist, we are dealing with a newly created empty table and we will use maybeMetastoreSchema defined in the options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, seems we do not need try ... catch at here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the source code, I am wondering if the maybeMetastoreSchema is redundant, and it probably should be always converted into maybeSchema when creating the ParquetRelation2 instance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on Cheng's comment at https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L194, I think that it is better to keep maybeMetastoreSchema and we just fix the bug for now.


partitionKeysIncludedInParquetSchema =
isPartitioned &&
Expand All @@ -308,7 +312,7 @@ case class ParquetRelation2(
}
}

private def readSchema(): StructType = {
private def readSchema(): Option[StructType] = {
// Sees which file(s) we need to touch in order to figure out the schema.
val filesToTouch =
// Always tries the summary files first if users don't require a merged schema. In this case,
Expand Down Expand Up @@ -611,8 +615,9 @@ object ParquetRelation2 {
// internally.
private[sql] val METASTORE_SCHEMA = "metastoreSchema"

private[parquet] def readSchema(footers: Seq[Footer], sqlContext: SQLContext): StructType = {
footers.map { footer =>
private[parquet] def readSchema(
footers: Seq[Footer], sqlContext: SQLContext): Option[StructType] = {
val mergedSchema = footers.map { footer =>
val metadata = footer.getParquetMetadata.getFileMetaData
val parquetSchema = metadata.getSchema
val maybeSparkSchema = metadata
Expand All @@ -630,11 +635,14 @@ object ParquetRelation2 {
sqlContext.conf.isParquetBinaryAsString,
sqlContext.conf.isParquetINT96AsTimestamp))
}
}.reduce { (left, right) =>
try left.merge(right) catch { case e: Throwable =>
throw new SparkException(s"Failed to merge incompatible schemas $left and $right", e)
}
}.foldLeft[StructType](null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using None instead of null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was trying that also, but seems using null is more simple, as Option requires some more value extracting code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right. Instead of putting a large code block in Option, how about use a temporary val and then use Option at the end of this method.

case (null, right) => right
case (left, right) => try left.merge(right) catch { case e: Throwable =>
throw new SparkException(s"Failed to merge incompatible schemas $left and $right", e)
}
}

Option(mergedSchema)
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -208,14 +208,14 @@ private[hive] class HiveMetastoreCatalog(hive: HiveContext) extends Catalog with
ParquetRelation2(
paths,
Map(ParquetRelation2.METASTORE_SCHEMA -> metastoreSchema.json),
None,
Some(metastoreSchema),
Some(partitionSpec))(hive))
} else {
val paths = Seq(metastoreRelation.hiveQlTable.getDataLocation.toString)
LogicalRelation(
ParquetRelation2(
LogicalRelation(ParquetRelation2(
paths,
Map(ParquetRelation2.METASTORE_SCHEMA -> metastoreSchema.json))(hive))
Map(ParquetRelation2.METASTORE_SCHEMA -> metastoreSchema.json),
Some(metastoreSchema))(hive))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we can leave this file unchanged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, evil case insensitivity...

}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -121,13 +121,54 @@ class ParquetDataSourceOnMetastoreSuite extends ParquetMetastoreSuiteBase {

override def beforeAll(): Unit = {
super.beforeAll()

sql(s"""
create table test_parquet
(
intField INT,
stringField STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
""")

val rdd = sparkContext.parallelize((1 to 10).map(i => s"""{"a":$i, "b":"str${i}"}"""))
jsonRDD(rdd).registerTempTable("jt")
sql("""
create table test_parquet_jt ROW FORMAT
| SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
| STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
| OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
| AS select * from jt""".stripMargin)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a test for CREATE TABLE ... STORED AS PARQUET AS ...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STORED AS PARQUET is supported since Hive 0.13, the unit test may failed in Hive 0.12 if we do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we use if (HiveShim.version =="0.13.1") to check the Hive version like what we did in e0490e2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, i thought STORED AS PARQUERT AS .. is just the syntactic sugar. Unfortunately, all of the test suite are implemented in the sub project sql, but the HiveShim is in the subproject hive with hive package accessing visibility.

Let's put this test in another PR?

conf.setConf(SQLConf.PARQUET_USE_DATA_SOURCE_API, "true")
}

override def afterAll(): Unit = {
super.afterAll()
sql("DROP TABLE test_parquet")
sql("DROP TABLE jt")
sql("DROP TABLE test_parquet_jt")

setConf(SQLConf.PARQUET_USE_DATA_SOURCE_API, originalConf.toString)
}

test("scan from an empty parquet table") {
checkAnswer(sql("SELECT count(*) FROM test_parquet"), Row(0))
}

test("scan from an empty parquet table with upper case") {
checkAnswer(sql("SELECT count(INTFIELD) FROM TEST_parquet"), Row(0))
}

test("scan from an non empty parquet table #1") {
checkAnswer(
sql(s"SELECT a, b FROM test_parquet_jt WHERE a = '1'"),
Seq(Row(1, "str1"))
)
}
}

class ParquetDataSourceOffMetastoreSuite extends ParquetMetastoreSuiteBase {
Expand Down