You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-5775] BugFix: GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table
The Bug solved here was due to a change in PartitionTableScan, when reading a partitioned table.
- When the Partititon column is requested out of a parquet table, the Table Scan needs to add the column back to the output Rows.
- To update the Row object created by PartitionTableScan, the Row was first casted in SpecificMutableRow, before being updated.
- This casting was unsafe, since there are no guarantee that the newHadoopRDD used internally will instanciate the output Rows as MutableRow.
Particularly, when reading a Table with complex (e.g. struct or Array) types, the newHadoopRDD uses a parquet.io.api.RecordMateralizer, that is produced by the org.apache.spark.sql.parquet.RowReadSupport . This consumer will be created as a org.apache.spark.sql.parquet.CatalystGroupConverter (a) and not a org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter (b), when there are complex types involved (in the org.apache.spark.sql.parquet.CatalystConverter.createRootConverter factory )
The consumer (a) will output GenericRow, while the consumer (b) produces SpecificMutableRow.
Therefore any request selecting a partition columns, plus a complex type column, are returned as GenericRows, and fails into an unsafe casting pit (see https://issues.apache.org/jira/browse/SPARK-5775 for an example. )
The fix proposed here originally replaced the unsafe class casting by a case matching on the Row type, updating the Row if it is of a mutable type, and recreating a Row otherwise.
This PR now implements the solution updated by liancheng on aa39460 :
The fix checks if every requested requested columns are primitiveType, in a manner symmetrical to the check in org.apache.spark.sql.parquet.CatalystConverter.createRootConverter.
- If all columns are primitive type, the Row can safely be casted to a MutableRow.
- Otherwise a new GenericRow is created, and the partition column is written this new row structure
This fix is unit-tested in sql/hive/src/test/scala/org/apache/spark/sql/parquet/parquetSuites.scala
Author: Anselme Vignon <[email protected]>
Author: Cheng Lian <[email protected]>
Closes#4697 from anselmevignon/local_dev and squashes the following commits:
6a4c53d [Anselme Vignon] style corrections
52f73fc [Cheng Lian] cherry-pick & merge from aa394608fc6a8c [Anselme Vignon] correcting tests on temporary tables
24928ea [Anselme Vignon] corrected mirror bug (see SPARK-5775) for newParquet
7c829cb [Anselme Vignon] bugfix, hopefully correct this time
005a7f8 [Anselme Vignon] added test cleanup
22cec52 [Anselme Vignon] lint compatible changes
ae48f7c [Anselme Vignon] unittesting SPARK-5775f876dea [Anselme Vignon] starting to write tests
dbceaa3 [Anselme Vignon] cutting lines
4eb04e9 [Anselme Vignon] bugfix SPARK-5775
0 commit comments