Skip to content

Commit 892519d

Browse files
committed
[SPARK-4386] Improve performance when writing Parquet files
Convert type of RowWriteSupport.attributes to Array. Analysis of performance for writing very wide tables shows that time is spent predominantly in apply method on attributes var. Type of attributes previously was LinearSeqOptimized and apply is O(N) which made write O(N squared). Measurements on 575 column table showed this change showed a 6x improvement in write times.
1 parent 0e532cc commit 892519d

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -130,15 +130,15 @@ private[parquet] object RowReadSupport {
130130
private[parquet] class RowWriteSupport extends WriteSupport[Row] with Logging {
131131

132132
private[parquet] var writer: RecordConsumer = null
133-
private[parquet] var attributes: Seq[Attribute] = null
133+
private[parquet] var attributes: Array[Attribute] = null
134134

135135
override def init(configuration: Configuration): WriteSupport.WriteContext = {
136136
val origAttributesStr: String = configuration.get(RowWriteSupport.SPARK_ROW_SCHEMA)
137137
val metadata = new JHashMap[String, String]()
138138
metadata.put(RowReadSupport.SPARK_METADATA_KEY, origAttributesStr)
139139

140140
if (attributes == null) {
141-
attributes = ParquetTypesConverter.convertFromString(origAttributesStr)
141+
attributes = ParquetTypesConverter.convertFromString(origAttributesStr).toArray
142142
}
143143

144144
log.debug(s"write support initialized for requested schema $attributes")

0 commit comments

Comments
 (0)