Skip to content

[Bug]: No parallelization for ReadFromParquet (or any Python Read transforms) in Spark RDD Runner #24422

@cozos

Description

@cozos

What happened?

The python iobase.Read transform is a splittable dofn. Since SparkRunner does not support splittable dofns, all Read operations end up on one Spark task/partition. This does not scale on any moderate+ sized workload. To fix this, I had to set the option --experiments=pre_optimize=all, which expands the SDF into a pair + split + read. But this option is hidden/undocumented/magic. I think it would be better if translations.expand_sdf was enabled on all runners that don't support SDFs.

Issue Priority

Priority: 2

Issue Component

Component: runner-spark

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions