What happened?
The python iobase.Read transform is a splittable dofn. Since SparkRunner does not support splittable dofns, all Read operations end up on one Spark task/partition. This does not scale on any moderate+ sized workload. To fix this, I had to set the option --experiments=pre_optimize=all, which expands the SDF into a pair + split + read. But this option is hidden/undocumented/magic. I think it would be better if translations.expand_sdf was enabled on all runners that don't support SDFs.
Issue Priority
Priority: 2
Issue Component
Component: runner-spark