[SPARK-52669][PySpark]Improvement PySpark choose pythonExec in cluster yarn client mode #51357
+25
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR fixes a Python version mismatch issue in PySpark cluster execution. The problem occurred when running PySpark jobs in YARN client mode where the driver and executors had different Python minor versions (eg: 3.10 vs 3.6), causing runtime errors. The fix ensures consistent Python versions by:
use old strategy to read current os.environ, if none try use spark conf variable "spark.pyspark.driver.python" and "spark.pyspark.python", to keep the driver and executor could use the same Python interpreter from the archived environment (eg: ./environment/bin/python).
Verifying the Python version consistency through RDD operations that report executor environments.
The solution was tested by running the sample run.py script which now correctly reports matching Python versions across driver and executors.
Why are the changes needed?
PySpark requires the driver and executors to use compatible Python versions (same minor version). Without these changes, when running in a Python development environment or executing Python scripts directly, the driver node would fail to locate the PYSPARK_PYTHON variable. This would force users to manually define the PYSPARK_PYTHON environment variable for every script, which is inconvenient.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Testing steps:
Deployed the fixed configuration in YARN client mode
Ran the sample run.py script with:
3 Verified:
Additional verification:
Tested with both working and broken configurations to confirm error cases
Validated archive path resolution (#environment) works as expected
Was this patch authored or co-authored using generative AI tooling?
No