Skip to content

[SPARK-52669][PySpark]Improvement PySpark choose pythonExec in cluster yarn client mode #51357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gwdgithubnom
Copy link

@gwdgithubnom gwdgithubnom commented Jul 3, 2025

What changes were proposed in this pull request?

This PR fixes a Python version mismatch issue in PySpark cluster execution. The problem occurred when running PySpark jobs in YARN client mode where the driver and executors had different Python minor versions (eg: 3.10 vs 3.6), causing runtime errors. The fix ensures consistent Python versions by:

  1. use old strategy to read current os.environ, if none try use spark conf variable "spark.pyspark.driver.python" and "spark.pyspark.python", to keep the driver and executor could use the same Python interpreter from the archived environment (eg: ./environment/bin/python).

  2. Verifying the Python version consistency through RDD operations that report executor environments.

The solution was tested by running the sample run.py script which now correctly reports matching Python versions across driver and executors.

Why are the changes needed?

PySpark requires the driver and executors to use compatible Python versions (same minor version). Without these changes, when running in a Python development environment or executing Python scripts directly, the driver node would fail to locate the PYSPARK_PYTHON variable. This would force users to manually define the PYSPARK_PYTHON environment variable for every script, which is inconvenient.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Testing steps:

  1. Deployed the fixed configuration in YARN client mode

  2. Ran the sample run.py script with:

spark.range(1).rdd.map(lambda x: (x, 
   f"Executor Python version: {sys.version}",
   f"#Executor Python executable: {sys.executable}")).collect()

3 Verified:

  • Driver and executor Python versions match in output
  • No RuntimeError occurs
  • UI shows correct environment configuration

Additional verification:

Tested with both working and broken configurations to confirm error cases
Validated archive path resolution (#environment) works as expected

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the PYTHON label Jul 3, 2025
@gwdgithubnom gwdgithubnom changed the title [SPARK-52669][ PySpark]Resolve the issue where PySpark fails to run due to different Python minor versions. Run Python directly. [SPARK-52669][PySpark]Resolve the issue where PySpark fails to run due to different Python minor versions. Run Python directly. Jul 3, 2025
@gwdgithubnom gwdgithubnom changed the title [SPARK-52669][PySpark]Resolve the issue where PySpark fails to run due to different Python minor versions. Run Python directly. [SPARK-52669][PySpark]PySpark fails to run due to different Python minor versions. Run Python directly. Jul 3, 2025
@gwdgithubnom gwdgithubnom changed the title [SPARK-52669][PySpark]PySpark fails to run due to different Python minor versions. Run Python directly. [SPARK-52669][PySpark]Improvement PySpark fails to run due to different Python minor versions. Run Python directly. Jul 3, 2025
@gwdgithubnom gwdgithubnom changed the title [SPARK-52669][PySpark]Improvement PySpark fails to run due to different Python minor versions. Run Python directly. [SPARK-52669][PySpark]Improvement PySpark choose pythonExec in cluster yarn client mode Jul 3, 2025
@gwdgithubnom
Copy link
Author

Hi @HyukjinKwon,could you please take a look at this PR when you have time? It's related to the pyspark module (context.py).
When we run mode with client mode, the driver and executor would throws a "python version" error. During Spark context initialization in local/client mode (non-YARN deployment), the launch_container.sh script - which typically sets the PYSPARK_PYTHON environment variable - would not be executed. Consequently, this critical environment variable may remain unset on the client node.

To address this, we should implement a more robust Python interpreter resolution mechanism that:

  1. First checks Spark configuration parameters (spark.pyspark.driver.python for driver-specific configuration and spark.pyspark.python for shared configuration)
  2. Only falls back to the default "python3" if no valid configuration is found

The implementation should prioritize configuration over environment variables, as Spark's configuration system provides a more reliable and explicit way to manage such settings across different deployment environments.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant