[SPARK-3706][PySpark] Cannot run IPython REPL with IPYTHON set to "1" and PYSPARK_PYTHON unset #2554

cocoatomo · 2014-09-27T03:49:57Z

Problem

The section "Using the shell" in Spark Programming Guide (https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell) says that we can run pyspark REPL through IPython.
But a folloing command does not run IPython but a default Python executable.

$ IPYTHON=1 ./bin/pyspark
Python 2.7.8 (default, Jul  2 2014, 10:14:46) 
...

the spark/bin/pyspark script on the commit b235e01 decides which executable and options it use folloing way.

if PYSPARK_PYTHON unset
- → defaulting to "python"
if IPYTHON_OPTS set
- → set IPYTHON "1"
some python scripts passed to ./bin/pyspak → run it with ./bin/spark-submit
- out of this issues scope
if IPYTHON set as "1"
- → execute $PYSPARK_PYTHON (default: ipython) with arguments $IPYTHON_OPTS
- otherwise execute $PYSPARK_PYTHON

Therefore, when PYSPARK_PYTHON is unset, python is executed though IPYTHON is "1".
In other word, when PYSPARK_PYTHON is unset, IPYTHON_OPS and IPYTHON has no effect on decide which command to use.

PYSPARK_PYTHON	IPYTHON_OPTS	IPYTHON	resulting command	expected command
(unset → defaults to python)	(unset)	(unset)	python	(same)
(unset → defaults to python)	(unset)	1	python	ipython
(unset → defaults to python)	an_option	(unset → set to 1)	python an_option	ipython an_option
(unset → defaults to python)	an_option	1	python an_option	ipython an_option
ipython	(unset)	(unset)	ipython	(same)
ipython	(unset)	1	ipython	(same)
ipython	an_option	(unset → set to 1)	ipython an_option	(same)
ipython	an_option	1	ipython an_option	(same)

Suggestion

The pyspark script should determine firstly whether a user wants to run IPython or other executables.

if IPYTHON_OPTS set
- set IPYTHON "1"
if IPYTHON has a value "1"

PYSPARK_PYTHON defaults to "ipython" if not set

PYSPARK_PYTHON defaults to "python" if not set

See the pull request for more detailed modification.

… and PYSPARK_PYTHON unset

AmplabJenkins · 2014-09-27T03:52:09Z

Can one of the admins verify this patch?

mattf · 2014-09-30T12:16:52Z

thanks for identifying this issue and doing the analysis.

the whole business of having a separate IPYTHON env variable complicates the situation. what about deprecating it?

say, introduce a PYPYTHON_PYTHON_OPTS and change the docs to "set PYPYTHON_PYTHON=ipython and PYPYTHON_PYTHON_OPTS=notebook..."

for backward compatibility the top of the file can detect IPYTHON and IPYTHON_OPTS and setup defaults correctly

mattf · 2014-09-30T12:18:23Z

also, 'test "$IPYTHON" = "1" should be written as 'test -n "$IPYTHON"', requiring the value to be 1 isn't very shell-ish

cocoatomo · 2014-10-01T17:36:32Z

Thank you for the comment.

I agree that using PYSPARK_PYTHON and PYSPARK_PYTHON_OPTS environment variables is simpler and IPYTHON flag should not be exposed.

I will keep backward compatibility for IPYTHON and IPYTHON_OPTS.

Please review the additional commit.

… execution of PySpark REPL

JoshRosen · 2014-10-01T17:47:06Z

Jenkins, this is ok to test.

SparkQA · 2014-10-01T17:49:34Z

QA tests have started for PR 2554 at commit 42e02d5.

This patch merges cleanly.

SparkQA · 2014-10-01T17:54:36Z

QA tests have started for PR 2554 at commit 42e02d5.

This patch merges cleanly.

mattf · 2014-10-01T17:56:38Z

much nicer. you could even remove the doc note about backward compatibility.

+1 lgtm

JoshRosen · 2014-10-01T18:01:15Z

Thanks for the very thorough description of this issue. It looks like IPYTHON=1 ./bin/pyspark works as expected in Spark 1.0.2 and 1.1.0, so it looks like this is only an issue in the master branch.

I think that the original motivation for the IPYTHON=1 flag was that older versions of IPython didn't use PYTHONSTARTUP, so this required us to use the %run magic to load the PySpark shell's startup file. I think IPYTHON=1 was added as a conveinence method and to avoid writing code to detect whether a particular Python executable was IPython (in retrospect, we probably should have just hidden this complexity from users and performed that auto-detection). In January, it looks like we removed support for IPython < 1.0 (82a1d38).

The approach in this PR is very nice, since we no longer require special handling / detection of IPython. This looks good to me, too, so I'd like to merge it (pending Jenkins).

SparkQA · 2014-10-01T18:40:22Z

QA tests have finished for PR 2554 at commit 42e02d5.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class PStatsParam(AccumulatorParam):

AmplabJenkins · 2014-10-01T18:40:25Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21136/

SparkQA · 2014-10-01T19:07:01Z

QA tests have finished for PR 2554 at commit 42e02d5.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-01T19:07:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21138/

JoshRosen · 2014-10-01T20:32:02Z

This looks great, but I noticed one minor problem when running some manual tests:

If I run IPYTHON=1 ./bin/pyspark test.py (where test.py is just some dummy file that prints "Hello World!"), then this produces an error:

WARNING: Running python applications through ./bin/pyspark is deprecated as of Spark 1.0.
Use ./bin/spark-submit <python file>

[TerminalIPythonApp] CRITICAL | Bad config encountered during initialization:
[TerminalIPythonApp] CRITICAL | Unrecognized flag: '-u'

The problem here is that spark-submit's PythonRunner passes the -u flag to configure Python to use unbuffered output, but IPython doesn't support this flag. It looks like we can use the PYTHONUNBUFFERED environment variable instead (source), which I think should also work with IPython.

JoshRosen · 2014-10-01T20:34:56Z

Switching to PYTHONUNBUFFERED should be a one- or two-line fix. Just remove the -u flag and add the new environment variable to the ProcessBuilder's environment:

spark/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala

Line 57 in 8cc70e7

    
           val builder = new ProcessBuilder(Seq(pythonExec, "-u", formattedPythonFile) ++ otherArgs)

…t variables

…ad of -u option Because IPython cannot recognize -u option, we will use PYTHONUNBUFFERED environment variable which has exactly same effect as -u option.

SparkQA · 2014-10-02T16:44:28Z

QA tests have started for PR 2554 at commit d2a9b06.

This patch merges cleanly.

cocoatomo · 2014-10-02T16:51:41Z

Thank you for the suggestions, @mattf and @JoshRosen .

I deleted the sentence about IPYTHON and IPYTHON_OPTS,
and replace "-u" option with PYTHONUNBUFFERED.

To confirm that PYTHONUNBUFFERED is set,
we can run a python executable with a following python script set as an argument.

# env.py
import os
print os.environ['PYTHONUNBUFFERED']

$ PYSPARK_PYTHON=ipython ./bin/pyspark env.py
...
YES

SparkQA · 2014-10-02T17:52:26Z

QA tests have finished for PR 2554 at commit d2a9b06.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-02T17:52:30Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21209/

JoshRosen · 2014-10-02T18:13:03Z

This looks good to me; I tested it out locally and everything works as expected. Thanks!

…upport improvements: This pull request addresses a few issues related to PySpark's IPython support: - Fix the remaining uses of the '-u' flag, which IPython doesn't support (see SPARK-3772). - Change PYSPARK_PYTHON_OPTS to PYSPARK_DRIVER_PYTHON_OPTS, so that the old name is reserved in case we ever want to allow the worker Python options to be customized (this variable was introduced in #2554 and hasn't landed in a release yet, so this doesn't break any compatibility). - Introduce a PYSPARK_DRIVER_PYTHON option that allows the driver to use `ipython` while the workers use a different Python version. - Attempt to use Python 2.7 by default if PYSPARK_PYTHON is not specified. - Retain the old semantics for IPYTHON=1 and IPYTHON_OPTS (to avoid breaking existing example programs). There are more details in a block comment in `bin/pyspark`. Author: Josh Rosen <[email protected]> Closes #2651 from JoshRosen/SPARK-3772 and squashes the following commits: 7b8eb86 [Josh Rosen] More changes to PySpark python executable configuration: c4f5778 [Josh Rosen] [SPARK-3772] Allow ipython to be used by Pyspark workers; IPython fixes:

[SPARK-3706][PySpark] Cannot run IPython REPL with IPYTHON set to "1"…

10d56fb

… and PYSPARK_PYTHON unset

[SPARK-3706][PySpark] Replace environment variables used to customize…

42e02d5

… execution of PySpark REPL

cocoatomo added 2 commits October 2, 2014 23:55

[SPARK-3706][PySpark] Remove the sentence about deprecated environmen…

264114c

…t variables

[SPARK-3706][PySpark] Use PYTHONUNBUFFERED environment variable inste…

d2a9b06

…ad of -u option Because IPython cannot recognize -u option, we will use PYTHONUNBUFFERED environment variable which has exactly same effect as -u option.

asfgit closed this in 5b4a5b1 Oct 2, 2014

JoshRosen mentioned this pull request Oct 4, 2014

[SPARK-3772] Allow ipython to be used by Pyspark workers; IPython support improvements: #2651

Closed

[SPARK-3706][PySpark] Cannot run IPython REPL with IPYTHON set to "1" and PYSPARK_PYTHON unset #2554

[SPARK-3706][PySpark] Cannot run IPython REPL with IPYTHON set to "1" and PYSPARK_PYTHON unset #2554

Uh oh!

Conversation

cocoatomo commented Sep 27, 2014

Problem

Suggestion

Uh oh!

AmplabJenkins commented Sep 27, 2014

Uh oh!

mattf commented Sep 30, 2014

Uh oh!

mattf commented Sep 30, 2014

Uh oh!

cocoatomo commented Oct 1, 2014

Uh oh!

JoshRosen commented Oct 1, 2014

Uh oh!

SparkQA commented Oct 1, 2014

Uh oh!

SparkQA commented Oct 1, 2014

Uh oh!

mattf commented Oct 1, 2014

Uh oh!

JoshRosen commented Oct 1, 2014

Uh oh!

SparkQA commented Oct 1, 2014

Uh oh!

AmplabJenkins commented Oct 1, 2014

Uh oh!

SparkQA commented Oct 1, 2014

Uh oh!

AmplabJenkins commented Oct 1, 2014

Uh oh!

JoshRosen commented Oct 1, 2014

Uh oh!

JoshRosen commented Oct 1, 2014

Uh oh!

SparkQA commented Oct 2, 2014

Uh oh!

cocoatomo commented Oct 2, 2014

Uh oh!

SparkQA commented Oct 2, 2014

Uh oh!

AmplabJenkins commented Oct 2, 2014

Uh oh!

JoshRosen commented Oct 2, 2014

Uh oh!

Uh oh!