Skip to content

Conversation

cocoatomo
Copy link
Contributor

Problem

The section "Using the shell" in Spark Programming Guide (https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell) says that we can run pyspark REPL through IPython.
But a folloing command does not run IPython but a default Python executable.

$ IPYTHON=1 ./bin/pyspark
Python 2.7.8 (default, Jul  2 2014, 10:14:46) 
...

the spark/bin/pyspark script on the commit b235e01 decides which executable and options it use folloing way.

  1. if PYSPARK_PYTHON unset
    • → defaulting to "python"
  2. if IPYTHON_OPTS set
    • → set IPYTHON "1"
  3. some python scripts passed to ./bin/pyspak → run it with ./bin/spark-submit
    • out of this issues scope
  4. if IPYTHON set as "1"
    • → execute $PYSPARK_PYTHON (default: ipython) with arguments $IPYTHON_OPTS
    • otherwise execute $PYSPARK_PYTHON

Therefore, when PYSPARK_PYTHON is unset, python is executed though IPYTHON is "1".
In other word, when PYSPARK_PYTHON is unset, IPYTHON_OPS and IPYTHON has no effect on decide which command to use.

PYSPARK_PYTHON IPYTHON_OPTS IPYTHON resulting command expected command
(unset → defaults to python) (unset) (unset) python (same)
(unset → defaults to python) (unset) 1 python ipython
(unset → defaults to python) an_option (unset → set to 1) python an_option ipython an_option
(unset → defaults to python) an_option 1 python an_option ipython an_option
ipython (unset) (unset) ipython (same)
ipython (unset) 1 ipython (same)
ipython an_option (unset → set to 1) ipython an_option (same)
ipython an_option 1 ipython an_option (same)

Suggestion

The pyspark script should determine firstly whether a user wants to run IPython or other executables.

  1. if IPYTHON_OPTS set
    • set IPYTHON "1"
  2. if IPYTHON has a value "1"
  • PYSPARK_PYTHON defaults to "ipython" if not set
  1. PYSPARK_PYTHON defaults to "python" if not set

See the pull request for more detailed modification.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mattf
Copy link

mattf commented Sep 30, 2014

thanks for identifying this issue and doing the analysis.

the whole business of having a separate IPYTHON env variable complicates the situation. what about deprecating it?

say, introduce a PYPYTHON_PYTHON_OPTS and change the docs to "set PYPYTHON_PYTHON=ipython and PYPYTHON_PYTHON_OPTS=notebook..."

for backward compatibility the top of the file can detect IPYTHON and IPYTHON_OPTS and setup defaults correctly

@mattf
Copy link

mattf commented Sep 30, 2014

also, 'test "$IPYTHON" = "1" should be written as 'test -n "$IPYTHON"', requiring the value to be 1 isn't very shell-ish

@cocoatomo
Copy link
Contributor Author

Thank you for the comment.

I agree that using PYSPARK_PYTHON and PYSPARK_PYTHON_OPTS environment variables is simpler and IPYTHON flag should not be exposed.

I will keep backward compatibility for IPYTHON and IPYTHON_OPTS.

Please review the additional commit.

@JoshRosen
Copy link
Contributor

Jenkins, this is ok to test.

@SparkQA
Copy link

SparkQA commented Oct 1, 2014

QA tests have started for PR 2554 at commit 42e02d5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 1, 2014

QA tests have started for PR 2554 at commit 42e02d5.

  • This patch merges cleanly.

@mattf
Copy link

mattf commented Oct 1, 2014

much nicer. you could even remove the doc note about backward compatibility.

+1 lgtm

@JoshRosen
Copy link
Contributor

Thanks for the very thorough description of this issue. It looks like IPYTHON=1 ./bin/pyspark works as expected in Spark 1.0.2 and 1.1.0, so it looks like this is only an issue in the master branch.

I think that the original motivation for the IPYTHON=1 flag was that older versions of IPython didn't use PYTHONSTARTUP, so this required us to use the %run magic to load the PySpark shell's startup file. I think IPYTHON=1 was added as a conveinence method and to avoid writing code to detect whether a particular Python executable was IPython (in retrospect, we probably should have just hidden this complexity from users and performed that auto-detection). In January, it looks like we removed support for IPython < 1.0 (82a1d38).

The approach in this PR is very nice, since we no longer require special handling / detection of IPython. This looks good to me, too, so I'd like to merge it (pending Jenkins).

@SparkQA
Copy link

SparkQA commented Oct 1, 2014

QA tests have finished for PR 2554 at commit 42e02d5.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class PStatsParam(AccumulatorParam):

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21136/

@SparkQA
Copy link

SparkQA commented Oct 1, 2014

QA tests have finished for PR 2554 at commit 42e02d5.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21138/

@JoshRosen
Copy link
Contributor

This looks great, but I noticed one minor problem when running some manual tests:

If I run IPYTHON=1 ./bin/pyspark test.py (where test.py is just some dummy file that prints "Hello World!"), then this produces an error:

WARNING: Running python applications through ./bin/pyspark is deprecated as of Spark 1.0.
Use ./bin/spark-submit <python file>

[TerminalIPythonApp] CRITICAL | Bad config encountered during initialization:
[TerminalIPythonApp] CRITICAL | Unrecognized flag: '-u'

The problem here is that spark-submit's PythonRunner passes the -u flag to configure Python to use unbuffered output, but IPython doesn't support this flag. It looks like we can use the PYTHONUNBUFFERED environment variable instead (source), which I think should also work with IPython.

@JoshRosen
Copy link
Contributor

Switching to PYTHONUNBUFFERED should be a one- or two-line fix. Just remove the -u flag and add the new environment variable to the ProcessBuilder's environment:

val builder = new ProcessBuilder(Seq(pythonExec, "-u", formattedPythonFile) ++ otherArgs)

…ad of -u option

Because IPython cannot recognize -u option, we will use PYTHONUNBUFFERED environment variable
which has exactly same effect as -u option.
@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have started for PR 2554 at commit d2a9b06.

  • This patch merges cleanly.

@cocoatomo
Copy link
Contributor Author

Thank you for the suggestions, @mattf and @JoshRosen .

I deleted the sentence about IPYTHON and IPYTHON_OPTS,
and replace "-u" option with PYTHONUNBUFFERED.

To confirm that PYTHONUNBUFFERED is set,
we can run a python executable with a following python script set as an argument.

# env.py
import os
print os.environ['PYTHONUNBUFFERED']
$ PYSPARK_PYTHON=ipython ./bin/pyspark env.py
...
YES

@SparkQA
Copy link

SparkQA commented Oct 2, 2014

QA tests have finished for PR 2554 at commit d2a9b06.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21209/

@JoshRosen
Copy link
Contributor

This looks good to me; I tested it out locally and everything works as expected. Thanks!

@asfgit asfgit closed this in 5b4a5b1 Oct 2, 2014
asfgit pushed a commit that referenced this pull request Oct 9, 2014
…upport improvements:

This pull request addresses a few issues related to PySpark's IPython support:

- Fix the remaining uses of the '-u' flag, which IPython doesn't support (see SPARK-3772).
- Change PYSPARK_PYTHON_OPTS to PYSPARK_DRIVER_PYTHON_OPTS, so that the old name is reserved in case we ever want to allow the worker Python options to be customized (this variable was introduced in #2554 and hasn't landed in a release yet, so this doesn't break any compatibility).
- Introduce a PYSPARK_DRIVER_PYTHON option that allows the driver to use `ipython` while the workers use a different Python version.
- Attempt to use Python 2.7 by default if PYSPARK_PYTHON is not specified.
- Retain the old semantics for IPYTHON=1 and IPYTHON_OPTS (to avoid breaking existing example programs).

There are more details in a block comment in `bin/pyspark`.

Author: Josh Rosen <[email protected]>

Closes #2651 from JoshRosen/SPARK-3772 and squashes the following commits:

7b8eb86 [Josh Rosen] More changes to PySpark python executable configuration:
c4f5778 [Josh Rosen] [SPARK-3772] Allow ipython to be used by Pyspark workers; IPython fixes:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants