Skip to content

Commit cf6cbe9

Browse files
andrewor14pwendell
authored andcommitted
[SPARK-1824] Remove <master> from Python examples
A recent PR (#552) fixed this for all Scala / Java examples. We need to do it for python too. Note that this blocks on #799, which makes `bin/pyspark` go through Spark submit. With only the changes in this PR, the only way to run these examples is through Spark submit. Once #799 goes in, you can use `bin/pyspark` to run them too. For example, ``` bin/pyspark examples/src/main/python/pi.py 100 --master local-cluster[4,1,512] ``` Author: Andrew Or <[email protected]> Closes #802 from andrewor14/python-examples and squashes the following commits: cf50b9f [Andrew Or] De-indent python comments (minor) 50f80b1 [Andrew Or] Remove pyFiles from SparkContext construction c362f69 [Andrew Or] Update docs to use spark-submit for python applications 7072c6a [Andrew Or] Merge branch 'master' of github.com:apache/spark into python-examples 427a5f0 [Andrew Or] Update docs d32072c [Andrew Or] Remove <master> from examples + update usages
1 parent 4b8ec6f commit cf6cbe9

File tree

12 files changed

+77
-72
lines changed

12 files changed

+77
-72
lines changed

docs/index.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -43,12 +43,15 @@ The `--master` option specifies the
4343
locally with one thread, or `local[N]` to run locally with N threads. You should start by using
4444
`local` for testing. For a full list of options, run Spark shell with the `--help` option.
4545

46-
Spark also provides a Python interface. To run an example Spark application written in Python, use
47-
`bin/pyspark <program> [params]`. For example,
46+
Spark also provides a Python interface. To run Spark interactively in a Python interpreter, use
47+
`bin/pyspark`. As in Spark shell, you can also pass in the `--master` option to configure your
48+
master URL.
4849

49-
./bin/pyspark examples/src/main/python/pi.py local[2] 10
50+
./bin/pyspark --master local[2]
5051

51-
or simply `bin/pyspark` without any arguments to run Spark interactively in a python interpreter.
52+
Example applications are also provided in Python. For example,
53+
54+
./bin/spark-submit examples/src/main/python/pi.py 10
5255

5356
# Launching on a Cluster
5457

docs/python-programming-guide.md

Lines changed: 17 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -60,13 +60,9 @@ By default, PySpark requires `python` to be available on the system `PATH` and u
6060

6161
All of PySpark's library dependencies, including [Py4J](http://py4j.sourceforge.net/), are bundled with PySpark and automatically imported.
6262

63-
Standalone PySpark applications should be run using the `bin/spark-submit` script, which automatically
64-
configures the Java and Python environment for running Spark.
65-
66-
6763
# Interactive Use
6864

69-
The `bin/pyspark` script launches a Python interpreter that is configured to run PySpark applications. To use `pyspark` interactively, first build Spark, then launch it directly from the command line without any options:
65+
The `bin/pyspark` script launches a Python interpreter that is configured to run PySpark applications. To use `pyspark` interactively, first build Spark, then launch it directly from the command line:
7066

7167
{% highlight bash %}
7268
$ sbt/sbt assembly
@@ -83,20 +79,24 @@ The Python shell can be used explore data interactively and is a simple way to l
8379
{% endhighlight %}
8480

8581
By default, the `bin/pyspark` shell creates SparkContext that runs applications locally on all of
86-
your machine's logical cores.
87-
To connect to a non-local cluster, or to specify a number of cores, set the `MASTER` environment variable.
88-
For example, to use the `bin/pyspark` shell with a [standalone Spark cluster](spark-standalone.html):
82+
your machine's logical cores. To connect to a non-local cluster, or to specify a number of cores,
83+
set the `--master` flag. For example, to use the `bin/pyspark` shell with a
84+
[standalone Spark cluster](spark-standalone.html):
8985

9086
{% highlight bash %}
91-
$ MASTER=spark://IP:PORT ./bin/pyspark
87+
$ ./bin/pyspark --master spark://1.2.3.4:7077
9288
{% endhighlight %}
9389

9490
Or, to use exactly four cores on the local machine:
9591

9692
{% highlight bash %}
97-
$ MASTER=local[4] ./bin/pyspark
93+
$ ./bin/pyspark --master local[4]
9894
{% endhighlight %}
9995

96+
Under the hood `bin/pyspark` is a wrapper around the
97+
[Spark submit script](cluster-overview.html#launching-applications-with-spark-submit), so these
98+
two scripts share the same list of options. For a complete list of options, run `bin/pyspark` with
99+
the `--help` option.
100100

101101
## IPython
102102

@@ -115,13 +115,14 @@ the [IPython Notebook](http://ipython.org/notebook.html) with PyLab graphing sup
115115
$ IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark
116116
{% endhighlight %}
117117

118-
IPython also works on a cluster or on multiple cores if you set the `MASTER` environment variable.
118+
IPython also works on a cluster or on multiple cores if you set the `--master` flag.
119119

120120

121121
# Standalone Programs
122122

123-
PySpark can also be used from standalone Python scripts by creating a SparkContext in your script and running the script using `bin/spark-submit`.
124-
The Quick Start guide includes a [complete example](quick-start.html#standalone-applications) of a standalone Python application.
123+
PySpark can also be used from standalone Python scripts by creating a SparkContext in your script
124+
and running the script using `bin/spark-submit`. The Quick Start guide includes a
125+
[complete example](quick-start.html#standalone-applications) of a standalone Python application.
125126

126127
Code dependencies can be deployed by passing .zip or .egg files in the `--py-files` option of `spark-submit`:
127128

@@ -138,6 +139,7 @@ You can set [configuration properties](configuration.html#spark-properties) by p
138139
{% highlight python %}
139140
from pyspark import SparkConf, SparkContext
140141
conf = (SparkConf()
142+
.setMaster("local")
141143
.setAppName("My app")
142144
.set("spark.executor.memory", "1g"))
143145
sc = SparkContext(conf = conf)
@@ -164,6 +166,6 @@ some example applications.
164166
PySpark also includes several sample programs in the [`examples/src/main/python` folder](https://github.com/apache/spark/tree/master/examples/src/main/python).
165167
You can run them by passing the files to `pyspark`; e.g.:
166168

167-
./bin/spark-submit examples/src/main/python/wordcount.py local[2] README.md
169+
./bin/spark-submit examples/src/main/python/wordcount.py README.md
168170

169-
Each program prints usage help when run without arguments.
171+
Each program prints usage help when run without the sufficient arguments.

examples/src/main/python/als.py

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -46,15 +46,15 @@ def update(i, vec, mat, ratings):
4646
return np.linalg.solve(XtX, Xty)
4747

4848
if __name__ == "__main__":
49-
if len(sys.argv) < 2:
50-
print >> sys.stderr, "Usage: als <master> <M> <U> <F> <iters> <slices>"
51-
exit(-1)
52-
sc = SparkContext(sys.argv[1], "PythonALS", pyFiles=[realpath(__file__)])
53-
M = int(sys.argv[2]) if len(sys.argv) > 2 else 100
54-
U = int(sys.argv[3]) if len(sys.argv) > 3 else 500
55-
F = int(sys.argv[4]) if len(sys.argv) > 4 else 10
56-
ITERATIONS = int(sys.argv[5]) if len(sys.argv) > 5 else 5
57-
slices = int(sys.argv[6]) if len(sys.argv) > 6 else 2
49+
"""
50+
Usage: als [M] [U] [F] [iterations] [slices]"
51+
"""
52+
sc = SparkContext(appName="PythonALS")
53+
M = int(sys.argv[1]) if len(sys.argv) > 1 else 100
54+
U = int(sys.argv[2]) if len(sys.argv) > 2 else 500
55+
F = int(sys.argv[3]) if len(sys.argv) > 3 else 10
56+
ITERATIONS = int(sys.argv[4]) if len(sys.argv) > 4 else 5
57+
slices = int(sys.argv[5]) if len(sys.argv) > 5 else 2
5858

5959
print "Running ALS with M=%d, U=%d, F=%d, iters=%d, slices=%d\n" % \
6060
(M, U, F, ITERATIONS, slices)

examples/src/main/python/kmeans.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -45,14 +45,14 @@ def closestPoint(p, centers):
4545

4646

4747
if __name__ == "__main__":
48-
if len(sys.argv) < 5:
49-
print >> sys.stderr, "Usage: kmeans <master> <file> <k> <convergeDist>"
48+
if len(sys.argv) != 4:
49+
print >> sys.stderr, "Usage: kmeans <file> <k> <convergeDist>"
5050
exit(-1)
51-
sc = SparkContext(sys.argv[1], "PythonKMeans")
52-
lines = sc.textFile(sys.argv[2])
51+
sc = SparkContext(appName="PythonKMeans")
52+
lines = sc.textFile(sys.argv[1])
5353
data = lines.map(parseVector).cache()
54-
K = int(sys.argv[3])
55-
convergeDist = float(sys.argv[4])
54+
K = int(sys.argv[2])
55+
convergeDist = float(sys.argv[3])
5656

5757
kPoints = data.takeSample(False, K, 1)
5858
tempDist = 1.0

examples/src/main/python/logistic_regression.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -47,12 +47,12 @@ def readPointBatch(iterator):
4747
return [matrix]
4848

4949
if __name__ == "__main__":
50-
if len(sys.argv) != 4:
51-
print >> sys.stderr, "Usage: logistic_regression <master> <file> <iters>"
50+
if len(sys.argv) != 3:
51+
print >> sys.stderr, "Usage: logistic_regression <file> <iterations>"
5252
exit(-1)
53-
sc = SparkContext(sys.argv[1], "PythonLR", pyFiles=[realpath(__file__)])
54-
points = sc.textFile(sys.argv[2]).mapPartitions(readPointBatch).cache()
55-
iterations = int(sys.argv[3])
53+
sc = SparkContext(appName="PythonLR")
54+
points = sc.textFile(sys.argv[1]).mapPartitions(readPointBatch).cache()
55+
iterations = int(sys.argv[2])
5656

5757
# Initialize w to a random value
5858
w = 2 * np.random.ranf(size=D) - 1

examples/src/main/python/mllib/kmeans.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -33,12 +33,12 @@ def parseVector(line):
3333

3434

3535
if __name__ == "__main__":
36-
if len(sys.argv) < 4:
37-
print >> sys.stderr, "Usage: kmeans <master> <file> <k>"
36+
if len(sys.argv) != 3:
37+
print >> sys.stderr, "Usage: kmeans <file> <k>"
3838
exit(-1)
39-
sc = SparkContext(sys.argv[1], "KMeans")
40-
lines = sc.textFile(sys.argv[2])
39+
sc = SparkContext(appName="KMeans")
40+
lines = sc.textFile(sys.argv[1])
4141
data = lines.map(parseVector)
42-
k = int(sys.argv[3])
42+
k = int(sys.argv[2])
4343
model = KMeans.train(data, k)
4444
print "Final centers: " + str(model.clusterCenters)

examples/src/main/python/mllib/logistic_regression.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -39,12 +39,12 @@ def parsePoint(line):
3939

4040

4141
if __name__ == "__main__":
42-
if len(sys.argv) != 4:
43-
print >> sys.stderr, "Usage: logistic_regression <master> <file> <iters>"
42+
if len(sys.argv) != 3:
43+
print >> sys.stderr, "Usage: logistic_regression <file> <iterations>"
4444
exit(-1)
45-
sc = SparkContext(sys.argv[1], "PythonLR")
46-
points = sc.textFile(sys.argv[2]).map(parsePoint)
47-
iterations = int(sys.argv[3])
45+
sc = SparkContext(appName="PythonLR")
46+
points = sc.textFile(sys.argv[1]).map(parsePoint)
47+
iterations = int(sys.argv[2])
4848
model = LogisticRegressionWithSGD.train(points, iterations)
4949
print "Final weights: " + str(model.weights)
5050
print "Final intercept: " + str(model.intercept)

examples/src/main/python/pagerank.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -36,19 +36,19 @@ def parseNeighbors(urls):
3636

3737

3838
if __name__ == "__main__":
39-
if len(sys.argv) < 3:
40-
print >> sys.stderr, "Usage: pagerank <master> <file> <number_of_iterations>"
39+
if len(sys.argv) != 3:
40+
print >> sys.stderr, "Usage: pagerank <file> <iterations>"
4141
exit(-1)
4242

4343
# Initialize the spark context.
44-
sc = SparkContext(sys.argv[1], "PythonPageRank")
44+
sc = SparkContext(appName="PythonPageRank")
4545

4646
# Loads in input file. It should be in format of:
4747
# URL neighbor URL
4848
# URL neighbor URL
4949
# URL neighbor URL
5050
# ...
51-
lines = sc.textFile(sys.argv[2], 1)
51+
lines = sc.textFile(sys.argv[1], 1)
5252

5353
# Loads all URLs from input file and initialize their neighbors.
5454
links = lines.map(lambda urls: parseNeighbors(urls)).distinct().groupByKey().cache()
@@ -57,7 +57,7 @@ def parseNeighbors(urls):
5757
ranks = links.map(lambda (url, neighbors): (url, 1.0))
5858

5959
# Calculates and updates URL ranks continuously using PageRank algorithm.
60-
for iteration in xrange(int(sys.argv[3])):
60+
for iteration in xrange(int(sys.argv[2])):
6161
# Calculates URL contributions to the rank of other URLs.
6262
contribs = links.join(ranks).flatMap(lambda (url, (urls, rank)):
6363
computeContribs(urls, rank))

examples/src/main/python/pi.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,11 @@
2323

2424

2525
if __name__ == "__main__":
26-
if len(sys.argv) == 1:
27-
print >> sys.stderr, "Usage: pi <master> [<slices>]"
28-
exit(-1)
29-
sc = SparkContext(sys.argv[1], "PythonPi")
30-
slices = int(sys.argv[2]) if len(sys.argv) > 2 else 2
26+
"""
27+
Usage: pi [slices]
28+
"""
29+
sc = SparkContext(appName="PythonPi")
30+
slices = int(sys.argv[1]) if len(sys.argv) > 1 else 2
3131
n = 100000 * slices
3232
def f(_):
3333
x = random() * 2 - 1

examples/src/main/python/sort.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,11 +21,11 @@
2121

2222

2323
if __name__ == "__main__":
24-
if len(sys.argv) < 3:
25-
print >> sys.stderr, "Usage: sort <master> <file>"
24+
if len(sys.argv) != 2:
25+
print >> sys.stderr, "Usage: sort <file>"
2626
exit(-1)
27-
sc = SparkContext(sys.argv[1], "PythonSort")
28-
lines = sc.textFile(sys.argv[2], 1)
27+
sc = SparkContext(appName="PythonSort")
28+
lines = sc.textFile(sys.argv[1], 1)
2929
sortedCount = lines.flatMap(lambda x: x.split(' ')) \
3030
.map(lambda x: (int(x), 1)) \
3131
.sortByKey(lambda x: x)

0 commit comments

Comments
 (0)