Skip to content

Commit 83ce29b

Browse files
committed
Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
2 parents 52e6752 + f10de04 commit 83ce29b

File tree

3 files changed

+135
-20
lines changed

3 files changed

+135
-20
lines changed

core/src/main/scala/org/apache/spark/util/Utils.scala

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1149,6 +1149,8 @@ private[spark] object Utils extends Logging {
11491149
try {
11501150
f
11511151
} catch {
1152+
case ct: ControlThrowable =>
1153+
throw ct
11521154
case t: Throwable =>
11531155
logError(s"Uncaught exception in thread ${Thread.currentThread().getName}", t)
11541156
throw t

docs/python-programming-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,7 @@ IPython also works on a cluster or on multiple cores if you set the `MASTER` env
121121
# Standalone Programs
122122

123123
PySpark can also be used from standalone Python scripts by creating a SparkContext in your script and running the script using `bin/spark-submit`.
124-
The Quick Start guide includes a [complete example](quick-start.html#a-standalone-app-in-python) of a standalone Python application.
124+
The Quick Start guide includes a [complete example](quick-start.html#standalone-applications) of a standalone Python application.
125125

126126
Code dependencies can be deployed by passing .zip or .egg files in the `--py-files` option of `spark-submit`:
127127

docs/quick-start.md

Lines changed: 132 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,9 @@ title: Quick Start
66
* This will become a table of contents (this text will be scraped).
77
{:toc}
88

9-
This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark's interactive Scala shell (don't worry if you don't know Scala -- you will not need much for this), then show how to write standalone applications in Scala, Java, and Python.
9+
This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark's
10+
interactive shell (in Python or Scala),
11+
then show how to write standalone applications in Java, Scala, and Python.
1012
See the [programming guide](scala-programming-guide.html) for a more complete reference.
1113

1214
To follow along with this guide, first download a packaged release of Spark from the
@@ -17,8 +19,12 @@ you can download a package for any version of Hadoop.
1719

1820
## Basics
1921

20-
Spark's interactive shell provides a simple way to learn the API, as well as a powerful tool to analyze datasets interactively.
21-
Start the shell by running the following in the Spark directory.
22+
Spark's shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively.
23+
It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries)
24+
or Python. Start it by running the following in the Spark directory:
25+
26+
<div class="codetabs">
27+
<div data-lang="scala" markdown="1">
2228

2329
./bin/spark-shell
2430

@@ -33,7 +39,7 @@ RDDs have _[actions](scala-programming-guide.html#actions)_, which return values
3339

3440
{% highlight scala %}
3541
scala> textFile.count() // Number of items in this RDD
36-
res0: Long = 74
42+
res0: Long = 126
3743

3844
scala> textFile.first() // First item in this RDD
3945
res1: String = # Apache Spark
@@ -53,12 +59,53 @@ scala> textFile.filter(line => line.contains("Spark")).count() // How many lines
5359
res3: Long = 15
5460
{% endhighlight %}
5561

62+
</div>
63+
<div data-lang="python" markdown="1">
64+
65+
./bin/pyspark
66+
67+
Spark's primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let's make a new RDD from the text of the README file in the Spark source directory:
68+
69+
{% highlight python %}
70+
>>> textFile = sc.textFile("README.md")
71+
{% endhighlight %}
72+
73+
RDDs have _[actions](scala-programming-guide.html#actions)_, which return values, and _[transformations](scala-programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
74+
75+
{% highlight python %}
76+
>>> textFile.count() # Number of items in this RDD
77+
126
78+
79+
>>> textFile.first() # First item in this RDD
80+
u'# Apache Spark'
81+
{% endhighlight %}
82+
83+
Now let's use a transformation. We will use the [`filter`](scala-programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
84+
85+
{% highlight python %}
86+
>>> linesWithSpark = textFile.filter(lambda line: "Spark" in line)
87+
{% endhighlight %}
88+
89+
We can chain together transformations and actions:
90+
91+
{% highlight python %}
92+
>>> textFile.filter(lambda line: "Spark" in line).count() # How many lines contain "Spark"?
93+
15
94+
{% endhighlight %}
95+
96+
</div>
97+
</div>
98+
99+
56100
## More on RDD Operations
57101
RDD actions and transformations can be used for more complex computations. Let's say we want to find the line with the most words:
58102

103+
<div class="codetabs">
104+
<div data-lang="scala" markdown="1">
105+
59106
{% highlight scala %}
60107
scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
61-
res4: Long = 16
108+
res4: Long = 15
62109
{% endhighlight %}
63110

64111
This first maps a line to an integer value, creating a new RDD. `reduce` is called on that RDD to find the largest line count. The arguments to `map` and `reduce` are Scala function literals (closures), and can use any language feature or Scala/Java library. For example, we can easily call functions declared elsewhere. We'll use `Math.max()` function to make this code easier to understand:
@@ -68,26 +115,69 @@ scala> import java.lang.Math
68115
import java.lang.Math
69116

70117
scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
71-
res5: Int = 16
118+
res5: Int = 15
72119
{% endhighlight %}
73120

74121
One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:
75122

76123
{% highlight scala %}
77124
scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
78-
wordCounts: spark.RDD[(java.lang.String, Int)] = spark.ShuffledAggregatedRDD@71f027b8
125+
wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8
79126
{% endhighlight %}
80127

81128
Here, we combined the [`flatMap`](scala-programming-guide.html#transformations), [`map`](scala-programming-guide.html#transformations) and [`reduceByKey`](scala-programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use the [`collect`](scala-programming-guide.html#actions) action:
82129

83130
{% highlight scala %}
84131
scala> wordCounts.collect()
85-
res6: Array[(java.lang.String, Int)] = Array((need,2), ("",43), (Extra,3), (using,1), (passed,1), (etc.,1), (its,1), (`/usr/local/lib/libmesos.so`,1), (`SCALA_HOME`,1), (option,1), (these,1), (#,1), (`PATH`,,2), (200,1), (To,3),...
132+
res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1), ...)
86133
{% endhighlight %}
87134

135+
</div>
136+
<div data-lang="python" markdown="1">
137+
138+
{% highlight python %}
139+
>>> textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)
140+
15
141+
{% endhighlight %}
142+
143+
This first maps a line to an integer value, creating a new RDD. `reduce` is called on that RDD to find the largest line count. The arguments to `map` and `reduce` are Python [anonymous functions (lambdas)](https://docs.python.org/2/reference/expressions.html#lambda),
144+
but we can also pass any top-level Python function we want.
145+
For example, we'll define a `max` function to make this code easier to understand:
146+
147+
{% highlight python %}
148+
>>> def max(a, b):
149+
... if a > b:
150+
... return a
151+
... else:
152+
... return b
153+
...
154+
155+
>>> textFile.map(lambda line: len(line.split())).reduce(max)
156+
15
157+
{% endhighlight %}
158+
159+
One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:
160+
161+
{% highlight python %}
162+
>>> wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
163+
{% endhighlight %}
164+
165+
Here, we combined the [`flatMap`](scala-programming-guide.html#transformations), [`map`](scala-programming-guide.html#transformations) and [`reduceByKey`](scala-programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (string, int) pairs. To collect the word counts in our shell, we can use the [`collect`](scala-programming-guide.html#actions) action:
166+
167+
{% highlight python %}
168+
>>> wordCounts.collect()
169+
[(u'and', 9), (u'A', 1), (u'webpage', 1), (u'README', 1), (u'Note', 1), (u'"local"', 1), (u'variable', 1), ...]
170+
{% endhighlight %}
171+
172+
</div>
173+
</div>
174+
88175
## Caching
89176
Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small "hot" dataset or when running an iterative algorithm like PageRank. As a simple example, let's mark our `linesWithSpark` dataset to be cached:
90177

178+
<div class="codetabs">
179+
<div data-lang="scala" markdown="1">
180+
91181
{% highlight scala %}
92182
scala> linesWithSpark.cache()
93183
res7: spark.RDD[String] = spark.FilteredRDD@17e51082
@@ -99,12 +189,33 @@ scala> linesWithSpark.count()
99189
res9: Long = 15
100190
{% endhighlight %}
101191

102-
It may seem silly to use Spark to explore and cache a 30-line text file. The interesting part is
192+
It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is
103193
that these same functions can be used on very large data sets, even when they are striped across
104194
tens or hundreds of nodes. You can also do this interactively by connecting `bin/spark-shell` to
105195
a cluster, as described in the [programming guide](scala-programming-guide.html#initializing-spark).
106196

107-
# A Standalone Application
197+
</div>
198+
<div data-lang="python" markdown="1">
199+
200+
{% highlight python %}
201+
>>> linesWithSpark.cache()
202+
203+
>>> linesWithSpark.count()
204+
15
205+
206+
>>> linesWithSpark.count()
207+
15
208+
{% endhighlight %}
209+
210+
It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is
211+
that these same functions can be used on very large data sets, even when they are striped across
212+
tens or hundreds of nodes. You can also do this interactively by connecting `bin/pyspark` to
213+
a cluster, as described in the [programming guide](scala-programming-guide.html#initializing-spark).
214+
215+
</div>
216+
</div>
217+
218+
# Standalone Applications
108219
Now say we wanted to write a standalone application using the Spark API. We will walk through a
109220
simple application in both Scala (with SBT), Java (with Maven), and Python.
110221

@@ -115,7 +226,7 @@ We'll create a very simple Spark application in Scala. So simple, in fact, that
115226
named `SimpleApp.scala`:
116227

117228
{% highlight scala %}
118-
/*** SimpleApp.scala ***/
229+
/* SimpleApp.scala */
119230
import org.apache.spark.SparkContext
120231
import org.apache.spark.SparkContext._
121232
import org.apache.spark.SparkConf
@@ -194,7 +305,7 @@ This example will use Maven to compile an application jar, but any similar build
194305
We'll create a very simple Spark application, `SimpleApp.java`:
195306

196307
{% highlight java %}
197-
/*** SimpleApp.java ***/
308+
/* SimpleApp.java */
198309
import org.apache.spark.api.java.*;
199310
import org.apache.spark.SparkConf;
200311
import org.apache.spark.api.java.function.Function;
@@ -309,24 +420,26 @@ Note that you'll need to replace YOUR_SPARK_HOME with the location where Spark i
309420
As with the Scala and Java examples, we use a SparkContext to create RDDs.
310421
We can pass Python functions to Spark, which are automatically serialized along with any variables
311422
that they reference.
312-
For applications that use custom classes or third-party libraries, we can add those code
313-
dependencies to SparkContext to ensure that they will be available on remote machines; this is
314-
described in more detail in the [Python programming guide](python-programming-guide.html).
423+
For applications that use custom classes or third-party libraries, we can also add code
424+
dependencies to `spark-submit` through its `--py-files` argument by packaging them into a
425+
.zip file (see `spark-submit --help` for details).
315426
`SimpleApp` is simple enough that we do not need to specify any code dependencies.
316427

317-
We can run this application using the `bin/pyspark` script:
428+
We can run this application using the `bin/spark-submit` script:
318429

319430
{% highlight python %}
320-
$ cd $SPARK_HOME
321-
$ ./bin/pyspark SimpleApp.py
431+
# Use spark-submit to run your application
432+
$ YOUR_SPARK_HOME/bin/spark-submit \
433+
--master local[4] \
434+
SimpleApp.py
322435
...
323436
Lines with a: 46, Lines with b: 23
324437
{% endhighlight python %}
325438

326439
</div>
327440
</div>
328441

329-
# Where to go from here
442+
# Where to Go from Here
330443
Congratulations on running your first Spark application!
331444

332445
* For an in-depth overview of the API see "Programming Guides" menu section.

0 commit comments

Comments
 (0)