Skip to content

Commit ea70cac

Browse files
Baohe ZhangDhruve Ashar
authored andcommitted
[YSPARK-1595] Move TestSparkPython to spark-starter (apache#22)
* Add spark python oozie example * Minor update * Delete script and update workflow * parameterize spark_latest and remove unnecessary options * Add README.md
1 parent 2fbd75a commit ea70cac

File tree

4 files changed

+154
-0
lines changed

4 files changed

+154
-0
lines changed

src/main/resources/data/README.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# Apache Spark
2+
3+
Spark is a fast and general cluster computing system for Big Data. It provides
4+
high-level APIs in Scala, Java, and Python, and an optimized engine that
5+
supports general computation graphs for data analysis. It also supports a
6+
rich set of higher-level tools including Spark SQL for SQL and structured
7+
data processing, MLlib for machine learning, GraphX for graph processing,
8+
and Spark Streaming for stream processing.
9+
10+
<http://spark.apache.org/>
11+
12+
13+
## Online Documentation
14+
15+
You can find the latest Spark documentation, including a programming
16+
guide, on the [project web page](http://spark.apache.org/documentation.html)
17+
and [project wiki](https://cwiki.apache.org/confluence/display/SPARK).
18+
This README file only contains basic setup instructions.
19+
20+
## Building Spark
21+
22+
Spark is built using [Apache Maven](http://maven.apache.org/).
23+
To build Spark and its example programs, run:
24+
25+
mvn -DskipTests clean package
26+
27+
(You do not need to do this if you downloaded a pre-built package.)
28+
More detailed documentation is available from the project site, at
29+
["Building Spark with Maven"](http://spark.apache.org/docs/latest/building-with-maven.html).
30+
31+
## Interactive Scala Shell
32+
33+
The easiest way to start using Spark is through the Scala shell:
34+
35+
./bin/spark-shell
36+
37+
Try the following command, which should return 1000:
38+
39+
scala> sc.parallelize(1 to 1000).count()
40+
41+
## Interactive Python Shell
42+
43+
Alternatively, if you prefer Python, you can use the Python shell:
44+
45+
./bin/pyspark
46+
47+
And run the following command, which should also return 1000:
48+
49+
>>> sc.parallelize(range(1000)).count()
50+
51+
## Example Programs
52+
53+
Spark also comes with several sample programs in the `examples` directory.
54+
To run one of them, use `./bin/run-example <class> [params]`. For example:
55+
56+
./bin/run-example SparkPi
57+
58+
will run the Pi example locally.
59+
60+
You can set the MASTER environment variable when running examples to submit
61+
examples to a cluster. This can be a mesos:// or spark:// URL,
62+
"yarn-cluster" or "yarn-client" to run on YARN, and "local" to run
63+
locally with one thread, or "local[N]" to run locally with N threads. You
64+
can also use an abbreviated class name if the class is in the `examples`
65+
package. For instance:
66+
67+
MASTER=spark://host:7077 ./bin/run-example SparkPi
68+
69+
Many of the example programs print usage help if no params are given.
70+
71+
## Running Tests
72+
73+
Testing first requires [building Spark](#building-spark). Once Spark is built, tests
74+
can be run using:
75+
76+
./dev/run-tests
77+
78+
Please see the guidance on how to
79+
[run all automated tests](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-AutomatedTesting).
80+
81+
## A Note About Hadoop Versions
82+
83+
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
84+
storage systems. Because the protocols have changed in different versions of
85+
Hadoop, you must build Spark against the same version that your cluster runs.
86+
87+
Please refer to the build documentation at
88+
["Specifying the Hadoop Version"](http://spark.apache.org/docs/latest/building-with-maven.html#specifying-the-hadoop-version)
89+
for detailed guidance on building for a particular distribution of Hadoop, including
90+
building for particular Hive and Hive Thriftserver distributions. See also
91+
["Third Party Hadoop Distributions"](http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html)
92+
for guidance on building a Spark application that works with a particular
93+
distribution.
94+
95+
## Configuration
96+
97+
Please refer to the [Configuration guide](http://spark.apache.org/docs/latest/configuration.html)
98+
in the online documentation for an overview on how to configure Spark.
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
Instructions for running this oozie application:
2+
3+
- create a directory `spark_python/` in HDFS for the oozie application.
4+
5+
- upload `workflow.xml` to `spark_python/apps/spark/`.
6+
7+
- upload the .py file `spark-starter/src/main/python/python_word_count.py` to `spark_python/apps/lib/`.
8+
9+
- upload resource files `spark-starter/src/main/resources/data/README.md` to `spark_python/data/`.
10+
11+
- update `nameNode` and `jobTracker` in `job.properties` if you are running on the cluster other than AR.
12+
13+
- export OOZIE_URL, for example, `export OOZIE_URL=https://axonitered-oozie.red.ygrid.yahoo.com:4443/oozie/`.
14+
15+
- submit the oozie job using `oozie job -run -config job.properties -auth KERBEROS`
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
nameNode=hdfs://axonitered-nn1.red.ygrid.yahoo.com:8020
2+
jobTracker=axonitered-jt1.red.ygrid.yahoo.com:8032
3+
wfRoot=spark_python
4+
sparkTag=spark_latest
5+
oozie.libpath=/user/${user.name}/${wfRoot}/apps/lib
6+
oozie.wf.application.path=${nameNode}/user/${user.name}/${wfRoot}/apps/spark
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
<workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkPythonOozieTest'>
2+
<global>
3+
<job-tracker>${jobTracker}</job-tracker>
4+
<name-node>${nameNode}</name-node>
5+
</global>
6+
7+
<start to='SparkPythonWordCount' />
8+
9+
<action name='SparkPythonWordCount'>
10+
<spark xmlns="uri:oozie:spark-action:0.2">
11+
<configuration>
12+
<property>
13+
<name>oozie.action.sharelib.for.spark</name>
14+
<value>${sparkTag}</value>
15+
</property>
16+
</configuration>
17+
<master>yarn</master>
18+
<mode>cluster</mode>
19+
<name>SparkPythonWordCount</name>
20+
<jar>python_word_count.py</jar>
21+
<spark-opts>--queue default</spark-opts>
22+
<arg>${wfRoot}/data/README.md</arg>
23+
<arg>${wfRoot}/output/python_word_count_output</arg>
24+
</spark>
25+
<ok to="end" />
26+
<error to="fail" />
27+
</action>
28+
29+
<kill name="fail">
30+
<message>Workflow failed, error
31+
message[${wf:errorMessage(wf:lastErrorNode())}]
32+
</message>
33+
</kill>
34+
<end name='end' />
35+
</workflow-app>

0 commit comments

Comments
 (0)