[YSPARK-1595] Move TestSparkPython to spark-starter (apache#22)

Baohe Zhang · Dhruve Ashar · commit ea70cac6a3aa · 2020-06-04T14:58:58.000-05:00
* Add spark python oozie example

* Minor update

* Delete script and update workflow

* parameterize spark_latest and remove unnecessary options

* Add README.md
diff --git a/src/main/resources/data/README.md b/src/main/resources/data/README.md
@@ -0,0 +1,98 @@
+# Apache Spark
+
+Spark is a fast and general cluster computing system for Big Data. It provides
+high-level APIs in Scala, Java, and Python, and an optimized engine that
+supports general computation graphs for data analysis. It also supports a
+rich set of higher-level tools including Spark SQL for SQL and structured
+data processing, MLlib for machine learning, GraphX for graph processing,
+and Spark Streaming for stream processing.
+
+<http://spark.apache.org/>
+
+
+## Online Documentation
+
+You can find the latest Spark documentation, including a programming
+guide, on the [project web page](http://spark.apache.org/documentation.html)
+and [project wiki](https://cwiki.apache.org/confluence/display/SPARK).
+This README file only contains basic setup instructions.
+
+## Building Spark
+
+Spark is built using [Apache Maven](http://maven.apache.org/).
+To build Spark and its example programs, run:
+
+    mvn -DskipTests clean package
+
+(You do not need to do this if you downloaded a pre-built package.)
+More detailed documentation is available from the project site, at
+["Building Spark with Maven"](http://spark.apache.org/docs/latest/building-with-maven.html).
+
+## Interactive Scala Shell
+
+The easiest way to start using Spark is through the Scala shell:
+
+    ./bin/spark-shell
+
+Try the following command, which should return 1000:
+
+    scala> sc.parallelize(1 to 1000).count()
+
+## Interactive Python Shell
+
+Alternatively, if you prefer Python, you can use the Python shell:
+
+    ./bin/pyspark
+
+And run the following command, which should also return 1000:
+
+    >>> sc.parallelize(range(1000)).count()
+
+## Example Programs
+
+Spark also comes with several sample programs in the `examples` directory.
+To run one of them, use `./bin/run-example <class> [params]`. For example:
+   
+    ./bin/run-example SparkPi
+
+will run the Pi example locally.
+
+You can set the MASTER environment variable when running examples to submit
+examples to a cluster. This can be a mesos:// or spark:// URL,
+"yarn-cluster" or "yarn-client" to run on YARN, and "local" to run
+locally with one thread, or "local[N]" to run locally with N threads. You
+can also use an abbreviated class name if the class is in the `examples`
+package. For instance:
+
+    MASTER=spark://host:7077 ./bin/run-example SparkPi
+
+Many of the example programs print usage help if no params are given.
+
+## Running Tests
+
+Testing first requires [building Spark](#building-spark). Once Spark is built, tests
+can be run using:
+
+    ./dev/run-tests
+
+Please see the guidance on how to
+[run all automated tests](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-AutomatedTesting).
+
+## A Note About Hadoop Versions
+
+Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
+storage systems. Because the protocols have changed in different versions of
+Hadoop, you must build Spark against the same version that your cluster runs.
+
+Please refer to the build documentation at
+["Specifying the Hadoop Version"](http://spark.apache.org/docs/latest/building-with-maven.html#specifying-the-hadoop-version)
+for detailed guidance on building for a particular distribution of Hadoop, including
+building for particular Hive and Hive Thriftserver distributions. See also
+["Third Party Hadoop Distributions"](http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html)
+for guidance on building a Spark application that works with a particular
+distribution.
+
+## Configuration
+
+Please refer to the [Configuration guide](http://spark.apache.org/docs/latest/configuration.html)
+in the online documentation for an overview on how to configure Spark.
diff --git a/src/main/resources/oozie/spark_python/README.md b/src/main/resources/oozie/spark_python/README.md
@@ -0,0 +1,15 @@
+Instructions for running this oozie application:
+
+- create a directory `spark_python/` in HDFS for the oozie application.
+
+- upload `workflow.xml` to `spark_python/apps/spark/`.
+
+- upload the .py file `spark-starter/src/main/python/python_word_count.py` to `spark_python/apps/lib/`.
+
+- upload resource files `spark-starter/src/main/resources/data/README.md` to `spark_python/data/`.
+
+- update `nameNode` and `jobTracker` in `job.properties` if you are running on the cluster other than AR.
+
+- export OOZIE_URL, for example, `export OOZIE_URL=https://axonitered-oozie.red.ygrid.yahoo.com:4443/oozie/`.
+
+- submit the oozie job using `oozie job -run -config job.properties -auth KERBEROS`
diff --git a/src/main/resources/oozie/spark_python/job.properties b/src/main/resources/oozie/spark_python/job.properties
@@ -0,0 +1,6 @@
+nameNode=hdfs://axonitered-nn1.red.ygrid.yahoo.com:8020
+jobTracker=axonitered-jt1.red.ygrid.yahoo.com:8032
+wfRoot=spark_python
+sparkTag=spark_latest
+oozie.libpath=/user/${user.name}/${wfRoot}/apps/lib
+oozie.wf.application.path=${nameNode}/user/${user.name}/${wfRoot}/apps/spark
diff --git a/src/main/resources/oozie/spark_python/workflow.xml b/src/main/resources/oozie/spark_python/workflow.xml
@@ -0,0 +1,35 @@
+<workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkPythonOozieTest'>
+    <global>
+        <job-tracker>${jobTracker}</job-tracker>
+        <name-node>${nameNode}</name-node>
+    </global>
+
+    <start to='SparkPythonWordCount' />
+
+    <action name='SparkPythonWordCount'>
+        <spark xmlns="uri:oozie:spark-action:0.2">
+            <configuration>
+                <property>
+                    <name>oozie.action.sharelib.for.spark</name>
+                    <value>${sparkTag}</value>
+                </property>
+            </configuration>
+            <master>yarn</master>
+            <mode>cluster</mode>
+            <name>SparkPythonWordCount</name>
+            <jar>python_word_count.py</jar>
+            <spark-opts>--queue default</spark-opts>
+            <arg>${wfRoot}/data/README.md</arg>
+            <arg>${wfRoot}/output/python_word_count_output</arg>
+        </spark>
+        <ok to="end" />
+        <error to="fail" />
+    </action>    
+
+    <kill name="fail">
+        <message>Workflow failed, error
+            message[${wf:errorMessage(wf:lastErrorNode())}]
+        </message>
+    </kill>
+    <end name='end' />
+</workflow-app>