Skip to content

Commit 0edb093

Browse files
joel-hamillSuzanne Scala
authored andcommitted
[DOCS-1985] [WIP] Spark doc fixup (apache#156)
* [WIP] Spark doc fixup * More * Remove extraneous files * Feedback * Added missing quotation per DOCS-1989 * Feedback from Michael * Corbin's feedback
1 parent 00f45be commit 0edb093

19 files changed

+310
-224
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
.idea/
12
.cache/
23
build/
34
dcos-commons-tools/

docs/custom-docker.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,9 @@ You can customize the Docker image in which Spark runs by extending the standard
1919
2020
1. Then, build an image from your Dockerfile.
2121
22-
$ docker build -t username/image:tag .
23-
$ docker push username/image:tag
22+
docker build -t username/image:tag .
23+
docker push username/image:tag
2424
2525
1. Reference your custom Docker image with the `--docker-image` option when running a Spark job.
2626
27-
$ dcos spark run --docker-image=myusername/myimage:v1 --submit-args="http://external.website/mysparkapp.py 30"
27+
dcos spark run --docker-image=myusername/myimage:v1 --submit-args="http://external.website/mysparkapp.py 30"

docs/fault-tolerance.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,13 @@ enterprise: 'no'
77

88
Failures such as host, network, JVM, or application failures can affect the behavior of three types of Spark components:
99

10-
- DC/OS Spark Service
10+
- DC/OS Apache Spark Service
1111
- Batch Jobs
1212
- Streaming Jobs
1313

14-
# DC/OS Spark Service
14+
# DC/OS Apache Spark Service
1515

16-
The DC/OS Spark service runs in Marathon and includes the Mesos Cluster Dispatcher and the Spark History Server. The Dispatcher manages jobs you submit via `dcos spark run`. Job data is persisted to Zookeeper. The Spark History Server reads event logs from HDFS. If the service dies, Marathon will restart it, and it will reload data from these highly available stores.
16+
The DC/OS Apache Spark service runs in Marathon and includes the Mesos Cluster Dispatcher and the Spark History Server. The Dispatcher manages jobs you submit via `dcos spark run`. Job data is persisted to Zookeeper. The Spark History Server reads event logs from HDFS. If the service dies, Marathon will restart it, and it will reload data from these highly available stores.
1717

1818
# Batch Jobs
1919

docs/hdfs.md

Lines changed: 19 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -5,16 +5,19 @@ menu_order: 20
55
enterprise: 'no'
66
---
77

8-
To configure Spark for a specific HDFS cluster, configure `hdfs.config-url` to be a URL that serves your `hdfs-site.xml` and `core-site.xml`. For example:
8+
You can configure Spark for a specific HDFS cluster.
99

10-
{
11-
"hdfs": {
12-
"config-url": "http://mydomain.com/hdfs-config"
13-
}
14-
}
10+
To configure `hdfs.config-url` to be a URL that serves your `hdfs-site.xml` and `core-site.xml`, use this example where `http://mydomain.com/hdfs-config/hdfs-site.xml` and `http://mydomain.com/hdfs-config/core-site.xml` are valid URLs:
1511

12+
```json
13+
{
14+
"hdfs": {
15+
"config-url": "http://mydomain.com/hdfs-config"
16+
}
17+
}
18+
```
1619

17-
where `http://mydomain.com/hdfs-config/hdfs-site.xml` and `http://mydomain.com/hdfs-config/core-site.xml` are valid URLs.[Learn more][8].
20+
For more information, see [Inheriting Hadoop Cluster Configuration][8].
1821

1922
For DC/OS HDFS, these configuration files are served at `http://<hdfs.framework-name>.marathon.mesos:<port>/v1/connection`, where `<hdfs.framework-name>` is a configuration variable set in the HDFS package, and `<port>` is the port of its marathon app.
2023

@@ -24,13 +27,13 @@ You can access external (i.e. non-DC/OS) Kerberos-secured HDFS clusters from Spa
2427

2528
## HDFS Configuration
2629

27-
Once you've set up a Kerberos-enabled HDFS cluster, configure Spark to connect to it. See instructions [here](#hdfs).
30+
After you've set up a Kerberos-enabled HDFS cluster, configure Spark to connect to it. See instructions [here](#hdfs).
2831

2932
## Installation
3033

31-
1. A krb5.conf file tells Spark how to connect to your KDC. Base64 encode this file:
34+
1. A `krb5.conf` file tells Spark how to connect to your KDC. Base64 encode this file:
3235

33-
$ cat krb5.conf | base64
36+
cat krb5.conf | base64
3437

3538
1. Add the following to your JSON configuration file to enable Kerberos in Spark:
3639

@@ -42,11 +45,11 @@ Once you've set up a Kerberos-enabled HDFS cluster, configure Spark to connect t
4245
}
4346
}
4447

45-
1. If you've enabled the history server via `history-server.enabled`, you must also configure the principal and keytab for the history server. **WARNING**: The keytab contains secrets, so you should ensure you have SSL enabled while installing DC/OS Spark.
48+
1. If you've enabled the history server via `history-server.enabled`, you must also configure the principal and keytab for the history server. **WARNING**: The keytab contains secrets, so you should ensure you have SSL enabled while installing DC/OS Apache Spark.
4649

4750
Base64 encode your keytab:
4851

49-
$ cat spark.keytab | base64
52+
cat spark.keytab | base64
5053

5154
And add the following to your configuration file:
5255

@@ -61,25 +64,25 @@ Once you've set up a Kerberos-enabled HDFS cluster, configure Spark to connect t
6164

6265
1. Install Spark with your custom configuration, here called `options.json`:
6366

64-
$ dcos package install --options=options.json spark
67+
dcos package install --options=options.json spark
6568

6669
## Job Submission
6770

68-
To authenticate to a Kerberos KDC, DC/OS Spark supports keytab files as well as ticket-granting tickets (TGTs).
71+
To authenticate to a Kerberos KDC, DC/OS Apache Spark supports keytab files as well as ticket-granting tickets (TGTs).
6972

7073
Keytabs are valid infinitely, while tickets can expire. Especially for long-running streaming jobs, keytabs are recommended.
7174

7275
### Keytab Authentication
7376

7477
Submit the job with the keytab:
7578

76-
$ dcos spark run --submit-args="--principal user@REALM --keytab <keytab-file-path>..."
79+
dcos spark run --submit-args="--principal user@REALM --keytab <keytab-file-path>..."
7780

7881
### TGT Authentication
7982

8083
Submit the job with the ticket:
8184

82-
$ dcos spark run --principal user@REALM --tgt <ticket-file-path>
85+
dcos spark run --principal user@REALM --tgt <ticket-file-path>
8386

8487
**Note:** These credentials are security-critical. We highly recommended configuring SSL encryption between the Spark components when accessing Kerberos-secured HDFS clusters. See the Security section for information on how to do this.
8588

docs/history-server.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,18 @@ menu_order: 30
44
enterprise: 'no'
55
---
66

7-
DC/OS Spark includes The [Spark History Server][3]. Because the history server requires HDFS, you must explicitly enable it.
7+
DC/OS Apache Spark includes The [Spark History Server][3]. Because the history server requires HDFS, you must explicitly enable it.
88

99
1. Install HDFS:
1010

11-
$ dcos package install hdfs
11+
dcos package install hdfs
1212

1313
**Note:** HDFS requires 5 private nodes.
1414

1515
1. Create a history HDFS directory (default is `/history`). [SSH into your cluster][10] and run:
1616

17-
$ docker run -it mesosphere/hdfs-client:1.0.0-2.6.0 bash
18-
$ ./bin/hdfs dfs -mkdir /history
17+
docker run -it mesosphere/hdfs-client:1.0.0-2.6.0 bash
18+
./bin/hdfs dfs -mkdir /history
1919

2020
1. Create `spark-history-options.json`:
2121

@@ -25,26 +25,26 @@ DC/OS Spark includes The [Spark History Server][3]. Because the history server r
2525

2626
1. Install The Spark History Server:
2727

28-
$ dcos package install spark-history --options=spark-history-options.json
28+
dcos package install spark-history --options=spark-history-options.json
2929

3030
1. Create `spark-dispatcher-options.json`;
3131

3232
{
3333
"service": {
34-
"spark-history-server-url": "http://<dcos_url>/service/spark-history
34+
"spark-history-server-url": "http://<dcos_url>/service/spark-history"
3535
},
3636
"hdfs": {
3737
"config-url": "http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints"
3838
}
3939
}
4040

41-
1. Install The Spark Dispatcher:
41+
1. Install the Spark dispatcher:
4242

43-
$ dcos package install spark --options=spark-dispatcher-options.json
43+
dcos package install spark --options=spark-dispatcher-options.json
4444

4545
1. Run jobs with the event log enabled:
4646

47-
$ dcos spark run --submit-args="--conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://hdfs/history ... --class MySampleClass http://external.website/mysparkapp.jar"
47+
dcos spark run --submit-args="--conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://hdfs/history ... --class MySampleClass http://external.website/mysparkapp.jar"
4848

4949
1. Visit your job in the dispatcher at `http://<dcos_url>/service/spark/`. It will include a link to the history server entry for that job.
5050

docs/img/spark-gui-install.png

42.3 KB
Loading

docs/index.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,20 +7,22 @@ feature_maturity: stable
77
enterprise: 'no'
88
---
99

10+
Welcome to the documentation for the DC/OS Apache Spark. For more information about new and changed features, see the [release notes](https://github.com/mesosphere/spark-build/releases/).
11+
1012
Apache Spark is a fast and general-purpose cluster computing system for big data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing. For more information, see the [Apache Spark documentation][1].
1113

12-
Apache DC/OS Spark consists of [Apache Spark with a few custom commits][17] along with [DC/OS-specific packaging][18].
14+
DC/OS Apache Spark consists of [Apache Spark with a few custom commits][17] along with [DC/OS-specific packaging][18].
1315

14-
DC/OS Spark includes:
16+
DC/OS Apache Spark includes:
1517

1618
* [Mesos Cluster Dispatcher][2]
1719
* [Spark History Server][3]
18-
* DC/OS Spark CLI
20+
* DC/OS Apache Spark CLI
1921
* Interactive Spark shell
2022

2123
# Benefits
2224

23-
* Utilization: DC/OS Spark leverages Mesos to run Spark on the same cluster as other DC/OS services
25+
* Utilization: DC/OS Apache Spark leverages Mesos to run Spark on the same cluster as other DC/OS services
2426
* Improved efficiency
2527
* Simple Management
2628
* Multi-team support
@@ -52,4 +54,4 @@ DC/OS Spark includes:
5254
[5]: https://docs.mesosphere.com/service-docs/kafka/
5355
[6]: https://zeppelin.incubator.apache.org/
5456
[17]: https://github.com/mesosphere/spark
55-
[18]: https://github.com/mesosphere/spark-build
57+
[18]: https://github.com/mesosphere/spark-build

docs/install.md

Lines changed: 75 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -5,98 +5,130 @@ feature_maturity: stable
55
enterprise: 'no'
66
---
77

8-
Spark is available in the Universe and can be installed by using either the web interface or the DC/OS CLI.
8+
Spark is available in the Universe and can be installed by using either the GUI or the DC/OS CLI.
99

10-
## <a name="install-enterprise"></a>Prerequisites
10+
**Prerequisites:**
1111

12-
- Depending on your security mode in Enterprise DC/OS, you may [need to provision a service account](https://docs.mesosphere.com/service-docs/spark/spark-auth/) before installing Spark. Only someone with `superuser` permission can create the service account.
13-
- `strict` [security mode](https://docs.mesosphere.com/1.9/installing/custom/configuration-parameters/#security) requires a service account.
14-
- `permissive` security mode a service account is optional.
15-
- `disabled` security mode does not require a service account.
12+
- [DC/OS and DC/OS CLI installed](https://docs.mesosphere.com/1.9/installing/).
13+
- Depending on your [security mode](https://docs.mesosphere.com/1.9/overview/security/security-modes/), Spark requires service authentication for access to DC/OS. For more information, see [Configuring DC/OS Access for Spark](https://docs.mesosphere.com/service-docs/spark/spark-auth/).
14+
15+
| Security mode | Service Account |
16+
|---------------|-----------------------|
17+
| Disabled | Not available |
18+
| Permissive | Optional |
19+
| Strict | Required |
1620

1721
# Default Installation
22+
To install the DC/OS Apache Spark service, run the following command on the DC/OS CLI. This installs the Spark DC/OS service, Spark CLI, dispatcher, and, optionally, the history server. See [Custom Installation][7] to install the history server.
1823

19-
To start a basic Spark cluster, run the following command on the DC/OS CLI.
24+
```bash
25+
dcos package install spark
26+
```
2027

21-
$ dcos package install spark
28+
Go to the **Services** > **Deployments** tab of the DC/OS GUI to monitor the deployment. When it has finished deploying , visit Spark at `http://<dcos-url>/service/spark/`.
2229

23-
This command installs the dispatcher, and, optionally, the history server. See [Custom Installation][7] to install the history server.
30+
You can also [install Spark via the DC/OS GUI](https://docs.mesosphere.com/1.9/usage/webinterface/#universe).
2431

25-
Go to the **Services** > **Deployments** tab of the DC/OS web interface to monitor the deployment. Once it is
26-
complete, visit Spark at `http://<dcos-url>/service/spark/`.
2732

28-
You can also [install Spark via the DC/OS web interface](https://docs.mesosphere.com/1.9/usage/webinterface/#universe).
33+
## Spark CLI
34+
You can install the Spark CLI with this command. This is useful if you already have a Spark cluster running, but need the Spark CLI.
2935

30-
**Note:** If you install Spark via the web interface, run the following command from the DC/OS CLI to install the Spark CLI:
36+
**Important:** If you install Spark via the DC/OS GUI, you must install the Spark CLI as a separate step from the DC/OS CLI.
3137

32-
$ dcos package install spark --cli
38+
```bash
39+
dcos package install spark --cli
40+
```
3341

3442
<a name="custom"></a>
3543

3644
# Custom Installation
3745

3846
You can customize the default configuration properties by creating a JSON options file and passing it to `dcos package install --options`. For example, to install the history server, create a file called `options.json`:
3947

40-
{
41-
"history-server": {
42-
"enabled": true
43-
}
44-
}
48+
```json
49+
{
50+
"history-server": {
51+
"enabled": true
52+
}
53+
}
54+
```
4555

46-
Then, install Spark with your custom configuration:
56+
Install Spark with the configuration specified in the `options.json` file:
4757

48-
$ dcos package install --options=options.json spark
58+
```bash
59+
dcos package install --options=options.json spark
60+
```
4961

50-
Run the following command to see all configuration options:
62+
**Tip:** Run this command to see all configuration options:
5163

52-
$ dcos package describe spark --config
64+
```bash
65+
dcos package describe spark --config
66+
```
5367

5468
## Customize Spark Distribution
5569

56-
DC/OS Spark does not support arbitrary Spark distributions, but Mesosphere does provide multiple pre-built distributions, primarily used to select Hadoop versions. To use one of these distributions, first select your desired Spark distribution from [here](https://github.com/mesosphere/spark-build/blob/master/docs/spark-versions.md), then select the corresponding docker image from [here](https://hub.docker.com/r/mesosphere/spark/tags/), then use those values to set the following configuration variables:
70+
DC/OS Apache Spark does not support arbitrary Spark distributions, but Mesosphere does provide multiple pre-built distributions, primarily used to select Hadoop versions.
5771

58-
{
59-
"service": {
60-
"spark-dist-uri": "<spark-dist-uri>"
61-
"docker-image": "<docker-image>"
62-
}
63-
}
72+
To use one of these distributions, select your Spark distribution from [here](https://github.com/mesosphere/spark-build/blob/master/docs/spark-versions.md), then select the corresponding Docker image from [here](https://hub.docker.com/r/mesosphere/spark/tags/), then use those values to set the following configuration variables:
73+
74+
```json
75+
{
76+
"service": {
77+
"spark-dist-uri": "<spark-dist-uri>"
78+
"docker-image": "<docker-image>"
79+
}
80+
}
81+
```
6482

6583
# Minimal Installation
6684

67-
For development purposes, you may wish to install Spark on a local DC/OS cluster. For this, you can use [dcos-vagrant][16].
85+
For development purposes, you can install Spark on a local DC/OS cluster. For this, you can use [dcos-vagrant][16].
6886

6987
1. Install DC/OS Vagrant:
7088

7189
Install a minimal DC/OS Vagrant according to the instructions [here][16].
7290

7391
1. Install Spark:
7492

75-
$ dcos package install spark
93+
```bash
94+
dcos package install spark
95+
```
7696

7797
1. Run a simple Job:
7898

79-
$ dcos spark run --submit-args="--class org.apache.spark.examples.SparkPi http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.5.0.jar"
99+
```bash
100+
dcos spark run --submit-args="--class org.apache.spark.examples.SparkPi http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.5.0.jar"
101+
```
80102

81-
NOTE: A limited resource environment such as DC/OS Vagrant restricts some of the features available in DC/OS Spark. For example, unless you have enough resources to start up a 5-agent cluster, you will not be able to install DC/OS HDFS, and you thus won't be able to enable the history server.
103+
NOTE: A limited resource environment such as DC/OS Vagrant restricts some of the features available in DC/OS Apache Spark. For example, unless you have enough resources to start up a 5-agent cluster, you will not be able to install DC/OS HDFS, and you thus won't be able to enable the history server.
82104

83105
Also, a limited resource environment can restrict how you size your executors, for example with `spark.executor.memory`.
84106

85107
# Multiple Installations
86108

87-
Installing multiple instances of the DC/OS Spark package provides basic multi-team support. Each dispatcher displays only the jobs submitted to it by a given team, and each team can be assigned different resources.
109+
Installing multiple instances of the DC/OS Apache Spark package provides basic multi-team support. Each dispatcher displays only the jobs submitted to it by a given team, and each team can be assigned different resources.
110+
111+
To install multiple instances of the DC/OS Apache Spark package, set each `service.name` to a unique name (e.g.: `spark-dev`) in your JSON configuration file during installation. For example, create a JSON options file name `multiple.json`:
112+
113+
```json
114+
{
115+
"service": {
116+
"name": "spark-dev"
117+
}
118+
}
119+
```
88120

89-
To install mutiple instances of the DC/OS Spark package, set each `service.name` to a unique name (e.g.: "spark-dev") in your JSON configuration file during installation:
121+
Install Spark with the options file specified:
90122

91-
{
92-
"service": {
93-
"name": "spark-dev"
94-
}
95-
}
123+
```bash
124+
dcos package install --options=multiple.json spark
125+
```
96126

97-
To use a specific Spark instance from the DC/OS Spark CLI:
127+
Alternatively, you can specify a Spark instance directly from the CLI. For example:
98128

99-
$ dcos config set spark.app_id <service.name>
129+
```bash
130+
dcos config set spark.app_id spark-dev
131+
```
100132

101133
[7]: #custom
102134
[16]: https://github.com/mesosphere/dcos-vagrant

docs/job-scheduling.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ The following is a description of the most common Spark on Mesos scheduling prop
5959
<tr>
6060
<td>spark.executor.cores</td>
6161
<td>All available cores in the offer</td>
62-
<td>Coarse-grained mode only. DC/OS Spark >= 1.6.1. Executor CPU allocation.</td>
62+
<td>Coarse-grained mode only. DC/OS Apache Spark >= 1.6.1. Executor CPU allocation.</td>
6363
</tr>
6464

6565
<tr>

0 commit comments

Comments
 (0)