You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/fault-tolerance.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,13 +7,13 @@ enterprise: 'no'
7
7
8
8
Failures such as host, network, JVM, or application failures can affect the behavior of three types of Spark components:
9
9
10
-
- DC/OS Spark Service
10
+
- DC/OS Apache Spark Service
11
11
- Batch Jobs
12
12
- Streaming Jobs
13
13
14
-
# DC/OS Spark Service
14
+
# DC/OS Apache Spark Service
15
15
16
-
The DC/OS Spark service runs in Marathon and includes the Mesos Cluster Dispatcher and the Spark History Server. The Dispatcher manages jobs you submit via `dcos spark run`. Job data is persisted to Zookeeper. The Spark History Server reads event logs from HDFS. If the service dies, Marathon will restart it, and it will reload data from these highly available stores.
16
+
The DC/OS Apache Spark service runs in Marathon and includes the Mesos Cluster Dispatcher and the Spark History Server. The Dispatcher manages jobs you submit via `dcos spark run`. Job data is persisted to Zookeeper. The Spark History Server reads event logs from HDFS. If the service dies, Marathon will restart it, and it will reload data from these highly available stores.
Copy file name to clipboardExpand all lines: docs/hdfs.md
+19-16Lines changed: 19 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,16 +5,19 @@ menu_order: 20
5
5
enterprise: 'no'
6
6
---
7
7
8
-
To configure Spark for a specific HDFS cluster, configure `hdfs.config-url` to be a URL that serves your `hdfs-site.xml` and `core-site.xml`. For example:
8
+
You can configure Spark for a specific HDFS cluster.
9
9
10
-
{
11
-
"hdfs": {
12
-
"config-url": "http://mydomain.com/hdfs-config"
13
-
}
14
-
}
10
+
To configure `hdfs.config-url` to be a URL that serves your `hdfs-site.xml` and `core-site.xml`, use this example where `http://mydomain.com/hdfs-config/hdfs-site.xml` and `http://mydomain.com/hdfs-config/core-site.xml` are valid URLs:
15
11
12
+
```json
13
+
{
14
+
"hdfs": {
15
+
"config-url": "http://mydomain.com/hdfs-config"
16
+
}
17
+
}
18
+
```
16
19
17
-
where `http://mydomain.com/hdfs-config/hdfs-site.xml` and `http://mydomain.com/hdfs-config/core-site.xml` are valid URLs.[Learn more][8].
20
+
For more information, see [Inheriting Hadoop Cluster Configuration][8].
18
21
19
22
For DC/OS HDFS, these configuration files are served at `http://<hdfs.framework-name>.marathon.mesos:<port>/v1/connection`, where `<hdfs.framework-name>` is a configuration variable set in the HDFS package, and `<port>` is the port of its marathon app.
20
23
@@ -24,13 +27,13 @@ You can access external (i.e. non-DC/OS) Kerberos-secured HDFS clusters from Spa
24
27
25
28
## HDFS Configuration
26
29
27
-
Once you've set up a Kerberos-enabled HDFS cluster, configure Spark to connect to it. See instructions [here](#hdfs).
30
+
After you've set up a Kerberos-enabled HDFS cluster, configure Spark to connect to it. See instructions [here](#hdfs).
28
31
29
32
## Installation
30
33
31
-
1. A krb5.conf file tells Spark how to connect to your KDC. Base64 encode this file:
34
+
1. A `krb5.conf` file tells Spark how to connect to your KDC. Base64 encode this file:
32
35
33
-
$ cat krb5.conf | base64
36
+
cat krb5.conf | base64
34
37
35
38
1. Add the following to your JSON configuration file to enable Kerberos in Spark:
36
39
@@ -42,11 +45,11 @@ Once you've set up a Kerberos-enabled HDFS cluster, configure Spark to connect t
42
45
}
43
46
}
44
47
45
-
1. If you've enabled the history server via `history-server.enabled`, you must also configure the principal and keytab for the history server. **WARNING**: The keytab contains secrets, so you should ensure you have SSL enabled while installing DC/OS Spark.
48
+
1. If you've enabled the history server via `history-server.enabled`, you must also configure the principal and keytab for the history server. **WARNING**: The keytab contains secrets, so you should ensure you have SSL enabled while installing DC/OS Apache Spark.
46
49
47
50
Base64 encode your keytab:
48
51
49
-
$ cat spark.keytab | base64
52
+
cat spark.keytab | base64
50
53
51
54
And add the following to your configuration file:
52
55
@@ -61,25 +64,25 @@ Once you've set up a Kerberos-enabled HDFS cluster, configure Spark to connect t
61
64
62
65
1. Install Spark with your custom configuration, here called `options.json`:
To authenticate to a Kerberos KDC, DC/OS Spark supports keytab files as well as ticket-granting tickets (TGTs).
71
+
To authenticate to a Kerberos KDC, DC/OS Apache Spark supports keytab files as well as ticket-granting tickets (TGTs).
69
72
70
73
Keytabs are valid infinitely, while tickets can expire. Especially for long-running streaming jobs, keytabs are recommended.
71
74
72
75
### Keytab Authentication
73
76
74
77
Submit the job with the keytab:
75
78
76
-
$ dcos spark run --submit-args="--principal user@REALM --keytab <keytab-file-path>..."
79
+
dcos spark run --submit-args="--principal user@REALM --keytab <keytab-file-path>..."
77
80
78
81
### TGT Authentication
79
82
80
83
Submit the job with the ticket:
81
84
82
-
$ dcos spark run --principal user@REALM --tgt <ticket-file-path>
85
+
dcos spark run --principal user@REALM --tgt <ticket-file-path>
83
86
84
87
**Note:** These credentials are security-critical. We highly recommended configuring SSL encryption between the Spark components when accessing Kerberos-secured HDFS clusters. See the Security section for information on how to do this.
Copy file name to clipboardExpand all lines: docs/index.md
+7-5Lines changed: 7 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,20 +7,22 @@ feature_maturity: stable
7
7
enterprise: 'no'
8
8
---
9
9
10
+
Welcome to the documentation for the DC/OS Apache Spark. For more information about new and changed features, see the [release notes](https://github.com/mesosphere/spark-build/releases/).
11
+
10
12
Apache Spark is a fast and general-purpose cluster computing system for big data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing. For more information, see the [Apache Spark documentation][1].
11
13
12
-
Apache DC/OS Spark consists of [Apache Spark with a few custom commits][17] along with [DC/OS-specific packaging][18].
14
+
DC/OS Apache Spark consists of [Apache Spark with a few custom commits][17] along with [DC/OS-specific packaging][18].
13
15
14
-
DC/OS Spark includes:
16
+
DC/OS Apache Spark includes:
15
17
16
18
*[Mesos Cluster Dispatcher][2]
17
19
*[Spark History Server][3]
18
-
* DC/OS Spark CLI
20
+
* DC/OS Apache Spark CLI
19
21
* Interactive Spark shell
20
22
21
23
# Benefits
22
24
23
-
* Utilization: DC/OS Spark leverages Mesos to run Spark on the same cluster as other DC/OS services
25
+
* Utilization: DC/OS Apache Spark leverages Mesos to run Spark on the same cluster as other DC/OS services
Copy file name to clipboardExpand all lines: docs/install.md
+75-43Lines changed: 75 additions & 43 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,98 +5,130 @@ feature_maturity: stable
5
5
enterprise: 'no'
6
6
---
7
7
8
-
Spark is available in the Universe and can be installed by using either the web interface or the DC/OS CLI.
8
+
Spark is available in the Universe and can be installed by using either the GUI or the DC/OS CLI.
9
9
10
-
## <aname="install-enterprise"></a>Prerequisites
10
+
**Prerequisites:**
11
11
12
-
- Depending on your security mode in Enterprise DC/OS, you may [need to provision a service account](https://docs.mesosphere.com/service-docs/spark/spark-auth/) before installing Spark. Only someone with `superuser` permission can create the service account.
13
-
- `strict` [security mode](https://docs.mesosphere.com/1.9/installing/custom/configuration-parameters/#security) requires a service account.
14
-
- `permissive` security mode a service account is optional.
15
-
- `disabled` security mode does not require a service account.
12
+
-[DC/OS and DC/OS CLI installed](https://docs.mesosphere.com/1.9/installing/).
13
+
- Depending on your [security mode](https://docs.mesosphere.com/1.9/overview/security/security-modes/), Spark requires service authentication for access to DC/OS. For more information, see [Configuring DC/OS Access for Spark](https://docs.mesosphere.com/service-docs/spark/spark-auth/).
14
+
15
+
| Security mode | Service Account |
16
+
|---------------|-----------------------|
17
+
| Disabled | Not available |
18
+
| Permissive | Optional |
19
+
| Strict | Required |
16
20
17
21
# Default Installation
22
+
To install the DC/OS Apache Spark service, run the following command on the DC/OS CLI. This installs the Spark DC/OS service, Spark CLI, dispatcher, and, optionally, the history server. See [Custom Installation][7] to install the history server.
18
23
19
-
To start a basic Spark cluster, run the following command on the DC/OS CLI.
24
+
```bash
25
+
dcos package install spark
26
+
```
20
27
21
-
$ dcos package install spark
28
+
Go to the **Services** > **Deployments** tab of the DC/OS GUI to monitor the deployment. When it has finished deploying , visit Spark at `http://<dcos-url>/service/spark/`.
22
29
23
-
This command installs the dispatcher, and, optionally, the history server. See [Custom Installation][7] to install the history server.
30
+
You can also [install Spark via the DC/OS GUI](https://docs.mesosphere.com/1.9/usage/webinterface/#universe).
24
31
25
-
Go to the **Services** > **Deployments** tab of the DC/OS web interface to monitor the deployment. Once it is
26
-
complete, visit Spark at `http://<dcos-url>/service/spark/`.
27
32
28
-
You can also [install Spark via the DC/OS web interface](https://docs.mesosphere.com/1.9/usage/webinterface/#universe).
33
+
## Spark CLI
34
+
You can install the Spark CLI with this command. This is useful if you already have a Spark cluster running, but need the Spark CLI.
29
35
30
-
**Note:** If you install Spark via the web interface, run the following command from the DC/OS CLI to install the Spark CLI:
36
+
**Important:** If you install Spark via the DC/OS GUI, you must install the Spark CLI as a separate step from the DC/OS CLI.
31
37
32
-
$ dcos package install spark --cli
38
+
```bash
39
+
dcos package install spark --cli
40
+
```
33
41
34
42
<aname="custom"></a>
35
43
36
44
# Custom Installation
37
45
38
46
You can customize the default configuration properties by creating a JSON options file and passing it to `dcos package install --options`. For example, to install the history server, create a file called `options.json`:
39
47
40
-
{
41
-
"history-server": {
42
-
"enabled": true
43
-
}
44
-
}
48
+
```json
49
+
{
50
+
"history-server": {
51
+
"enabled": true
52
+
}
53
+
}
54
+
```
45
55
46
-
Then, install Spark with your custom configuration:
56
+
Install Spark with the configuration specified in the `options.json` file:
Run the following command to see all configuration options:
62
+
**Tip:**Run this command to see all configuration options:
51
63
52
-
$ dcos package describe spark --config
64
+
```bash
65
+
dcos package describe spark --config
66
+
```
53
67
54
68
## Customize Spark Distribution
55
69
56
-
DC/OS Spark does not support arbitrary Spark distributions, but Mesosphere does provide multiple pre-built distributions, primarily used to select Hadoop versions. To use one of these distributions, first select your desired Spark distribution from [here](https://github.com/mesosphere/spark-build/blob/master/docs/spark-versions.md), then select the corresponding docker image from [here](https://hub.docker.com/r/mesosphere/spark/tags/), then use those values to set the following configuration variables:
70
+
DC/OS Apache Spark does not support arbitrary Spark distributions, but Mesosphere does provide multiple pre-built distributions, primarily used to select Hadoop versions.
57
71
58
-
{
59
-
"service": {
60
-
"spark-dist-uri": "<spark-dist-uri>"
61
-
"docker-image": "<docker-image>"
62
-
}
63
-
}
72
+
To use one of these distributions, select your Spark distribution from [here](https://github.com/mesosphere/spark-build/blob/master/docs/spark-versions.md), then select the corresponding Docker image from [here](https://hub.docker.com/r/mesosphere/spark/tags/), then use those values to set the following configuration variables:
73
+
74
+
```json
75
+
{
76
+
"service": {
77
+
"spark-dist-uri": "<spark-dist-uri>"
78
+
"docker-image": "<docker-image>"
79
+
}
80
+
}
81
+
```
64
82
65
83
# Minimal Installation
66
84
67
-
For development purposes, you may wish to install Spark on a local DC/OS cluster. For this, you can use [dcos-vagrant][16].
85
+
For development purposes, you can install Spark on a local DC/OS cluster. For this, you can use [dcos-vagrant][16].
68
86
69
87
1. Install DC/OS Vagrant:
70
88
71
89
Install a minimal DC/OS Vagrant according to the instructions [here][16].
72
90
73
91
1. Install Spark:
74
92
75
-
$ dcos package install spark
93
+
```bash
94
+
dcos package install spark
95
+
```
76
96
77
97
1. Run a simple Job:
78
98
79
-
$ dcos spark run --submit-args="--class org.apache.spark.examples.SparkPi http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.5.0.jar"
99
+
```bash
100
+
dcos spark run --submit-args="--class org.apache.spark.examples.SparkPi http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.5.0.jar"
101
+
```
80
102
81
-
NOTE: A limited resource environment such as DC/OS Vagrant restricts some of the features available in DC/OS Spark. For example, unless you have enough resources to start up a 5-agent cluster, you will not be able to install DC/OS HDFS, and you thus won't be able to enable the history server.
103
+
NOTE: A limited resource environment such as DC/OS Vagrant restricts some of the features available in DC/OS Apache Spark. For example, unless you have enough resources to start up a 5-agent cluster, you will not be able to install DC/OS HDFS, and you thus won't be able to enable the history server.
82
104
83
105
Also, a limited resource environment can restrict how you size your executors, for example with `spark.executor.memory`.
84
106
85
107
# Multiple Installations
86
108
87
-
Installing multiple instances of the DC/OS Spark package provides basic multi-team support. Each dispatcher displays only the jobs submitted to it by a given team, and each team can be assigned different resources.
109
+
Installing multiple instances of the DC/OS Apache Spark package provides basic multi-team support. Each dispatcher displays only the jobs submitted to it by a given team, and each team can be assigned different resources.
110
+
111
+
To install multiple instances of the DC/OS Apache Spark package, set each `service.name` to a unique name (e.g.: `spark-dev`) in your JSON configuration file during installation. For example, create a JSON options file name `multiple.json`:
112
+
113
+
```json
114
+
{
115
+
"service": {
116
+
"name": "spark-dev"
117
+
}
118
+
}
119
+
```
88
120
89
-
To install mutiple instances of the DC/OS Spark package, set each `service.name` to a unique name (e.g.: "spark-dev") in your JSON configuration file during installation:
0 commit comments