Skip to content

Commit 39a9737

Browse files
committed
Spark integration with Openstack Swift
1 parent c977658 commit 39a9737

File tree

4 files changed

+215
-107
lines changed

4 files changed

+215
-107
lines changed

core/pom.xml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -35,10 +35,6 @@
3535
<groupId>org.apache.hadoop</groupId>
3636
<artifactId>hadoop-client</artifactId>
3737
</dependency>
38-
<dependency>
39-
<groupId>org.apache.hadoop</groupId>
40-
<artifactId>hadoop-openstack</artifactId>
41-
</dependency>
4238
<dependency>
4339
<groupId>net.java.dev.jets3t</groupId>
4440
<artifactId>jets3t</artifactId>

docs/openstack-integration.md

Lines changed: 214 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -1,110 +1,237 @@
1-
yout: global
2-
title: Accessing Openstack Swift storage from Spark
1+
layout: global
2+
title: Accessing Openstack Swift from Spark
33
---
44

5-
# Accessing Openstack Swift storage from Spark
5+
# Accessing Openstack Swift from Spark
66

77
Spark's file interface allows it to process data in Openstack Swift using the same URI
88
formats that are supported for Hadoop. You can specify a path in Swift as input through a
9-
URI of the form `swift://<container.service_provider>/path`. You will also need to set your
10-
Swift security credentials, through `SparkContext.hadoopConfiguration`.
11-
12-
#Configuring Hadoop to use Openstack Swift
13-
Openstack Swift driver was merged in Hadoop verion 2.3.0 ([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)). Users that wish to use previous Hadoop versions will need to configure Swift driver manually. Current Swift driver
9+
URI of the form `swift://<container.PROVIDER/path`. You will also need to set your
10+
Swift security credentials, through `core-sites.xml` or via `SparkContext.hadoopConfiguration`.
11+
Openstack Swift driver was merged in Hadoop version 2.3.0 ([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)). Users that wish to use previous Hadoop versions will need to configure Swift driver manually. Current Swift driver
1412
requieres Swift to use Keystone authentication method. There are recent efforts to support
15-
also temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420).
16-
To configure Hadoop to work with Swift one need to modify core-sites.xml of Hadoop and
17-
setup Swift FS.
18-
19-
<configuration>
20-
<property>
21-
<name>fs.swift.impl</name>
22-
<value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value>
23-
</property>
24-
</configuration>
13+
temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420).
2514

26-
#Configuring Swift
15+
# Configuring Swift
2716
Proxy server of Swift should include `list_endpoints` middleware. More information
28-
available [here] (https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py)
29-
30-
#Configuring Spark
31-
To use Swift driver, Spark need to be compiled with `hadoop-openstack-2.3.0.jar`
32-
distributted with Hadoop 2.3.0. For the Maven builds, Spark's main pom.xml should include
33-
34-
<swift.version>2.3.0</swift.version>
17+
available [here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py)
3518

19+
# Compilation of Spark
20+
Spark should be compiled with `hadoop-openstack-2.3.0.jar` that is distributted with Hadoop 2.3.0.
21+
For the Maven builds, the `dependencyManagement` section of Spark's main `pom.xml` should include
3622

23+
<dependencyManagement>
24+
---------
3725
<dependency>
3826
<groupId>org.apache.hadoop</groupId>
3927
<artifactId>hadoop-openstack</artifactId>
40-
<version>${swift.version}</version>
28+
<version>2.3.0</version>
4129
</dependency>
30+
----------
31+
</dependencyManagement>
4232

43-
in addition, pom.xml of the `core` and `yarn` projects should include
33+
in addition, both `core` and `yarn` projects should add `hadoop-openstack` to the `dependencies` section of their `pom.xml`
4434

35+
<dependencies>
36+
----------
4537
<dependency>
4638
<groupId>org.apache.hadoop</groupId>
4739
<artifactId>hadoop-openstack</artifactId>
4840
</dependency>
41+
----------
42+
</dependencies>
43+
# Configuration of Spark
44+
Create `core-sites.xml` and place it inside `/spark/conf` directory. There are two main categories of parameters that should to be
45+
configured: declaration of the Swift driver and the parameters that are required by Keystone.
46+
47+
Configuration of Hadoop to use Swift File system achieved via
48+
49+
<table class="table">
50+
<tr><th>Property Name</th><th>Value</th></tr>
51+
<tr>
52+
<td>fs.swift.impl</td>
53+
<td>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</td>
54+
<tr>
55+
</table>
56+
57+
Additional parameters requiered by Keystone and should be provided to the Swift driver. Those
58+
parameters will be used to perform authentication in Keystone to access Swift. The following table
59+
contains a list of Keystone mandatory parameters. `PROVIDER` can be any name.
60+
61+
<table class="table">
62+
<tr><th>Property Name</th><th>Meaning</th><th>Required</th></tr>
63+
<tr>
64+
<td>fs.swift.service.PROVIDER.auth.url</td>
65+
<td>Keystone Authentication URL</td>
66+
<td>Mandatory</td>
67+
</tr>
68+
<tr>
69+
<td>fs.swift.service.PROVIDER.auth.endpoint.prefix</td>
70+
<td>Keystone endpoints prefix</td>
71+
<td>Optional</td>
72+
</tr>
73+
<tr>
74+
<td>fs.swift.service.PROVIDER.tenant</td>
75+
<td>Tenant</td>
76+
<td>Mandatory</td>
77+
</tr>
78+
<tr>
79+
<td>fs.swift.service.PROVIDER.username</td>
80+
<td>Username</td>
81+
<td>Mandatory</td>
82+
</tr>
83+
<tr>
84+
<td>fs.swift.service.PROVIDER.password</td>
85+
<td>Password</td>
86+
<td>Mandatory</td>
87+
</tr>
88+
<tr>
89+
<td>fs.swift.service.PROVIDER.http.port</td>
90+
<td>HTTP port</td>
91+
<td>Mandatory</td>
92+
</tr>
93+
<tr>
94+
<td>fs.swift.service.PROVIDER.region</td>
95+
<td>Keystone region</td>
96+
<td>Mandatory</td>
97+
</tr>
98+
<tr>
99+
<td>fs.swift.service.PROVIDER.public</td>
100+
<td>Indicates if all URLs are public</td>
101+
<td>Mandatory</td>
102+
</tr>
103+
</table>
104+
105+
For example, assume `PROVIDER=SparkTest` and Keystone contains user `tester` with password `testing` defined for tenant `tenant`.
106+
Than `core-sites.xml` should include:
49107

108+
<configuration>
109+
<property>
110+
<name>fs.swift.impl</name>
111+
<value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value>
112+
</property>
113+
<property>
114+
<name>fs.swift.service.SparkTest.auth.url</name>
115+
<value>http://127.0.0.1:5000/v2.0/tokens</value>
116+
</property>
117+
<property>
118+
<name>fs.swift.service.SparkTest.auth.endpoint.prefix</name>
119+
<value>endpoints</value>
120+
</property>
121+
<name>fs.swift.service.SparkTest.http.port</name>
122+
<value>8080</value>
123+
</property>
124+
<property>
125+
<name>fs.swift.service.SparkTest.region</name>
126+
<value>RegionOne</value>
127+
</property>
128+
<property>
129+
<name>fs.swift.service.SparkTest.public</name>
130+
<value>true</value>
131+
</property>
132+
<property>
133+
<name>fs.swift.service.SparkTest.tenant</name>
134+
<value>test</value>
135+
</property>
136+
<property>
137+
<name>fs.swift.service.SparkTest.username</name>
138+
<value>tester</value>
139+
</property>
140+
<property>
141+
<name>fs.swift.service.SparkTest.password</name>
142+
<value>testing</value>
143+
</property>
144+
</configuration>
50145

51-
Additional parameters has to be provided to the Swift driver. Swift driver will use those
52-
parameters to perform authentication in Keystone prior accessing Swift. List of mandatory
53-
parameters is : `fs.swift.service.<PROVIDER>.auth.url`,
54-
`fs.swift.service.<PROVIDER>.auth.endpoint.prefix`, `fs.swift.service.<PROVIDER>.tenant`,
55-
`fs.swift.service.<PROVIDER>.username`,
56-
`fs.swift.service.<PROVIDER>.password`, `fs.swift.service.<PROVIDER>.http.port`,
57-
`fs.swift.service.<PROVIDER>.http.port`, `fs.swift.service.<PROVIDER>.public`, where
58-
`PROVIDER` is any name. `fs.swift.service.<PROVIDER>.auth.url` should point to the Keystone
59-
authentication URL.
60-
61-
Create core-sites.xml with the mandatory parameters and place it under /spark/conf
62-
directory. For example:
63-
64-
65-
<property>
66-
<name>fs.swift.service.<PROVIDER>.auth.url</name>
67-
<value>http://127.0.0.1:5000/v2.0/tokens</value>
68-
</property>
69-
<property>
70-
<name>fs.swift.service.<PROVIDER>.auth.endpoint.prefix</name>
71-
<value>endpoints</value>
72-
</property>
73-
<name>fs.swift.service.<PROVIDER>.http.port</name>
74-
<value>8080</value>
75-
</property>
76-
<property>
77-
<name>fs.swift.service.<PROVIDER>.region</name>
78-
<value>RegionOne</value>
79-
</property>
80-
<property>
81-
<name>fs.swift.service.<PROVIDER>.public</name>
82-
<value>true</value>
83-
</property>
84-
85-
We left with `fs.swift.service.<PROVIDER>.tenant`, `fs.swift.service.<PROVIDER>.username`,
86-
`fs.swift.service.<PROVIDER>.password`. The best way to provide those parameters to
87-
SparkContext in run time, which seems to be impossible yet.
88-
Another approach is to adapt Swift driver to obtain those values from system environment
89-
variables. For now we provide them via core-sites.xml.
90-
Assume a tenant `test` with user `tester` was defined in Keystone, then the core-sites.xml
91-
shoud include:
92-
93-
<property>
94-
<name>fs.swift.service.<PROVIDER>.tenant</name>
95-
<value>test</value>
96-
</property>
97-
<property>
98-
<name>fs.swift.service.<PROVIDER>.username</name>
99-
<value>tester</value>
100-
</property>
101-
<property>
102-
<name>fs.swift.service.<PROVIDER>.password</name>
103-
<value>testing</value>
104-
</property>
105-
# Usage
106-
Assume there exists Swift container `logs` with an object `data.log`. To access `data.log`
107-
from Spark the `swift://` scheme should be used. For example:
108-
109-
val sfdata = sc.textFile("swift://logs.<PROVIDER>/data.log")
146+
Notice that `fs.swift.service.PROVIDER.tenant`, `fs.swift.service.PROVIDER.username`,
147+
`fs.swift.service.PROVIDER.password` contains sensitive information and keeping them in `core-sites.xml` is not always a good approach.
148+
We suggest to keep those parameters in `core-sites.xml` for testing purposes when running Spark via `spark-shell`. For job submissions they should be provided via `sparkContext.hadoopConfiguration`
149+
150+
# Usage examples
151+
Assume Keystone's authentication URL is `http://127.0.0.1:5000/v2.0/tokens` and Keystone contains tenant `test`, user `tester` with password `testing`. In our example we define `PROVIDER=SparkTest`. Assume that Swift contains container `logs` with an object `data.log`. To access `data.log`
152+
from Spark the `swift://` scheme should be used.
153+
154+
## Running Spark via spark-shell
155+
Make sure that `core-sites.xml` contains `fs.swift.service.SparkTest.tenant`, `fs.swift.service.SparkTest.username`,
156+
`fs.swift.service.SparkTest.password`. Run Spark via `spark-shell` and access Swift via `swift:\\` scheme.
157+
158+
val sfdata = sc.textFile("swift://logs.SparkTest/data.log")
159+
sfdata.count()
160+
161+
## Job submission via spark-submit
162+
In this case `core-sites.xml` need not contain `fs.swift.service.SparkTest.tenant`, `fs.swift.service.SparkTest.username`,
163+
`fs.swift.service.SparkTest.password`. Example of Java usage:
164+
165+
/* SimpleApp.java */
166+
import org.apache.spark.api.java.*;
167+
import org.apache.spark.SparkConf;
168+
import org.apache.spark.api.java.function.Function;
169+
170+
public class SimpleApp {
171+
public static void main(String[] args) {
172+
String logFile = "swift://logs.SparkTest/data.log";
173+
SparkConf conf = new SparkConf().setAppName("Simple Application");
174+
JavaSparkContext sc = new JavaSparkContext(conf);
175+
sc.hadoopConfiguration().set("fs.swift.service.ibm.tenant", "test");
176+
sc.hadoopConfiguration().set("fs.swift.service.ibm.password", "testing");
177+
sc.hadoopConfiguration().set("fs.swift.service.ibm.username", "tester");
178+
179+
JavaRDD<String> logData = sc.textFile(logFile).cache();
180+
181+
long num = logData.count();
182+
183+
System.out.println("Total number of lines: " + num);
184+
}
185+
}
186+
187+
The directory sturture is
188+
189+
find .
190+
./src
191+
./src/main
192+
./src/main/java
193+
./src/main/java/SimpleApp.java
194+
195+
Maven pom.xml is
196+
197+
<project>
198+
<groupId>edu.berkeley</groupId>
199+
<artifactId>simple-project</artifactId>
200+
<modelVersion>4.0.0</modelVersion>
201+
<name>Simple Project</name>
202+
<packaging>jar</packaging>
203+
<version>1.0</version>
204+
<repositories>
205+
<repository>
206+
<id>Akka repository</id>
207+
<url>http://repo.akka.io/releases</url>
208+
</repository>
209+
</repositories>
210+
<build>
211+
<plugins>
212+
<plugin>
213+
<groupId>org.apache.maven.plugins</groupId>
214+
<artifactId>maven-compiler-plugin</artifactId>
215+
<version>2.3</version>
216+
<configuration>
217+
<source>1.6</source>
218+
<target>1.6</target>
219+
</configuration>
220+
</plugin>
221+
</plugins>
222+
</build>
223+
<dependencies>
224+
<dependency> <!-- Spark dependency -->
225+
<groupId>org.apache.spark</groupId>
226+
<artifactId>spark-core_2.10</artifactId>
227+
<version>1.0.0</version>
228+
</dependency>
229+
</dependencies>
230+
231+
</project>
232+
233+
Compile and execute
234+
235+
mvn package
236+
SPARK_HOME/spark-submit --class "SimpleApp" --master local[4] target/simple-project-1.0.jar
110237

pom.xml

Lines changed: 1 addition & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -132,8 +132,7 @@
132132
<codahale.metrics.version>3.0.0</codahale.metrics.version>
133133
<avro.version>1.7.6</avro.version>
134134
<jets3t.version>0.7.1</jets3t.version>
135-
<swift.version>2.3.0</swift.version>
136-
135+
137136
<PermGen>64m</PermGen>
138137
<MaxPermGen>512m</MaxPermGen>
139138
</properties>
@@ -585,11 +584,6 @@
585584
</exclusion>
586585
</exclusions>
587586
</dependency>
588-
<dependency>
589-
<groupId>org.apache.hadoop</groupId>
590-
<artifactId>hadoop-openstack</artifactId>
591-
<version>${swift.version}</version>
592-
</dependency>
593587
<dependency>
594588
<groupId>org.apache.hadoop</groupId>
595589
<artifactId>hadoop-yarn-api</artifactId>
@@ -1030,11 +1024,6 @@
10301024
<artifactId>hadoop-client</artifactId>
10311025
<scope>provided</scope>
10321026
</dependency>
1033-
<dependency>
1034-
<groupId>org.apache.hadoop</groupId>
1035-
<artifactId>hadoop-openstack</artifactId>
1036-
<scope>provided</scope>
1037-
</dependency>
10381027
<dependency>
10391028
<groupId>org.apache.hadoop</groupId>
10401029
<artifactId>hadoop-yarn-api</artifactId>

yarn/pom.xml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -55,10 +55,6 @@
5555
<groupId>org.apache.hadoop</groupId>
5656
<artifactId>hadoop-client</artifactId>
5757
</dependency>
58-
<dependency>
59-
<groupId>org.apache.hadoop</groupId>
60-
<artifactId>hadoop-openstack</artifactId>
61-
</dependency>
6258
<dependency>
6359
<groupId>org.scalatest</groupId>
6460
<artifactId>scalatest_${scala.binary.version}</artifactId>

0 commit comments

Comments
 (0)