Skip to content

Commit eb22295

Browse files
committed
Merge pull request #1010 from gilv/master
SPARK-938 - Openstack Swift object storage support
2 parents 13f8cfd + 39a9737 commit eb22295

File tree

3 files changed

+239
-2
lines changed

3 files changed

+239
-2
lines changed

core/pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535
<groupId>org.apache.hadoop</groupId>
3636
<artifactId>hadoop-client</artifactId>
3737
</dependency>
38-
<dependency>
38+
<dependency>
3939
<groupId>net.java.dev.jets3t</groupId>
4040
<artifactId>jets3t</artifactId>
4141
</dependency>

docs/openstack-integration.md

Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
layout: global
2+
title: Accessing Openstack Swift from Spark
3+
---
4+
5+
# Accessing Openstack Swift from Spark
6+
7+
Spark's file interface allows it to process data in Openstack Swift using the same URI
8+
formats that are supported for Hadoop. You can specify a path in Swift as input through a
9+
URI of the form `swift://<container.PROVIDER/path`. You will also need to set your
10+
Swift security credentials, through `core-sites.xml` or via `SparkContext.hadoopConfiguration`.
11+
Openstack Swift driver was merged in Hadoop version 2.3.0 ([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)). Users that wish to use previous Hadoop versions will need to configure Swift driver manually. Current Swift driver
12+
requieres Swift to use Keystone authentication method. There are recent efforts to support
13+
temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420).
14+
15+
# Configuring Swift
16+
Proxy server of Swift should include `list_endpoints` middleware. More information
17+
available [here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py)
18+
19+
# Compilation of Spark
20+
Spark should be compiled with `hadoop-openstack-2.3.0.jar` that is distributted with Hadoop 2.3.0.
21+
For the Maven builds, the `dependencyManagement` section of Spark's main `pom.xml` should include
22+
23+
<dependencyManagement>
24+
---------
25+
<dependency>
26+
<groupId>org.apache.hadoop</groupId>
27+
<artifactId>hadoop-openstack</artifactId>
28+
<version>2.3.0</version>
29+
</dependency>
30+
----------
31+
</dependencyManagement>
32+
33+
in addition, both `core` and `yarn` projects should add `hadoop-openstack` to the `dependencies` section of their `pom.xml`
34+
35+
<dependencies>
36+
----------
37+
<dependency>
38+
<groupId>org.apache.hadoop</groupId>
39+
<artifactId>hadoop-openstack</artifactId>
40+
</dependency>
41+
----------
42+
</dependencies>
43+
# Configuration of Spark
44+
Create `core-sites.xml` and place it inside `/spark/conf` directory. There are two main categories of parameters that should to be
45+
configured: declaration of the Swift driver and the parameters that are required by Keystone.
46+
47+
Configuration of Hadoop to use Swift File system achieved via
48+
49+
<table class="table">
50+
<tr><th>Property Name</th><th>Value</th></tr>
51+
<tr>
52+
<td>fs.swift.impl</td>
53+
<td>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</td>
54+
<tr>
55+
</table>
56+
57+
Additional parameters requiered by Keystone and should be provided to the Swift driver. Those
58+
parameters will be used to perform authentication in Keystone to access Swift. The following table
59+
contains a list of Keystone mandatory parameters. `PROVIDER` can be any name.
60+
61+
<table class="table">
62+
<tr><th>Property Name</th><th>Meaning</th><th>Required</th></tr>
63+
<tr>
64+
<td>fs.swift.service.PROVIDER.auth.url</td>
65+
<td>Keystone Authentication URL</td>
66+
<td>Mandatory</td>
67+
</tr>
68+
<tr>
69+
<td>fs.swift.service.PROVIDER.auth.endpoint.prefix</td>
70+
<td>Keystone endpoints prefix</td>
71+
<td>Optional</td>
72+
</tr>
73+
<tr>
74+
<td>fs.swift.service.PROVIDER.tenant</td>
75+
<td>Tenant</td>
76+
<td>Mandatory</td>
77+
</tr>
78+
<tr>
79+
<td>fs.swift.service.PROVIDER.username</td>
80+
<td>Username</td>
81+
<td>Mandatory</td>
82+
</tr>
83+
<tr>
84+
<td>fs.swift.service.PROVIDER.password</td>
85+
<td>Password</td>
86+
<td>Mandatory</td>
87+
</tr>
88+
<tr>
89+
<td>fs.swift.service.PROVIDER.http.port</td>
90+
<td>HTTP port</td>
91+
<td>Mandatory</td>
92+
</tr>
93+
<tr>
94+
<td>fs.swift.service.PROVIDER.region</td>
95+
<td>Keystone region</td>
96+
<td>Mandatory</td>
97+
</tr>
98+
<tr>
99+
<td>fs.swift.service.PROVIDER.public</td>
100+
<td>Indicates if all URLs are public</td>
101+
<td>Mandatory</td>
102+
</tr>
103+
</table>
104+
105+
For example, assume `PROVIDER=SparkTest` and Keystone contains user `tester` with password `testing` defined for tenant `tenant`.
106+
Than `core-sites.xml` should include:
107+
108+
<configuration>
109+
<property>
110+
<name>fs.swift.impl</name>
111+
<value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value>
112+
</property>
113+
<property>
114+
<name>fs.swift.service.SparkTest.auth.url</name>
115+
<value>http://127.0.0.1:5000/v2.0/tokens</value>
116+
</property>
117+
<property>
118+
<name>fs.swift.service.SparkTest.auth.endpoint.prefix</name>
119+
<value>endpoints</value>
120+
</property>
121+
<name>fs.swift.service.SparkTest.http.port</name>
122+
<value>8080</value>
123+
</property>
124+
<property>
125+
<name>fs.swift.service.SparkTest.region</name>
126+
<value>RegionOne</value>
127+
</property>
128+
<property>
129+
<name>fs.swift.service.SparkTest.public</name>
130+
<value>true</value>
131+
</property>
132+
<property>
133+
<name>fs.swift.service.SparkTest.tenant</name>
134+
<value>test</value>
135+
</property>
136+
<property>
137+
<name>fs.swift.service.SparkTest.username</name>
138+
<value>tester</value>
139+
</property>
140+
<property>
141+
<name>fs.swift.service.SparkTest.password</name>
142+
<value>testing</value>
143+
</property>
144+
</configuration>
145+
146+
Notice that `fs.swift.service.PROVIDER.tenant`, `fs.swift.service.PROVIDER.username`,
147+
`fs.swift.service.PROVIDER.password` contains sensitive information and keeping them in `core-sites.xml` is not always a good approach.
148+
We suggest to keep those parameters in `core-sites.xml` for testing purposes when running Spark via `spark-shell`. For job submissions they should be provided via `sparkContext.hadoopConfiguration`
149+
150+
# Usage examples
151+
Assume Keystone's authentication URL is `http://127.0.0.1:5000/v2.0/tokens` and Keystone contains tenant `test`, user `tester` with password `testing`. In our example we define `PROVIDER=SparkTest`. Assume that Swift contains container `logs` with an object `data.log`. To access `data.log`
152+
from Spark the `swift://` scheme should be used.
153+
154+
## Running Spark via spark-shell
155+
Make sure that `core-sites.xml` contains `fs.swift.service.SparkTest.tenant`, `fs.swift.service.SparkTest.username`,
156+
`fs.swift.service.SparkTest.password`. Run Spark via `spark-shell` and access Swift via `swift:\\` scheme.
157+
158+
val sfdata = sc.textFile("swift://logs.SparkTest/data.log")
159+
sfdata.count()
160+
161+
## Job submission via spark-submit
162+
In this case `core-sites.xml` need not contain `fs.swift.service.SparkTest.tenant`, `fs.swift.service.SparkTest.username`,
163+
`fs.swift.service.SparkTest.password`. Example of Java usage:
164+
165+
/* SimpleApp.java */
166+
import org.apache.spark.api.java.*;
167+
import org.apache.spark.SparkConf;
168+
import org.apache.spark.api.java.function.Function;
169+
170+
public class SimpleApp {
171+
public static void main(String[] args) {
172+
String logFile = "swift://logs.SparkTest/data.log";
173+
SparkConf conf = new SparkConf().setAppName("Simple Application");
174+
JavaSparkContext sc = new JavaSparkContext(conf);
175+
sc.hadoopConfiguration().set("fs.swift.service.ibm.tenant", "test");
176+
sc.hadoopConfiguration().set("fs.swift.service.ibm.password", "testing");
177+
sc.hadoopConfiguration().set("fs.swift.service.ibm.username", "tester");
178+
179+
JavaRDD<String> logData = sc.textFile(logFile).cache();
180+
181+
long num = logData.count();
182+
183+
System.out.println("Total number of lines: " + num);
184+
}
185+
}
186+
187+
The directory sturture is
188+
189+
find .
190+
./src
191+
./src/main
192+
./src/main/java
193+
./src/main/java/SimpleApp.java
194+
195+
Maven pom.xml is
196+
197+
<project>
198+
<groupId>edu.berkeley</groupId>
199+
<artifactId>simple-project</artifactId>
200+
<modelVersion>4.0.0</modelVersion>
201+
<name>Simple Project</name>
202+
<packaging>jar</packaging>
203+
<version>1.0</version>
204+
<repositories>
205+
<repository>
206+
<id>Akka repository</id>
207+
<url>http://repo.akka.io/releases</url>
208+
</repository>
209+
</repositories>
210+
<build>
211+
<plugins>
212+
<plugin>
213+
<groupId>org.apache.maven.plugins</groupId>
214+
<artifactId>maven-compiler-plugin</artifactId>
215+
<version>2.3</version>
216+
<configuration>
217+
<source>1.6</source>
218+
<target>1.6</target>
219+
</configuration>
220+
</plugin>
221+
</plugins>
222+
</build>
223+
<dependencies>
224+
<dependency> <!-- Spark dependency -->
225+
<groupId>org.apache.spark</groupId>
226+
<artifactId>spark-core_2.10</artifactId>
227+
<version>1.0.0</version>
228+
</dependency>
229+
</dependencies>
230+
231+
</project>
232+
233+
Compile and execute
234+
235+
mvn package
236+
SPARK_HOME/spark-submit --class "SimpleApp" --master local[4] target/simple-project-1.0.jar
237+

pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,7 @@
132132
<codahale.metrics.version>3.0.0</codahale.metrics.version>
133133
<avro.version>1.7.6</avro.version>
134134
<jets3t.version>0.7.1</jets3t.version>
135-
135+
136136
<PermGen>64m</PermGen>
137137
<MaxPermGen>512m</MaxPermGen>
138138
</properties>

0 commit comments

Comments
 (0)