Skip to content

Commit e9c3761

Browse files
committed
Merge pull request #1010 from gilv/master
SPARK-938 - Openstack Swift object storage support
2 parents 9422c4e + 9233fef commit e9c3761

File tree

2 files changed

+270
-1
lines changed

2 files changed

+270
-1
lines changed

core/pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@
4444
</exclusion>
4545
</exclusions>
4646
</dependency>
47-
<dependency>
47+
<dependency>
4848
<groupId>net.java.dev.jets3t</groupId>
4949
<artifactId>jets3t</artifactId>
5050
</dependency>

docs/openstack-integration.md

Lines changed: 269 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,269 @@
1+
---
2+
layout: global
3+
title: OpenStack Integration
4+
---
5+
6+
* This will become a table of contents (this text will be scraped).
7+
{:toc}
8+
9+
10+
# Accessing OpenStack Swift from Spark
11+
12+
Spark's file interface allows it to process data in OpenStack Swift using the same URI
13+
formats that are supported for Hadoop. You can specify a path in Swift as input through a
14+
URI of the form <code>swift://<container.PROVIDER/path</code>. You will also need to set your
15+
Swift security credentials, through <code>core-sites.xml</code> or via
16+
<code>SparkContext.hadoopConfiguration</code>.
17+
Openstack Swift driver was merged in Hadoop version 2.3.0
18+
([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)).
19+
Users that wish to use previous Hadoop versions will need to configure Swift driver manually.
20+
Current Swift driver requires Swift to use Keystone authentication method. There are recent efforts
21+
to support temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420).
22+
23+
# Configuring Swift
24+
Proxy server of Swift should include <code>list_endpoints</code> middleware. More information
25+
available
26+
[here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py)
27+
28+
# Dependencies
29+
30+
Spark should be compiled with <code>hadoop-openstack-2.3.0.jar</code> that is distributted with
31+
Hadoop 2.3.0. For the Maven builds, the <code>dependencyManagement</code> section of Spark's main
32+
<code>pom.xml</code> should include:
33+
{% highlight xml %}
34+
<dependencyManagement>
35+
...
36+
<dependency>
37+
<groupId>org.apache.hadoop</groupId>
38+
<artifactId>hadoop-openstack</artifactId>
39+
<version>2.3.0</version>
40+
</dependency>
41+
...
42+
</dependencyManagement>
43+
{% endhighlight %}
44+
45+
In addition, both <code>core</code> and <code>yarn</code> projects should add
46+
<code>hadoop-openstack</code> to the <code>dependencies</code> section of their
47+
<code>pom.xml</code>:
48+
{% highlight xml %}
49+
<dependencies>
50+
...
51+
<dependency>
52+
<groupId>org.apache.hadoop</groupId>
53+
<artifactId>hadoop-openstack</artifactId>
54+
</dependency>
55+
...
56+
</dependencies>
57+
{% endhighlight %}
58+
59+
# Configuration Parameters
60+
61+
Create <code>core-sites.xml</code> and place it inside <code>/spark/conf</code> directory.
62+
There are two main categories of parameters that should to be configured: declaration of the
63+
Swift driver and the parameters that are required by Keystone.
64+
65+
Configuration of Hadoop to use Swift File system achieved via
66+
67+
<table class="table">
68+
<tr><th>Property Name</th><th>Value</th></tr>
69+
<tr>
70+
<td>fs.swift.impl</td>
71+
<td>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</td>
72+
</tr>
73+
</table>
74+
75+
Additional parameters required by Keystone and should be provided to the Swift driver. Those
76+
parameters will be used to perform authentication in Keystone to access Swift. The following table
77+
contains a list of Keystone mandatory parameters. <code>PROVIDER</code> can be any name.
78+
79+
<table class="table">
80+
<tr><th>Property Name</th><th>Meaning</th><th>Required</th></tr>
81+
<tr>
82+
<td><code>fs.swift.service.PROVIDER.auth.url</code></td>
83+
<td>Keystone Authentication URL</td>
84+
<td>Mandatory</td>
85+
</tr>
86+
<tr>
87+
<td><code>fs.swift.service.PROVIDER.auth.endpoint.prefix</code></td>
88+
<td>Keystone endpoints prefix</td>
89+
<td>Optional</td>
90+
</tr>
91+
<tr>
92+
<td><code>fs.swift.service.PROVIDER.tenant</code></td>
93+
<td>Tenant</td>
94+
<td>Mandatory</td>
95+
</tr>
96+
<tr>
97+
<td><code>fs.swift.service.PROVIDER.username</code></td>
98+
<td>Username</td>
99+
<td>Mandatory</td>
100+
</tr>
101+
<tr>
102+
<td><code>fs.swift.service.PROVIDER.password</code></td>
103+
<td>Password</td>
104+
<td>Mandatory</td>
105+
</tr>
106+
<tr>
107+
<td><code>fs.swift.service.PROVIDER.http.port</code></td>
108+
<td>HTTP port</td>
109+
<td>Mandatory</td>
110+
</tr>
111+
<tr>
112+
<td><code>fs.swift.service.PROVIDER.region</code></td>
113+
<td>Keystone region</td>
114+
<td>Mandatory</td>
115+
</tr>
116+
<tr>
117+
<td><code>fs.swift.service.PROVIDER.public</code></td>
118+
<td>Indicates if all URLs are public</td>
119+
<td>Mandatory</td>
120+
</tr>
121+
</table>
122+
123+
For example, assume <code>PROVIDER=SparkTest</code> and Keystone contains user <code>tester</code> with password <code>testing</code>
124+
defined for tenant <code>tenant</code>. Than <code>core-sites.xml</code> should include:
125+
126+
{% highlight xml %}
127+
<configuration>
128+
<property>
129+
<name>fs.swift.impl</name>
130+
<value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value>
131+
</property>
132+
<property>
133+
<name>fs.swift.service.SparkTest.auth.url</name>
134+
<value>http://127.0.0.1:5000/v2.0/tokens</value>
135+
</property>
136+
<property>
137+
<name>fs.swift.service.SparkTest.auth.endpoint.prefix</name>
138+
<value>endpoints</value>
139+
</property>
140+
<name>fs.swift.service.SparkTest.http.port</name>
141+
<value>8080</value>
142+
</property>
143+
<property>
144+
<name>fs.swift.service.SparkTest.region</name>
145+
<value>RegionOne</value>
146+
</property>
147+
<property>
148+
<name>fs.swift.service.SparkTest.public</name>
149+
<value>true</value>
150+
</property>
151+
<property>
152+
<name>fs.swift.service.SparkTest.tenant</name>
153+
<value>test</value>
154+
</property>
155+
<property>
156+
<name>fs.swift.service.SparkTest.username</name>
157+
<value>tester</value>
158+
</property>
159+
<property>
160+
<name>fs.swift.service.SparkTest.password</name>
161+
<value>testing</value>
162+
</property>
163+
</configuration>
164+
{% endhighlight %}
165+
166+
Notice that
167+
<code>fs.swift.service.PROVIDER.tenant</code>,
168+
<code>fs.swift.service.PROVIDER.username</code>,
169+
<code>fs.swift.service.PROVIDER.password</code> contains sensitive information and keeping them in
170+
<code>core-sites.xml</code> is not always a good approach.
171+
We suggest to keep those parameters in <code>core-sites.xml</code> for testing purposes when running Spark
172+
via <code>spark-shell</code>.
173+
For job submissions they should be provided via <code>sparkContext.hadoopConfiguration</code>.
174+
175+
# Usage examples
176+
177+
Assume Keystone's authentication URL is <code>http://127.0.0.1:5000/v2.0/tokens</code> and Keystone contains tenant <code>test</code>, user <code>tester</code> with password <code>testing</code>. In our example we define <code>PROVIDER=SparkTest</code>. Assume that Swift contains container <code>logs</code> with an object <code>data.log</code>. To access <code>data.log</code> from Spark the <code>swift://</code> scheme should be used.
178+
179+
180+
## Running Spark via spark-shell
181+
182+
Make sure that <code>core-sites.xml</code> contains <code>fs.swift.service.SparkTest.tenant</code>, <code>fs.swift.service.SparkTest.username</code>,
183+
<code>fs.swift.service.SparkTest.password</code>. Run Spark via <code>spark-shell</code> and access Swift via <code>swift://</code> scheme.
184+
185+
{% highlight scala %}
186+
val sfdata = sc.textFile("swift://logs.SparkTest/data.log")
187+
sfdata.count()
188+
{% endhighlight %}
189+
190+
191+
## Sample Application
192+
193+
In this case <code>core-sites.xml</code> need not contain <code>fs.swift.service.SparkTest.tenant</code>, <code>fs.swift.service.SparkTest.username</code>,
194+
<code>fs.swift.service.SparkTest.password</code>. Example of Java usage:
195+
196+
{% highlight java %}
197+
/* SimpleApp.java */
198+
import org.apache.spark.api.java.*;
199+
import org.apache.spark.SparkConf;
200+
import org.apache.spark.api.java.function.Function;
201+
202+
public class SimpleApp {
203+
public static void main(String[] args) {
204+
String logFile = "swift://logs.SparkTest/data.log";
205+
SparkConf conf = new SparkConf().setAppName("Simple Application");
206+
JavaSparkContext sc = new JavaSparkContext(conf);
207+
sc.hadoopConfiguration().set("fs.swift.service.ibm.tenant", "test");
208+
sc.hadoopConfiguration().set("fs.swift.service.ibm.password", "testing");
209+
sc.hadoopConfiguration().set("fs.swift.service.ibm.username", "tester");
210+
211+
JavaRDD<String> logData = sc.textFile(logFile).cache();
212+
long num = logData.count();
213+
214+
System.out.println("Total number of lines: " + num);
215+
}
216+
}
217+
{% endhighlight %}
218+
219+
The directory structure is
220+
{% highlight bash %}
221+
./src
222+
./src/main
223+
./src/main/java
224+
./src/main/java/SimpleApp.java
225+
{% endhighlight %}
226+
227+
Maven pom.xml should contain:
228+
{% highlight xml %}
229+
<project>
230+
<groupId>edu.berkeley</groupId>
231+
<artifactId>simple-project</artifactId>
232+
<modelVersion>4.0.0</modelVersion>
233+
<name>Simple Project</name>
234+
<packaging>jar</packaging>
235+
<version>1.0</version>
236+
<repositories>
237+
<repository>
238+
<id>Akka repository</id>
239+
<url>http://repo.akka.io/releases</url>
240+
</repository>
241+
</repositories>
242+
<build>
243+
<plugins>
244+
<plugin>
245+
<groupId>org.apache.maven.plugins</groupId>
246+
<artifactId>maven-compiler-plugin</artifactId>
247+
<version>2.3</version>
248+
<configuration>
249+
<source>1.6</source>
250+
<target>1.6</target>
251+
</configuration>
252+
</plugin>
253+
</plugins>
254+
</build>
255+
<dependencies>
256+
<dependency> <!-- Spark dependency -->
257+
<groupId>org.apache.spark</groupId>
258+
<artifactId>spark-core_2.10</artifactId>
259+
<version>1.0.0</version>
260+
</dependency>
261+
</dependencies>
262+
</project>
263+
{% endhighlight %}
264+
265+
Compile and execute
266+
{% highlight bash %}
267+
mvn package
268+
SPARK_HOME/spark-submit --class SimpleApp --master local[4] target/simple-project-1.0.jar
269+
{% endhighlight %}

0 commit comments

Comments
 (0)