Skip to content

Commit eff538d

Browse files
committed
SPARK-938 - Openstack Swift object storage support
Documentation how to integrate Spark with Openstack Swift.
1 parent b6c37ef commit eff538d

File tree

4 files changed

+96
-68
lines changed

4 files changed

+96
-68
lines changed

core/pom.xml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,11 @@
3535
<groupId>org.apache.hadoop</groupId>
3636
<artifactId>hadoop-client</artifactId>
3737
</dependency>
38-
<dependency>
38+
<dependency>
39+
<groupId>org.apache.hadoop</groupId>
40+
<artifactId>hadoop-openstack</artifactId>
41+
</dependency>
42+
<dependency>
3943
<groupId>net.java.dev.jets3t</groupId>
4044
<artifactId>jets3t</artifactId>
4145
</dependency>

docs/openstack-integration.md

Lines changed: 76 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -8,76 +8,85 @@ title: Accessing Openstack Swift storage from Spark
88
Spark's file interface allows it to process data in Openstack Swift using the same URI formats that are supported for Hadoop. You can specify a path in Swift as input through a URI of the form `swift://<container.service_provider>/path`. You will also need to set your Swift security credentials, through `SparkContext.hadoopConfiguration`.
99

1010
#Configuring Hadoop to use Openstack Swift
11-
Openstack Swift driver was merged in Hadoop verion 2.3.0 ([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)) Users that wish to use previous Hadoop versions will need to configure Swift driver manually.
12-
<h2>Hadoop 2.3.0 and above.</h2>
13-
An Openstack Swift driver was merged into Haddop 2.3.0 . Current Hadoop driver requieres Swift to use Keystone authentication. There are additional efforts to support temp auth for Hadoop [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420).
11+
Openstack Swift driver was merged in Hadoop verion 2.3.0 ([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)). Users that wish to use previous Hadoop versions will need to configure Swift driver manually. Current Swift driver requieres Swift to use Keystone authentication method. There are recent efforts to support also temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420).
1412
To configure Hadoop to work with Swift one need to modify core-sites.xml of Hadoop and setup Swift FS.
1513

16-
<configuration>
17-
<property>
18-
<name>fs.swift.impl</name>
19-
<value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value>
20-
</property>
21-
</configuration>
14+
<configuration>
15+
<property>
16+
<name>fs.swift.impl</name>
17+
<value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value>
18+
</property>
19+
</configuration>
2220

21+
#Configuring Swift
22+
Proxy server of Swift should include `list_endpoints` middleware. More information available [here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py)
2323

24-
<h2>Configuring Spark - stand alone cluster</h2>
25-
You need to configure the compute-classpath.sh and add Hadoop classpath for
24+
#Configuring Spark
25+
To use Swift driver, Spark need to be compiled with `hadoop-openstack-2.3.0.jar` distributted with Hadoop 2.3.0.
26+
For the Maven builds, Spark's main pom.xml should include
2627

27-
28-
CLASSPATH = <YOUR HADOOP PATH>/share/hadoop/common/lib/*
29-
CLASSPATH = <YOUR HADOOP PATH>/share/hadoop/hdfs/*
30-
CLASSPATH = <YOUR HADOOP PATH>/share/hadoop/tools/lib/*
31-
CLASSPATH = <YOUR HADOOP PATH>/share/hadoop/hdfs/lib/*
32-
CLASSPATH = <YOUR HADOOP PATH>/share/hadoop/mapreduce/*
33-
CLASSPATH = <YOUR HADOOP PATH>/share/hadoop/mapreduce/lib/*
34-
CLASSPATH = <YOUR HADOOP PATH>/share/hadoop/yarn/*
35-
CLASSPATH = <YOUR HADOOP PATH>/share/hadoop/yarn/lib/*
36-
37-
Additional parameters has to be provided to the Hadoop from Spark. Swift driver of Hadoop uses those parameters to perform authentication in Keystone needed to access Swift.
38-
List of mandatory parameters is : `fs.swift.service.<PROVIDER>.auth.url`, `fs.swift.service.<PROVIDER>.auth.endpoint.prefix`, `fs.swift.service.<PROVIDER>.tenant`, `fs.swift.service.<PROVIDER>.username`,
39-
`fs.swift.service.<PROVIDER>.password`, `fs.swift.service.<PROVIDER>.http.port`, `fs.swift.service.<PROVIDER>.http.port`, `fs.swift.service.<PROVIDER>.public`.
40-
Create core-sites.xml and place it under /spark/conf directory. Configure core-sites.xml with general Keystone parameters, for example
41-
42-
43-
<property>
44-
<name>fs.swift.service.<PROVIDER>.auth.url</name>
45-
<value>http://127.0.0.1:5000/v2.0/tokens</value>
46-
</property>
47-
<property>
48-
<name>fs.swift.service.<PROVIDER>.auth.endpoint.prefix</name>
49-
<value>endpoints</value>
50-
</property>
51-
<name>fs.swift.service.<PROVIDER>.http.port</name>
52-
<value>8080</value>
53-
</property>
54-
<property>
55-
<name>fs.swift.service.<PROVIDER>.region</name>
56-
<value>RegionOne</value>
57-
</property>
58-
<property>
59-
<name>fs.swift.service.<PROVIDER>.public</name>
60-
<value>true</value>
61-
</property>
62-
63-
We left with `fs.swift.service.<PROVIDER>.tenant`, `fs.swift.service.<PROVIDER>.username`, `fs.swift.service.<PROVIDER>.password`. The best way is to provide them to SparkContext in run time, which seems to be impossible yet.
64-
Another approach is to change Hadoop Swift FS driver to provide them via system environment variables. For now we provide them via core-sites.xml
65-
66-
<property>
67-
<name>fs.swift.service.<PROVIDER>.tenant</name>
68-
<value>test</value>
69-
</property>
70-
<property>
71-
<name>fs.swift.service.<PROVIDER>.username</name>
72-
<value>tester</value>
73-
</property>
74-
<property>
75-
<name>fs.swift.service.<PROVIDER>.password</name>
76-
<value>testing</value>
77-
</property>
78-
<property>
79-
<h3> Usage </h3>
80-
Assume you have a Swift container `logs` with an object `data.log`. You can use `swift://` scheme to access objects from Swift.
81-
82-
val sfdata = sc.textFile("swift://logs.<PROVIDER>/data.log")
28+
<swift.version>2.3.0</swift.version>
29+
30+
31+
<dependency>
32+
<groupId>org.apache.hadoop</groupId>
33+
<artifactId>hadoop-openstack</artifactId>
34+
<version>${swift.version}</version>
35+
</dependency>
36+
37+
in addition, pom.xml of the `core` and `yarn` projects should include
38+
39+
<dependency>
40+
<groupId>org.apache.hadoop</groupId>
41+
<artifactId>hadoop-openstack</artifactId>
42+
</dependency>
43+
44+
45+
Additional parameters has to be provided to the Swift driver. Swift driver will use those parameters to perform authentication in Keystone prior accessing Swift. List of mandatory parameters is : `fs.swift.service.<PROVIDER>.auth.url`, `fs.swift.service.<PROVIDER>.auth.endpoint.prefix`, `fs.swift.service.<PROVIDER>.tenant`, `fs.swift.service.<PROVIDER>.username`,
46+
`fs.swift.service.<PROVIDER>.password`, `fs.swift.service.<PROVIDER>.http.port`, `fs.swift.service.<PROVIDER>.http.port`, `fs.swift.service.<PROVIDER>.public`, where `PROVIDER` is any name. `fs.swift.service.<PROVIDER>.auth.url` should point to the Keystone authentication URL.
47+
48+
Create core-sites.xml with the mandatory parameters and place it under /spark/conf directory. For example:
49+
50+
51+
<property>
52+
<name>fs.swift.service.<PROVIDER>.auth.url</name>
53+
<value>http://127.0.0.1:5000/v2.0/tokens</value>
54+
</property>
55+
<property>
56+
<name>fs.swift.service.<PROVIDER>.auth.endpoint.prefix</name>
57+
<value>endpoints</value>
58+
</property>
59+
<name>fs.swift.service.<PROVIDER>.http.port</name>
60+
<value>8080</value>
61+
</property>
62+
<property>
63+
<name>fs.swift.service.<PROVIDER>.region</name>
64+
<value>RegionOne</value>
65+
</property>
66+
<property>
67+
<name>fs.swift.service.<PROVIDER>.public</name>
68+
<value>true</value>
69+
</property>
70+
71+
We left with `fs.swift.service.<PROVIDER>.tenant`, `fs.swift.service.<PROVIDER>.username`, `fs.swift.service.<PROVIDER>.password`. The best way to provide those parameters to SparkContext in run time, which seems to be impossible yet.
72+
Another approach is to adapt Swift driver to obtain those values from system environment variables. For now we provide them via core-sites.xml.
73+
Assume a tenant `test` with user `tester` was defined in Keystone, then the core-sites.xml shoud include:
74+
75+
<property>
76+
<name>fs.swift.service.<PROVIDER>.tenant</name>
77+
<value>test</value>
78+
</property>
79+
<property>
80+
<name>fs.swift.service.<PROVIDER>.username</name>
81+
<value>tester</value>
82+
</property>
83+
<property>
84+
<name>fs.swift.service.<PROVIDER>.password</name>
85+
<value>testing</value>
86+
</property>
87+
# Usage
88+
Assume there exists Swift container `logs` with an object `data.log`. To access `data.log` from Spark the `swift://` scheme should be used.
89+
For example:
90+
91+
val sfdata = sc.textFile("swift://logs.<PROVIDER>/data.log")
8392

pom.xml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,7 @@
132132
<codahale.metrics.version>3.0.0</codahale.metrics.version>
133133
<avro.version>1.7.6</avro.version>
134134
<jets3t.version>0.7.1</jets3t.version>
135+
<swift.version>2.3.0</swift.version>
135136

136137
<PermGen>64m</PermGen>
137138
<MaxPermGen>512m</MaxPermGen>
@@ -584,6 +585,11 @@
584585
</exclusion>
585586
</exclusions>
586587
</dependency>
588+
<dependency>
589+
<groupId>org.apache.hadoop</groupId>
590+
<artifactId>hadoop-openstack</artifactId>
591+
<version>${swift.version}</version>
592+
</dependency>
587593
<dependency>
588594
<groupId>org.apache.hadoop</groupId>
589595
<artifactId>hadoop-yarn-api</artifactId>
@@ -1024,6 +1030,11 @@
10241030
<artifactId>hadoop-client</artifactId>
10251031
<scope>provided</scope>
10261032
</dependency>
1033+
<dependency>
1034+
<groupId>org.apache.hadoop</groupId>
1035+
<artifactId>hadoop-openstack</artifactId>
1036+
<scope>provided</scope>
1037+
</dependency>
10271038
<dependency>
10281039
<groupId>org.apache.hadoop</groupId>
10291040
<artifactId>hadoop-yarn-api</artifactId>

yarn/pom.xml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,10 @@
5555
<groupId>org.apache.hadoop</groupId>
5656
<artifactId>hadoop-client</artifactId>
5757
</dependency>
58+
<dependency>
59+
<groupId>org.apache.hadoop</groupId>
60+
<artifactId>hadoop-openstack</artifactId>
61+
</dependency>
5862
<dependency>
5963
<groupId>org.scalatest</groupId>
6064
<artifactId>scalatest_${scala.binary.version}</artifactId>

0 commit comments

Comments
 (0)