|
1 | 1 | ---
|
2 | 2 | layout: global
|
3 |
| -title: "GraphX: Unifying Graphs and Tables" |
| 3 | +title: GraphX Programming Guide |
4 | 4 | ---
|
5 | 5 |
|
| 6 | +* This will become a table of contents (this text will be scraped). |
| 7 | +{:toc} |
6 | 8 |
|
7 |
| -GraphX extends the distributed fault-tolerant collections API and |
8 |
| -interactive console of [Spark](http://spark.incubator.apache.org) with |
9 |
| -a new graph API which leverages recent advances in graph systems |
10 |
| -(e.g., [GraphLab](http://graphlab.org)) to enable users to easily and |
11 |
| -interactively build, transform, and reason about graph structured data |
12 |
| -at scale. |
13 |
| - |
14 |
| - |
15 |
| -## Motivation |
16 |
| - |
17 |
| -From social networks and targeted advertising to protein modeling and |
18 |
| -astrophysics, big graphs capture the structure in data and are central |
19 |
| -to the recent advances in machine learning and data mining. Directly |
20 |
| -applying existing *data-parallel* tools (e.g., |
21 |
| -[Hadoop](http://hadoop.apache.org) and |
22 |
| -[Spark](http://spark.incubator.apache.org)) to graph computation tasks |
23 |
| -can be cumbersome and inefficient. The need for intuitive, scalable |
24 |
| -tools for graph computation has lead to the development of new |
25 |
| -*graph-parallel* systems (e.g., |
26 |
| -[Pregel](http://http://giraph.apache.org) and |
27 |
| -[GraphLab](http://graphlab.org)) which are designed to efficiently |
28 |
| -execute graph algorithms. Unfortunately, these systems do not address |
29 |
| -the challenges of graph construction and transformation and provide |
30 |
| -limited fault-tolerance and support for interactive analysis. |
31 |
| - |
32 |
| -{:.pagination-centered} |
33 |
| - |
34 |
| - |
35 |
| -## Solution |
36 |
| - |
37 |
| -The GraphX project combines the advantages of both data-parallel and |
38 |
| -graph-parallel systems by efficiently expressing graph computation |
39 |
| -within the [Spark](http://spark.incubator.apache.org) framework. We |
40 |
| -leverage new ideas in distributed graph representation to efficiently |
41 |
| -distribute graphs as tabular data-structures. Similarly, we leverage |
42 |
| -advances in data-flow systems to exploit in-memory computation and |
43 |
| -fault-tolerance. We provide powerful new operations to simplify graph |
44 |
| -construction and transformation. Using these primitives we implement |
45 |
| -the PowerGraph and Pregel abstractions in less than 20 lines of code. |
46 |
| -Finally, by exploiting the Scala foundation of Spark, we enable users |
47 |
| -to interactively load, transform, and compute on massive graphs. |
48 |
| - |
49 |
| -<p align="center"> |
50 |
| - <img src="https://raw.github.com/amplab/graphx/master/docs/img/tables_and_graphs.png" /> |
| 9 | +<p style="text-align: center;"> |
| 10 | + <img src="img/graphx_logo.png" |
| 11 | + title="GraphX Logo" |
| 12 | + alt="GraphX" |
| 13 | + width="65%" /> |
51 | 14 | </p>
|
52 | 15 |
|
53 |
| -## Examples |
| 16 | +# Overview |
| 17 | + |
| 18 | +GraphX is the new (alpha) Spark API for graphs and graph-parallel |
| 19 | +computation. At a high-level GraphX, extends the Spark |
| 20 | +[RDD](api/core/index.html#org.apache.spark.rdd.RDD) by |
| 21 | +introducing the [Resilient Distributed property Graph (RDG)](#property_graph): |
| 22 | +a directed graph with properties attached to each vertex and edge. |
| 23 | +To support graph computation, GraphX exposes a set of functions |
| 24 | +(e.g., [mapReduceTriplets](#mrTriplets)) as well as optimized variants of the |
| 25 | +[Pregel](http://giraph.apache.org) and [GraphLab](http://graphlab.org) |
| 26 | +APIs. In addition, GraphX includes a growing collection of graph |
| 27 | +[algorithms](#graph_algorithms) and [builders](#graph_builders) to simplify |
| 28 | +graph analytics tasks. |
| 29 | + |
| 30 | +## Background on Graph-Parallel Computation |
| 31 | + |
| 32 | +From social networks to language modeling, the growing scale and importance of |
| 33 | +graph data has driven the development of numerous new *graph-parallel* systems |
| 34 | +(e.g., [Giraph](http://http://giraph.apache.org) and |
| 35 | +[GraphLab](http://graphlab.org)). By restricting the types of computation that can be |
| 36 | +expressed and introducing new techniques to partition and distribute graphs, |
| 37 | +these systems can efficiently execute sophisticated graph algorithms orders of |
| 38 | +magnitude faster than more general *data-parallel* systems. |
| 39 | + |
| 40 | +<p style="text-align: center;"> |
| 41 | + <img src="img/data_parallel_vs_graph_parallel.png" |
| 42 | + title="Data-Parallel vs. Graph-Parallel" |
| 43 | + alt="Data-Parallel vs. Graph-Parallel" |
| 44 | + width="50%" /> |
| 45 | +</p> |
| 46 | + |
| 47 | +However, the same restrictions that enable these substantial performance gains |
| 48 | +also make it difficult to express many of the important stages in a typical graph-analytics pipeline: |
| 49 | +constructing the graph, modifying its structure, or expressing computation that |
| 50 | +spans multiple graphs. As a consequence, existing graph analytics pipelines |
| 51 | +compose graph-parallel and data-parallel systems, leading to extensive data |
| 52 | +movement and duplication and a complicated programming model. |
| 53 | + |
| 54 | +<p style="text-align: center;"> |
| 55 | + <img src="img/graph_analytics_pipeline.png" |
| 56 | + title="Graph Analytics Pipeline" |
| 57 | + alt="Graph Analytics Pipeline" |
| 58 | + width="50%" /> |
| 59 | +</p> |
| 60 | + |
| 61 | +The goal of the GraphX project is to unify graph-parallel and data-parallel |
| 62 | +computation in one system with a single composable API. This goal is achieved |
| 63 | +through an API that enables users to view data both as a graph and as |
| 64 | +collections (i.e., RDDs) without data movement or duplication and by |
| 65 | +incorporating advances in graph-parallel systems to optimize the execution of |
| 66 | +operations on the graph view. In preliminary experiments we find that the GraphX |
| 67 | +system is able to achieve performance comparable to state-of-the-art |
| 68 | +graph-parallel systems while easily expressing the entire analytics pipelines. |
| 69 | + |
| 70 | +<p style="text-align: center;"> |
| 71 | + <img src="img/graphx_performance_comparison.png" |
| 72 | + title="GraphX Performance Comparison" |
| 73 | + alt="GraphX Performance Comparison" |
| 74 | + width="50%" /> |
| 75 | +</p> |
| 76 | + |
| 77 | +## GraphX Replaces the Spark Bagel API |
| 78 | + |
| 79 | +Prior to the release of GraphX, graph computation in Spark was expressed using |
| 80 | +Bagel, an implementation of the Pregel API. GraphX improves upon Bagel by exposing |
| 81 | +a richer property graph API, a more streamlined version of the Pregel abstraction, |
| 82 | +and system optimizations to improve performance and reduce memory |
| 83 | +overhead. While we plan to eventually deprecate the Bagel, we will continue to |
| 84 | +support the API and [Bagel programming guide](bagel-programming-guide.html). However, |
| 85 | +we encourage Bagel to explore the new GraphX API and comment on issues that may |
| 86 | +complicate the transition from Bagel. |
| 87 | + |
| 88 | +# The Property Graph |
| 89 | +<a name="property_graph"></a> |
| 90 | + |
| 91 | +<p style="text-align: center;"> |
| 92 | + <img src="img/edge_cut_vs_vertex_cut.png" |
| 93 | + title="Edge Cut vs. Vertex Cut" |
| 94 | + alt="Edge Cut vs. Vertex Cut" |
| 95 | + width="50%" /> |
| 96 | +</p> |
| 97 | + |
| 98 | +<p style="text-align: center;"> |
| 99 | + <img src="img/property_graph.png" |
| 100 | + title="The Property Graph" |
| 101 | + alt="The Property Graph" |
| 102 | + width="50%" /> |
| 103 | +</p> |
| 104 | + |
| 105 | +<p style="text-align: center;"> |
| 106 | + <img src="img/vertex_routing_edge_tables.png" |
| 107 | + title="RDD Graph Representation" |
| 108 | + alt="RDD Graph Representation" |
| 109 | + width="50%" /> |
| 110 | +</p> |
| 111 | + |
| 112 | + |
| 113 | +# Graph Operators |
| 114 | + |
| 115 | +## Map Reduce Triplets (mapReduceTriplets) |
| 116 | +<a name="mrTriplets"></a> |
| 117 | + |
| 118 | +# Graph Algorithms |
| 119 | +<a name="graph_algorithms"></a> |
| 120 | + |
| 121 | +# Graph Builders |
| 122 | +<a name="graph_builders"></a> |
| 123 | + |
| 124 | +<p style="text-align: center;"> |
| 125 | + <img src="img/tables_and_graphs.png" |
| 126 | + title="Tables and Graphs" |
| 127 | + alt="Tables and Graphs" |
| 128 | + width="50%" /> |
| 129 | +</p> |
| 130 | + |
| 131 | +# Examples |
54 | 132 |
|
55 | 133 | Suppose I want to build a graph from some text files, restrict the graph
|
56 | 134 | to important relationships and users, run page-rank on the sub-graph, and
|
57 | 135 | then finally return attributes associated with the top users. I can do
|
58 | 136 | all of this in just a few lines with GraphX:
|
59 | 137 |
|
60 |
| -```scala |
| 138 | +{% highlight scala %} |
61 | 139 | // Connect to the Spark cluster
|
62 | 140 | val sc = new SparkContext("spark://master.amplab.org", "research")
|
63 | 141 |
|
@@ -89,108 +167,5 @@ val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices){
|
89 | 167 |
|
90 | 168 | println(userInfoWithPageRank.top(5))
|
91 | 169 |
|
92 |
| -``` |
93 |
| - |
94 |
| - |
95 |
| -## Online Documentation |
96 |
| - |
97 |
| -You can find the latest Spark documentation, including a programming |
98 |
| -guide, on the project webpage at |
99 |
| -<http://spark.incubator.apache.org/documentation.html>. This README |
100 |
| -file only contains basic setup instructions. |
101 |
| - |
102 |
| - |
103 |
| -## Building |
104 |
| - |
105 |
| -Spark requires Scala 2.9.3 (Scala 2.10 is not yet supported). The |
106 |
| -project is built using Simple Build Tool (SBT), which is packaged with |
107 |
| -it. To build Spark and its example programs, run: |
108 |
| - |
109 |
| - sbt/sbt assembly |
110 |
| - |
111 |
| -Once you've built Spark, the easiest way to start using it is the |
112 |
| -shell: |
113 |
| - |
114 |
| - ./spark-shell |
115 |
| - |
116 |
| -Or, for the Python API, the Python shell (`./pyspark`). |
117 |
| - |
118 |
| -Spark also comes with several sample programs in the `examples` |
119 |
| -directory. To run one of them, use `./run-example <class> |
120 |
| -<params>`. For example: |
121 |
| - |
122 |
| - ./run-example org.apache.spark.examples.SparkLR local[2] |
123 |
| - |
124 |
| -will run the Logistic Regression example locally on 2 CPUs. |
125 |
| - |
126 |
| -Each of the example programs prints usage help if no params are given. |
127 |
| - |
128 |
| -All of the Spark samples take a `<master>` parameter that is the |
129 |
| -cluster URL to connect to. This can be a mesos:// or spark:// URL, or |
130 |
| -"local" to run locally with one thread, or "local[N]" to run locally |
131 |
| -with N threads. |
132 |
| - |
133 |
| - |
134 |
| -## A Note About Hadoop Versions |
135 |
| - |
136 |
| -Spark uses the Hadoop core library to talk to HDFS and other |
137 |
| -Hadoop-supported storage systems. Because the protocols have changed |
138 |
| -in different versions of Hadoop, you must build Spark against the same |
139 |
| -version that your cluster runs. You can change the version by setting |
140 |
| -the `SPARK_HADOOP_VERSION` environment when building Spark. |
141 |
| - |
142 |
| -For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop |
143 |
| -versions without YARN, use: |
144 |
| - |
145 |
| - # Apache Hadoop 1.2.1 |
146 |
| - $ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly |
147 |
| - |
148 |
| - # Cloudera CDH 4.2.0 with MapReduce v1 |
149 |
| - $ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly |
150 |
| - |
151 |
| -For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions |
152 |
| -with YARN, also set `SPARK_YARN=true`: |
153 |
| - |
154 |
| - # Apache Hadoop 2.0.5-alpha |
155 |
| - $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly |
156 |
| - |
157 |
| - # Cloudera CDH 4.2.0 with MapReduce v2 |
158 |
| - $ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly |
159 |
| - |
160 |
| -For convenience, these variables may also be set through the |
161 |
| -`conf/spark-env.sh` file described below. |
162 |
| - |
163 |
| -When developing a Spark application, specify the Hadoop version by adding the |
164 |
| -"hadoop-client" artifact to your project's dependencies. For example, if you're |
165 |
| -using Hadoop 1.2.1 and build your application using SBT, add this entry to |
166 |
| -`libraryDependencies`: |
167 |
| - |
168 |
| - "org.apache.hadoop" % "hadoop-client" % "1.2.1" |
169 |
| - |
170 |
| -If your project is built with Maven, add this to your POM file's |
171 |
| -`<dependencies>` section: |
172 |
| - |
173 |
| - <dependency> |
174 |
| - <groupId>org.apache.hadoop</groupId> |
175 |
| - <artifactId>hadoop-client</artifactId> |
176 |
| - <version>1.2.1</version> |
177 |
| - </dependency> |
178 |
| - |
179 |
| - |
180 |
| -## Configuration |
181 |
| - |
182 |
| -Please refer to the [Configuration |
183 |
| -guide](http://spark.incubator.apache.org/docs/latest/configuration.html) |
184 |
| -in the online documentation for an overview on how to configure Spark. |
185 |
| - |
186 |
| - |
187 |
| -## Contributing to GraphX |
| 170 | +{% endhighlight %} |
188 | 171 |
|
189 |
| -Contributions via GitHub pull requests are gladly accepted from their |
190 |
| -original author. Along with any pull requests, please state that the |
191 |
| -contribution is your original work and that you license the work to |
192 |
| -the project under the project's open source license. Whether or not |
193 |
| -you state this explicitly, by submitting any copyrighted material via |
194 |
| -pull request, email, or other means you agree to license the material |
195 |
| -under the project's open source license and warrant that you have the |
196 |
| -legal authority to do so. |
0 commit comments