Skip to content

Commit 23d2995

Browse files
committed
Merge pull request #1 from jegonzal/graphx
ProgrammingGuide
2 parents 729277e + b1eeefb commit 23d2995

12 files changed

+134
-159
lines changed

docs/_layouts/global.html

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
<link rel="stylesheet" href="css/main.css">
2222

2323
<script src="js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script>
24-
24+
2525
<link rel="stylesheet" href="css/pygments-default.css">
2626

2727
<!-- Google analytics script -->
@@ -67,10 +67,10 @@
6767
<li class="divider"></li>
6868
<li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
6969
<li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
70-
<li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
70+
<li><a href="graphx-programming-guide.html">GraphX (Graph-Parallel Spark)</a></li>
7171
</ul>
7272
</li>
73-
73+
7474
<li class="dropdown">
7575
<a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
7676
<ul class="dropdown-menu">
@@ -79,7 +79,7 @@
7979
<li class="divider"></li>
8080
<li><a href="api/streaming/index.html#org.apache.spark.streaming.package">Spark Streaming</a></li>
8181
<li><a href="api/mllib/index.html#org.apache.spark.mllib.package">MLlib (Machine Learning)</a></li>
82-
<li><a href="api/bagel/index.html#org.apache.spark.bagel.package">Bagel (Pregel on Spark)</a></li>
82+
<li><a href="api/graphx/index.html#org.apache.spark.graphx.package">GraphX (Graph-Paralle Spark)</a></li>
8383
</ul>
8484
</li>
8585

@@ -161,7 +161,7 @@ <h2>Heading</h2>
161161
<script src="js/vendor/jquery-1.8.0.min.js"></script>
162162
<script src="js/vendor/bootstrap.min.js"></script>
163163
<script src="js/main.js"></script>
164-
164+
165165
<!-- A script to fix internal hash links because we have an overlapping top bar.
166166
Based on https://github.com/twitter/bootstrap/issues/193#issuecomment-2281510 -->
167167
<script>

docs/graphx-programming-guide.md

Lines changed: 126 additions & 151 deletions
Original file line numberDiff line numberDiff line change
@@ -1,63 +1,141 @@
11
---
22
layout: global
3-
title: "GraphX: Unifying Graphs and Tables"
3+
title: GraphX Programming Guide
44
---
55

6+
* This will become a table of contents (this text will be scraped).
7+
{:toc}
68

7-
GraphX extends the distributed fault-tolerant collections API and
8-
interactive console of [Spark](http://spark.incubator.apache.org) with
9-
a new graph API which leverages recent advances in graph systems
10-
(e.g., [GraphLab](http://graphlab.org)) to enable users to easily and
11-
interactively build, transform, and reason about graph structured data
12-
at scale.
13-
14-
15-
## Motivation
16-
17-
From social networks and targeted advertising to protein modeling and
18-
astrophysics, big graphs capture the structure in data and are central
19-
to the recent advances in machine learning and data mining. Directly
20-
applying existing *data-parallel* tools (e.g.,
21-
[Hadoop](http://hadoop.apache.org) and
22-
[Spark](http://spark.incubator.apache.org)) to graph computation tasks
23-
can be cumbersome and inefficient. The need for intuitive, scalable
24-
tools for graph computation has lead to the development of new
25-
*graph-parallel* systems (e.g.,
26-
[Pregel](http://http://giraph.apache.org) and
27-
[GraphLab](http://graphlab.org)) which are designed to efficiently
28-
execute graph algorithms. Unfortunately, these systems do not address
29-
the challenges of graph construction and transformation and provide
30-
limited fault-tolerance and support for interactive analysis.
31-
32-
{:.pagination-centered}
33-
![Data-parallel vs. graph-parallel]({{ site.url }}/img/data_parallel_vs_graph_parallel.png)
34-
35-
## Solution
36-
37-
The GraphX project combines the advantages of both data-parallel and
38-
graph-parallel systems by efficiently expressing graph computation
39-
within the [Spark](http://spark.incubator.apache.org) framework. We
40-
leverage new ideas in distributed graph representation to efficiently
41-
distribute graphs as tabular data-structures. Similarly, we leverage
42-
advances in data-flow systems to exploit in-memory computation and
43-
fault-tolerance. We provide powerful new operations to simplify graph
44-
construction and transformation. Using these primitives we implement
45-
the PowerGraph and Pregel abstractions in less than 20 lines of code.
46-
Finally, by exploiting the Scala foundation of Spark, we enable users
47-
to interactively load, transform, and compute on massive graphs.
48-
49-
<p align="center">
50-
<img src="https://raw.github.com/amplab/graphx/master/docs/img/tables_and_graphs.png" />
9+
<p style="text-align: center;">
10+
<img src="img/graphx_logo.png"
11+
title="GraphX Logo"
12+
alt="GraphX"
13+
width="65%" />
5114
</p>
5215

53-
## Examples
16+
# Overview
17+
18+
GraphX is the new (alpha) Spark API for graphs and graph-parallel
19+
computation. At a high-level GraphX, extends the Spark
20+
[RDD](api/core/index.html#org.apache.spark.rdd.RDD) by
21+
introducing the [Resilient Distributed property Graph (RDG)](#property_graph):
22+
a directed graph with properties attached to each vertex and edge.
23+
To support graph computation, GraphX exposes a set of functions
24+
(e.g., [mapReduceTriplets](#mrTriplets)) as well as optimized variants of the
25+
[Pregel](http://giraph.apache.org) and [GraphLab](http://graphlab.org)
26+
APIs. In addition, GraphX includes a growing collection of graph
27+
[algorithms](#graph_algorithms) and [builders](#graph_builders) to simplify
28+
graph analytics tasks.
29+
30+
## Background on Graph-Parallel Computation
31+
32+
From social networks to language modeling, the growing scale and importance of
33+
graph data has driven the development of numerous new *graph-parallel* systems
34+
(e.g., [Giraph](http://http://giraph.apache.org) and
35+
[GraphLab](http://graphlab.org)). By restricting the types of computation that can be
36+
expressed and introducing new techniques to partition and distribute graphs,
37+
these systems can efficiently execute sophisticated graph algorithms orders of
38+
magnitude faster than more general *data-parallel* systems.
39+
40+
<p style="text-align: center;">
41+
<img src="img/data_parallel_vs_graph_parallel.png"
42+
title="Data-Parallel vs. Graph-Parallel"
43+
alt="Data-Parallel vs. Graph-Parallel"
44+
width="50%" />
45+
</p>
46+
47+
However, the same restrictions that enable these substantial performance gains
48+
also make it difficult to express many of the important stages in a typical graph-analytics pipeline:
49+
constructing the graph, modifying its structure, or expressing computation that
50+
spans multiple graphs. As a consequence, existing graph analytics pipelines
51+
compose graph-parallel and data-parallel systems, leading to extensive data
52+
movement and duplication and a complicated programming model.
53+
54+
<p style="text-align: center;">
55+
<img src="img/graph_analytics_pipeline.png"
56+
title="Graph Analytics Pipeline"
57+
alt="Graph Analytics Pipeline"
58+
width="50%" />
59+
</p>
60+
61+
The goal of the GraphX project is to unify graph-parallel and data-parallel
62+
computation in one system with a single composable API. This goal is achieved
63+
through an API that enables users to view data both as a graph and as
64+
collections (i.e., RDDs) without data movement or duplication and by
65+
incorporating advances in graph-parallel systems to optimize the execution of
66+
operations on the graph view. In preliminary experiments we find that the GraphX
67+
system is able to achieve performance comparable to state-of-the-art
68+
graph-parallel systems while easily expressing the entire analytics pipelines.
69+
70+
<p style="text-align: center;">
71+
<img src="img/graphx_performance_comparison.png"
72+
title="GraphX Performance Comparison"
73+
alt="GraphX Performance Comparison"
74+
width="50%" />
75+
</p>
76+
77+
## GraphX Replaces the Spark Bagel API
78+
79+
Prior to the release of GraphX, graph computation in Spark was expressed using
80+
Bagel, an implementation of the Pregel API. GraphX improves upon Bagel by exposing
81+
a richer property graph API, a more streamlined version of the Pregel abstraction,
82+
and system optimizations to improve performance and reduce memory
83+
overhead. While we plan to eventually deprecate the Bagel, we will continue to
84+
support the API and [Bagel programming guide](bagel-programming-guide.html). However,
85+
we encourage Bagel to explore the new GraphX API and comment on issues that may
86+
complicate the transition from Bagel.
87+
88+
# The Property Graph
89+
<a name="property_graph"></a>
90+
91+
<p style="text-align: center;">
92+
<img src="img/edge_cut_vs_vertex_cut.png"
93+
title="Edge Cut vs. Vertex Cut"
94+
alt="Edge Cut vs. Vertex Cut"
95+
width="50%" />
96+
</p>
97+
98+
<p style="text-align: center;">
99+
<img src="img/property_graph.png"
100+
title="The Property Graph"
101+
alt="The Property Graph"
102+
width="50%" />
103+
</p>
104+
105+
<p style="text-align: center;">
106+
<img src="img/vertex_routing_edge_tables.png"
107+
title="RDD Graph Representation"
108+
alt="RDD Graph Representation"
109+
width="50%" />
110+
</p>
111+
112+
113+
# Graph Operators
114+
115+
## Map Reduce Triplets (mapReduceTriplets)
116+
<a name="mrTriplets"></a>
117+
118+
# Graph Algorithms
119+
<a name="graph_algorithms"></a>
120+
121+
# Graph Builders
122+
<a name="graph_builders"></a>
123+
124+
<p style="text-align: center;">
125+
<img src="img/tables_and_graphs.png"
126+
title="Tables and Graphs"
127+
alt="Tables and Graphs"
128+
width="50%" />
129+
</p>
130+
131+
# Examples
54132

55133
Suppose I want to build a graph from some text files, restrict the graph
56134
to important relationships and users, run page-rank on the sub-graph, and
57135
then finally return attributes associated with the top users. I can do
58136
all of this in just a few lines with GraphX:
59137

60-
```scala
138+
{% highlight scala %}
61139
// Connect to the Spark cluster
62140
val sc = new SparkContext("spark://master.amplab.org", "research")
63141

@@ -89,108 +167,5 @@ val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices){
89167

90168
println(userInfoWithPageRank.top(5))
91169

92-
```
93-
94-
95-
## Online Documentation
96-
97-
You can find the latest Spark documentation, including a programming
98-
guide, on the project webpage at
99-
<http://spark.incubator.apache.org/documentation.html>. This README
100-
file only contains basic setup instructions.
101-
102-
103-
## Building
104-
105-
Spark requires Scala 2.9.3 (Scala 2.10 is not yet supported). The
106-
project is built using Simple Build Tool (SBT), which is packaged with
107-
it. To build Spark and its example programs, run:
108-
109-
sbt/sbt assembly
110-
111-
Once you've built Spark, the easiest way to start using it is the
112-
shell:
113-
114-
./spark-shell
115-
116-
Or, for the Python API, the Python shell (`./pyspark`).
117-
118-
Spark also comes with several sample programs in the `examples`
119-
directory. To run one of them, use `./run-example <class>
120-
<params>`. For example:
121-
122-
./run-example org.apache.spark.examples.SparkLR local[2]
123-
124-
will run the Logistic Regression example locally on 2 CPUs.
125-
126-
Each of the example programs prints usage help if no params are given.
127-
128-
All of the Spark samples take a `<master>` parameter that is the
129-
cluster URL to connect to. This can be a mesos:// or spark:// URL, or
130-
"local" to run locally with one thread, or "local[N]" to run locally
131-
with N threads.
132-
133-
134-
## A Note About Hadoop Versions
135-
136-
Spark uses the Hadoop core library to talk to HDFS and other
137-
Hadoop-supported storage systems. Because the protocols have changed
138-
in different versions of Hadoop, you must build Spark against the same
139-
version that your cluster runs. You can change the version by setting
140-
the `SPARK_HADOOP_VERSION` environment when building Spark.
141-
142-
For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
143-
versions without YARN, use:
144-
145-
# Apache Hadoop 1.2.1
146-
$ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly
147-
148-
# Cloudera CDH 4.2.0 with MapReduce v1
149-
$ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly
150-
151-
For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
152-
with YARN, also set `SPARK_YARN=true`:
153-
154-
# Apache Hadoop 2.0.5-alpha
155-
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
156-
157-
# Cloudera CDH 4.2.0 with MapReduce v2
158-
$ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly
159-
160-
For convenience, these variables may also be set through the
161-
`conf/spark-env.sh` file described below.
162-
163-
When developing a Spark application, specify the Hadoop version by adding the
164-
"hadoop-client" artifact to your project's dependencies. For example, if you're
165-
using Hadoop 1.2.1 and build your application using SBT, add this entry to
166-
`libraryDependencies`:
167-
168-
"org.apache.hadoop" % "hadoop-client" % "1.2.1"
169-
170-
If your project is built with Maven, add this to your POM file's
171-
`<dependencies>` section:
172-
173-
<dependency>
174-
<groupId>org.apache.hadoop</groupId>
175-
<artifactId>hadoop-client</artifactId>
176-
<version>1.2.1</version>
177-
</dependency>
178-
179-
180-
## Configuration
181-
182-
Please refer to the [Configuration
183-
guide](http://spark.incubator.apache.org/docs/latest/configuration.html)
184-
in the online documentation for an overview on how to configure Spark.
185-
186-
187-
## Contributing to GraphX
170+
{% endhighlight %}
188171

189-
Contributions via GitHub pull requests are gladly accepted from their
190-
original author. Along with any pull requests, please state that the
191-
contribution is your original work and that you license the work to
192-
the project under the project's open source license. Whether or not
193-
you state this explicitly, by submitting any copyrighted material via
194-
pull request, email, or other means you agree to license the material
195-
under the project's open source license and warrant that you have the
196-
legal authority to do so.
228 KB
Loading

docs/img/edge_cut_vs_vertex_cut.png

77.9 KB
Loading

docs/img/graph_analytics_pipeline.png

417 KB
Loading

docs/img/graphx_figures.pptx

1.07 MB
Binary file not shown.

docs/img/graphx_logo.png

39.4 KB
Loading
162 KB
Loading

docs/img/property_graph.png

77.2 KB
Loading

docs/img/tables_and_graphs.png

95.1 KB
Loading

0 commit comments

Comments
 (0)