Skip to content

Commit e23b9d6

Browse files
author
Davies Liu
committed
delete R docs for RDD API
1 parent 222e4ff commit e23b9d6

File tree

4 files changed

+7
-323
lines changed

4 files changed

+7
-323
lines changed

R/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ The SparkR documentation (Rd files and HTML files) are not a part of the source
5252
SparkR comes with several sample programs in the `examples/src/main/r` directory.
5353
To run one of them, use `./bin/sparkR <filename> <args>`. For example:
5454

55-
./bin/sparkR examples/src/main/r/pi.R local[2]
55+
./bin/sparkR examples/src/main/r/dataframe.R
5656

5757
You can also run the unit-tests for SparkR by running (you need to install the [testthat](http://cran.r-project.org/web/packages/testthat/index.html) package first):
5858

@@ -63,5 +63,5 @@ You can also run the unit-tests for SparkR by running (you need to install the [
6363
The `./bin/spark-submit` and `./bin/sparkR` can also be used to submit jobs to YARN clusters. You will need to set YARN conf dir before doing so. For example on CDH you can run
6464
```
6565
export YARN_CONF_DIR=/etc/hadoop/conf
66-
./bin/spark-submit --master yarn examples/src/main/r/pi.R 4
66+
./bin/spark-submit --master yarn examples/src/main/r/dataframe.R
6767
```

docs/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,14 +54,14 @@ Example applications are also provided in Python. For example,
5454

5555
./bin/spark-submit examples/src/main/python/pi.py 10
5656

57-
Spark also provides an experimental R API since 1.4 (only RDD and DataFrames APIs included).
57+
Spark also provides an experimental R API since 1.4 (only DataFrames APIs included).
5858
To run Spark interactively in a R interpreter, use `bin/sparkR`:
5959

6060
./bin/sparkR --master local[2]
6161

6262
Example applications are also provided in R. For example,
6363

64-
./bin/spark-submit examples/src/main/r/pi.R 10
64+
./bin/spark-submit examples/src/main/r/dataframe.R
6565

6666
# Launching on a Cluster
6767

docs/programming-guide.md

Lines changed: 1 addition & 182 deletions
Original file line numberDiff line numberDiff line change
@@ -105,28 +105,6 @@ from pyspark import SparkContext, SparkConf
105105

106106
</div>
107107

108-
<div data-lang="r" markdown="1">
109-
110-
Spark {{site.SPARK_VERSION}} works with R 3.1 or higher.
111-
112-
To run Spark applications in R, use the `bin/spark-submit` script located in the Spark directory.
113-
This script will load Spark's Java/Scala libraries and allow you to submit applications to a cluster.
114-
You can also use `bin/sparkR` to launch an interactive R shell.
115-
116-
If you wish to access HDFS data, you need to use a build of Spark linking
117-
to your version of HDFS. Some common HDFS version tags are listed on the
118-
[third party distributions](hadoop-third-party-distributions.html) page.
119-
[Prebuilt packages](http://spark.apache.org/downloads.html) are also available on the Spark homepage
120-
for common HDFS versions.
121-
122-
Finally, you need to import the SparkR library into your program. Add the following line:
123-
124-
{% highlight r %}
125-
library(SparkR)
126-
{% endhighlight %}
127-
128-
</div>
129-
130108
</div>
131109

132110

@@ -175,17 +153,6 @@ sc = SparkContext(conf=conf)
175153

176154
</div>
177155

178-
<div data-lang="r" markdown="1">
179-
180-
The first thing a Spark program must do is to create a SparkContext object, which tells Spark
181-
how to access a cluster.
182-
183-
{% highlight r %}
184-
sc = sparkR.init(master, appName)
185-
{% endhighlight %}
186-
187-
</div>
188-
189156
</div>
190157

191158
The `appName` parameter is a name for your application to show on the cluster UI.
@@ -279,23 +246,6 @@ your notebook before you start to try Spark from the IPython notebook.
279246

280247
</div>
281248

282-
<div data-lang="r" markdown="1">
283-
284-
In the SparkR shell, a special interpreter-aware SparkContext is already created for you, in the
285-
variable called `sc`. Making your own SparkContext will not work. You can set which master the
286-
context connects to using the `--master` argument. You can also add dependencies
287-
(e.g. Spark Packages) to your shell session by supplying a comma-separated list of maven coordinates
288-
to the `--packages` argument. Any additional repositories where dependencies might exist (e.g. SonaType)
289-
can be passed to the `--repositories` argument. For example, to run `bin/sparkR` on exactly four cores, use:
290-
291-
{% highlight bash %}
292-
$ ./bin/sparkR --master local[4]
293-
{% endhighlight %}
294-
295-
For a complete list of options, run `bin/sparkR --help`. Behind the scenes,
296-
`sparkR` invokes the more general [`spark-submit` script](submitting-applications.html).
297-
298-
</div>
299249
</div>
300250

301251
# Resilient Distributed Datasets (RDDs)
@@ -354,20 +304,6 @@ We describe operations on distributed datasets later on.
354304

355305
</div>
356306

357-
<div data-lang="r" markdown="1">
358-
359-
Parallelized collections are created by calling `SparkContext`'s `parallelize` method on an existing collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. For example, here is how to create a parallelized collection holding the numbers 1 to 5:
360-
361-
{% highlight r %}
362-
data <- c(1, 2, 3, 4, 5)
363-
distData <- parallelize(sc, data)
364-
{% endhighlight %}
365-
366-
Once created, the distributed dataset (`distData`) can be operated on in parallel. For example, we can call `reduce(distData, "+")` to add up the elements of the list.
367-
We describe operations on distributed datasets later on.
368-
369-
</div>
370-
371307
</div>
372308

373309
One important parameter for parallel collections is the number of *partitions* to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to `parallelize` (e.g. `sc.parallelize(data, 10)`). Note: some places in the code use the term slices (a synonym for partitions) to maintain backward compatibility.
@@ -540,27 +476,6 @@ See the [Python examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/mai
540476
the [Converter examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/pythonconverters)
541477
for examples of using Cassandra / HBase ```InputFormat``` and ```OutputFormat``` with custom converters.
542478

543-
</div>
544-
<div data-lang="r" markdown="1">
545-
546-
SparkR can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, [Amazon S3](http://wiki.apache.org/hadoop/AmazonS3), etc. Spark supports text files, [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html).
547-
548-
Text file RDDs can be created using `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3n://`, etc URI) and reads it as a collection of lines. Here is an example invocation:
549-
550-
{% highlight r %}
551-
distFile <- textFile(sc, "data.txt")
552-
{% endhighlight %}
553-
554-
Once created, `distFile` can be acted on by dataset operations. For example, we can add up the sizes of all the lines using the `map` and `reduce` operations as follows: `reduce(map(distFile, length), "+")`.
555-
556-
Some notes on reading files with Spark:
557-
558-
* If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
559-
560-
* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile(sc, "/my/directory")`, `textFile(sc, "/my/directory/*.txt")`, and `textFile(sc, "/my/directory/*.gz")`.
561-
562-
* The `textFile` method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.
563-
564479
</div>
565480
</div>
566481

@@ -660,34 +575,6 @@ before the `reduce`, which would cause `lineLengths` to be saved in memory after
660575

661576
</div>
662577

663-
<div data-lang="r" markdown="1">
664-
665-
To illustrate RDD basics, consider the simple program below:
666-
667-
{% highlight r %}
668-
lines <- textFile(sc, "data.txt")
669-
lineLengths <- map(lines, length)
670-
totalLength <- reduce(lineLengths, "+")
671-
{% endhighlight %}
672-
673-
The first line defines a base RDD from an external file. This dataset is not loaded in memory or
674-
otherwise acted on: `lines` is merely a pointer to the file.
675-
The second line defines `lineLengths` as the result of a `map` transformation. Again, `lineLengths`
676-
is *not* immediately computed, due to laziness.
677-
Finally, we run `reduce`, which is an action. At this point Spark breaks the computation into tasks
678-
to run on separate machines, and each machine runs both its part of the map and a local reduction,
679-
returning only its answer to the driver program.
680-
681-
If we also wanted to use `lineLengths` again later, we could add:
682-
683-
{% highlight r %}
684-
persist(lineLengths)
685-
{% endhighlight %}
686-
687-
before the `reduce`, which would cause `lineLengths` to be saved in memory after the first time it is computed.
688-
689-
</div>
690-
691578
</div>
692579

693580
### Passing Functions to Spark
@@ -855,29 +742,6 @@ def doStuff(self, rdd):
855742

856743
</div>
857744

858-
<div data-lang="r" markdown="1">
859-
860-
Spark's API relies heavily on passing functions in the driver program to run on the cluster.
861-
There are three recommended ways to do this:
862-
863-
* [Anonymous functions](http://adv-r.had.co.nz/Functional-programming.html#anonymous-functions),
864-
for simple functions that can be written as an anonymous function.
865-
* Top-level functions in a module.
866-
867-
For example, to pass a longer function, consider the code below:
868-
869-
{% highlight r %}
870-
"""MyScript.R"""
871-
myFunc <- function(s) {
872-
words = strsplit(s, " ")[[1]]
873-
length(words)
874-
}
875-
876-
sc <- sparkR.init(...)
877-
map(textFile(sc, "file.txt"), myFunc)
878-
{% endhighlight %}
879-
</div>
880-
881745
</div>
882746

883747
### Understanding closures <a name="ClosuresLink"></a>
@@ -926,18 +790,6 @@ print("Counter value: " + counter)
926790
{% endhighlight %}
927791
</div>
928792

929-
<div data-lang="r" markdown="1">
930-
{% highlight r %}
931-
counter <- 0
932-
rdd <- parallelize(sc, data)
933-
934-
# Wrong: Don't do this!!
935-
rdd.foreach(function(x){ counter = counter + x })
936-
937-
cat("Counter value: ", counter)
938-
{% endhighlight %}
939-
</div>
940-
941793
</div>
942794

943795
#### Local vs. cluster modes
@@ -1054,30 +906,6 @@ We could also use `counts.sortByKey()`, for example, to sort the pairs alphabeti
1054906

1055907
</div>
1056908

1057-
<div data-lang="r" markdown="1">
1058-
1059-
While most Spark operations work on RDDs containing any type of objects, a few special operations are
1060-
only available on RDDs of key-value pairs.
1061-
The most common ones are distributed "shuffle" operations, such as grouping or aggregating the elements
1062-
by a key.
1063-
1064-
In R, these operations work on RDDs containing built-in R list such as `list(1, 2)`.
1065-
Simply create such lists and then call your desired operation.
1066-
1067-
For example, the following code uses the `reduceByKey` operation on key-value pairs to count how
1068-
many times each line of text occurs in a file:
1069-
1070-
{% highlight r %}
1071-
lines <- textFile(sc, "data.txt")
1072-
pairs <- map(lines, function(s) list(s, 1))
1073-
counts <- reduceByKey(pairs, "+")
1074-
{% endhighlight %}
1075-
1076-
We could also use `sortByKey(counts)`, for example, to sort the pairs alphabetically, and finally
1077-
`collect(counts)` to bring them back to the driver program as a list of objects.
1078-
1079-
</div>
1080-
1081909
</div>
1082910

1083911

@@ -1493,15 +1321,6 @@ broadcastVar.value();
14931321

14941322
</div>
14951323

1496-
<div data-lang="r" markdown="1">
1497-
1498-
{% highlight r %}
1499-
> broadcastVar <- broadcast(sc, c(1, 2, 3))
1500-
> value(broadcastVar)
1501-
[1] 1 2 3
1502-
{% endhighlight %}
1503-
1504-
</div>
15051324
</div>
15061325

15071326
After the broadcast variable is created, it should be used instead of the value `v` in any functions
@@ -1761,7 +1580,7 @@ For Python examples, use `spark-submit` instead:
17611580

17621581
For R examples, use `spark-submit` instead:
17631582

1764-
./bin/spark-submit examples/src/main/r/pi.R
1583+
./bin/spark-submit examples/src/main/r/dataframe.R
17651584

17661585
For help on optimizing your programs, the [configuration](configuration.html) and
17671586
[tuning](tuning.html) guides provide information on best practices. They are especially important for

0 commit comments

Comments
 (0)