You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: R/README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -52,7 +52,7 @@ The SparkR documentation (Rd files and HTML files) are not a part of the source
52
52
SparkR comes with several sample programs in the `examples/src/main/r` directory.
53
53
To run one of them, use `./bin/sparkR <filename> <args>`. For example:
54
54
55
-
./bin/sparkR examples/src/main/r/pi.R local[2]
55
+
./bin/sparkR examples/src/main/r/dataframe.R
56
56
57
57
You can also run the unit-tests for SparkR by running (you need to install the [testthat](http://cran.r-project.org/web/packages/testthat/index.html) package first):
58
58
@@ -63,5 +63,5 @@ You can also run the unit-tests for SparkR by running (you need to install the [
63
63
The `./bin/spark-submit` and `./bin/sparkR` can also be used to submit jobs to YARN clusters. You will need to set YARN conf dir before doing so. For example on CDH you can run
Copy file name to clipboardExpand all lines: docs/programming-guide.md
+1-182Lines changed: 1 addition & 182 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -105,28 +105,6 @@ from pyspark import SparkContext, SparkConf
105
105
106
106
</div>
107
107
108
-
<divdata-lang="r"markdown="1">
109
-
110
-
Spark {{site.SPARK_VERSION}} works with R 3.1 or higher.
111
-
112
-
To run Spark applications in R, use the `bin/spark-submit` script located in the Spark directory.
113
-
This script will load Spark's Java/Scala libraries and allow you to submit applications to a cluster.
114
-
You can also use `bin/sparkR` to launch an interactive R shell.
115
-
116
-
If you wish to access HDFS data, you need to use a build of Spark linking
117
-
to your version of HDFS. Some common HDFS version tags are listed on the
118
-
[third party distributions](hadoop-third-party-distributions.html) page.
119
-
[Prebuilt packages](http://spark.apache.org/downloads.html) are also available on the Spark homepage
120
-
for common HDFS versions.
121
-
122
-
Finally, you need to import the SparkR library into your program. Add the following line:
123
-
124
-
{% highlight r %}
125
-
library(SparkR)
126
-
{% endhighlight %}
127
-
128
-
</div>
129
-
130
108
</div>
131
109
132
110
@@ -175,17 +153,6 @@ sc = SparkContext(conf=conf)
175
153
176
154
</div>
177
155
178
-
<divdata-lang="r"markdown="1">
179
-
180
-
The first thing a Spark program must do is to create a SparkContext object, which tells Spark
181
-
how to access a cluster.
182
-
183
-
{% highlight r %}
184
-
sc = sparkR.init(master, appName)
185
-
{% endhighlight %}
186
-
187
-
</div>
188
-
189
156
</div>
190
157
191
158
The `appName` parameter is a name for your application to show on the cluster UI.
@@ -279,23 +246,6 @@ your notebook before you start to try Spark from the IPython notebook.
279
246
280
247
</div>
281
248
282
-
<divdata-lang="r"markdown="1">
283
-
284
-
In the SparkR shell, a special interpreter-aware SparkContext is already created for you, in the
285
-
variable called `sc`. Making your own SparkContext will not work. You can set which master the
286
-
context connects to using the `--master` argument. You can also add dependencies
287
-
(e.g. Spark Packages) to your shell session by supplying a comma-separated list of maven coordinates
288
-
to the `--packages` argument. Any additional repositories where dependencies might exist (e.g. SonaType)
289
-
can be passed to the `--repositories` argument. For example, to run `bin/sparkR` on exactly four cores, use:
290
-
291
-
{% highlight bash %}
292
-
$ ./bin/sparkR --master local[4]
293
-
{% endhighlight %}
294
-
295
-
For a complete list of options, run `bin/sparkR --help`. Behind the scenes,
296
-
`sparkR` invokes the more general [`spark-submit` script](submitting-applications.html).
297
-
298
-
</div>
299
249
</div>
300
250
301
251
# Resilient Distributed Datasets (RDDs)
@@ -354,20 +304,6 @@ We describe operations on distributed datasets later on.
354
304
355
305
</div>
356
306
357
-
<divdata-lang="r"markdown="1">
358
-
359
-
Parallelized collections are created by calling `SparkContext`'s `parallelize` method on an existing collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. For example, here is how to create a parallelized collection holding the numbers 1 to 5:
360
-
361
-
{% highlight r %}
362
-
data <- c(1, 2, 3, 4, 5)
363
-
distData <- parallelize(sc, data)
364
-
{% endhighlight %}
365
-
366
-
Once created, the distributed dataset (`distData`) can be operated on in parallel. For example, we can call `reduce(distData, "+")` to add up the elements of the list.
367
-
We describe operations on distributed datasets later on.
368
-
369
-
</div>
370
-
371
307
</div>
372
308
373
309
One important parameter for parallel collections is the number of *partitions* to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to `parallelize` (e.g. `sc.parallelize(data, 10)`). Note: some places in the code use the term slices (a synonym for partitions) to maintain backward compatibility.
@@ -540,27 +476,6 @@ See the [Python examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/mai
540
476
the [Converter examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/pythonconverters)
541
477
for examples of using Cassandra / HBase ```InputFormat``` and ```OutputFormat``` with custom converters.
542
478
543
-
</div>
544
-
<divdata-lang="r"markdown="1">
545
-
546
-
SparkR can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, [Amazon S3](http://wiki.apache.org/hadoop/AmazonS3), etc. Spark supports text files, [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html).
547
-
548
-
Text file RDDs can be created using `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3n://`, etc URI) and reads it as a collection of lines. Here is an example invocation:
549
-
550
-
{% highlight r %}
551
-
distFile <- textFile(sc, "data.txt")
552
-
{% endhighlight %}
553
-
554
-
Once created, `distFile` can be acted on by dataset operations. For example, we can add up the sizes of all the lines using the `map` and `reduce` operations as follows: `reduce(map(distFile, length), "+")`.
555
-
556
-
Some notes on reading files with Spark:
557
-
558
-
* If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
559
-
560
-
* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile(sc, "/my/directory")`, `textFile(sc, "/my/directory/*.txt")`, and `textFile(sc, "/my/directory/*.gz")`.
561
-
562
-
* The `textFile` method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.
563
-
564
479
</div>
565
480
</div>
566
481
@@ -660,34 +575,6 @@ before the `reduce`, which would cause `lineLengths` to be saved in memory after
660
575
661
576
</div>
662
577
663
-
<divdata-lang="r"markdown="1">
664
-
665
-
To illustrate RDD basics, consider the simple program below:
666
-
667
-
{% highlight r %}
668
-
lines <- textFile(sc, "data.txt")
669
-
lineLengths <- map(lines, length)
670
-
totalLength <- reduce(lineLengths, "+")
671
-
{% endhighlight %}
672
-
673
-
The first line defines a base RDD from an external file. This dataset is not loaded in memory or
674
-
otherwise acted on: `lines` is merely a pointer to the file.
675
-
The second line defines `lineLengths` as the result of a `map` transformation. Again, `lineLengths`
676
-
is *not* immediately computed, due to laziness.
677
-
Finally, we run `reduce`, which is an action. At this point Spark breaks the computation into tasks
678
-
to run on separate machines, and each machine runs both its part of the map and a local reduction,
679
-
returning only its answer to the driver program.
680
-
681
-
If we also wanted to use `lineLengths` again later, we could add:
682
-
683
-
{% highlight r %}
684
-
persist(lineLengths)
685
-
{% endhighlight %}
686
-
687
-
before the `reduce`, which would cause `lineLengths` to be saved in memory after the first time it is computed.
688
-
689
-
</div>
690
-
691
578
</div>
692
579
693
580
### Passing Functions to Spark
@@ -855,29 +742,6 @@ def doStuff(self, rdd):
855
742
856
743
</div>
857
744
858
-
<divdata-lang="r"markdown="1">
859
-
860
-
Spark's API relies heavily on passing functions in the driver program to run on the cluster.
0 commit comments