You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -379,7 +379,7 @@ out <- dapply(carsSubDF, function(x) { x <- cbind(x, x$mpg * 1.61) }, schema)
379
379
head(collect(out))
380
380
```
381
381
382
-
Like `dapply`, apply a function to each partition of a `SparkDataFrame` and collect the result back. The output of function should be a `data.frame`, but no schema is required in this case. Note that `dapplyCollect` can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory.
382
+
Like `dapply`, `dapplyCollect` can apply a function to each partition of a `SparkDataFrame` and collect the result back. The output of the function should be a `data.frame`, but no schema is required in this case. Note that `dapplyCollect` can fail if the output of the UDF on all partitions cannot be pulled into the driver's memory.
Like gapply, `gapplyCollect` applies a function to each partition of a `SparkDataFrame` and collect the result back to R `data.frame`. The output of the function should be a `data.frame` but no schema is required in this case. Note that `gapplyCollect` can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory.
408
+
Like gapply, `gapplyCollect` applies a function to each partition of a `SparkDataFrame` and collect the result back to R `data.frame`. The output of the function should be a `data.frame` but no schema is required in this case. Note that `gapplyCollect` can fail if the output of the UDF on all partitions cannot be pulled into the driver's memory.
409
409
410
410
```{r}
411
411
result <- gapplyCollect(
@@ -458,20 +458,20 @@ options(ops)
458
458
459
459
460
460
### SQL Queries
461
-
A `SparkDataFrame` can also be registered as a temporary view in Spark SQL and that allows you to run SQL queries over its data. The sql function enables applications to run SQL queries programmatically and returns the result as a `SparkDataFrame`.
461
+
A `SparkDataFrame` can also be registered as a temporary view in Spark SQL so that one can run SQL queries over its data. The sql function enables applications to run SQL queries programmatically and returns the result as a `SparkDataFrame`.
462
462
463
463
```{r}
464
464
people <- read.df(paste0(sparkR.conf("spark.home"),
* Topics and documents both exist in a feature space, where feature vectors are vectors of word counts (bag of words).
853
853
854
-
* Rather than estimating a clustering using a traditional distance, LDA uses a function based on a statistical model of how text documents are generated.
854
+
* Rather than clustering using a traditional distance, LDA uses a function based on a statistical model of how text documents are generated.
855
855
856
-
To use LDA, we need to specify a `features` column in `data` where each entry represents a document. There are two type options for the column:
856
+
To use LDA, we need to specify a `features` column in `data` where each entry represents a document. There are two options for the column:
857
857
858
858
* character string: This can be a string of the whole document. It will be parsed automatically. Additional stop words can be added in `customizedStopWords`.
859
859
@@ -901,7 +901,7 @@ perplexity
901
901
902
902
`spark.als` learns latent factors in [collaborative filtering](https://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering) via [alternating least squares](http://dl.acm.org/citation.cfm?id=1608614).
903
903
904
-
There are multiple options that can be configured in `spark.als`, including `rank`, `reg`, `nonnegative`. For a complete list, refer to the help file.
904
+
There are multiple options that can be configured in `spark.als`, including `rank`, `reg`, and `nonnegative`. For a complete list, refer to the help file.
The following example shows how to save/load an ML model by SparkR.
984
+
The following example shows how to save/load an ML model in SparkR.
985
985
```{r}
986
986
t <- as.data.frame(Titanic)
987
987
training <- createDataFrame(t)
@@ -1079,19 +1079,19 @@ There are three main object classes in SparkR you may be working with.
1079
1079
+`sdf` stores a reference to the corresponding Spark Dataset in the Spark JVM backend.
1080
1080
+`env` saves the meta-information of the object such as `isCached`.
1081
1081
1082
-
It can be created by data import methods or by transforming an existing `SparkDataFrame`. We can manipulate `SparkDataFrame` by numerous data processing functions and feed that into machine learning algorithms.
1082
+
It can be created by data import methods or by transforming an existing `SparkDataFrame`. We can manipulate `SparkDataFrame` by numerous data processing functions and feed that into machine learning algorithms.
1083
1083
1084
-
*`Column`: an S4 class representing column of `SparkDataFrame`. The slot `jc` saves a reference to the corresponding Column object in the Spark JVM backend.
1084
+
*`Column`: an S4 class representing a column of `SparkDataFrame`. The slot `jc` saves a reference to the corresponding `Column` object in the Spark JVM backend.
1085
1085
1086
-
It can be obtained from a `SparkDataFrame` by `$` operator, `df$col`. More often, it is used together with other functions, for example, with `select` to select particular columns, with `filter` and constructed conditions to select rows, with aggregation functions to compute aggregate statistics for each group.
1086
+
It can be obtained from a `SparkDataFrame` by `$` operator, e.g., `df$col`. More often, it is used together with other functions, for example, with `select` to select particular columns, with `filter` and constructed conditions to select rows, with aggregation functions to compute aggregate statistics for each group.
1087
1087
1088
-
*`GroupedData`: an S4 class representing grouped data created by `groupBy` or by transforming other `GroupedData`. Its `sgd` slot saves a reference to a RelationalGroupedDataset object in the backend.
1088
+
*`GroupedData`: an S4 class representing grouped data created by `groupBy` or by transforming other `GroupedData`. Its `sgd` slot saves a reference to a `RelationalGroupedDataset` object in the backend.
1089
1089
1090
-
This is often an intermediate object with group information and followed up by aggregation operations.
1090
+
This is often an intermediate object with group information and followed up by aggregation operations.
1091
1091
1092
1092
### Architecture
1093
1093
1094
-
A complete description of architecture can be seen in reference, in particular the paper *SparkR: Scaling R Programs with Spark*.
1094
+
A complete description of architecture can be seen in the references, in particular the paper *SparkR: Scaling R Programs with Spark*.
1095
1095
1096
1096
Under the hood of SparkR is Spark SQL engine. This avoids the overheads of running interpreted R code, and the optimized SQL execution engine in Spark uses structural information about data and computation flow to perform a bunch of optimizations to speed up the computation.
0 commit comments