Skip to content

Commit 796a8e7

Browse files
author
Wayne Zhang
committed
fix typo and style in vignettes
1 parent 8639025 commit 796a8e7

File tree

1 file changed

+17
-17
lines changed

1 file changed

+17
-17
lines changed

R/pkg/vignettes/sparkr-vignettes.Rmd

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ We can view the first few rows of the `SparkDataFrame` by `head` or `showDF` fun
6565
head(carsDF)
6666
```
6767

68-
Common data processing operations such as `filter`, `select` are supported on the `SparkDataFrame`.
68+
Common data processing operations such as `filter` and `select` are supported on the `SparkDataFrame`.
6969
```{r}
7070
carsSubDF <- select(carsDF, "model", "mpg", "hp")
7171
carsSubDF <- filter(carsSubDF, carsSubDF$hp >= 200)
@@ -379,7 +379,7 @@ out <- dapply(carsSubDF, function(x) { x <- cbind(x, x$mpg * 1.61) }, schema)
379379
head(collect(out))
380380
```
381381

382-
Like `dapply`, apply a function to each partition of a `SparkDataFrame` and collect the result back. The output of function should be a `data.frame`, but no schema is required in this case. Note that `dapplyCollect` can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory.
382+
Like `dapply`, `dapplyCollect` can apply a function to each partition of a `SparkDataFrame` and collect the result back. The output of the function should be a `data.frame`, but no schema is required in this case. Note that `dapplyCollect` can fail if the output of the UDF on all partitions cannot be pulled into the driver's memory.
383383

384384
```{r}
385385
out <- dapplyCollect(
@@ -405,7 +405,7 @@ result <- gapply(
405405
head(arrange(result, "max_mpg", decreasing = TRUE))
406406
```
407407

408-
Like gapply, `gapplyCollect` applies a function to each partition of a `SparkDataFrame` and collect the result back to R `data.frame`. The output of the function should be a `data.frame` but no schema is required in this case. Note that `gapplyCollect` can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory.
408+
Like gapply, `gapplyCollect` applies a function to each partition of a `SparkDataFrame` and collect the result back to R `data.frame`. The output of the function should be a `data.frame` but no schema is required in this case. Note that `gapplyCollect` can fail if the output of the UDF on all partitions cannot be pulled into the driver's memory.
409409

410410
```{r}
411411
result <- gapplyCollect(
@@ -458,20 +458,20 @@ options(ops)
458458

459459

460460
### SQL Queries
461-
A `SparkDataFrame` can also be registered as a temporary view in Spark SQL and that allows you to run SQL queries over its data. The sql function enables applications to run SQL queries programmatically and returns the result as a `SparkDataFrame`.
461+
A `SparkDataFrame` can also be registered as a temporary view in Spark SQL so that one can run SQL queries over its data. The sql function enables applications to run SQL queries programmatically and returns the result as a `SparkDataFrame`.
462462

463463
```{r}
464464
people <- read.df(paste0(sparkR.conf("spark.home"),
465465
"/examples/src/main/resources/people.json"), "json")
466466
```
467467

468-
Register this SparkDataFrame as a temporary view.
468+
Register this `SparkDataFrame` as a temporary view.
469469

470470
```{r}
471471
createOrReplaceTempView(people, "people")
472472
```
473473

474-
SQL statements can be run by using the sql method.
474+
SQL statements can be run using the sql method.
475475
```{r}
476476
teenagers <- sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
477477
head(teenagers)
@@ -780,7 +780,7 @@ head(predict(isoregModel, newDF))
780780
`spark.gbt` fits a [gradient-boosted tree](https://en.wikipedia.org/wiki/Gradient_boosting) classification or regression model on a `SparkDataFrame`.
781781
Users can call `summary` to get a summary of the fitted model, `predict` to make predictions, and `write.ml`/`read.ml` to save/load fitted models.
782782

783-
Similar to the random forest example above, we use the `longley` dataset to train a gradient-boosted tree and make predictions:
783+
We use the `longley` dataset to train a gradient-boosted tree and make predictions:
784784

785785
```{r, warning=FALSE}
786786
df <- createDataFrame(longley)
@@ -851,9 +851,9 @@ head(select(kmeansPredictions, "model", "mpg", "hp", "wt", "prediction"), n = 20
851851

852852
* Topics and documents both exist in a feature space, where feature vectors are vectors of word counts (bag of words).
853853

854-
* Rather than estimating a clustering using a traditional distance, LDA uses a function based on a statistical model of how text documents are generated.
854+
* Rather than clustering using a traditional distance, LDA uses a function based on a statistical model of how text documents are generated.
855855

856-
To use LDA, we need to specify a `features` column in `data` where each entry represents a document. There are two type options for the column:
856+
To use LDA, we need to specify a `features` column in `data` where each entry represents a document. There are two options for the column:
857857

858858
* character string: This can be a string of the whole document. It will be parsed automatically. Additional stop words can be added in `customizedStopWords`.
859859

@@ -901,7 +901,7 @@ perplexity
901901

902902
`spark.als` learns latent factors in [collaborative filtering](https://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering) via [alternating least squares](http://dl.acm.org/citation.cfm?id=1608614).
903903

904-
There are multiple options that can be configured in `spark.als`, including `rank`, `reg`, `nonnegative`. For a complete list, refer to the help file.
904+
There are multiple options that can be configured in `spark.als`, including `rank`, `reg`, and `nonnegative`. For a complete list, refer to the help file.
905905

906906
```{r, eval=FALSE}
907907
ratings <- list(list(0, 0, 4.0), list(0, 1, 2.0), list(1, 1, 3.0), list(1, 2, 4.0),
@@ -981,7 +981,7 @@ testSummary
981981

982982

983983
### Model Persistence
984-
The following example shows how to save/load an ML model by SparkR.
984+
The following example shows how to save/load an ML model in SparkR.
985985
```{r}
986986
t <- as.data.frame(Titanic)
987987
training <- createDataFrame(t)
@@ -1079,19 +1079,19 @@ There are three main object classes in SparkR you may be working with.
10791079
+ `sdf` stores a reference to the corresponding Spark Dataset in the Spark JVM backend.
10801080
+ `env` saves the meta-information of the object such as `isCached`.
10811081

1082-
It can be created by data import methods or by transforming an existing `SparkDataFrame`. We can manipulate `SparkDataFrame` by numerous data processing functions and feed that into machine learning algorithms.
1082+
It can be created by data import methods or by transforming an existing `SparkDataFrame`. We can manipulate `SparkDataFrame` by numerous data processing functions and feed that into machine learning algorithms.
10831083

1084-
* `Column`: an S4 class representing column of `SparkDataFrame`. The slot `jc` saves a reference to the corresponding Column object in the Spark JVM backend.
1084+
* `Column`: an S4 class representing a column of `SparkDataFrame`. The slot `jc` saves a reference to the corresponding `Column` object in the Spark JVM backend.
10851085

1086-
It can be obtained from a `SparkDataFrame` by `$` operator, `df$col`. More often, it is used together with other functions, for example, with `select` to select particular columns, with `filter` and constructed conditions to select rows, with aggregation functions to compute aggregate statistics for each group.
1086+
It can be obtained from a `SparkDataFrame` by `$` operator, e.g., `df$col`. More often, it is used together with other functions, for example, with `select` to select particular columns, with `filter` and constructed conditions to select rows, with aggregation functions to compute aggregate statistics for each group.
10871087

1088-
* `GroupedData`: an S4 class representing grouped data created by `groupBy` or by transforming other `GroupedData`. Its `sgd` slot saves a reference to a RelationalGroupedDataset object in the backend.
1088+
* `GroupedData`: an S4 class representing grouped data created by `groupBy` or by transforming other `GroupedData`. Its `sgd` slot saves a reference to a `RelationalGroupedDataset` object in the backend.
10891089

1090-
This is often an intermediate object with group information and followed up by aggregation operations.
1090+
This is often an intermediate object with group information and followed up by aggregation operations.
10911091

10921092
### Architecture
10931093

1094-
A complete description of architecture can be seen in reference, in particular the paper *SparkR: Scaling R Programs with Spark*.
1094+
A complete description of architecture can be seen in the references, in particular the paper *SparkR: Scaling R Programs with Spark*.
10951095

10961096
Under the hood of SparkR is Spark SQL engine. This avoids the overheads of running interpreted R code, and the optimized SQL execution engine in Spark uses structural information about data and computation flow to perform a bunch of optimizations to speed up the computation.
10971097

0 commit comments

Comments
 (0)