Merge pull request #110 from fe2s/spark-shell-docs-2

fe2s · web-flow · commit 78ff03d868a5 · 2018-11-05T10:03:47.000+02:00
#109: document spark-shell and pyspark configuration parameters
diff --git a/doc/dataframe.md b/doc/dataframe.md
@@ -75,7 +75,7 @@ It is used by spark-redis internally when reading DataFrame back to Spark memory
 
 ### Specifying Redis key
 
-By default, spark-redis generates UUID identifier for each row to ensure
+By default spark-redis generates UUID identifier for each row to ensure
 their uniqueness. However, you can also provide your own column as a key. This is controlled with `key.column` option:
 
 ```scala
@@ -157,7 +157,7 @@ df.write
 
 ### Persistence model
 
-By default, DataFrames are persisted as Redis Hashes. It allows to write data with Spark and query from non-Spark environment.
+By default DataFrames are persisted as Redis Hashes. It allows to write data with Spark and query from non-Spark environment.
 It also enables projection query optimization when only a small subset of columns are selected. On the other hand, there is currently 
 a limitation with Hash model - it doesn't support nested DataFrame schema. One option to overcome it is making your DataFrame schema flat.
 If it is not possible due to some constraints, you may consider using Binary persistence model.
diff --git a/doc/getting-started.md b/doc/getting-started.md
@@ -20,25 +20,21 @@ cd spark-redis
 mvn clean package -DskipTests
 ```
 
-## Using the library
-Add Spark-Redis to Spark with the `--jars` command line option. For example, use it from spark-shell, include it in the following manner:
+### Using the library with spark shell
+Add Spark-Redis to Spark with the `--jars` command line option. 
 
-```
+```bash
 $ bin/spark-shell --jars <path-to>/spark-redis-<version>-jar-with-dependencies.jar
+```
 
-Welcome to
-      ____              __
-     / __/__  ___ _____/ /__
-    _\ \/ _ \/ _ `/ __/  '_/
-   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
-      /_/
+By default it connects to `localhost:6379` without any password, you can change the connection settings in the following manner:
 
-Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101)
+```bash
+$ bin/spark-shell --jars <path-to>/spark-redis-<version>-jar-with-dependencies.jar --conf "spark.redis.host=localhost" --conf "spark.redis.port=6379" --conf "spark.redis.auth=passwd"
 ```
 
-The following sections contain code snippets that demonstrate the use of Spark-Redis. To use the sample code, you'll need to replace `your.redis.server` and `6379` with your Redis database's IP address or hostname and port, respectively.
 
-### Configuring Connections to Redis using SparkConf
+### Configuring connection to Redis in a self-contained application
 
 Below is an example configuration of SparkContext with redis configuration:
 
@@ -47,21 +43,33 @@ import com.redislabs.provider.redis._
 
 ...
 
-sc = new SparkContext(new SparkConf()
+val sc = new SparkContext(new SparkConf()
       .setMaster("local")
       .setAppName("myApp")
-
       // initial redis host - can be any node in cluster mode
       .set("spark.redis.host", "localhost")
-
       // initial redis port
       .set("spark.redis.port", "6379")
-
       // optional redis AUTH password
-      .set("spark.redis.auth", "")
+      .set("spark.redis.auth", "passwd")
   )
 ```
 
+The SparkSession can be configured in a similar manner:
+
+```scala
+val spark = SparkSession
+  .builder()
+  .appName("myApp")
+  .master("local[*]")
+  .config("spark.redis.host", "localhost")
+  .config("spark.redis.port", "6379")
+  .config("spark.redis.auth", "passwd")
+  .getOrCreate()
+  
+val sc = spark.sparkContext  
+```
+
 ### Create RDD
 
 ```scala
@@ -83,6 +91,8 @@ df.write
 ### Create Stream
 
 ```scala
+import com.redislabs.provider.redis._
+
 val ssc = new StreamingContext(sc, Seconds(1))
 val redisStream = ssc.createRedisStream(Array("foo", "bar"),
     storageLevel = StorageLevel.MEMORY_AND_DISK_2)
diff --git a/doc/python.md b/doc/python.md
@@ -8,9 +8,16 @@ Here is an example:
 1. Run `pyspark` providing the spark-redis jar file 
 
 ```bash
-$ ./bin/pyspark --jars /your/path/to/spark-redis-<version>-jar-with-dependencies.jar
+$ ./bin/pyspark --jars <path-to>/spark-redis-<version>-jar-with-dependencies.jar
 ```
 
+By default it connects to `localhost:6379` without any password, you can change the connection settings in the following manner:
+
+```bash
+$ bin/spark-shell --jars <path-to>/spark-redis-<version>-jar-with-dependencies.jar --conf "spark.redis.host=localhost" --conf "spark.redis.port=6379" --conf "spark.redis.auth=passwd"
+```
+
+
 2. Read DataFrame from json, write/read from Redis:
 ```python
 df = spark.read.json("examples/src/main/resources/people.json")
@@ -19,7 +26,7 @@ loadedDf = spark.read.format("org.apache.spark.sql.redis").option("table", "peop
 loadedDf.show()
 ```
 
-2. Check the data with redis-cli:
+3. Check the data with redis-cli:
 
 ```bash
 127.0.0.1:6379> hgetall people:Justin
@@ -29,3 +36,16 @@ loadedDf.show()
 4) "Justin"
 ```
 
+The self-contained application can be configured in the following manner:
+
+```python
+SparkSession\
+    .builder\
+    .appName("myApp")\ 
+    .config("spark.redis.host", "localhost")\ 
+    .config("spark.redis.port", "6379")\
+    .config("spark.redis.auth", "passwd")\ 
+    .getOrCreate()
+```
+
+