Skip to content

Commit 528dd34

Browse files
zsxwingAndrew Or
authored andcommitted
[SPARK-4361][Doc] Add more docs for Hadoop Configuration
I'm trying to point out reusing a Configuration in these APIs is dangerous. Any better idea? Author: zsxwing <[email protected]> Closes #3225 from zsxwing/SPARK-4361 and squashes the following commits: fe4e3d5 [zsxwing] Add more docs for Hadoop Configuration (cherry picked from commit af2a2a2) Signed-off-by: Andrew Or <[email protected]>
1 parent 9e828f4 commit 528dd34

File tree

2 files changed

+46
-2
lines changed

2 files changed

+46
-2
lines changed

core/src/main/scala/org/apache/spark/SparkContext.scala

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -288,7 +288,12 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
288288
// the bound port to the cluster manager properly
289289
ui.foreach(_.bind())
290290

291-
/** A default Hadoop Configuration for the Hadoop code (e.g. file systems) that we reuse. */
291+
/**
292+
* A default Hadoop Configuration for the Hadoop code (e.g. file systems) that we reuse.
293+
*
294+
* '''Note:''' As it will be reused in all Hadoop RDDs, it's better not to modify it unless you
295+
* plan to set some global configurations for all Hadoop RDDs.
296+
*/
292297
val hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(conf)
293298

294299
// Add each JAR given through the constructor
@@ -694,7 +699,10 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
694699
* necessary info (e.g. file name for a filesystem-based dataset, table name for HyperTable),
695700
* using the older MapReduce API (`org.apache.hadoop.mapred`).
696701
*
697-
* @param conf JobConf for setting up the dataset
702+
* @param conf JobConf for setting up the dataset. Note: This will be put into a Broadcast.
703+
* Therefore if you plan to reuse this conf to create multiple RDDs, you need to make
704+
* sure you won't modify the conf. A safe approach is always creating a new conf for
705+
* a new RDD.
698706
* @param inputFormatClass Class of the InputFormat
699707
* @param keyClass Class of the keys
700708
* @param valueClass Class of the values
@@ -830,6 +838,14 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
830838
* Get an RDD for a given Hadoop file with an arbitrary new API InputFormat
831839
* and extra configuration options to pass to the input format.
832840
*
841+
* @param conf Configuration for setting up the dataset. Note: This will be put into a Broadcast.
842+
* Therefore if you plan to reuse this conf to create multiple RDDs, you need to make
843+
* sure you won't modify the conf. A safe approach is always creating a new conf for
844+
* a new RDD.
845+
* @param fClass Class of the InputFormat
846+
* @param kClass Class of the keys
847+
* @param vClass Class of the values
848+
*
833849
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each
834850
* record, directly caching the returned RDD or directly passing it to an aggregation or shuffle
835851
* operation will create many references to the same object.

core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -373,6 +373,15 @@ class JavaSparkContext(val sc: SparkContext)
373373
* other necessary info (e.g. file name for a filesystem-based dataset, table name for HyperTable,
374374
* etc).
375375
*
376+
* @param conf JobConf for setting up the dataset. Note: This will be put into a Broadcast.
377+
* Therefore if you plan to reuse this conf to create multiple RDDs, you need to make
378+
* sure you won't modify the conf. A safe approach is always creating a new conf for
379+
* a new RDD.
380+
* @param inputFormatClass Class of the InputFormat
381+
* @param keyClass Class of the keys
382+
* @param valueClass Class of the values
383+
* @param minPartitions Minimum number of Hadoop Splits to generate.
384+
*
376385
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each
377386
* record, directly caching the returned RDD will create many references to the same object.
378387
* If you plan to directly cache Hadoop writable objects, you should first copy them using
@@ -395,6 +404,14 @@ class JavaSparkContext(val sc: SparkContext)
395404
* Get an RDD for a Hadoop-readable dataset from a Hadooop JobConf giving its InputFormat and any
396405
* other necessary info (e.g. file name for a filesystem-based dataset, table name for HyperTable,
397406
*
407+
* @param conf JobConf for setting up the dataset. Note: This will be put into a Broadcast.
408+
* Therefore if you plan to reuse this conf to create multiple RDDs, you need to make
409+
* sure you won't modify the conf. A safe approach is always creating a new conf for
410+
* a new RDD.
411+
* @param inputFormatClass Class of the InputFormat
412+
* @param keyClass Class of the keys
413+
* @param valueClass Class of the values
414+
*
398415
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each
399416
* record, directly caching the returned RDD will create many references to the same object.
400417
* If you plan to directly cache Hadoop writable objects, you should first copy them using
@@ -476,6 +493,14 @@ class JavaSparkContext(val sc: SparkContext)
476493
* Get an RDD for a given Hadoop file with an arbitrary new API InputFormat
477494
* and extra configuration options to pass to the input format.
478495
*
496+
* @param conf Configuration for setting up the dataset. Note: This will be put into a Broadcast.
497+
* Therefore if you plan to reuse this conf to create multiple RDDs, you need to make
498+
* sure you won't modify the conf. A safe approach is always creating a new conf for
499+
* a new RDD.
500+
* @param fClass Class of the InputFormat
501+
* @param kClass Class of the keys
502+
* @param vClass Class of the values
503+
*
479504
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each
480505
* record, directly caching the returned RDD will create many references to the same object.
481506
* If you plan to directly cache Hadoop writable objects, you should first copy them using
@@ -675,6 +700,9 @@ class JavaSparkContext(val sc: SparkContext)
675700

676701
/**
677702
* Returns the Hadoop configuration used for the Hadoop code (e.g. file systems) we reuse.
703+
*
704+
* '''Note:''' As it will be reused in all Hadoop RDDs, it's better not to modify it unless you
705+
* plan to set some global configurations for all Hadoop RDDs.
678706
*/
679707
def hadoopConfiguration(): Configuration = {
680708
sc.hadoopConfiguration

0 commit comments

Comments
 (0)