Skip to content

Commit c0169da

Browse files
JerryLeadJerryLead
authored andcommitted
update to the latest version
2 parents 52799e3 + b0a46d8 commit c0169da

File tree

11 files changed

+76
-28
lines changed

11 files changed

+76
-28
lines changed

docs/programming-guide.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -934,6 +934,12 @@ for details.
934934
<td> Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them.
935935
This always shuffles all data over the network. </td>
936936
</tr>
937+
<tr>
938+
<td> <b>repartitionAndSortWithinPartitions</b>(<i>partitioner</i>) </td>
939+
<td> Repartition the RDD according to the given partitioner and, within each resulting partition,
940+
sort records by their keys. This is more efficient than calling <code>repartition</code> and then sorting within
941+
each partition because it can push the sorting down into the shuffle machinery. </td>
942+
</tr>
937943
</table>
938944

939945
### Actions

docs/sql-programming-guide.md

Lines changed: 41 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,7 @@ describes the various methods for loading data into a SchemaRDD.
146146

147147
Spark SQL supports two different methods for converting existing RDDs into SchemaRDDs. The first
148148
method uses reflection to infer the schema of an RDD that contains specific types of objects. This
149-
reflection based approach leads to more concise code and works well when you already know the schema
149+
reflection based approach leads to more concise code and works well when you already know the schema
150150
while writing your Spark application.
151151

152152
The second method for creating SchemaRDDs is through a programmatic interface that allows you to
@@ -566,7 +566,7 @@ for teenName in teenNames.collect():
566566

567567
### Configuration
568568

569-
Configuration of Parquet can be done using the `setConf` method on SQLContext or by running
569+
Configuration of Parquet can be done using the `setConf` method on SQLContext or by running
570570
`SET key=value` commands using SQL.
571571

572572
<table class="table">
@@ -575,8 +575,8 @@ Configuration of Parquet can be done using the `setConf` method on SQLContext or
575575
<td><code>spark.sql.parquet.binaryAsString</code></td>
576576
<td>false</td>
577577
<td>
578-
Some other Parquet-producing systems, in particular Impala and older versions of Spark SQL, do
579-
not differentiate between binary data and strings when writing out the Parquet schema. This
578+
Some other Parquet-producing systems, in particular Impala and older versions of Spark SQL, do
579+
not differentiate between binary data and strings when writing out the Parquet schema. This
580580
flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems.
581581
</td>
582582
</tr>
@@ -591,10 +591,20 @@ Configuration of Parquet can be done using the `setConf` method on SQLContext or
591591
<td><code>spark.sql.parquet.compression.codec</code></td>
592592
<td>gzip</td>
593593
<td>
594-
Sets the compression codec use when writing Parquet files. Acceptable values include:
594+
Sets the compression codec use when writing Parquet files. Acceptable values include:
595595
uncompressed, snappy, gzip, lzo.
596596
</td>
597597
</tr>
598+
<tr>
599+
<td><code>spark.sql.parquet.filterPushdown</code></td>
600+
<td>false</td>
601+
<td>
602+
Turn on Parquet filter pushdown optimization. This feature is turned off by default because of a known
603+
bug in Paruet 1.6.0rc3 (<a href="https://issues.apache.org/jira/browse/PARQUET-136">PARQUET-136</a>).
604+
However, if your table doesn't contain any nullable string or binary columns, it's still safe to turn
605+
this feature on.
606+
</td>
607+
</tr>
598608
<tr>
599609
<td><code>spark.sql.hive.convertMetastoreParquet</code></td>
600610
<td>true</td>
@@ -945,7 +955,7 @@ options.
945955

946956
## Migration Guide for Shark User
947957

948-
### Scheduling
958+
### Scheduling
949959
To set a [Fair Scheduler](job-scheduling.html#fair-scheduler-pools) pool for a JDBC client session,
950960
users can set the `spark.sql.thriftserver.scheduler.pool` variable:
951961

@@ -992,7 +1002,7 @@ Several caching related features are not supported yet:
9921002
## Compatibility with Apache Hive
9931003

9941004
Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs. Currently Spark
995-
SQL is based on Hive 0.12.0.
1005+
SQL is based on Hive 0.12.0 and 0.13.1.
9961006

9971007
#### Deploying in Existing Hive Warehouses
9981008

@@ -1031,6 +1041,7 @@ Spark SQL supports the vast majority of Hive features, such as:
10311041
* Sampling
10321042
* Explain
10331043
* Partitioned tables
1044+
* View
10341045
* All Hive DDL Functions, including:
10351046
* `CREATE TABLE`
10361047
* `CREATE TABLE AS SELECT`
@@ -1046,6 +1057,7 @@ Spark SQL supports the vast majority of Hive features, such as:
10461057
* `STRING`
10471058
* `BINARY`
10481059
* `TIMESTAMP`
1060+
* `DATE`
10491061
* `ARRAY<>`
10501062
* `MAP<>`
10511063
* `STRUCT<>`
@@ -1146,6 +1158,7 @@ evaluated by the SQL execution engine. A full list of the functions supported c
11461158
* Datetime type
11471159
- `TimestampType`: Represents values comprising values of fields year, month, day,
11481160
hour, minute, and second.
1161+
- `DateType`: Represents values comprising values of fields year, month, day.
11491162
* Complex types
11501163
- `ArrayType(elementType, containsNull)`: Represents values comprising a sequence of
11511164
elements with the type of `elementType`. `containsNull` is used to indicate if
@@ -1253,6 +1266,13 @@ import org.apache.spark.sql._
12531266
TimestampType
12541267
</td>
12551268
</tr>
1269+
<tr>
1270+
<td> <b>DateType</b> </td>
1271+
<td> java.sql.Date </td>
1272+
<td>
1273+
DateType
1274+
</td>
1275+
</tr>
12561276
<tr>
12571277
<td> <b>ArrayType</b> </td>
12581278
<td> scala.collection.Seq </td>
@@ -1379,6 +1399,13 @@ please use factory methods provided in
13791399
DataType.TimestampType
13801400
</td>
13811401
</tr>
1402+
<tr>
1403+
<td> <b>DateType</b> </td>
1404+
<td> java.sql.Date </td>
1405+
<td>
1406+
DataType.DateType
1407+
</td>
1408+
</tr>
13821409
<tr>
13831410
<td> <b>ArrayType</b> </td>
13841411
<td> java.util.List </td>
@@ -1526,6 +1553,13 @@ from pyspark.sql import *
15261553
TimestampType()
15271554
</td>
15281555
</tr>
1556+
<tr>
1557+
<td> <b>DateType</b> </td>
1558+
<td> datetime.date </td>
1559+
<td>
1560+
DateType()
1561+
</td>
1562+
</tr>
15291563
<tr>
15301564
<td> <b>ArrayType</b> </td>
15311565
<td> list, tuple, or array </td>

examples/src/main/scala/org/apache/spark/examples/sql/hive/HiveFromSpark.scala

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,10 @@ object HiveFromSpark {
2929
val sc = new SparkContext(sparkConf)
3030
val path = s"${System.getenv("SPARK_HOME")}/examples/src/main/resources/kv1.txt"
3131

32-
// A local hive context creates an instance of the Hive Metastore in process, storing
33-
// the warehouse data in the current directory. This location can be overridden by
34-
// specifying a second parameter to the constructor.
32+
// A hive context adds support for finding tables in the MetaStore and writing queries
33+
// using HiveQL. Users who do not have an existing Hive deployment can still create a
34+
// HiveContext. When not configured by the hive-site.xml, the context automatically
35+
// creates metastore_db and warehouse in the current directory.
3536
val hiveContext = new HiveContext(sc)
3637
import hiveContext._
3738

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SparkSQLParser.scala

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -97,10 +97,10 @@ class SqlLexical(val keywords: Seq[String]) extends StdLexical {
9797

9898
/** Generate all variations of upper and lower case of a given string */
9999
def allCaseVersions(s: String, prefix: String = ""): Stream[String] = {
100-
if (s == "") {
100+
if (s.isEmpty) {
101101
Stream(prefix)
102102
} else {
103-
allCaseVersions(s.tail, prefix + s.head.toLower) ++
103+
allCaseVersions(s.tail, prefix + s.head.toLower) #:::
104104
allCaseVersions(s.tail, prefix + s.head.toUpper)
105105
}
106106
}

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala

Lines changed: 5 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -277,7 +277,8 @@ class SqlParser extends AbstractSparkSQLParser {
277277
| SUM ~> "(" ~> DISTINCT ~> expression <~ ")" ^^ { case exp => SumDistinct(exp) }
278278
| COUNT ~ "(" ~> "*" <~ ")" ^^ { case _ => Count(Literal(1)) }
279279
| COUNT ~ "(" ~> expression <~ ")" ^^ { case exp => Count(exp) }
280-
| COUNT ~> "(" ~> DISTINCT ~> expression <~ ")" ^^ { case exp => CountDistinct(exp :: Nil) }
280+
| COUNT ~> "(" ~> DISTINCT ~> repsep(expression, ",") <~ ")" ^^
281+
{ case exps => CountDistinct(exps) }
281282
| APPROXIMATE ~ COUNT ~ "(" ~ DISTINCT ~> expression <~ ")" ^^
282283
{ case exp => ApproxCountDistinct(exp) }
283284
| APPROXIMATE ~> "(" ~> floatLit ~ ")" ~ COUNT ~ "(" ~ DISTINCT ~ expression <~ ")" ^^
@@ -340,18 +341,13 @@ class SqlParser extends AbstractSparkSQLParser {
340341
| floatLit ^^ { f => Literal(f.toDouble) }
341342
)
342343

343-
private val longMax = BigDecimal(s"${Long.MaxValue}")
344-
private val longMin = BigDecimal(s"${Long.MinValue}")
345-
private val intMax = BigDecimal(s"${Int.MaxValue}")
346-
private val intMin = BigDecimal(s"${Int.MinValue}")
347-
348344
private def toNarrowestIntegerType(value: String) = {
349345
val bigIntValue = BigDecimal(value)
350346

351347
bigIntValue match {
352-
case v if v < longMin || v > longMax => v
353-
case v if v < intMin || v > intMax => v.toLong
354-
case v => v.toInt
348+
case v if bigIntValue.isValidInt => v.toIntExact
349+
case v if bigIntValue.isValidLong => v.toLongExact
350+
case v => v
355351
}
356352
}
357353

sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,8 @@ class SchemaRDD(
225225
* {{{
226226
* schemaRDD.limit(10)
227227
* }}}
228+
*
229+
* @group Query
228230
*/
229231
def limit(limitNum: Int): SchemaRDD =
230232
new SchemaRDD(sqlContext, Limit(Literal(limitNum), logicalPlan))
@@ -355,6 +357,8 @@ class SchemaRDD(
355357
* Return the number of elements in the RDD. Unlike the base RDD implementation of count, this
356358
* implementation leverages the query optimizer to compute the count on the SchemaRDD, which
357359
* supports features such as filter pushdown.
360+
*
361+
* @group Query
358362
*/
359363
@Experimental
360364
override def count(): Long = aggregate(Count(Literal(1))).collect().head.getLong(0)

sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -39,8 +39,8 @@ import scala.collection.JavaConversions._
3939

4040
/**
4141
* Allows creation of parquet based tables using the syntax
42-
* `CREATE TABLE ... USING org.apache.spark.sql.parquet`. Currently the only option required
43-
* is `path`, which should be the location of a collection of, optionally partitioned,
42+
* `CREATE TEMPORARY TABLE ... USING org.apache.spark.sql.parquet`. Currently the only option
43+
* required is `path`, which should be the location of a collection of, optionally partitioned,
4444
* parquet files.
4545
*/
4646
class DefaultSource extends RelationProvider {
@@ -49,7 +49,7 @@ class DefaultSource extends RelationProvider {
4949
sqlContext: SQLContext,
5050
parameters: Map[String, String]): BaseRelation = {
5151
val path =
52-
parameters.getOrElse("path", sys.error("'path' must be specifed for parquet tables."))
52+
parameters.getOrElse("path", sys.error("'path' must be specified for parquet tables."))
5353

5454
ParquetRelation2(path)(sqlContext)
5555
}

sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ private[sql] class DDLParser extends StandardTokenParsers with PackratParsers wi
6767
protected lazy val ddl: Parser[LogicalPlan] = createTable
6868

6969
/**
70-
* CREATE FOREIGN TEMPORARY TABLE avroTable
70+
* CREATE TEMPORARY TABLE avroTable
7171
* USING org.apache.spark.sql.avro
7272
* OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro")
7373
*/

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -992,4 +992,11 @@ class SQLQuerySuite extends QueryTest with BeforeAndAfterAll {
992992
"nulldata2 on nulldata1.value <=> nulldata2.value"),
993993
(1 to 2).map(i => Seq(i)))
994994
}
995+
996+
test("Multi-column COUNT(DISTINCT ...)") {
997+
val data = TestData(1,"val_1") :: TestData(2,"val_2") :: Nil
998+
val rdd = sparkContext.parallelize((0 to 1).map(i => data(i)))
999+
rdd.registerTempTable("distinctData")
1000+
checkAnswer(sql("SELECT COUNT(DISTINCT key,value) FROM distinctData"), 2)
1001+
}
9951002
}

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -379,7 +379,7 @@ private[hive] object HiveQl {
379379
protected def nameExpressions(exprs: Seq[Expression]): Seq[NamedExpression] = {
380380
exprs.zipWithIndex.map {
381381
case (ne: NamedExpression, _) => ne
382-
case (e, i) => Alias(e, s"c_$i")()
382+
case (e, i) => Alias(e, s"_c$i")()
383383
}
384384
}
385385

0 commit comments

Comments
 (0)