You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sql-programming-guide.md
+39-15Lines changed: 39 additions & 15 deletions
Original file line number
Diff line number
Diff line change
@@ -578,7 +578,9 @@ evaluated by the SQL execution engine. A full list of the functions supported c
578
578
579
579
The Thrift JDBC server implemented here corresponds to the [`HiveServer2`]
580
580
(https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2) in Hive 0.12. You can test
581
-
the JDBC server with the beeline script comes with either Spark or Hive 0.12.
581
+
the JDBC server with the beeline script comes with either Spark or Hive 0.12. In order to use Hive
582
+
you must first run '`sbt/sbt -Phive-thriftserver assembly/assembly`' (or use `-Phive-thriftserver`
583
+
for maven).
582
584
583
585
To start the JDBC server, run the following in the Spark directory:
584
586
@@ -605,7 +607,9 @@ You may also use the beeline script comes with Hive.
605
607
606
608
#### Reducer number
607
609
608
-
In Shark, default reducer number is 1 and is controlled by the property `mapred.reduce.tasks`. Spark SQL deprecates this property by a new property `spark.sql.shuffle.partitions`, whose default value is 200. Users may customize this property via `SET`:
610
+
In Shark, default reducer number is 1 and is controlled by the property `mapred.reduce.tasks`. Spark
611
+
SQL deprecates this property by a new property `spark.sql.shuffle.partitions`, whose default value
612
+
is 200. Users may customize this property via `SET`:
609
613
610
614
```
611
615
SET spark.sql.shuffle.partitions=10;
@@ -615,18 +619,23 @@ GROUP BY page ORDER BY c DESC LIMIT 10;
615
619
616
620
You may also put this property in `hive-site.xml` to override the default value.
617
621
618
-
For now, the `mapred.reduce.tasks` property is still recognized, and is converted to `spark.sql.shuffle.partitions` automatically.
622
+
For now, the `mapred.reduce.tasks` property is still recognized, and is converted to
623
+
`spark.sql.shuffle.partitions` automatically.
619
624
620
625
#### Caching
621
626
622
-
The `shark.cache` table property no longer exists, and tables whose name end with `_cached` are no longer automcatically cached. Instead, we provide `CACHE TABLE` and `UNCACHE TABLE` statements to let user control table caching explicitly:
627
+
The `shark.cache` table property no longer exists, and tables whose name end with `_cached` are no
628
+
longer automcatically cached. Instead, we provide `CACHE TABLE` and `UNCACHE TABLE` statements to
629
+
let user control table caching explicitly:
623
630
624
631
```
625
632
CACHE TABLE logs_last_month;
626
633
UNCACHE TABLE logs_last_month;
627
634
```
628
635
629
-
**NOTE**`CACHE TABLE tbl` is lazy, it only marks table `tbl` as "need to by cached if necessary", but doesn't actually cache it until a query that touches `tbl` is executed. To force the table to be cached, you may simply count the table immediately after executing `CACHE TABLE`:
636
+
**NOTE**`CACHE TABLE tbl` is lazy, it only marks table `tbl` as "need to by cached if necessary",
637
+
but doesn't actually cache it until a query that touches `tbl` is executed. To force the table to be
638
+
cached, you may simply count the table immediately after executing `CACHE TABLE`:
630
639
631
640
```
632
641
CACHE TABLE logs_last_month;
@@ -699,20 +708,25 @@ Spark SQL supports the vast majority of Hive features, such as:
699
708
700
709
#### Unsupported Hive Functionality
701
710
702
-
Below is a list of Hive features that we don't support yet. Most of these features are rarely used in Hive deployments.
711
+
Below is a list of Hive features that we don't support yet. Most of these features are rarely used
712
+
in Hive deployments.
703
713
704
714
**Major Hive Features**
705
715
706
-
* Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL doesn't support buckets yet.
716
+
* Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL
717
+
doesn't support buckets yet.
707
718
708
719
**Esoteric Hive Features**
709
720
710
-
* Tables with partitions using different input formats: In Spark SQL, all table partitions need to have the same input format.
711
-
* Non-equi outer join: For the uncommon use case of using outer joins with non-equi join conditions (e.g. condition "`key < 10`"), Spark SQL will output wrong result for the `NULL` tuple.
721
+
* Tables with partitions using different input formats: In Spark SQL, all table partitions need to
722
+
have the same input format.
723
+
* Non-equi outer join: For the uncommon use case of using outer joins with non-equi join conditions
724
+
(e.g. condition "`key < 10`"), Spark SQL will output wrong result for the `NULL` tuple.
712
725
*`UNIONTYPE`
713
726
* Unique join
714
727
* Single query multi insert
715
-
* Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at the moment.
728
+
* Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at
729
+
the moment.
716
730
717
731
**Hive Input/Output Formats**
718
732
@@ -721,15 +735,25 @@ Below is a list of Hive features that we don't support yet. Most of these featur
721
735
722
736
**Hive Optimizations**
723
737
724
-
A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are not necessary due to Spark SQL's in-memory computational model. Others are slotted for future releases of Spark SQL.
738
+
A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are
739
+
not necessary due to Spark SQL's in-memory computational model. Others are slotted for future
740
+
releases of Spark SQL.
725
741
726
742
* Block level bitmap indexes and virtual columns (used to build indexes)
727
-
* Automatically convert a join to map join: For joining a large table with multiple small tables, Hive automatically converts the join into a map join. We are adding this auto conversion in the next release.
728
-
* Automatically determine the number of reducers for joins and groupbys: Currently in Spark SQL, you need to control the degree of parallelism post-shuffle using "SET spark.sql.shuffle.partitions=[num_tasks];". We are going to add auto-setting of parallelism in the next release.
729
-
* Meta-data only query: For queries that can be answered by using only meta data, Spark SQL still launches tasks to compute the result.
743
+
* Automatically convert a join to map join: For joining a large table with multiple small tables,
744
+
Hive automatically converts the join into a map join. We are adding this auto conversion in the
745
+
next release.
746
+
* Automatically determine the number of reducers for joins and groupbys: Currently in Spark SQL, you
747
+
need to control the degree of parallelism post-shuffle using "SET
748
+
spark.sql.shuffle.partitions=[num_tasks];". We are going to add auto-setting of parallelism in the
749
+
next release.
750
+
* Meta-data only query: For queries that can be answered by using only meta data, Spark SQL still
751
+
launches tasks to compute the result.
730
752
* Skew data flag: Spark SQL does not follow the skew data flags in Hive.
731
753
*`STREAMTABLE` hint in join: Spark SQL does not follow the `STREAMTABLE` hint.
732
-
* Merge multiple small files for query results: if the result output contains multiple small files, Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS metadata. Spark SQL does not support that.
754
+
* Merge multiple small files for query results: if the result output contains multiple small files,
755
+
Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS
0 commit comments