Skip to content

Commit cfcf461

Browse files
committed
Updated documents and build scripts for the newly added hive-thriftserver profile
1 parent 061880f commit cfcf461

File tree

4 files changed

+46
-22
lines changed

4 files changed

+46
-22
lines changed

dev/create-release/create-release.sh

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -53,15 +53,15 @@ if [[ ! "$@" =~ --package-only ]]; then
5353
-Dusername=$GIT_USERNAME -Dpassword=$GIT_PASSWORD \
5454
-Dmaven.javadoc.skip=true \
5555
-Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 \
56-
-Pyarn -Phive -Phadoop-2.2 -Pspark-ganglia-lgpl\
56+
-Pyarn -Phive -Phive-thriftserver -Phadoop-2.2 -Pspark-ganglia-lgpl\
5757
-Dtag=$GIT_TAG -DautoVersionSubmodules=true \
5858
--batch-mode release:prepare
5959

6060
mvn -DskipTests \
6161
-Darguments="-DskipTests=true -Dmaven.javadoc.skip=true -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -Dgpg.passphrase=${GPG_PASSPHRASE}" \
6262
-Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 \
6363
-Dmaven.javadoc.skip=true \
64-
-Pyarn -Phive -Phadoop-2.2 -Pspark-ganglia-lgpl\
64+
-Pyarn -Phive -Phive-thriftserver -Phadoop-2.2 -Pspark-ganglia-lgpl\
6565
release:perform
6666

6767
cd ..
@@ -111,10 +111,10 @@ make_binary_release() {
111111
spark-$RELEASE_VERSION-bin-$NAME.tgz.sha
112112
}
113113

114-
make_binary_release "hadoop1" "-Phive -Dhadoop.version=1.0.4"
115-
make_binary_release "cdh4" "-Phive -Dhadoop.version=2.0.0-mr1-cdh4.2.0"
114+
make_binary_release "hadoop1" "-Phive -Phive-thriftserver -Dhadoop.version=1.0.4"
115+
make_binary_release "cdh4" "-Phive -Phive-thriftserver -Dhadoop.version=2.0.0-mr1-cdh4.2.0"
116116
make_binary_release "hadoop2" \
117-
"-Phive -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -Pyarn.version=2.2.0"
117+
"-Phive -Phive-thriftserver -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -Pyarn.version=2.2.0"
118118

119119
# Copy data
120120
echo "Copying release tarballs"

dev/run-tests

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ echo "========================================================================="
6565
# (either resolution or compilation) prompts the user for input either q, r,
6666
# etc to quit or retry. This echo is there to make it not block.
6767
if [ -n "$_RUN_SQL_TESTS" ]; then
68-
echo -e "q\n" | SBT_MAVEN_PROFILES="$SBT_MAVEN_PROFILES -Phive" sbt/sbt clean package \
68+
echo -e "q\n" | SBT_MAVEN_PROFILES="$SBT_MAVEN_PROFILES -Phive -Phive-thriftserver" sbt/sbt clean package \
6969
assembly/assembly test | grep -v -e "info.*Resolving" -e "warn.*Merging" -e "info.*Including"
7070
else
7171
echo -e "q\n" | sbt/sbt clean package assembly/assembly test | \

dev/scalastyle

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
# limitations under the License.
1818
#
1919

20-
echo -e "q\n" | sbt/sbt -Phive scalastyle > scalastyle.txt
20+
echo -e "q\n" | sbt/sbt -Phive -Phive-thriftserver scalastyle > scalastyle.txt
2121
# Check style with YARN alpha built too
2222
echo -e "q\n" | sbt/sbt -Pyarn -Phadoop-0.23 -Dhadoop.version=0.23.9 yarn-alpha/scalastyle \
2323
>> scalastyle.txt

docs/sql-programming-guide.md

Lines changed: 39 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -578,7 +578,9 @@ evaluated by the SQL execution engine. A full list of the functions supported c
578578

579579
The Thrift JDBC server implemented here corresponds to the [`HiveServer2`]
580580
(https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2) in Hive 0.12. You can test
581-
the JDBC server with the beeline script comes with either Spark or Hive 0.12.
581+
the JDBC server with the beeline script comes with either Spark or Hive 0.12. In order to use Hive
582+
you must first run '`sbt/sbt -Phive-thriftserver assembly/assembly`' (or use `-Phive-thriftserver`
583+
for maven).
582584

583585
To start the JDBC server, run the following in the Spark directory:
584586

@@ -605,7 +607,9 @@ You may also use the beeline script comes with Hive.
605607

606608
#### Reducer number
607609

608-
In Shark, default reducer number is 1 and is controlled by the property `mapred.reduce.tasks`. Spark SQL deprecates this property by a new property `spark.sql.shuffle.partitions`, whose default value is 200. Users may customize this property via `SET`:
610+
In Shark, default reducer number is 1 and is controlled by the property `mapred.reduce.tasks`. Spark
611+
SQL deprecates this property by a new property `spark.sql.shuffle.partitions`, whose default value
612+
is 200. Users may customize this property via `SET`:
609613

610614
```
611615
SET spark.sql.shuffle.partitions=10;
@@ -615,18 +619,23 @@ GROUP BY page ORDER BY c DESC LIMIT 10;
615619

616620
You may also put this property in `hive-site.xml` to override the default value.
617621

618-
For now, the `mapred.reduce.tasks` property is still recognized, and is converted to `spark.sql.shuffle.partitions` automatically.
622+
For now, the `mapred.reduce.tasks` property is still recognized, and is converted to
623+
`spark.sql.shuffle.partitions` automatically.
619624

620625
#### Caching
621626

622-
The `shark.cache` table property no longer exists, and tables whose name end with `_cached` are no longer automcatically cached. Instead, we provide `CACHE TABLE` and `UNCACHE TABLE` statements to let user control table caching explicitly:
627+
The `shark.cache` table property no longer exists, and tables whose name end with `_cached` are no
628+
longer automcatically cached. Instead, we provide `CACHE TABLE` and `UNCACHE TABLE` statements to
629+
let user control table caching explicitly:
623630

624631
```
625632
CACHE TABLE logs_last_month;
626633
UNCACHE TABLE logs_last_month;
627634
```
628635

629-
**NOTE** `CACHE TABLE tbl` is lazy, it only marks table `tbl` as "need to by cached if necessary", but doesn't actually cache it until a query that touches `tbl` is executed. To force the table to be cached, you may simply count the table immediately after executing `CACHE TABLE`:
636+
**NOTE** `CACHE TABLE tbl` is lazy, it only marks table `tbl` as "need to by cached if necessary",
637+
but doesn't actually cache it until a query that touches `tbl` is executed. To force the table to be
638+
cached, you may simply count the table immediately after executing `CACHE TABLE`:
630639

631640
```
632641
CACHE TABLE logs_last_month;
@@ -699,20 +708,25 @@ Spark SQL supports the vast majority of Hive features, such as:
699708

700709
#### Unsupported Hive Functionality
701710

702-
Below is a list of Hive features that we don't support yet. Most of these features are rarely used in Hive deployments.
711+
Below is a list of Hive features that we don't support yet. Most of these features are rarely used
712+
in Hive deployments.
703713

704714
**Major Hive Features**
705715

706-
* Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL doesn't support buckets yet.
716+
* Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL
717+
doesn't support buckets yet.
707718

708719
**Esoteric Hive Features**
709720

710-
* Tables with partitions using different input formats: In Spark SQL, all table partitions need to have the same input format.
711-
* Non-equi outer join: For the uncommon use case of using outer joins with non-equi join conditions (e.g. condition "`key < 10`"), Spark SQL will output wrong result for the `NULL` tuple.
721+
* Tables with partitions using different input formats: In Spark SQL, all table partitions need to
722+
have the same input format.
723+
* Non-equi outer join: For the uncommon use case of using outer joins with non-equi join conditions
724+
(e.g. condition "`key < 10`"), Spark SQL will output wrong result for the `NULL` tuple.
712725
* `UNIONTYPE`
713726
* Unique join
714727
* Single query multi insert
715-
* Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at the moment.
728+
* Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at
729+
the moment.
716730

717731
**Hive Input/Output Formats**
718732

@@ -721,15 +735,25 @@ Below is a list of Hive features that we don't support yet. Most of these featur
721735

722736
**Hive Optimizations**
723737

724-
A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are not necessary due to Spark SQL's in-memory computational model. Others are slotted for future releases of Spark SQL.
738+
A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are
739+
not necessary due to Spark SQL's in-memory computational model. Others are slotted for future
740+
releases of Spark SQL.
725741

726742
* Block level bitmap indexes and virtual columns (used to build indexes)
727-
* Automatically convert a join to map join: For joining a large table with multiple small tables, Hive automatically converts the join into a map join. We are adding this auto conversion in the next release.
728-
* Automatically determine the number of reducers for joins and groupbys: Currently in Spark SQL, you need to control the degree of parallelism post-shuffle using "SET spark.sql.shuffle.partitions=[num_tasks];". We are going to add auto-setting of parallelism in the next release.
729-
* Meta-data only query: For queries that can be answered by using only meta data, Spark SQL still launches tasks to compute the result.
743+
* Automatically convert a join to map join: For joining a large table with multiple small tables,
744+
Hive automatically converts the join into a map join. We are adding this auto conversion in the
745+
next release.
746+
* Automatically determine the number of reducers for joins and groupbys: Currently in Spark SQL, you
747+
need to control the degree of parallelism post-shuffle using "SET
748+
spark.sql.shuffle.partitions=[num_tasks];". We are going to add auto-setting of parallelism in the
749+
next release.
750+
* Meta-data only query: For queries that can be answered by using only meta data, Spark SQL still
751+
launches tasks to compute the result.
730752
* Skew data flag: Spark SQL does not follow the skew data flags in Hive.
731753
* `STREAMTABLE` hint in join: Spark SQL does not follow the `STREAMTABLE` hint.
732-
* Merge multiple small files for query results: if the result output contains multiple small files, Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS metadata. Spark SQL does not support that.
754+
* Merge multiple small files for query results: if the result output contains multiple small files,
755+
Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS
756+
metadata. Spark SQL does not support that.
733757

734758
## Running the Spark SQL CLI
735759

0 commit comments

Comments
 (0)