@@ -51,9 +51,11 @@ Perform these steps from Unix shell on an HBase edge node.
5151 ```
5252
53531. Export the snapshot
54- ```
54+ 1. Install [hadoop connectors](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md)
55+ 1. Copy to a GCS bucket
56+ ```
5557 hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot $SNAPSHOT_NAME \
56- -copy-to $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/data -mappers NUM_MAPPERS
58+ -copy-to $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/data -mappers $ NUM_MAPPERS
5759 ```
58601. Create hashes for the table to be used during the data validation step.
5961[Visit the HBase documentation for more information on each parameter](http://hbase.apache.org/book.html#_step_1_hashtable).
@@ -101,14 +103,15 @@ Exporting HBase snapshots from Bigtable is not supported.
101103 ```
1021041. Run the export.
103105 ```
104- java -jar bigtable-beam-import-2.0.0-alpha1 .jar export \
106+ java -jar bigtable-beam-import-2.0.0.jar export \
105107 --runner=dataflow \
106108 --project=$PROJECT_ID \
107109 --bigtableInstanceId=$INSTANCE_ID \
108110 --bigtableTableId=$TABLE_NAME \
109111 --destinationPath=$BUCKET_NAME/hbase_export/ \
110112 --tempLocation=$BUCKET_NAME/hbase_temp/ \
111- --maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES)
113+ --maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES) \
114+ --region=$REGION
112115 ```
113116
114117
@@ -140,7 +143,7 @@ Please pay attention to the Cluster CPU usage and adjust the number of Dataflow
140143
1411441. Run the import.
142145 ```
143- java -jar bigtable-beam-import-2.0.0-alpha1 .jar importsnapshot \
146+ java -jar bigtable-beam-import-2.0.0.jar importsnapshot \
144147 --runner=DataflowRunner \
145148 --project=$PROJECT_ID \
146149 --bigtableInstanceId=$INSTANCE_ID \
@@ -153,6 +156,35 @@ Please pay attention to the Cluster CPU usage and adjust the number of Dataflow
153156 --region=$REGION
154157 ```
155158
159+ ### Snappy compressed Snapshots
160+
161+ 1. Set the environment variables.
162+ ```
163+ PROJECT_ID=your-project-id
164+ INSTANCE_ID=your-instance-id
165+ TABLE_NAME=your-table-name
166+ REGION=us-central1
167+
168+ SNAPSHOT_GCS_PATH="$BUCKET_NAME/hbase-migration-snap"
169+ SNAPSHOT_NAME=your-snapshot-name
170+ ```
171+
172+ 1. Run the import.
173+ ```
174+ java -jar bigtable-beam-import-2.0.0.jar importsnapshot \
175+ --runner=DataflowRunner \
176+ --project=$PROJECT_ID \
177+ --bigtableInstanceId=$INSTANCE_ID \
178+ --bigtableTableId=$TABLE_NAME \
179+ --hbaseSnapshotSourceDir=$SNAPSHOT_GCS_PATH/data \
180+ --snapshotName=$SNAPSHOT_NAME \
181+ --stagingLocation=$SNAPSHOT_GCS_PATH/staging \
182+ --tempLocation=$SNAPSHOT_GCS_PATH/temp \
183+ --maxWorkerNodes=$(expr 3 \* $CLUSTER_NUM_NODES) \
184+ --region=$REGION \
185+ --experiments=use_runner_v2 \
186+ --sdkContainerImage=gcr.io/cloud-bigtable-ecosystem/unified-harness:latest
187+ ```
156188
157189### Sequence Files
158190
@@ -168,25 +200,18 @@ Please pay attention to the Cluster CPU usage and adjust the number of Dataflow
168200 ```
1692011. Run the import.
170202 ```
171- java -jar bigtable-beam-import-2.0.0-alpha1 .jar import \
203+ java -jar bigtable-beam-import-2.0.0.jar import \
172204 --runner=dataflow \
173205 --project=$PROJECT_ID \
174- --bigtableInstanceId=$INSTANCE_D \
206+ --bigtableInstanceId=$INSTANCE_ID \
175207 --bigtableTableId=$TABLE_NAME \
176- --sourcePattern=' $BUCKET_NAME/hbase-export/part-*' \
208+ --sourcePattern=$BUCKET_NAME/hbase-export/part-* \
177209 --tempLocation=$BUCKET_NAME/hbase_temp \
178210 --maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES) \
179- --zone=$CLUSTER_ZONE
211+ --zone=$CLUSTER_ZONE \
212+ --region=$REGION
180213 ```
181214
182- ---
183- **NOTE**
184-
185- Snappy compressed files are not supported with Import pipelines, we are working to add support for Snappy compressed files.
186-
187- ---
188-
189-
190215## Validating data
191216
192217Once your snapshot or sequence file is imported, you should run the validator to
@@ -203,10 +228,10 @@ check if there are any rows with mismatched data.
203228 ```
2042291. Run the sync job. It will put the results into `$SNAPSHOT_GCS_PATH/data-verification/output-TIMESTAMP`.
205230 ```
206- java -jar bigtable-beam-import-2.0.0-alpha1 .jar sync-table \
231+ java -jar bigtable-beam-import-2.0.0.jar sync-table \
207232 --runner=dataflow \
208233 --project=$PROJECT_ID \
209- --bigtableInstanceId=$INSTANCE_D \
234+ --bigtableInstanceId=$INSTANCE_ID \
210235 --bigtableTableId=$TABLE_NAME \
211236 --outputPrefix=$SNAPSHOT_GCS_PATH/sync-table/output-${date +"%s"} \
212237 --stagingLocation=$SNAPSHOT_GCS_PATH/sync-table/staging \
0 commit comments