docs: fix typos in README and add --region option (#3520)

mutianf · web-flow · commit c189d46b9192 · 2022-02-25T14:49:18.000-05:00
* fix: fix typos in README and add --region option

* fix

* add instructions on importing from snappy compressed snapshots
diff --git a/bigtable-dataflow-parent/bigtable-beam-import/README.md b/bigtable-dataflow-parent/bigtable-beam-import/README.md
@@ -51,9 +51,11 @@ Perform these steps from Unix shell on an HBase edge node.
     ```
 
 1. Export the snapshot   
-    ```
+    1. Install [hadoop connectors](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md)
+    1. Copy to a GCS bucket
+   ```
     hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot $SNAPSHOT_NAME \
-        -copy-to $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/data -mappers NUM_MAPPERS
+        -copy-to $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/data -mappers $NUM_MAPPERS
     ```
 1. Create hashes for the table to be used during the data validation step.
 [Visit the HBase documentation for more information on each parameter](http://hbase.apache.org/book.html#_step_1_hashtable).
@@ -101,14 +103,15 @@ Exporting HBase snapshots from Bigtable is not supported.
     ```
 1. Run the export.
    ```
-   java -jar bigtable-beam-import-2.0.0-alpha1.jar export \
+   java -jar bigtable-beam-import-2.0.0.jar export \
         --runner=dataflow \
         --project=$PROJECT_ID \
         --bigtableInstanceId=$INSTANCE_ID \
         --bigtableTableId=$TABLE_NAME \
         --destinationPath=$BUCKET_NAME/hbase_export/ \
         --tempLocation=$BUCKET_NAME/hbase_temp/ \
-        --maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES)
+        --maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES) \
+        --region=$REGION
    ```
 
 
@@ -140,7 +143,7 @@ Please pay attention to the Cluster CPU usage and adjust the number of Dataflow
     
 1. Run the import.
     ```
-    java -jar bigtable-beam-import-2.0.0-alpha1.jar importsnapshot \
+    java -jar bigtable-beam-import-2.0.0.jar importsnapshot \
         --runner=DataflowRunner \
         --project=$PROJECT_ID \
         --bigtableInstanceId=$INSTANCE_ID \
@@ -153,6 +156,35 @@ Please pay attention to the Cluster CPU usage and adjust the number of Dataflow
         --region=$REGION
     ```
 
+### Snappy compressed Snapshots
+
+1. Set the environment variables.
+    ```
+    PROJECT_ID=your-project-id
+    INSTANCE_ID=your-instance-id
+    TABLE_NAME=your-table-name
+    REGION=us-central1
+
+    SNAPSHOT_GCS_PATH="$BUCKET_NAME/hbase-migration-snap"
+    SNAPSHOT_NAME=your-snapshot-name
+    ```
+
+1. Run the import.
+    ```
+    java -jar bigtable-beam-import-2.0.0.jar importsnapshot \
+        --runner=DataflowRunner \
+        --project=$PROJECT_ID \
+        --bigtableInstanceId=$INSTANCE_ID \
+        --bigtableTableId=$TABLE_NAME \
+        --hbaseSnapshotSourceDir=$SNAPSHOT_GCS_PATH/data \
+        --snapshotName=$SNAPSHOT_NAME \
+        --stagingLocation=$SNAPSHOT_GCS_PATH/staging \
+        --tempLocation=$SNAPSHOT_GCS_PATH/temp \
+        --maxWorkerNodes=$(expr 3 \* $CLUSTER_NUM_NODES) \
+        --region=$REGION \
+        --experiments=use_runner_v2 \
+        --sdkContainerImage=gcr.io/cloud-bigtable-ecosystem/unified-harness:latest
+    ```
 
 ### Sequence Files
 
@@ -168,25 +200,18 @@ Please pay attention to the Cluster CPU usage and adjust the number of Dataflow
     ```
 1. Run the import.
     ```
-    java -jar bigtable-beam-import-2.0.0-alpha1.jar import \
+    java -jar bigtable-beam-import-2.0.0.jar import \
         --runner=dataflow \
         --project=$PROJECT_ID \
-        --bigtableInstanceId=$INSTANCE_D \
+        --bigtableInstanceId=$INSTANCE_ID \
         --bigtableTableId=$TABLE_NAME \
-        --sourcePattern='$BUCKET_NAME/hbase-export/part-*' \
+        --sourcePattern=$BUCKET_NAME/hbase-export/part-* \
         --tempLocation=$BUCKET_NAME/hbase_temp \
         --maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES)  \
-        --zone=$CLUSTER_ZONE
+        --zone=$CLUSTER_ZONE \
+        --region=$REGION
     ```
 
----
-**NOTE**
-
-Snappy compressed files are not supported with Import pipelines, we are working to add support for Snappy compressed files.
-
----
-
-
 ## Validating data
 
 Once your snapshot or sequence file is imported, you should run the validator to
@@ -203,10 +228,10 @@ check if there are any rows with mismatched data.
     ```
 1. Run the sync job. It will put the results into `$SNAPSHOT_GCS_PATH/data-verification/output-TIMESTAMP`. 
     ```
-    java -jar bigtable-beam-import-2.0.0-alpha1.jar sync-table  \
+    java -jar bigtable-beam-import-2.0.0.jar sync-table  \
         --runner=dataflow \
         --project=$PROJECT_ID \
-        --bigtableInstanceId=$INSTANCE_D \
+        --bigtableInstanceId=$INSTANCE_ID \
         --bigtableTableId=$TABLE_NAME \
         --outputPrefix=$SNAPSHOT_GCS_PATH/sync-table/output-${date +"%s"} \
         --stagingLocation=$SNAPSHOT_GCS_PATH/sync-table/staging \