Skip to content

Commit c189d46

Browse files
authored
docs: fix typos in README and add --region option (#3520)
* fix: fix typos in README and add --region option * fix * add instructions on importing from snappy compressed snapshots
1 parent 53481bd commit c189d46

File tree

1 file changed

+44
-19
lines changed
  • bigtable-dataflow-parent/bigtable-beam-import

1 file changed

+44
-19
lines changed

bigtable-dataflow-parent/bigtable-beam-import/README.md

Lines changed: 44 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -51,9 +51,11 @@ Perform these steps from Unix shell on an HBase edge node.
5151
```
5252
5353
1. Export the snapshot
54-
```
54+
1. Install [hadoop connectors](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md)
55+
1. Copy to a GCS bucket
56+
```
5557
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot $SNAPSHOT_NAME \
56-
-copy-to $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/data -mappers NUM_MAPPERS
58+
-copy-to $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/data -mappers $NUM_MAPPERS
5759
```
5860
1. Create hashes for the table to be used during the data validation step.
5961
[Visit the HBase documentation for more information on each parameter](http://hbase.apache.org/book.html#_step_1_hashtable).
@@ -101,14 +103,15 @@ Exporting HBase snapshots from Bigtable is not supported.
101103
```
102104
1. Run the export.
103105
```
104-
java -jar bigtable-beam-import-2.0.0-alpha1.jar export \
106+
java -jar bigtable-beam-import-2.0.0.jar export \
105107
--runner=dataflow \
106108
--project=$PROJECT_ID \
107109
--bigtableInstanceId=$INSTANCE_ID \
108110
--bigtableTableId=$TABLE_NAME \
109111
--destinationPath=$BUCKET_NAME/hbase_export/ \
110112
--tempLocation=$BUCKET_NAME/hbase_temp/ \
111-
--maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES)
113+
--maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES) \
114+
--region=$REGION
112115
```
113116
114117
@@ -140,7 +143,7 @@ Please pay attention to the Cluster CPU usage and adjust the number of Dataflow
140143
141144
1. Run the import.
142145
```
143-
java -jar bigtable-beam-import-2.0.0-alpha1.jar importsnapshot \
146+
java -jar bigtable-beam-import-2.0.0.jar importsnapshot \
144147
--runner=DataflowRunner \
145148
--project=$PROJECT_ID \
146149
--bigtableInstanceId=$INSTANCE_ID \
@@ -153,6 +156,35 @@ Please pay attention to the Cluster CPU usage and adjust the number of Dataflow
153156
--region=$REGION
154157
```
155158
159+
### Snappy compressed Snapshots
160+
161+
1. Set the environment variables.
162+
```
163+
PROJECT_ID=your-project-id
164+
INSTANCE_ID=your-instance-id
165+
TABLE_NAME=your-table-name
166+
REGION=us-central1
167+
168+
SNAPSHOT_GCS_PATH="$BUCKET_NAME/hbase-migration-snap"
169+
SNAPSHOT_NAME=your-snapshot-name
170+
```
171+
172+
1. Run the import.
173+
```
174+
java -jar bigtable-beam-import-2.0.0.jar importsnapshot \
175+
--runner=DataflowRunner \
176+
--project=$PROJECT_ID \
177+
--bigtableInstanceId=$INSTANCE_ID \
178+
--bigtableTableId=$TABLE_NAME \
179+
--hbaseSnapshotSourceDir=$SNAPSHOT_GCS_PATH/data \
180+
--snapshotName=$SNAPSHOT_NAME \
181+
--stagingLocation=$SNAPSHOT_GCS_PATH/staging \
182+
--tempLocation=$SNAPSHOT_GCS_PATH/temp \
183+
--maxWorkerNodes=$(expr 3 \* $CLUSTER_NUM_NODES) \
184+
--region=$REGION \
185+
--experiments=use_runner_v2 \
186+
--sdkContainerImage=gcr.io/cloud-bigtable-ecosystem/unified-harness:latest
187+
```
156188
157189
### Sequence Files
158190
@@ -168,25 +200,18 @@ Please pay attention to the Cluster CPU usage and adjust the number of Dataflow
168200
```
169201
1. Run the import.
170202
```
171-
java -jar bigtable-beam-import-2.0.0-alpha1.jar import \
203+
java -jar bigtable-beam-import-2.0.0.jar import \
172204
--runner=dataflow \
173205
--project=$PROJECT_ID \
174-
--bigtableInstanceId=$INSTANCE_D \
206+
--bigtableInstanceId=$INSTANCE_ID \
175207
--bigtableTableId=$TABLE_NAME \
176-
--sourcePattern='$BUCKET_NAME/hbase-export/part-*' \
208+
--sourcePattern=$BUCKET_NAME/hbase-export/part-* \
177209
--tempLocation=$BUCKET_NAME/hbase_temp \
178210
--maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES) \
179-
--zone=$CLUSTER_ZONE
211+
--zone=$CLUSTER_ZONE \
212+
--region=$REGION
180213
```
181214
182-
---
183-
**NOTE**
184-
185-
Snappy compressed files are not supported with Import pipelines, we are working to add support for Snappy compressed files.
186-
187-
---
188-
189-
190215
## Validating data
191216
192217
Once your snapshot or sequence file is imported, you should run the validator to
@@ -203,10 +228,10 @@ check if there are any rows with mismatched data.
203228
```
204229
1. Run the sync job. It will put the results into `$SNAPSHOT_GCS_PATH/data-verification/output-TIMESTAMP`.
205230
```
206-
java -jar bigtable-beam-import-2.0.0-alpha1.jar sync-table \
231+
java -jar bigtable-beam-import-2.0.0.jar sync-table \
207232
--runner=dataflow \
208233
--project=$PROJECT_ID \
209-
--bigtableInstanceId=$INSTANCE_D \
234+
--bigtableInstanceId=$INSTANCE_ID \
210235
--bigtableTableId=$TABLE_NAME \
211236
--outputPrefix=$SNAPSHOT_GCS_PATH/sync-table/output-${date +"%s"} \
212237
--stagingLocation=$SNAPSHOT_GCS_PATH/sync-table/staging \

0 commit comments

Comments
 (0)