@@ -45,10 +45,12 @@ Note, there are different versions of the Delta Lake docker
45
45
| ----------------- | -------- | ------ | ------ | ----------- | ----- | ---------- | ------ | ----- |
46
46
| 0.8.1_2.3.0 | amd64 | 0.8.1 | latest | 2.3.0 | 3.3.2 | 3.6.3 | 1.5.3 | 0.9.0 |
47
47
| 0.8.1_2.3.0_arm64 | arm64 | 0.8.1 | latest | 2.3.0 | 3.3.2 | 3.6.3 | 1.5.3 | 0.9.0 |
48
- | latest | amd64 | 0.9.0 | latest | 2.3.0 | 3.3.2 | 3.6.3 | 1.5.3 | 0.9.0 |
49
- | latest | arm64 | 0.9.0 | latest | 2.3.0 | 3.3.2 | 3.6.3 | 1.5.3 | 0.9.0 |
48
+ | 1.0.0_3.0.0 | amd64 | 0.12.0 | latest | 3.0.0 | 3.5.0 | 3.6.3 | 1.5.3 | 0.9.0 |
49
+ | 1.0.0_3.0.0_arm64 | arm64 | 0.12.0 | latest | 3.0.0 | 3.5.0 | 3.6.3 | 1.5.3 | 0.9.0 |
50
+ | latest | amd64 | 0.12.0 | latest | 3.0.0 | 3.5.0 | 3.6.3 | 1.5.3 | 0.9.0 |
51
+ | latest | arm64 | 0.12.0 | latest | 3.0.0 | 3.5.0 | 3.6.3 | 1.5.3 | 0.9.0 |
50
52
51
- \*\* Note, the arm64 version is built for ARM64 platforms like Mac M1
53
+ > Note, the arm64 version is built for ARM64 platforms like Mac M1
52
54
53
55
Download the appropriate tag, e.g.:
54
56
@@ -75,7 +77,7 @@ Once the image has been built or you have downloaded the correct image, you can
75
77
76
78
In the following instructions, the variable ` ${DELTA_PACKAGE_VERSION} ` refers to the Delta Lake Package version.
77
79
78
- The current version is ` delta-core_2 .12:2. 3.0 ` which corresponds to Apache Spark 3.3 .x release line.
80
+ The current version is ` delta-spark_2 .12:3.0.0 ` which corresponds to Apache Spark 3.5 .x release line.
79
81
80
82
## Choose an Interface
81
83
@@ -98,7 +100,7 @@ The current version is `delta-core_2.12:2.3.0` which corresponds to Apache Spark
98
100
python3
99
101
```
100
102
101
- > Note: The Delta Rust Python bindings are already installed in this docker. To do this manually in your own environment, run the command: ` pip3 install deltalake==0.9 .0 `
103
+ > Note: The Delta Rust Python bindings are already installed in this docker. To do this manually in your own environment, run the command: ` pip3 install deltalake==0.12 .0 `
102
104
103
105
1 . Run some basic commands in the shell to write to and read from Delta Lake with Pandas
104
106
@@ -126,13 +128,13 @@ The current version is `delta-core_2.12:2.3.0` which corresponds to Apache Spark
126
128
127
129
``` python
128
130
# # Output
129
- 0
130
- 0 0
131
- 1 1
132
- 2 2
133
- ... ...
134
- 8 9
135
- 9 10
131
+ data
132
+ 0 0
133
+ 1 1
134
+ 2 2
135
+ ...
136
+ 8 9
137
+ 9 10
136
138
```
137
139
138
140
1 . Review the files
@@ -144,7 +146,7 @@ The current version is `delta-core_2.12:2.3.0` which corresponds to Apache Spark
144
146
145
147
``` python
146
148
# # Output
147
- [' 0-d4920663-30e9-4a1a-afde-59bc4ebd24b5 -0.parquet' , ' 1-f27a5ea6-a15f-4ca1-91b3-72bcf64fbc09 -0.parquet' ]
149
+ [' 0-6944fddf-60e3-4eab-811d-1398e9f64073 -0.parquet' , ' 1-66c7ee6e-6aab-4c74-866d-a82790102652 -0.parquet' ]
148
150
```
149
151
150
152
1 . Review history
@@ -156,7 +158,7 @@ The current version is `delta-core_2.12:2.3.0` which corresponds to Apache Spark
156
158
157
159
``` python
158
160
# # Output
159
- [{' timestamp' : 1682475171964 , ' delta-rs ' : ' 0.8 .0' }, {' timestamp' : 1682475171985 , ' operation' : ' WRITE ' , ' operationParameters' : {' partitionBy ' : ' [] ' , ' mode ' : ' Append ' }, ' clientVersion' : ' delta-rs.0.8 .0' }]
161
+ [{' timestamp' : 1698002214493 , ' operation ' : ' WRITE ' , ' operationParameters ' : { ' mode ' : ' Append ' , ' partitionBy ' : ' [] ' }, ' clientVersion ' : ' delta-rs.0.17 .0' , ' version ' : 1 }, {' timestamp' : 1698002207527 , ' operation' : ' CREATE TABLE ' , ' operationParameters' : {' mode ' : ' ErrorIfExists ' , ' protocol ' : ' {"minReaderVersion":1,"minWriterVersion":1} ' , ' location ' : ' file:///tmp/deltars_table ' , ' metadata ' : ' {"configuration": {} ,"created_time":1698002207525,"description":null,"format":{"options": {} ,"provider":"parquet"},"id":"bf749aab-22b6-484b-bd73-dc1680ee4384","name":null,"partition_columns":[],"schema":{"fields":[{"metadata": {} ,"name":"data","nullable":true,"type":"long"}],"type":"struct" }} ' }, ' clientVersion' : ' delta-rs.0.17 .0' , ' version ' : 0 }]
160
162
```
161
163
162
164
1 . Time Travel (load older version of table)
@@ -171,12 +173,12 @@ The current version is `delta-core_2.12:2.3.0` which corresponds to Apache Spark
171
173
172
174
``` python
173
175
# # Output
174
- 0
175
- 0 0
176
- 1 1
177
- 2 2
178
- 3 3
179
- 4 4
176
+ data
177
+ 0 0
178
+ 1 1
179
+ 2 2
180
+ 3 3
181
+ 4 4
180
182
```
181
183
182
184
1 . Follow the delta-rs Python documentation [ here] ( https://delta-io.github.io/delta-rs/python/usage.html# )
@@ -189,9 +191,9 @@ The current version is `delta-core_2.12:2.3.0` which corresponds to Apache Spark
189
191
190
192
``` bash
191
193
total 12
192
- 4 drwxr-xr-x 2 NBuser 4096 Apr 26 02:12 _delta_log
193
- 4 -rw-r--r-- 1 NBuser 1689 Apr 26 02:12 0-d4920663-30e9-4a1a-afde-59bc4ebd24b5 -0.parquet
194
- 4 -rw-r--r-- 1 NBuser 1691 Apr 26 02:12 1-f27a5ea6-a15f-4ca1-91b3-72bcf64fbc09-0.parquet
194
+ 4 -rw-r--r-- 1 NBuser 1689 Oct 22 19:16 0-6944fddf-60e3-4eab-811d-1398e9f64073-0.parquet
195
+ 4 -rw-r--r-- 1 NBuser 1691 Oct 22 19:16 1-66c7ee6e-6aab-4c74-866d-a82790102652 -0.parquet
196
+ 4 drwxr-xr-x 2 NBuser 4096 Oct 22 19:16 _delta_log
195
197
```
196
198
197
199
1 . [ Optional] Skip ahead to try out the [ Delta Rust API] ( #delta-rust-api ) and [ ROAPI] ( #optional-roapi )
@@ -225,11 +227,15 @@ The current version is `delta-core_2.12:2.3.0` which corresponds to Apache Spark
225
227
3 . Launch a pyspark interactive shell session
226
228
227
229
``` bash
230
+
228
231
$SPARK_HOME /bin/pyspark --packages io.delta:${DELTA_PACKAGE_VERSION} \
232
+ --conf spark.driver.extraJavaOptions=" -Divy.cache.dir=/tmp -Divy.home=/tmp" \
229
233
--conf " spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
230
234
--conf " spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
231
235
```
232
236
237
+ > Note: ` DELTA_PACKAGE_VERSION ` is set in ` ./startup.sh `
238
+
233
239
4 . Run some basic commands in the shell
234
240
235
241
``` python
@@ -277,16 +283,20 @@ The current version is `delta-core_2.12:2.3.0` which corresponds to Apache Spark
277
283
```
278
284
279
285
``` bash
280
- total 36
281
- 4 drwxr-xr-x 2 NBuser 4096 Apr 26 02:30 _delta_log
282
- 4 -rw-r--r-- 1 NBuser 12 Apr 26 02:30 .part-00000-bdee316b-8623-4423-b59c-6a809addaea8-c000.snappy.parquet.crc
283
- 4 -rw-r--r-- 1 NBuser 12 Apr 26 02:30 .part-00001-6b373d50-5bdd-496a-9e21-ab4164176f11-c000.snappy.parquet.crc
284
- 4 -rw-r--r-- 1 NBuser 12 Apr 26 02:30 .part-00002-9721ce9e-e043-4875-bcff-08f7d7c3d3f0-c000.snappy.parquet.crc
285
- 4 -rw-r--r-- 1 NBuser 12 Apr 26 02:30 .part-00003-61aaf450-c318-452a-aea5-5a44c909fd74-c000.snappy.parquet.crc
286
- 4 -rw-r--r-- 1 NBuser 478 Apr 26 02:30 part-00000-bdee316b-8623-4423-b59c-6a809addaea8-c000.snappy.parquet
287
- 4 -rw-r--r-- 1 NBuser 478 Apr 26 02:30 part-00001-6b373d50-5bdd-496a-9e21-ab4164176f11-c000.snappy.parquet
288
- 4 -rw-r--r-- 1 NBuser 478 Apr 26 02:30 part-00002-9721ce9e-e043-4875-bcff-08f7d7c3d3f0-c000.snappy.parquet
289
- 4 -rw-r--r-- 1 NBuser 486 Apr 26 02:30 part-00003-61aaf450-c318-452a-aea5-5a44c909fd74-c000.snappy.parquet
286
+ total 52
287
+ 4 drwxr-xr-x 2 NBuser 4096 Oct 22 19:23 _delta_log
288
+ 4 -rw-r--r-- 1 NBuser 296 Oct 22 19:23 part-00000-dc0fd6b3-9c0f-442f-a6db-708301b27bd2-c000.snappy.parquet
289
+ 4 -rw-r--r-- 1 NBuser 12 Oct 22 19:23 .part-00000-dc0fd6b3-9c0f-442f-a6db-708301b27bd2-c000.snappy.parquet.crc
290
+ 4 -rw-r--r-- 1 NBuser 478 Oct 22 19:23 part-00001-d379441e-1ee4-4e78-8616-1d9635df1c7b-c000.snappy.parquet
291
+ 4 -rw-r--r-- 1 NBuser 12 Oct 22 19:23 .part-00001-d379441e-1ee4-4e78-8616-1d9635df1c7b-c000.snappy.parquet.crc
292
+ 4 -rw-r--r-- 1 NBuser 478 Oct 22 19:23 part-00003-c08dcac4-5ea9-4329-b85d-9110493e8757-c000.snappy.parquet
293
+ 4 -rw-r--r-- 1 NBuser 12 Oct 22 19:23 .part-00003-c08dcac4-5ea9-4329-b85d-9110493e8757-c000.snappy.parquet.crc
294
+ 4 -rw-r--r-- 1 NBuser 478 Oct 22 19:23 part-00005-5db8dd16-2ab1-4d76-9b4d-457c5641b1c8-c000.snappy.parquet
295
+ 4 -rw-r--r-- 1 NBuser 12 Oct 22 19:23 .part-00005-5db8dd16-2ab1-4d76-9b4d-457c5641b1c8-c000.snappy.parquet.crc
296
+ 4 -rw-r--r-- 1 NBuser 478 Oct 22 19:23 part-00007-cad760e0-3c26-4d22-bed6-7d75a9459a0f-c000.snappy.parquet
297
+ 4 -rw-r--r-- 1 NBuser 12 Oct 22 19:23 .part-00007-cad760e0-3c26-4d22-bed6-7d75a9459a0f-c000.snappy.parquet.crc
298
+ 4 -rw-r--r-- 1 NBuser 478 Oct 22 19:23 part-00009-b58e8445-07b7-4e2a-9abf-6fea8d0c3e3f-c000.snappy.parquet
299
+ 4 -rw-r--r-- 1 NBuser 12 Oct 22 19:23 .part-00009-b58e8445-07b7-4e2a-9abf-6fea8d0c3e3f-c000.snappy.parquet.crc
290
300
```
291
301
292
302
### Scala Shell
@@ -299,17 +309,21 @@ The current version is `delta-core_2.12:2.3.0` which corresponds to Apache Spark
299
309
300
310
``` bash
301
311
$SPARK_HOME /bin/spark-shell --packages io.delta:${DELTA_PACKAGE_VERSION} \
312
+ --conf spark.driver.extraJavaOptions=" -Divy.cache.dir=/tmp -Divy.home=/tmp" \
302
313
--conf " spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
303
314
--conf " spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
304
315
```
305
316
306
317
4 . Run some basic commands in the shell
307
318
319
+ > note: if you've already written to the Delta table in the python shell example, use ` .mode("overwrite") ` to overwrite the current delta table. You can always time-travel to rewind.
320
+
308
321
``` scala
309
322
// Create a Spark DataFrame
310
323
val data = spark.range(0 , 5 )
311
324
312
325
// Write to a Delta Lake table
326
+
313
327
(data
314
328
.write
315
329
.format(" delta" )
@@ -350,22 +364,29 @@ The current version is `delta-core_2.12:2.3.0` which corresponds to Apache Spark
350
364
```
351
365
352
366
``` bash
353
- total 36
354
- 4 drwxr-xr-x 2 NBuser 4096 Apr 26 02:31 _delta_log
355
- 4 -rw-r--r-- 1 NBuser 12 Apr 26 02:31 .part-00000-e0353d3e-7473-4ff7-9b58-e977d48d008a-c000.snappy.parquet.crc
356
- 4 -rw-r--r-- 1 NBuser 12 Apr 26 02:31 .part-00001-0e2c89cf-3f9b-4698-b059-6dd41d4e3aed-c000.snappy.parquet.crc
357
- 4 -rw-r--r-- 1 NBuser 12 Apr 26 02:31 .part-00002-06bf68f9-16d8-4c08-ba8e-7b0b00d52b8e-c000.snappy.parquet.crc
358
- 4 -rw-r--r-- 1 NBuser 12 Apr 26 02:31 .part-00003-5963f002-d98a-421f-9c2d-22376b7f87e4-c000.snappy.parquet.crc
359
- 4 -rw-r--r-- 1 NBuser 478 Apr 26 02:31 part-00000-e0353d3e-7473-4ff7-9b58-e977d48d008a-c000.snappy.parquet
360
- 4 -rw-r--r-- 1 NBuser 478 Apr 26 02:31 part-00001-0e2c89cf-3f9b-4698-b059-6dd41d4e3aed-c000.snappy.parquet
361
- 4 -rw-r--r-- 1 NBuser 478 Apr 26 02:31 part-00002-06bf68f9-16d8-4c08-ba8e-7b0b00d52b8e-c000.snappy.parquet
362
- 4 -rw-r--r-- 1 NBuser 486 Apr 26 02:31 part-00003-5963f002-d98a-421f-9c2d-22376b7f87e4-c000.snappy.parquet
367
+ total 52
368
+ 4 drwxr-xr-x 2 NBuser 4096 Oct 22 19:28 _delta_log
369
+ 4 -rw-r--r-- 1 NBuser 296 Oct 22 19:28 part-00000-f1f417f7-df64-4c7c-96f2-6a452ae2b49e-c000.snappy.parquet
370
+ 4 -rw-r--r-- 1 NBuser 12 Oct 22 19:28 .part-00000-f1f417f7-df64-4c7c-96f2-6a452ae2b49e-c000.snappy.parquet.crc
371
+ 4 -rw-r--r-- 1 NBuser 478 Oct 22 19:28 part-00001-b28acb6f-f08a-460f-a24e-4d9c1affee86-c000.snappy.parquet
372
+ 4 -rw-r--r-- 1 NBuser 12 Oct 22 19:28 .part-00001-b28acb6f-f08a-460f-a24e-4d9c1affee86-c000.snappy.parquet.crc
373
+ 4 -rw-r--r-- 1 NBuser 478 Oct 22 19:28 part-00003-29079c58-d1ad-4604-9c04-0f00bf09546d-c000.snappy.parquet
374
+ 4 -rw-r--r-- 1 NBuser 12 Oct 22 19:28 .part-00003-29079c58-d1ad-4604-9c04-0f00bf09546d-c000.snappy.parquet.crc
375
+ 4 -rw-r--r-- 1 NBuser 478 Oct 22 19:28 part-00005-04424aa7-48e1-4212-bd57-52552c713154-c000.snappy.parquet
376
+ 4 -rw-r--r-- 1 NBuser 12 Oct 22 19:28 .part-00005-04424aa7-48e1-4212-bd57-52552c713154-c000.snappy.parquet.crc
377
+ 4 -rw-r--r-- 1 NBuser 478 Oct 22 19:28 part-00007-e7a54a4f-bee4-4371-a35d-d284e28eb9f8-c000.snappy.parquet
378
+ 4 -rw-r--r-- 1 NBuser 12 Oct 22 19:28 .part-00007-e7a54a4f-bee4-4371-a35d-d284e28eb9f8-c000.snappy.parquet.crc
379
+ 4 -rw-r--r-- 1 NBuser 478 Oct 22 19:28 part-00009-086e6cd9-e8c6-4f16-9658-b15baf22905d-c000.snappy.parquet
380
+ 4 -rw-r--r-- 1 NBuser 12 Oct 22 19:28 .part-00009-086e6cd9-e8c6-4f16-9658-b15baf22905d-c000.snappy.parquet.crc
363
381
```
364
382
365
383
</details >
366
384
367
385
### Delta Rust API
368
386
387
+ > Note: Use a docker volume in case of running into limits "no room left on device"
388
+ > ` docker volume create rustbuild ` > ` docker run --name delta_quickstart -v rustbuild:/tmp --rm -it --entrypoint bash deltaio/delta-docker:3.0.0 `
389
+
369
390
1 . Open a bash shell (if on windows use git bash, WSL, or any shell configured for bash commands)
370
391
371
392
2 . Run a container from the image with a bash entrypoint ([ build] ( #build-entry-point ) | [ DockerHub] ( #image-entry-point ) )
@@ -377,28 +398,26 @@ The current version is `delta-core_2.12:2.3.0` which corresponds to Apache Spark
377
398
cargo run --example read_delta_table
378
399
```
379
400
401
+ > You can also use a different location to build and run the examples
402
+
403
+ ``` bash
404
+ cd rs
405
+ CARGO_TARGET_DIR=/tmp cargo run --example read_delta_table
406
+ ```
407
+
380
408
> If using [ Delta Lake DockerHub] ( https://go.delta.io/dockerhub ) , sometimes the Rust environment hasn't been configured. To resolve this, run the command ` source "$HOME/.cargo/env" `
381
409
382
410
``` bash
383
411
=== Delta table metadata ===
384
- DeltaTable(../quickstart_docker /rs/data/COVID-19_NYT)
412
+ DeltaTable(/opt/spark/work-dir /rs/data/COVID-19_NYT)
385
413
version: 0
386
414
metadata: GUID=7245fd1d-8a6d-4988-af72-92a95b646511, name=None, description=None, partitionColumns=[], createdTime=Some(1619121484605), configuration={}
387
415
min_version: read=1, write=2
388
416
files count: 8
389
417
390
418
391
419
=== Delta table files ===
392
- [
393
- Path { raw: " part-00000-a496f40c-e091-413a-85f9-b1b69d4b3b4e-c000.snappy.parquet" },
394
- Path { raw: " part-00001-9d9d980b-c500-4f0b-bb96-771a515fbccc-c000.snappy.parquet" },
395
- Path { raw: " part-00002-8826af84-73bd-49a6-a4b9-e39ffed9c15a-c000.snappy.parquet" },
396
- Path { raw: " part-00003-539aff30-2349-4b0d-9726-c18630c6ad90-c000.snappy.parquet" },
397
- Path { raw: " part-00004-1bb9c3e3-c5b0-4d60-8420-23261f58a5eb-c000.snappy.parquet" },
398
- Path { raw: " part-00005-4d47f8ff-94db-4d32-806c-781a1cf123d2-c000.snappy.parquet" },
399
- Path { raw: " part-00006-d0ec7722-b30c-4e1c-92cd-b4fe8d3bb954-c000.snappy.parquet" },
400
- Path { raw: " part-00007-4582392f-9fc2-41b0-ba97-a74b3afc8239-c000.snappy.parquet" }
401
- ]
420
+ [Path { raw: " part-00000-a496f40c-e091-413a-85f9-b1b69d4b3b4e-c000.snappy.parquet" }, Path { raw: " part-00001-9d9d980b-c500-4f0b-bb96-771a515fbccc-c000.snappy.parquet" }, Path { raw: " part-00002-8826af84-73bd-49a6-a4b9-e39ffed9c15a-c000.snappy.parquet" }, Path { raw: " part-00003-539aff30-2349-4b0d-9726-c18630c6ad90-c000.snappy.parquet" }, Path { raw: " part-00004-1bb9c3e3-c5b0-4d60-8420-23261f58a5eb-c000.snappy.parquet" }, Path { raw: " part-00005-4d47f8ff-94db-4d32-806c-781a1cf123d2-c000.snappy.parquet" }, Path { raw: " part-00006-d0ec7722-b30c-4e1c-92cd-b4fe8d3bb954-c000.snappy.parquet" }, Path { raw: " part-00007-4582392f-9fc2-41b0-ba97-a74b3afc8239-c000.snappy.parquet" }]
402
421
` ` `
403
422
404
423
4. Execute ` examples/read_delta_datafusion.rs` to query the ` covid19_nyt` Delta Lake table using ` datafusion`
@@ -408,37 +427,29 @@ The current version is `delta-core_2.12:2.3.0` which corresponds to Apache Spark
408
427
` ` `
409
428
410
429
` ` ` bash
430
+ === Datafusion query ===
431
+ [RecordBatch { schema: Schema { fields: [Field { name: " cases" , data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: " county" , data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: " date" , data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }], metadata: {} }, columns: [PrimitiveArray< Int32>
411
432
[
412
- RecordBatch {
413
- schema: Schema {
414
- fields: [
415
- Field { name: " cases" , data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None },
416
- Field { name: " county" , data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None },
417
- Field { name: " date" , data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }
418
- ], metadata: {}
419
- },
420
- columns: [PrimitiveArray< Int32> [
421
- 1,
422
- 1,
423
- 1,
424
- 1,
425
- 1,
426
- ], StringArray [
427
- " Snohomish" ,
428
- " Snohomish" ,
429
- " Snohomish" ,
430
- " Cook" ,
431
- " Snohomish" ,
432
- ], StringArray [
433
- " 2020-01-21" ,
434
- " 2020-01-22" ,
435
- " 2020-01-23" ,
436
- " 2020-01-24" ,
437
- " 2020-01-24" ,
438
- ]],
439
- row_count: 5
440
- }
441
- ]
433
+ 1,
434
+ 1,
435
+ 1,
436
+ 1,
437
+ 1,
438
+ ], StringArray
439
+ [
440
+ " Snohomish" ,
441
+ " Snohomish" ,
442
+ " Snohomish" ,
443
+ " Cook" ,
444
+ " Snohomish" ,
445
+ ], StringArray
446
+ [
447
+ " 2020-01-21" ,
448
+ " 2020-01-22" ,
449
+ " 2020-01-23" ,
450
+ " 2020-01-24" ,
451
+ " 2020-01-24" ,
452
+ ]], row_count: 5 }]
442
453
` ` `
443
454
444
455
< /p>
0 commit comments