-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Thanks to all for the quick turnaround resolving #301.
Unfortunately we've hit a much deeper snag performing an upgrade starting with 0.11.0. Nearly as soon as we re-open in read mode an Array we had previously been writing to we get native code errors or what appears to be a deadlock. This only happens after writing many, many overlapping chunks. This does not happen with 0.10.1.
With our production code this is the behaviour:
TileDB-Java 0.11.0 (TileDB 2.9.0)
[2023-07-25 17:49:14.391] [Process: 2565256] [error] [Global] [TileDB::FragmentMetadata] Error: Trying to access metadata that's not loaded
Caused by: io.tiledb.java.api.TileDBError: [TileDB::FragmentMetadata] Error: Trying to access metadata that's not loaded
at io.tiledb.java.api.ContextCallback.call(ContextCallback.java:56)
at io.tiledb.java.api.Context.handleError(Context.java:142)
at io.tiledb.java.api.Query.submit(Query.java:130)
...
TileDB-Java 0.13.0 (TileDB 2.11.0)
[2023-07-25 13:57:01.358] [Process: 889580] [error] [Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:57:03.753] [Process: 889580] [error] [Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:57:08.940] [Process: 889580] [error] [Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:57:08.942] [Process: 889580] [error] [Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:57:08.943] [Process: 889580] [error] [Global] [TileDB::FragmentMetadata] Error: Trying to access metadata that's not loaded
[2023-07-25 13:57:09.289] [Process: 889580] [error] [Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:57:11.144] [Process: 889580] [error] [Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
...
Caused by: io.tiledb.java.api.TileDBError: [TileDB::FragmentMetadata] Error: Trying to access metadata that's not loaded
at io.tiledb.java.api.ContextCallback.call(ContextCallback.java:56)
at io.tiledb.java.api.Context.handleError(Context.java:142)
at io.tiledb.java.api.Query.submit(Query.java:130)
...
TileDB-Java 0.14.1 (TileDB 2.12)
[2023-07-25 13:43:49.710] [Process: 1772240] [error] [1690288813910863900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:43:49.712] [Process: 1772240] [error] [1690288813910863900-Global] Error: Internal TileDB uncaught exception; device or resource busy: device or resource busy
[2023-07-25 13:43:50.810] [Process: 1772240] [error] [1690288813910863900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:43:50.811] [Process: 1772240] [error] [1690288813910863900-Global] Error: Internal TileDB uncaught exception; device or resource busy: device or resource busy
...
Caused by: io.tiledb.java.api.TileDBError: Error: Internal TileDB uncaught exception; device or resource busy: device or resource busy
at io.tiledb.java.api.ContextCallback.call(ContextCallback.java:56)
at io.tiledb.java.api.Context.handleError(Context.java:142)
at io.tiledb.java.api.Query.submit(Query.java:130)
...
TileDB-Java 0.15.2 (TileDB 2.13.2)
Hang or deadlock. Worker stack traces (collected via jstack) are:
"pool-1-thread-1" #23 prio=5 os_prio=0 cpu=89546.88ms elapsed=340.37s tid=0x0000020972a92800 nid=0x3d32c runnable [0x000000be392fe000]
java.lang.Thread.State: RUNNABLE
at io.tiledb.libtiledb.tiledbJNI.tiledb_query_submit(Native Method)
at io.tiledb.libtiledb.tiledb.tiledb_query_submit(tiledb.java:2853)
at io.tiledb.java.api.Query.submit(Query.java:130)
...
TileDB-Java 0.16.1 (TileDB 2.14.1)
Works for a while, dies later.
[2023-07-25 14:20:19.628] [Process: 1313796] [error] [1690290969279989200-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 14:20:19.632] [Process: 1313796] [error] [1690290969279989200-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 14:20:20.317] [Process: 1313796] [error] [1690290969279989200-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 14:20:20.317] [Process: 1313796] [error] [1690290969279989200-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
...
Caused by: io.tiledb.java.api.TileDBError: C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
at io.tiledb.java.api.ContextCallback.call(ContextCallback.java:56)
at io.tiledb.java.api.Context.handleError(Context.java:144)
at io.tiledb.java.api.Query.submit(Query.java:130)
...
TileDB-Java 0.17.8 (TileDB 2.15.4)
[2023-07-25 16:45:09.726] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:09.729] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 16:45:09.921] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:09.922] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 16:45:10.401] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:10.402] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 16:45:10.617] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:10.617] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 16:45:10.762] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:10.763] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 16:45:11.033] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:11.034] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 16:45:11.776] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:11.777] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
...
Caused by: io.tiledb.java.api.TileDBError: C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
at io.tiledb.java.api.ContextCallback.call(ContextCallback.java:56)
at io.tiledb.java.api.Context.handleError(Context.java:144)
at io.tiledb.java.api.Query.submit(Query.java:130)
...
I've put together a limited example which reproduces this:
It fails like this:
...
Inserting rectangle: [11078, 21057]
Inserting rectangle: [12093, 21090]
Not consolidating H:\code\tiledb-java-torture\tiledb_14210934035379110573\0
Creating TileDB array: H:\code\tiledb-java-torture\tiledb_14210934035379110573\1
[2023-07-26 13:28:54.114] [Process: 2200528] [error] [1690373955299503400-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-26 13:28:54.114] [Process: 2200528] [error] [1690373955299503400-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
Exception during execution
java.util.concurrent.CompletionException: io.tiledb.java.api.TileDBError: C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
at java.base/java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1423)
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
at com.glencoesoftware.tiledb.Main.lambda$2(Main.java:404)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.tiledb.java.api.TileDBError: C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
at io.tiledb.java.api.ContextCallback.call(ContextCallback.java:56)
at io.tiledb.java.api.Context.handleError(Context.java:144)
at io.tiledb.java.api.Query.submit(Query.java:130)
at com.glencoesoftware.tiledb.Main.processTile(Main.java:359)
at com.glencoesoftware.tiledb.Main.lambda$2(Main.java:402)
... 3 more
That above output snippet from Windows 10. Linux behaves similarly but not identically.
The code reflects the pattern from our production code that relies on TileDB fairly well:
- Process a large number of tiles writing them in a non-adjacent fashion from multiple workers to a 5-dimensional TileDB Array
- Downsample from the Array [1] and write to new 5-dimensional Array for each new "resolution"
20 channels is about right to produce the errors; ~11000 fragments. If less data is processed, things proceed as normal. The issue occurs with or without consolidation.