Skip to content

Concurrency issues after upgrade to 0.11.0 #304

@chris-allan

Description

@chris-allan

Thanks to all for the quick turnaround resolving #301.

Unfortunately we've hit a much deeper snag performing an upgrade starting with 0.11.0. Nearly as soon as we re-open in read mode an Array we had previously been writing to we get native code errors or what appears to be a deadlock. This only happens after writing many, many overlapping chunks. This does not happen with 0.10.1.

With our production code this is the behaviour:

TileDB-Java 0.11.0 (TileDB 2.9.0)

[2023-07-25 17:49:14.391] [Process: 2565256] [error] [Global] [TileDB::FragmentMetadata] Error: Trying to access metadata that's not loaded
Caused by: io.tiledb.java.api.TileDBError: [TileDB::FragmentMetadata] Error: Trying to access metadata that's not loaded
	at io.tiledb.java.api.ContextCallback.call(ContextCallback.java:56)
	at io.tiledb.java.api.Context.handleError(Context.java:142)
	at io.tiledb.java.api.Query.submit(Query.java:130)
...

TileDB-Java 0.13.0 (TileDB 2.11.0)

[2023-07-25 13:57:01.358] [Process: 889580] [error] [Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:57:03.753] [Process: 889580] [error] [Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:57:08.940] [Process: 889580] [error] [Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:57:08.942] [Process: 889580] [error] [Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:57:08.943] [Process: 889580] [error] [Global] [TileDB::FragmentMetadata] Error: Trying to access metadata that's not loaded
[2023-07-25 13:57:09.289] [Process: 889580] [error] [Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:57:11.144] [Process: 889580] [error] [Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
...
Caused by: io.tiledb.java.api.TileDBError: [TileDB::FragmentMetadata] Error: Trying to access metadata that's not loaded
	at io.tiledb.java.api.ContextCallback.call(ContextCallback.java:56)
	at io.tiledb.java.api.Context.handleError(Context.java:142)
	at io.tiledb.java.api.Query.submit(Query.java:130)
...

TileDB-Java 0.14.1 (TileDB 2.12)

[2023-07-25 13:43:49.710] [Process: 1772240] [error] [1690288813910863900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:43:49.712] [Process: 1772240] [error] [1690288813910863900-Global] Error: Internal TileDB uncaught exception; device or resource busy: device or resource busy
[2023-07-25 13:43:50.810] [Process: 1772240] [error] [1690288813910863900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 13:43:50.811] [Process: 1772240] [error] [1690288813910863900-Global] Error: Internal TileDB uncaught exception; device or resource busy: device or resource busy
...
Caused by: io.tiledb.java.api.TileDBError: Error: Internal TileDB uncaught exception; device or resource busy: device or resource busy
	at io.tiledb.java.api.ContextCallback.call(ContextCallback.java:56)
	at io.tiledb.java.api.Context.handleError(Context.java:142)
	at io.tiledb.java.api.Query.submit(Query.java:130)
...

TileDB-Java 0.15.2 (TileDB 2.13.2)

Hang or deadlock. Worker stack traces (collected via jstack) are:

"pool-1-thread-1" #23 prio=5 os_prio=0 cpu=89546.88ms elapsed=340.37s tid=0x0000020972a92800 nid=0x3d32c runnable  [0x000000be392fe000]
   java.lang.Thread.State: RUNNABLE
        at io.tiledb.libtiledb.tiledbJNI.tiledb_query_submit(Native Method)
        at io.tiledb.libtiledb.tiledb.tiledb_query_submit(tiledb.java:2853)
        at io.tiledb.java.api.Query.submit(Query.java:130)
...

TileDB-Java 0.16.1 (TileDB 2.14.1)

Works for a while, dies later.

[2023-07-25 14:20:19.628] [Process: 1313796] [error] [1690290969279989200-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 14:20:19.632] [Process: 1313796] [error] [1690290969279989200-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 14:20:20.317] [Process: 1313796] [error] [1690290969279989200-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 14:20:20.317] [Process: 1313796] [error] [1690290969279989200-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
...
Caused by: io.tiledb.java.api.TileDBError: C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
	at io.tiledb.java.api.ContextCallback.call(ContextCallback.java:56)
	at io.tiledb.java.api.Context.handleError(Context.java:144)
	at io.tiledb.java.api.Query.submit(Query.java:130)
...

TileDB-Java 0.17.8 (TileDB 2.15.4)

[2023-07-25 16:45:09.726] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:09.729] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 16:45:09.921] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:09.922] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 16:45:10.401] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:10.402] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 16:45:10.617] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:10.617] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 16:45:10.762] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:10.763] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 16:45:11.033] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:11.034] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
[2023-07-25 16:45:11.776] [Process: 452168] [error] [1690299697002774900-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-25 16:45:11.777] [Process: 452168] [error] [1690299697002774900-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
...
Caused by: io.tiledb.java.api.TileDBError: C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
	at io.tiledb.java.api.ContextCallback.call(ContextCallback.java:56)
	at io.tiledb.java.api.Context.handleError(Context.java:144)
	at io.tiledb.java.api.Query.submit(Query.java:130)
...

I've put together a limited example which reproduces this:

It fails like this:

...
Inserting rectangle: [11078, 21057]
Inserting rectangle: [12093, 21090]
Not consolidating H:\code\tiledb-java-torture\tiledb_14210934035379110573\0
Creating TileDB array: H:\code\tiledb-java-torture\tiledb_14210934035379110573\1
[2023-07-26 13:28:54.114] [Process: 2200528] [error] [1690373955299503400-Global] [TileDB::Task] Error: Caught std::exception: device or resource busy: device or resource busy
[2023-07-26 13:28:54.114] [Process: 2200528] [error] [1690373955299503400-Global] C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
Exception during execution
java.util.concurrent.CompletionException: io.tiledb.java.api.TileDBError: C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
	at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
	at java.base/java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1423)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
	at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073)
	at com.glencoesoftware.tiledb.Main.lambda$2(Main.java:404)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.tiledb.java.api.TileDBError: C API: TileDB Internal, std::exception; device or resource busy: device or resource busy
	at io.tiledb.java.api.ContextCallback.call(ContextCallback.java:56)
	at io.tiledb.java.api.Context.handleError(Context.java:144)
	at io.tiledb.java.api.Query.submit(Query.java:130)
	at com.glencoesoftware.tiledb.Main.processTile(Main.java:359)
	at com.glencoesoftware.tiledb.Main.lambda$2(Main.java:402)
	... 3 more

That above output snippet from Windows 10. Linux behaves similarly but not identically.

The code reflects the pattern from our production code that relies on TileDB fairly well:

  1. Process a large number of tiles writing them in a non-adjacent fashion from multiple workers to a 5-dimensional TileDB Array
  2. Downsample from the Array [1] and write to new 5-dimensional Array for each new "resolution"

20 channels is about right to produce the errors; ~11000 fragments. If less data is processed, things proceed as normal. The issue occurs with or without consolidation.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions