Skip to content

[FEA] SparkRapidsAdaptor locking performance issues #3905

@abellina

Description

@abellina

We have reports of the SparkResourceAdaptor causing a regression between 25.02 and 25.04 given coarse locks taken while we do some expensive STL and metric operations. We have confirmed that reducing those critical sections improves the performance immensely, removing the regression.

This issue is to help fix this in the proper way, by speeding up the processing within these locks.

List of changes I am proposing:

  1. read/write locks in RmmSpark against Rmm.class: Use Rmm read/write locks in RmmSpark #3924
  2. logging macros in SparkResourceAdaptor to consistently check for logging enabled, and to stop doing expensive pre-log operations that we do not need. Decouple logger object from spark_resource_adaptor #3931
  3. make full_thread_state a shared pointer Make full_thread_state a shared pointer #3966
  4. stop deadlock detection in critical path, and leave it to allocation failures and deadlock watchdog + stop calling into jni during deadlock protection, instead java sends to jni state of threads (less expensive java -> native) Only check for deadlocks in deadlock busting thread #3977
  5. use state specific collections in the adaptor, especially for blocked threads, ordered by priority. This prevents expensive loops on the threads map in critical sections.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions