Skip to content

Conversation

@abellina
Copy link
Collaborator

  1. This PR changes BUFN deadlock detection so it only happens in the deadlock busting thread, not in every free.
  2. It removes a JNI->Java callback that was done, instead taking in the java values when needed (in the thread) as a parameter.
  3. It also removes a static lock in ThreadStateRegistry

This is a performance improvement for the state machine and is part of #3905

@abellina abellina force-pushed the detect_deadlock_in_background_thread_only branch from c4501b3 to 9485a51 Compare November 19, 2025 20:17
@abellina
Copy link
Collaborator Author

build

@abellina
Copy link
Collaborator Author

build

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 19, 2025

Greptile Summary

  • Moves BUFN deadlock detection from critical path (every free/alloc) to dedicated watchdog thread for performance improvement
  • Replaces expensive JNI->Java callback (isThreadBlocked) with Java->JNI parameter passing of blocked thread IDs
  • Removes static synchronized lock in ThreadStateRegistry by switching to ConcurrentHashMap

Confidence Score: 4/5

  • This PR is safe to merge with careful monitoring of deadlock detection behavior in production.
  • Score reflects architectural improvement with one concern: BUFN thread wake-up logic removed from wake_next_highest_priority_blocked means progress now depends entirely on watchdog thread frequency, which could delay deadlock resolution if watchdog doesn't run frequently enough.
  • Pay close attention to src/main/cpp/src/SparkResourceAdaptorJni.cpp - the removed BUFN wake-up logic in deallocation path changes deadlock resolution timing.

Important Files Changed

Filename Overview
src/main/cpp/src/SparkResourceAdaptorJni.cpp Removes JNI callback in critical path, delegates BUFN deadlock detection to watchdog thread, and simplifies wake logic to only handle THREAD_BLOCKED states
src/main/java/com/nvidia/spark/rapids/jni/ThreadStateRegistry.java Replaces synchronized HashMap with ConcurrentHashMap, adds blockedThreadIds() method to collect blocked thread state

Sequence Diagram

sequenceDiagram
    participant DT as Deadlock Thread
    participant TSR as ThreadStateRegistry
    participant SRA as SparkResourceAdaptor
    participant Native as C++ spark_resource_adaptor
    participant WT as Worker Thread

    DT->>TSR: blockedThreadIds()
    TSR->>TSR: iterate knownThreads
    TSR->>TSR: check thread state for each
    TSR-->>DT: return blocked thread IDs array
    DT->>SRA: checkAndBreakDeadlocks()
    SRA->>Native: checkAndBreakDeadlocks(handle, blockedThreadIds)
    Native->>Native: check_and_break_deadlocks(java_blocked_thread_ids)
    Native->>Native: check_and_update_for_bufn(lock, java_blocked_thread_ids)
    Native->>Native: is_in_deadlock() with java thread state
    alt Deadlock detected
        Native->>Native: transition thread to BUFN_THROW or RUNNING
        Native->>Native: notify blocked threads
    end
    
    WT->>Native: do_deallocate()
    Native->>Native: dealloc_core()
    Native->>Native: wake_next_highest_priority_blocked()
    Note over Native: Only wakes THREAD_BLOCKED, not BUFN threads
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

revans2
revans2 previously approved these changes Nov 19, 2025
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also why is the cudf version changing? is there something needed by it to be different?

full_thread_state const& state,
std::optional<std::unordered_set<long>> const& java_blocked_thread_ids)
{
LOG_INFO("is_thread_bufn_or_above: state: {}, java_blocked_thread_ids: {}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Generally we have always had the logging follow a CSV/TSV like structure so that we can parse it and do post processing/analytics on it. Is the needed? Could this be moved to more of that format?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I can change or remove. It is "just in case", I never like to be deadlocked without finding a way to debug it.

@abellina
Copy link
Collaborator Author

yeah let me check the cudf thing, it's not intended.

@abellina abellina changed the base branch from main to release/25.12 November 19, 2025 20:38
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

@abellina
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Nov 19, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

@abellina
Copy link
Collaborator Author

build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

@pxLi
Copy link
Member

pxLi commented Nov 20, 2025

submodule sync is syncing the latest cudf commits (w/ more breaking changes) and corresponding pinned deps to release/25.12, the cudf ref updates in this change may cause commit roll back after merge. Can we revert the cudf refer update, and trigger CI later? thanks

Aborted the submodule run to allow this one to get merged first

@pxLi pxLi merged commit 8f5f8c2 into NVIDIA:release/25.12 Nov 20, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants