Skip to content

YARN-11838: YARN ConcurrentModificationException When Refreshing Node Attributes #7828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

shameersss1
Copy link
Contributor

@shameersss1 shameersss1 commented Jul 24, 2025

Description of PR

Refer YARN-11838 for more details.

The issue is that.

  1. The LOG statement which prints newNodeToAttributesMap tries to iterate host.attribute
  2. host.attribute gets modified by some other thread - leading to concurrent modification exception.

There are two ways to solve this

  1. To readLock before LOG statement so that host.attribute does not get modified during LOG statement
  2. Create a defensive copy of host.attribute (under read lock because the modification can happen at that time as well).

The rationale behind using option 2 to avoid logging inconsistency- Assume that we readLock before LOG statement. Once the LOG statement is executed, some other thread modifies the host.attribute this will lead to we logging something and processing something else.

Creating a defensive copy make sure that we don't change value. i.e what is LOGed gets processed as well.

How was this patch tested?

Added unit test

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@sjlee
Copy link
Contributor

sjlee commented Jul 24, 2025

@shameersss1 Thanks for your contribution. I haven't sat down and looked at the larger code yet, but a couple of questions:

  • Why are we using the read lock for a mutation operation? Shouldn't we be using the write lock? The read lock will still permit concurrent operation and is not the right thing to use here, no?
  • Regarding the unit test, I wonder how it is passing even with the read lock? Maybe the concurrency is not enough to reproduce the problem? It would be great if you could reproduce the problem with the old code first and prove that the new code fixes it.
  • Have you done a fully analysis of the all reads and writes to this hashmap so that all read access is protected by the read lock and all write access by the write lock? That is the correct thing to do here.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 21m 57s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 46m 9s trunk passed
+1 💚 compile 1m 6s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 compile 0m 54s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 0m 56s trunk passed
+1 💚 mvnsite 1m 0s trunk passed
+1 💚 javadoc 0m 57s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 50s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 58s trunk passed
+1 💚 shadedclient 41m 16s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 49s the patch passed
+1 💚 compile 0m 57s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 57s the patch passed
+1 💚 compile 0m 47s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 javac 0m 47s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 43s the patch passed
+1 💚 mvnsite 0m 50s the patch passed
+1 💚 javadoc 0m 47s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 42s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 56s the patch passed
+1 💚 shadedclient 41m 56s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 119m 51s /patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt hadoop-yarn-server-resourcemanager in the patch failed.
+1 💚 asflicense 0m 38s The patch does not generate ASF License warnings.
286m 58s
Subsystem Report/Notes
Docker ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7828/1/artifact/out/Dockerfile
GITHUB PR #7828
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 3f403c42a993 5.15.0-139-generic #149-Ubuntu SMP Fri Apr 11 22:06:13 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / c48dd49
Default Java Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7828/1/testReport/
Max. process+thread count 912 (vs. ulimit of 5500)
modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7828/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@shameersss1
Copy link
Contributor Author

Thanks @sjlee for the review. Please find the answers inline

* Why are we using the read lock for a mutation operation? Shouldn't we be using the write lock? The read lock will still permit concurrent operation and is not the right thing to use here, no?

The method refreshNodeAttributesToScheduler does not do any writing. It only reads the variable host.attributes which can potentially be modified by some other thread leading to concurrent modification exception. We are also creating defensive copy so that further access is safe.

ReadLock ensures that refreshNodeAttributesToScheduler can be accessed by multiple threads (since there is not writing) and the critical block newNodeToAttributesMap.put(hostName, new HashSet<>(host.attributes.keySet())); is protected.

* Regarding the unit test, I wonder how it is passing even with the read lock? Maybe the concurrency is not enough to reproduce the problem? It would be great if you could reproduce the problem with the old code first and prove that the new code fixes it.

Since it is raise condition - Replication through unit test is difficult without inducing artificial sleeps in the core code flow. Ye, the unit test passes even without this change as well. The purpose of this unit test is more of protective measure.

* Have you done a fully analysis of the all reads and writes to this hashmap so that all read access is protected by the read lock and all write access by the write lock? That is the correct thing to do here.

As per my analysis host.attributes is accessed during node attribute add , removal and replacing - every access except this one is protected using read/write lock.

@shameersss1
Copy link
Contributor Author

The unit test failure seems flaky and not reurun it passed locally,

[INFO] Running org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor [INFO] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 285.3 s -- in org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor [INFO] [INFO] Results: [INFO] [INFO] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 06:13 min [INFO] Finished at: 2025-07-25T15:25:01+05:30

@shameersss1
Copy link
Contributor Author

@slfan1989 @TaoYang526 @zeekling could you please review ?

@sjlee
Copy link
Contributor

sjlee commented Jul 26, 2025

I see that you're copying the key set while holding the read lock to avoid the issue. I do think it is one correct way to address the issue. That's a valid fix.

My only point would be that guarding the logging call might be a cheaper and still correct fix, as it avoids copying. I don't think a keySet() call would cause iteration so that is still safe without the read lock. Let me know what you think.

@shameersss1
Copy link
Contributor Author

shameersss1 commented Jul 26, 2025

I see that you're copying the key set while holding the read lock to avoid the issue. I do think it is one correct way to address the issue. That's a valid fix.

My only point would be that guarding the logging call might be a cheaper and still correct fix, as it avoids copying. I don't think a keySet() call would cause iteration so that is still safe without the read lock. Let me know what you think.

The lock is required for creating copy (since it will iterate). The only advantage i see with copying is that the log statement will be consistent with what we process. If we don't copy and some other threads might modify host.attribute after we execute the LOG statement - Will this lead to inconsistent logging and processing.

On side note - I don't anticipate a host using a large number of attributes in which case the copy might become expensive

@sjlee Any thoughts on this ?

// other threads might access host.attributes
readLock.lock();
try {
newNodeToAttributesMap.put(hostName, new HashSet<>(host.attributes.keySet()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging from the stack, the problem occurred when the log was printed, and the wrong line was modified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that.

  1. The LOG statement which prints newNodeToAttributesMap tries to iterate host.attribute
  2. host.attribute gets modified by some other thread - leading to concurrent modification exception.

There are two ways to solve this

  1. As you said to readLock before LOG statement so that host.attribute does not get modified during LOG statement
  2. Create a defensive copy of host.attribute (under read lock because the modification can happen at that time as well).

The rationale behind using option 2 to avoid logging inconsistency- Assume that we readLock before LOG statement. Once the LOG statement is executed, some other thread modifies the host.attribute this will lead to we logging something and processing something else.

Creating a defensive copy make sure that we don't change value. i.e what is LOGed gets processed as well.

@shameersss1 shameersss1 requested a review from zeekling July 28, 2025 08:52
@TaoYang526
Copy link
Contributor

@shameersss1 Thanks for fixing this issue, LGTM.

@shameersss1
Copy link
Contributor Author

@slfan1989 - Gentle reminder for review

@violetnspct
Copy link

@shameersss1 Should you be adding unit tests to cover the following two edge cases? Or those are already covered?

  1. Lock acquisition failure. Important because lock acquisition could fail in high contention.
  2. Exception during locked section. Important to verify lock release in error conditions

@shameersss1
Copy link
Contributor Author

@shameersss1 Should you be adding unit tests to cover the following two edge cases? Or those are already covered?

1. Lock acquisition failure. Important because lock acquisition could fail in high contention.

2. Exception during locked section. Important to verify lock release in error conditions

The locking is inconsistent with the other methods in the class which uses try{}finally{} block to release the lock, hence i don't see any concerns here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants