HDFS-17815. Fix upload fsimage failure when checkpoint takes a long time #7845

lfxy · 2025-07-31T17:20:43Z

The capacity of Our hdfs federation cluster are more then 500 PB, with one NS containing over 600 million files. Once checkpoint takes nearly two hours.

We discover checkpoint frequently failures due to fail to put the fsimage to the active Namenode, leading to repeat checkpoints. We configured dfs.recent.image.check.enabled=true. After debug, the reason is the standby NN updates the lastCheckpointTime use the start time of checkpoint, rather than the end time. In our cluster, the lastCheckpointTime of the standby node is approximately 80 minutes ahead of the lastCheckpointTime of the active NN.

When the checkpoint interval in standby NN exceeds dfs.namenode.checkpoint.period, the next checkpoint is performed. Because the active NN's lastCheckpointTime is later than standby NN's, the interval is less than dfs.namenode.checkpoint.period, and the putting fsimage is been rejected, causing the checkpoint to fail and retried.

hadoop-yetus · 2025-07-31T22:29:09Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 51s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
-1 ❌	test4tests	0m 0s		The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
			_ trunk Compile Tests _
+1 💚	mvninstall	48m 57s		trunk passed
+1 💚	compile	1m 31s		trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	compile	1m 11s		trunk passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	checkstyle	1m 12s		trunk passed
+1 💚	mvnsite	1m 19s		trunk passed
+1 💚	javadoc	1m 15s		trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	1m 43s		trunk passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	3m 15s		trunk passed
+1 💚	shadedclient	43m 29s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	1m 7s		the patch passed
+1 💚	compile	1m 16s		the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javac	1m 16s		the patch passed
+1 💚	compile	1m 6s		the patch passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	javac	1m 6s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	1m 2s		the patch passed
+1 💚	mvnsite	1m 11s		the patch passed
+1 💚	javadoc	1m 2s		the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	1m 32s		the patch passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	3m 12s		the patch passed
+1 💚	shadedclient	43m 11s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	148m 18s		hadoop-hdfs in the patch passed.
+1 💚	asflicense	0m 42s		The patch does not generate ASF License warnings.
		306m 59s

Subsystem	Report/Notes
Docker	ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7845/1/artifact/out/Dockerfile
GITHUB PR	#7845
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux 36c83d3474c2 5.15.0-144-generic #157-Ubuntu SMP Mon Jun 16 07:33:10 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `2dc4048`
Default Java	Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7845/1/testReport/
Max. process+thread count	2279 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7845/1/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Fix upload fsimage failure when checkpoint takes a long time

2dc4048

github-actions bot added HDFS trunk labels Jul 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDFS-17815. Fix upload fsimage failure when checkpoint takes a long time #7845

HDFS-17815. Fix upload fsimage failure when checkpoint takes a long time #7845

lfxy commented Jul 31, 2025

Uh oh!

hadoop-yetus commented Jul 31, 2025

Uh oh!

Uh oh!

HDFS-17815. Fix upload fsimage failure when checkpoint takes a long time #7845

Are you sure you want to change the base?

HDFS-17815. Fix upload fsimage failure when checkpoint takes a long time #7845

Conversation

lfxy commented Jul 31, 2025

Uh oh!

hadoop-yetus commented Jul 31, 2025

Uh oh!

Uh oh!