FreeBSD: zfs_putpages: don't undirty pages until after write completes (try #2) #17533

robn · 2025-07-11T06:37:43Z

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

After #17445 was pulled into FreeBSD @markjdb discovered a fairly serious performance regression due to my misunderstanding a difference between how Linux and FreeBSD lock out pages during writebacks.

Mark reverted that change in FreeBSD (freebsd/freebsd-src@738a9a7), and I took another swing at it. Here's what I've got.

Description

First commit reverts the original (238eab7), using the commit message, author and timestamp from freebsd/freebsd-src@738a9a7. That's mostly there to not complicate importing OpenZFS into FreeBSD in the future, but it's useful history too

The second commit is the do-over.

On FreeBSD, writeback is done with the page in an "sbusy" state, which (aiui) is a full and total lockout - it cannot be used or modified while in that state. So in the async case, keeping the page busy until the itx callback fires could have the page locked out until after the txg sync happens, maybe even whole seconds in the future. Linux, by constrast, does writeback under a much "softer" lock, that keeps the page from being evicted while it's being written out, but doesn't make it unusable.

So this commit splits the difference between the old and the new: for sync, we log the pages with a callback and call zil_commit(). For async, we do what we did before - undirty and return OK, and let the kernel unbusy them. The data is still copied to the DMU and to the ZIL, so it will still be written out just fine, and the page is now separately clean and evictable.

How Has This Been Tested?

The mmap test tag passes on FreeBSD 14.3. With #17398 rebased on top, so does the failmode tag.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

…completes" This causes async putpages to leave the pages sbusied for a long time, which hurts concurrency. Revert for now until we have a better approach. This reverts commit 238eab7. Reported by: Ihor Antonov <[email protected]> Discussed with: Rob Norris <[email protected]> References: freebsd/freebsd-src@738a9a7 Ported-by: Rob Norris <[email protected]> Signed-off-by: Rob Norris <[email protected]>

In syncing mode, zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework that case to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. The original version of this (238eab7) copied the Linux model and did the cleanup in a ZIL callback for both async and sync. This was a mistake, as FreeBSD does not have a separate "busy for writeback" flag like Linux which keeps the page usable. The full sbusy flag locks the entire page out until the itx callback fires, which for async is after txg sync, which could be literal seconds in the future. For the async case, the data is already on the DMU and the in-memory ZIL, which is sufficient for async writeback, so the old method of logging it without a callback, undirtying the page and returning is more than sufficient and reclaims that lost performance. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

amotin

Looks good to me. I just wonder: on Linux we have to always use sync semantics, because otherwise we may not receive anything on msync() call after all the pages written asynchronously, that would result in not providing promised guaranties. I wonder if this is different on FreeBSD, or with this change we are only doing best we can do (better than before) at reasonable cost?

markjdb · 2025-07-13T17:06:53Z

Looks good to me. I just wonder: on Linux we have to always use sync semantics, because otherwise we may not receive anything on msync() call after all the pages written asynchronously, that would result in not providing promised guaranties. I wonder if this is different on FreeBSD, or with this change we are only doing best we can do (better than before) at reasonable cost?

Sorry, I'm not very familiar with this area of ZFS, so I have some dumb questions:

Why do you say Linux always uses sync semantics? It looks like the zil_commit() call is conditional, but maybe commit is always true in practice? Or do you mean something else?
We are calling zil_commit() for synchronous writes. That seems to be how VOP_FSYNC is implemented as well. Why isn't that sufficient for msync(MS_SYNC)?

On FreeBSD, writeback is done with the page in an "sbusy" state, which (aiui) is a full and total lockout - it cannot be used or modified while in that state.

It's basically a read-only state: the page is being written to storage asynchronously, and must not be modified in that state. In particular, there are no writable mappings of the page while it's in this state, so a write from userspace will trigger a page fault, which will block until the writeback is complete.

On Linux, I see there is some per-mapping STABLE_WRITES flag which I think provides the same effect, but it's not the default. It seems to only be used if there's some data checksumming in play which requires written-back data to be consistent with checksums computed before the writeback begins.

Offhand, I'm not sure exactly why FreeBSD requires writes to be stable always.

amotin · 2025-07-13T21:49:25Z

@markjdb:

Why do you say Linux always uses sync semantics? It looks like the zil_commit() call is conditional, but maybe commit is always true in practice? Or do you mean something else?

Ah. I've mixed it with different context. There are several cases when we have to force kernel/mmap caches to be flushed, see zn_flush_cached_data() uses. And in that case we have to always request it to be sync, even if we don't care there, since if we do it async, then on later msync(MS_SYNC) from application kernel won't call us at all. In the context of Linux putpage though the zil_commit() use is correct. But the other question is: since we immediately copy the pages into DMU (at least until we support O_DIRECT in this context), why should we hold the pages at all for async writes, not allowing kernel to recycle them? IIRC it was also related to later syncs.

We are calling zil_commit() for synchronous writes. That seems to be how VOP_FSYNC is implemented as well. Why isn't that sufficient for msync(MS_SYNC)?

Will we be called in any way to trigger zil_commit() once we release all the pages? IIRC on Linux we won't.

On FreeBSD, writeback is done with the page in an "sbusy" state, which (aiui) is a full and total lockout - it cannot be used or modified while in that state.

It's basically a read-only state: the page is being written to storage asynchronously, and must not be modified in that state. In particular, there are no writable mappings of the page while it's in this state, so a write from userspace will trigger a page fault, which will block until the writeback is complete.

On Linux, I see there is some per-mapping STABLE_WRITES flag which I think provides the same effect, but it's not the default. It seems to only be used if there's some data checksumming in play which requires written-back data to be consistent with checksums computed before the writeback begins.

Offhand, I'm not sure exactly why FreeBSD requires writes to be stable always.

Having pages stable would be great for O_DIRECT. But for it to be possible, kernel should give us pages in granularity of recordsize, which I suspect might not be true, and should be ready to wait for synchronous writes without performance penalty. I have doubts it is realistic enough to bother implementing it.

markjdb · 2025-07-14T14:27:24Z

@markjdb:

Why do you say Linux always uses sync semantics? It looks like the zil_commit() call is conditional, but maybe commit is always true in practice? Or do you mean something else?

Ah. I've mixed it with different context. There are several cases when we have to force kernel/mmap caches to be flushed, see zn_flush_cached_data() uses. And in that case we have to always request it to be sync, even if we don't care there, since if we do it async, then on later msync(MS_SYNC) from application kernel won't call us at all. In the context of Linux putpage though the zil_commit() use is correct. But the other question is: since we immediately copy the pages into DMU (at least until we support O_DIRECT in this context), why should we hold the pages at all for async writes, not allowing kernel to recycle them? IIRC it was also related to later syncs.

Well, the problem with the original commit is that we were holding the pages for too long for async writes. Specifically, we were holding them until the corresponding TXG flush (I think), so pages could stay sbusy for several seconds(!).

We are calling zil_commit() for synchronous writes. That seems to be how VOP_FSYNC is implemented as well. Why isn't that sufficient for msync(MS_SYNC)?

Will we be called in any way to trigger zil_commit() once we release all the pages? IIRC on Linux we won't.

In the async case, no, I don't think so.

In syncing mode, zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework that case to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. The original version of this (238eab7) copied the Linux model and did the cleanup in a ZIL callback for both async and sync. This was a mistake, as FreeBSD does not have a separate "busy for writeback" flag like Linux which keeps the page usable. The full sbusy flag locks the entire page out until the itx callback fires, which for async is after txg sync, which could be literal seconds in the future. For the async case, the data is already on the DMU and the in-memory ZIL, which is sufficient for async writeback, so the old method of logging it without a callback, undirtying the page and returning is more than sufficient and reclaims that lost performance. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Mark Johnston <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #17533

…completes" This causes async putpages to leave the pages sbusied for a long time, which hurts concurrency. Revert for now until we have a better approach. This reverts commit 238eab7. Reported by: Ihor Antonov <[email protected]> Discussed with: Rob Norris <[email protected]> References: freebsd/freebsd-src@738a9a7 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Mark Johnston <[email protected]> Ported-by: Rob Norris <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17533

In syncing mode, zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework that case to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. The original version of this (238eab7) copied the Linux model and did the cleanup in a ZIL callback for both async and sync. This was a mistake, as FreeBSD does not have a separate "busy for writeback" flag like Linux which keeps the page usable. The full sbusy flag locks the entire page out until the itx callback fires, which for async is after txg sync, which could be literal seconds in the future. For the async case, the data is already on the DMU and the in-memory ZIL, which is sufficient for async writeback, so the old method of logging it without a callback, undirtying the page and returning is more than sufficient and reclaims that lost performance. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Mark Johnston <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17533

markjdb and others added 2 commits July 11, 2025 10:17

robn assigned robn and unassigned robn Jul 11, 2025

amotin approved these changes Jul 11, 2025

View reviewed changes

amotin added the Status: Code Review Needed Ready for review and testing label Jul 11, 2025

markjdb approved these changes Jul 13, 2025

View reviewed changes

amotin added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jul 14, 2025

amotin mentioned this pull request Jul 15, 2025

Sometimes a delay of up to zfs_txg_timeout seconds when doing ftruncate #14290

Closed

behlendorf approved these changes Jul 15, 2025

View reviewed changes

behlendorf closed this in ee2a2d9 Jul 15, 2025

robn mentioned this pull request Aug 1, 2025

Linux: zfs_putpage: complete async page writeback immediately #17584

Closed

14 tasks

robn mentioned this pull request Aug 13, 2025

2.3.4 staging prep #17595

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FreeBSD: zfs_putpages: don't undirty pages until after write completes (try #2) #17533

FreeBSD: zfs_putpages: don't undirty pages until after write completes (try #2) #17533

Uh oh!

robn commented Jul 11, 2025

Uh oh!

amotin left a comment •

edited

Loading

Uh oh!

markjdb commented Jul 13, 2025

Uh oh!

amotin commented Jul 13, 2025 •

edited

Loading

Uh oh!

markjdb commented Jul 14, 2025

Uh oh!

Uh oh!

FreeBSD: zfs_putpages: don't undirty pages until after write completes (try #2) #17533

FreeBSD: zfs_putpages: don't undirty pages until after write completes (try #2) #17533

Uh oh!

Conversation

robn commented Jul 11, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

amotin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markjdb commented Jul 13, 2025

Uh oh!

amotin commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markjdb commented Jul 14, 2025

Uh oh!

Uh oh!

amotin left a comment •

edited

Loading

amotin commented Jul 13, 2025 •

edited

Loading