Skip to content

FreeBSD: zfs_putpages: don't undirty pages until after write completes (try #2) #17533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

robn
Copy link
Member

@robn robn commented Jul 11, 2025

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

After #17445 was pulled into FreeBSD @markjdb discovered a fairly serious performance regression due to my misunderstanding a difference between how Linux and FreeBSD lock out pages during writebacks.

Mark reverted that change in FreeBSD (freebsd/freebsd-src@738a9a7), and I took another swing at it. Here's what I've got.

Description

First commit reverts the original (238eab7), using the commit message, author and timestamp from freebsd/freebsd-src@738a9a7. That's mostly there to not complicate importing OpenZFS into FreeBSD in the future, but it's useful history too

The second commit is the do-over.

On FreeBSD, writeback is done with the page in an "sbusy" state, which (aiui) is a full and total lockout - it cannot be used or modified while in that state. So in the async case, keeping the page busy until the itx callback fires could have the page locked out until after the txg sync happens, maybe even whole seconds in the future. Linux, by constrast, does writeback under a much "softer" lock, that keeps the page from being evicted while it's being written out, but doesn't make it unusable.

So this commit splits the difference between the old and the new: for sync, we log the pages with a callback and call zil_commit(). For async, we do what we did before - undirty and return OK, and let the kernel unbusy them. The data is still copied to the DMU and to the ZIL, so it will still be written out just fine, and the page is now separately clean and evictable.

How Has This Been Tested?

The mmap test tag passes on FreeBSD 14.3. With #17398 rebased on top, so does the failmode tag.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

markjdb and others added 2 commits July 11, 2025 10:17
…completes"

This causes async putpages to leave the pages sbusied for a long time,
which hurts concurrency.  Revert for now until we have a better
approach.

This reverts commit 238eab7.

Reported by:    Ihor Antonov <[email protected]>
Discussed with: Rob Norris <[email protected]>

References: freebsd/freebsd-src@738a9a7
Ported-by: Rob Norris <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
In syncing mode, zfs_putpages() would put the entire range of pages onto
the ZIL, then return VM_PAGER_OK for each page to the kernel. However,
an associated zil_commit() or txg sync had not happened at this point,
so the write may not actually be on disk.

So, we rework that case to use a ZIL commit callback, and do the
post-write work of undirtying the page and signaling completion there.
We return VM_PAGER_PEND to the kernel instead so it knows that we will
take care of it.

The original version of this (238eab7) copied the Linux model and did
the cleanup in a ZIL callback for both async and sync. This was a
mistake, as FreeBSD does not have a separate "busy for writeback" flag
like Linux which keeps the page usable. The full sbusy flag locks the
entire page out until the itx callback fires, which for async is after
txg sync, which could be literal seconds in the future.

For the async case, the data is already on the DMU and the in-memory
ZIL, which is sufficient for async writeback, so the old method of
logging it without a callback, undirtying the page and returning is more
than sufficient and reclaims that lost performance.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
@robn robn assigned robn and unassigned robn Jul 11, 2025
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I just wonder: on Linux we have to always use sync semantics, because otherwise we may not receive anything on msync() call after all the pages written asynchronously, that would result in not providing promised guaranties. I wonder if this is different on FreeBSD, or with this change we are only doing best we can do (better than before) at reasonable cost?

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Jul 11, 2025
@markjdb
Copy link
Contributor

markjdb commented Jul 13, 2025

Looks good to me. I just wonder: on Linux we have to always use sync semantics, because otherwise we may not receive anything on msync() call after all the pages written asynchronously, that would result in not providing promised guaranties. I wonder if this is different on FreeBSD, or with this change we are only doing best we can do (better than before) at reasonable cost?

Sorry, I'm not very familiar with this area of ZFS, so I have some dumb questions:

  • Why do you say Linux always uses sync semantics? It looks like the zil_commit() call is conditional, but maybe commit is always true in practice? Or do you mean something else?
  • We are calling zil_commit() for synchronous writes. That seems to be how VOP_FSYNC is implemented as well. Why isn't that sufficient for msync(MS_SYNC)?

On FreeBSD, writeback is done with the page in an "sbusy" state, which (aiui) is a full and total lockout - it cannot be used or modified while in that state.

It's basically a read-only state: the page is being written to storage asynchronously, and must not be modified in that state. In particular, there are no writable mappings of the page while it's in this state, so a write from userspace will trigger a page fault, which will block until the writeback is complete.

On Linux, I see there is some per-mapping STABLE_WRITES flag which I think provides the same effect, but it's not the default. It seems to only be used if there's some data checksumming in play which requires written-back data to be consistent with checksums computed before the writeback begins.

Offhand, I'm not sure exactly why FreeBSD requires writes to be stable always.

@amotin
Copy link
Member

amotin commented Jul 13, 2025

@markjdb:

Why do you say Linux always uses sync semantics? It looks like the zil_commit() call is conditional, but maybe commit is always true in practice? Or do you mean something else?

Ah. I've mixed it with different context. There are several cases when we have to force kernel/mmap caches to be flushed, see zn_flush_cached_data() uses. And in that case we have to always request it to be sync, even if we don't care there, since if we do it async, then on later msync(MS_SYNC) from application kernel won't call us at all. In the context of Linux putpage though the zil_commit() use is correct. But the other question is: since we immediately copy the pages into DMU (at least until we support O_DIRECT in this context), why should we hold the pages at all for async writes, not allowing kernel to recycle them? IIRC it was also related to later syncs.

We are calling zil_commit() for synchronous writes. That seems to be how VOP_FSYNC is implemented as well. Why isn't that sufficient for msync(MS_SYNC)?

Will we be called in any way to trigger zil_commit() once we release all the pages? IIRC on Linux we won't.

On FreeBSD, writeback is done with the page in an "sbusy" state, which (aiui) is a full and total lockout - it cannot be used or modified while in that state.

It's basically a read-only state: the page is being written to storage asynchronously, and must not be modified in that state. In particular, there are no writable mappings of the page while it's in this state, so a write from userspace will trigger a page fault, which will block until the writeback is complete.

On Linux, I see there is some per-mapping STABLE_WRITES flag which I think provides the same effect, but it's not the default. It seems to only be used if there's some data checksumming in play which requires written-back data to be consistent with checksums computed before the writeback begins.

Offhand, I'm not sure exactly why FreeBSD requires writes to be stable always.

Having pages stable would be great for O_DIRECT. But for it to be possible, kernel should give us pages in granularity of recordsize, which I suspect might not be true, and should be ready to wait for synchronous writes without performance penalty. I have doubts it is realistic enough to bother implementing it.

@markjdb
Copy link
Contributor

markjdb commented Jul 14, 2025

@markjdb:

Why do you say Linux always uses sync semantics? It looks like the zil_commit() call is conditional, but maybe commit is always true in practice? Or do you mean something else?

Ah. I've mixed it with different context. There are several cases when we have to force kernel/mmap caches to be flushed, see zn_flush_cached_data() uses. And in that case we have to always request it to be sync, even if we don't care there, since if we do it async, then on later msync(MS_SYNC) from application kernel won't call us at all. In the context of Linux putpage though the zil_commit() use is correct. But the other question is: since we immediately copy the pages into DMU (at least until we support O_DIRECT in this context), why should we hold the pages at all for async writes, not allowing kernel to recycle them? IIRC it was also related to later syncs.

Well, the problem with the original commit is that we were holding the pages for too long for async writes. Specifically, we were holding them until the corresponding TXG flush (I think), so pages could stay sbusy for several seconds(!).

We are calling zil_commit() for synchronous writes. That seems to be how VOP_FSYNC is implemented as well. Why isn't that sufficient for msync(MS_SYNC)?

Will we be called in any way to trigger zil_commit() once we release all the pages? IIRC on Linux we won't.

In the async case, no, I don't think so.

@amotin amotin added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jul 14, 2025
behlendorf pushed a commit that referenced this pull request Jul 15, 2025
In syncing mode, zfs_putpages() would put the entire range of pages onto
the ZIL, then return VM_PAGER_OK for each page to the kernel. However,
an associated zil_commit() or txg sync had not happened at this point,
so the write may not actually be on disk.

So, we rework that case to use a ZIL commit callback, and do the
post-write work of undirtying the page and signaling completion there.
We return VM_PAGER_PEND to the kernel instead so it knows that we will
take care of it.

The original version of this (238eab7) copied the Linux model and did
the cleanup in a ZIL callback for both async and sync. This was a
mistake, as FreeBSD does not have a separate "busy for writeback" flag
like Linux which keeps the page usable. The full sbusy flag locks the
entire page out until the itx callback fires, which for async is after
txg sync, which could be literal seconds in the future.

For the async case, the data is already on the DMU and the in-memory
ZIL, which is sufficient for async writeback, so the old method of
logging it without a callback, undirtying the page and returning is more
than sufficient and reclaims that lost performance.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Mark Johnston <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #17533
@robn robn mentioned this pull request Aug 13, 2025
14 tasks
amotin pushed a commit to amotin/zfs that referenced this pull request Aug 13, 2025
…completes"

This causes async putpages to leave the pages sbusied for a long time,
which hurts concurrency.  Revert for now until we have a better
approach.

This reverts commit 238eab7.

Reported by:    Ihor Antonov <[email protected]>
Discussed with: Rob Norris <[email protected]>

References: freebsd/freebsd-src@738a9a7
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Mark Johnston <[email protected]>
Ported-by: Rob Norris <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17533
amotin pushed a commit to amotin/zfs that referenced this pull request Aug 13, 2025
In syncing mode, zfs_putpages() would put the entire range of pages onto
the ZIL, then return VM_PAGER_OK for each page to the kernel. However,
an associated zil_commit() or txg sync had not happened at this point,
so the write may not actually be on disk.

So, we rework that case to use a ZIL commit callback, and do the
post-write work of undirtying the page and signaling completion there.
We return VM_PAGER_PEND to the kernel instead so it knows that we will
take care of it.

The original version of this (238eab7) copied the Linux model and did
the cleanup in a ZIL callback for both async and sync. This was a
mistake, as FreeBSD does not have a separate "busy for writeback" flag
like Linux which keeps the page usable. The full sbusy flag locks the
entire page out until the itx callback fires, which for async is after
txg sync, which could be literal seconds in the future.

For the async case, the data is already on the DMU and the in-memory
ZIL, which is sufficient for async writeback, so the old method of
logging it without a callback, undirtying the page and returning is more
than sufficient and reclaims that lost performance.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Mark Johnston <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17533
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants