-
Notifications
You must be signed in to change notification settings - Fork 1.9k
FreeBSD: zfs_putpages: don't undirty pages until after write completes (try #2) #17533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…completes" This causes async putpages to leave the pages sbusied for a long time, which hurts concurrency. Revert for now until we have a better approach. This reverts commit 238eab7. Reported by: Ihor Antonov <[email protected]> Discussed with: Rob Norris <[email protected]> References: freebsd/freebsd-src@738a9a7 Ported-by: Rob Norris <[email protected]> Signed-off-by: Rob Norris <[email protected]>
In syncing mode, zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework that case to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. The original version of this (238eab7) copied the Linux model and did the cleanup in a ZIL callback for both async and sync. This was a mistake, as FreeBSD does not have a separate "busy for writeback" flag like Linux which keeps the page usable. The full sbusy flag locks the entire page out until the itx callback fires, which for async is after txg sync, which could be literal seconds in the future. For the async case, the data is already on the DMU and the in-memory ZIL, which is sufficient for async writeback, so the old method of logging it without a callback, undirtying the page and returning is more than sufficient and reclaims that lost performance. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I just wonder: on Linux we have to always use sync semantics, because otherwise we may not receive anything on msync()
call after all the pages written asynchronously, that would result in not providing promised guaranties. I wonder if this is different on FreeBSD, or with this change we are only doing best we can do (better than before) at reasonable cost?
Sorry, I'm not very familiar with this area of ZFS, so I have some dumb questions:
It's basically a read-only state: the page is being written to storage asynchronously, and must not be modified in that state. In particular, there are no writable mappings of the page while it's in this state, so a write from userspace will trigger a page fault, which will block until the writeback is complete. On Linux, I see there is some per-mapping STABLE_WRITES flag which I think provides the same effect, but it's not the default. It seems to only be used if there's some data checksumming in play which requires written-back data to be consistent with checksums computed before the writeback begins. Offhand, I'm not sure exactly why FreeBSD requires writes to be stable always. |
Ah. I've mixed it with different context. There are several cases when we have to force kernel/mmap caches to be flushed, see
Will we be called in any way to trigger
Having pages stable would be great for O_DIRECT. But for it to be possible, kernel should give us pages in granularity of recordsize, which I suspect might not be true, and should be ready to wait for synchronous writes without performance penalty. I have doubts it is realistic enough to bother implementing it. |
Well, the problem with the original commit is that we were holding the pages for too long for async writes. Specifically, we were holding them until the corresponding TXG flush (I think), so pages could stay sbusy for several seconds(!).
In the async case, no, I don't think so. |
In syncing mode, zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework that case to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. The original version of this (238eab7) copied the Linux model and did the cleanup in a ZIL callback for both async and sync. This was a mistake, as FreeBSD does not have a separate "busy for writeback" flag like Linux which keeps the page usable. The full sbusy flag locks the entire page out until the itx callback fires, which for async is after txg sync, which could be literal seconds in the future. For the async case, the data is already on the DMU and the in-memory ZIL, which is sufficient for async writeback, so the old method of logging it without a callback, undirtying the page and returning is more than sufficient and reclaims that lost performance. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Mark Johnston <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #17533
…completes" This causes async putpages to leave the pages sbusied for a long time, which hurts concurrency. Revert for now until we have a better approach. This reverts commit 238eab7. Reported by: Ihor Antonov <[email protected]> Discussed with: Rob Norris <[email protected]> References: freebsd/freebsd-src@738a9a7 Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Mark Johnston <[email protected]> Ported-by: Rob Norris <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17533
In syncing mode, zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework that case to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. The original version of this (238eab7) copied the Linux model and did the cleanup in a ZIL callback for both async and sync. This was a mistake, as FreeBSD does not have a separate "busy for writeback" flag like Linux which keeps the page usable. The full sbusy flag locks the entire page out until the itx callback fires, which for async is after txg sync, which could be literal seconds in the future. For the async case, the data is already on the DMU and the in-memory ZIL, which is sufficient for async writeback, so the old method of logging it without a callback, undirtying the page and returning is more than sufficient and reclaims that lost performance. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Mark Johnston <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17533
[Sponsors: Klara, Inc., Wasabi Technology, Inc.]
Motivation and Context
After #17445 was pulled into FreeBSD @markjdb discovered a fairly serious performance regression due to my misunderstanding a difference between how Linux and FreeBSD lock out pages during writebacks.
Mark reverted that change in FreeBSD (freebsd/freebsd-src@738a9a7), and I took another swing at it. Here's what I've got.
Description
First commit reverts the original (238eab7), using the commit message, author and timestamp from freebsd/freebsd-src@738a9a7. That's mostly there to not complicate importing OpenZFS into FreeBSD in the future, but it's useful history too
The second commit is the do-over.
On FreeBSD, writeback is done with the page in an "sbusy" state, which (aiui) is a full and total lockout - it cannot be used or modified while in that state. So in the async case, keeping the page busy until the itx callback fires could have the page locked out until after the txg sync happens, maybe even whole seconds in the future. Linux, by constrast, does writeback under a much "softer" lock, that keeps the page from being evicted while it's being written out, but doesn't make it unusable.
So this commit splits the difference between the old and the new: for sync, we log the pages with a callback and call
zil_commit()
. For async, we do what we did before - undirty and returnOK
, and let the kernel unbusy them. The data is still copied to the DMU and to the ZIL, so it will still be written out just fine, and the page is now separately clean and evictable.How Has This Been Tested?
The
mmap
test tag passes on FreeBSD 14.3. With #17398 rebased on top, so does thefailmode
tag.Types of changes
Checklist:
Signed-off-by
.