Skip to content

Add allocation profile export and zleak utility for import #17576

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pcd1193182
Copy link
Contributor

Sponsored by: [Klara, Inc.; Wasabi Technology, Inc.]

Motivation and Context

When attempting to debug performance problems on large systems, one of the major factors that affect performance is free space fragmentation. This heavily affects the allocation process, which is an area of active development in ZFS. Unfortunately, fragmenting a large pool for testing purposes is time consuming; it usually involves filling the pool and then repeatedly overwriting data until the free space becomes fragmented, which can take many hours. And even if the time is available, artificial workloads rarely generate the same fragmentation patterns as the natural workloads they're attempting to mimic. Finally, it may also be difficult to source storage that is large enough to match what's being used at customer/production sites for budgetary or procurement reasons.

Description

The core idea of the solution is this: If we know what regions are allocated on the production system we're trying to mimic, we don't actually need to do the process that got us there. We can skip straight to the final state by doing raw allocations of the allocated regions on that system, with no data underlying them or block pointers pointing to them.

This patch has two parts. First, in zdb, we add the ability to export the full allocation map of the pool. It iterates over each vdev, printing every allocated segment in the ms_allocatable range tree. This can be done while the pool is online, though if the process takes long enough we can run into issues with our older TXG starting to get overwritten. A checkpoint is a good way to preserve the system state at a single point in time for analysis while the system is serving IO.

The second is a new utility called zleak (and its supporting library and kernel changes). This is a small python program that invokes a new ioctl (via libzfs_core): zfs_ioc_raw_alloc. This ioctl takes in an nvlist of allocations to perform, and then allocates them. It does not currently store those allocations anywhere to make them reversible, and there is no corresponding raw_free ioctl (which would be extremely dangerous); this is an irreversible process, only intended for performance testing. The only way to reclaim the space afterwards is to destroy the pool or roll back to a checkpoint.

We verify that the system receiving the allocation profile has the same layout as the source system, to prevent any issues with violating ZFS's expectations or triggering assertions. This includes number of vdevs, number of metaslabs per vdev, and metaslab size. There is a -f option to allow profiles to skip the check for number of metaslabs per vdev, in which case allocations beyond the last metaslab will be dropped.

How Has This Been Tested?

Tested with ZFS test suite to ensure no regressions. New utility and functionality has been used to performance performance testing multiple times. Also manually verified space map contents to ensure that allocation mapping matches original system.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@pcd1193182 pcd1193182 force-pushed the frag_copy branch 2 times, most recently from 1a44f2e to 0f38484 Compare July 28, 2025 21:01
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Jul 29, 2025
When attempting to debug performance problems on large systems, one of
the major factors that affect performance is free space
fragmentation. This heavily affects the allocation process, which is an
area of active development in ZFS. Unfortunately, fragmenting a large
pool for testing purposes is time consuming; it usually involves filling
the pool and then repeatedly overwriting data until the free space
becomes fragmented, which can take many hours. And even if the time is
available, artificial workloads rarely generate the same fragmentation
patterns as the natural workloads they're attempting to mimic.

This patch has two parts. First, in zdb, we add the ability to export
the full allocation map of the pool. It iterates over each vdev,
printing every allocated segment in the ms_allocatable range tree. This
can be done while the pool is online, though in that case the allocation
map may actually be from several different TXGs as new ones are loaded
on demand.

The second is a new utility called zleak (and its supporting library and
kernel changes). This is a small python program that invokes a new ioctl
(via libzfs_core): zfs_ioc_raw_alloc. This ioctl takes in an nvlist of
allocations to perform, and then allocates them. It does not currently
store those allocations anywhere to make them reversible, and there is
no corresponding raw_free ioctl (which would be extremely dangerous);
this is an irreversible process, only intended for performance
testing. The only way to reclaim the space afterwards is to destroy the
pool or roll back to a checkpoint.

Signed-off-by: Paul Dagnelie <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants