Skip to content

[SH] add userfault support #5261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jun 26, 2025

Conversation

kalyazin
Copy link
Contributor

@kalyazin kalyazin commented Jun 13, 2025

Changes

Implement userfault support in Secret Freedom. The goal of this change is to be able to resume Secret-Free VMs via UFFD.

Major changes:

  • Firecracker sends guest_memfd and memfd to the UFFD handler. UFFD handler writes to the guest_memfd to populate guest pages and clears bits in the userfault bitmap (memfd) to stop KVM from sending vCPU fault notifications
  • vCPU faults on guest_memfd cause VM exits. Once vCPU exits to userspace on a fault, it sends a fault request to the VMM thread via a pipe for the VMM thread to forward it to the UFFD handler.
  • Firecracker- and KVM-triggered faults are delivered to the UFFD handler via minor UFFD notifications and UFFD handler unblocks the faulting process via UFFDIO_CONTINUE.

Reason

This is needed to be able to restore snapshots where the VM was backed by guest_memfd.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • [ ] I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • [ ] I have mentioned all user-facing changes in CHANGELOG.md.
  • [ ] If a specific issue led to this PR, this PR closes the issue.
  • [ ] When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • [ ] I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • [ ] I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

Copy link

codecov bot commented Jun 13, 2025

Codecov Report

Attention: Patch coverage is 29.70297% with 284 lines in your changes missing coverage. Please review.

Project coverage is 81.83%. Comparing base (00ac2f3) to head (7622c4c).
Report is 18 commits behind head on feature/secret-hiding.

Files with missing lines Patch % Lines
src/vmm/src/lib.rs 1.78% 110 Missing ⚠️
src/vmm/src/builder.rs 34.90% 69 Missing ⚠️
src/vmm/src/vstate/vcpu.rs 26.56% 47 Missing ⚠️
src/vmm/src/persist.rs 37.77% 28 Missing ⚠️
src/vmm/src/vstate/memory.rs 0.00% 19 Missing ⚠️
src/vmm/src/vstate/vm.rs 81.03% 11 Missing ⚠️
Additional details and impacted files
@@                    Coverage Diff                    @@
##           feature/secret-hiding    #5261      +/-   ##
=========================================================
- Coverage                  82.52%   81.83%   -0.69%     
=========================================================
  Files                        250      250              
  Lines                      27386    27700     +314     
=========================================================
+ Hits                       22599    22668      +69     
- Misses                      4787     5032     +245     
Flag Coverage Δ
5.10-c5n.metal 81.96% <23.76%> (-0.95%) ⬇️
5.10-m5n.metal 82.00% <23.76%> (-0.91%) ⬇️
5.10-m6a.metal 81.11% <23.76%> (-0.98%) ⬇️
5.10-m6g.metal 77.92% <23.19%> (-0.78%) ⬇️
5.10-m6i.metal 81.95% <23.76%> (-0.95%) ⬇️
5.10-m7a.metal-48xl 81.10% <23.76%> (-0.98%) ⬇️
5.10-m7g.metal 77.92% <23.19%> (-0.78%) ⬇️
5.10-m7i.metal-24xl 81.91% <23.76%> (-0.95%) ⬇️
5.10-m7i.metal-48xl 81.91% <23.76%> (-0.95%) ⬇️
5.10-m8g.metal-24xl 77.91% <23.19%> (-0.78%) ⬇️
5.10-m8g.metal-48xl 77.91% <23.19%> (-0.78%) ⬇️
6.1-c5n.metal 82.00% <23.76%> (-0.95%) ⬇️
6.1-m5n.metal 82.00% <23.76%> (-0.95%) ⬇️
6.1-m6a.metal 81.16% <23.76%> (-0.98%) ⬇️
6.1-m6g.metal 77.92% <23.19%> (-0.78%) ⬇️
6.1-m6i.metal 81.99% <23.76%> (-0.96%) ⬇️
6.1-m7a.metal-48xl 81.15% <23.76%> (-0.98%) ⬇️
6.1-m7g.metal 77.91% <23.19%> (-0.79%) ⬇️
6.1-m7i.metal-24xl 82.00% <23.76%> (-0.96%) ⬇️
6.1-m7i.metal-48xl 82.01% <23.76%> (-0.95%) ⬇️
6.1-m8g.metal-24xl 77.91% <23.19%> (-0.78%) ⬇️
6.1-m8g.metal-48xl 77.91% <23.19%> (-0.78%) ⬇️
6.14-c5n.metal 82.05% <29.20%> (-0.87%) ⬇️
6.14-m5n.metal 82.04% <29.20%> (-0.89%) ⬇️
6.14-m6a.metal 81.21% <29.20%> (-0.90%) ⬇️
6.14-m6g.metal 77.95% <28.67%> (-0.70%) ⬇️
6.14-m6i.metal 82.04% <29.20%> (-0.88%) ⬇️
6.14-m7a.metal-48xl 81.20% <29.20%> (-0.90%) ⬇️
6.14-m7g.metal 77.96% <28.67%> (-0.70%) ⬇️
6.14-m7i.metal-24xl 82.06% <29.20%> (-0.87%) ⬇️
6.14-m7i.metal-48xl 82.06% <29.20%> (-0.87%) ⬇️
6.14-m8g.metal-24xl 77.95% <28.67%> (-0.71%) ⬇️
6.14-m8g.metal-48xl 77.96% <28.67%> (-0.70%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kalyazin kalyazin force-pushed the sh_uf branch 2 times, most recently from 286efbe to 4e10e54 Compare June 13, 2025 19:55
@kalyazin kalyazin marked this pull request as ready for review June 16, 2025 07:26
@kalyazin kalyazin changed the title [WIP][SH] add userfault support to UFFD handlers [SH] add userfault support to UFFD handlers Jun 16, 2025
@kalyazin kalyazin force-pushed the sh_uf branch 3 times, most recently from b6185cb to 60abeb9 Compare June 17, 2025 10:42
@kalyazin kalyazin force-pushed the sh_uf branch 4 times, most recently from d5e7aa8 to 40101cd Compare June 19, 2025 11:41
@kalyazin kalyazin mentioned this pull request Jun 19, 2025
10 tasks
@kalyazin kalyazin changed the title [SH] add userfault support to UFFD handlers [SH] add userfault support Jun 19, 2025
@kalyazin kalyazin self-assigned this Jun 19, 2025
JackThomson2
JackThomson2 previously approved these changes Jun 19, 2025
@kalyazin kalyazin force-pushed the sh_uf branch 4 times, most recently from ea14bbc to 6ff118a Compare June 25, 2025 17:38
@kalyazin kalyazin requested a review from roypat June 25, 2025 21:06
Copy link
Contributor

@roypat roypat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last round of comments :)

we'll need to keep track that at some point later we implement proper error handling instead of panic!-ing all over the place in Firecracker, but let's just get this into the feature branch to unblock everyone else and deal with that later

kalyazin and others added 18 commits June 26, 2025 10:03
This is needed because if guest_memfd is used to back guest memory, vCPU
fault notifications are delivered via the UFFD UDS socket.

Signed-off-by: Nikita Kalyazin <[email protected]>
It is used by Secret-Free-enabled UFFD handlers to disable vCPU fault
notifications from the kernel.

Signed-off-by: Nikita Kalyazin <[email protected]>
Accept receiving 3 fds instead of 1, where fds[1] is guest_memfd and
fds[2] is userfault bitmap memfd.

Also handle the FaultRequest message over the UDS socket by calling a
new callback in the Runtime and sending a FaultReply.

Co-authored-by: Patrick Roy <[email protected]>
Signed-off-by: Patrick Roy <[email protected]>
Signed-off-by: Nikita Kalyazin <[email protected]>
There are two ways a UFFD handler receives a fault notification if
Secret Fredom is enabled (which is inferred from 3 fds sent by
Firecracker instead of 1):
 - a VMM- or KVM-triggered fault is delivered via a minor UFFD fault
   event.  The handler is supposed to respond to it via memcpying the
   content of the page (if the page hasn't already been populated)
   followed by a UFFDIO_CONTINUE call.
 - a vCPU-triggered fault is delievered via a FaultRequest message on
   the UDS socket.  The handler is supposed to reply with a pwrite64
   call on the guest_memfd to populate the page followed by a FaultReply
   message on the UDS socket.

In both cases, the handler also needs to clear the bit in the userfault
bitmap at the corresponding offset in order to stop further fault
notifications for the same page.

UFFD handlers use the userfault bitmap for two purposes:
 - communicate to the kernel whether a fault at the corresponding
   guest_memfd offset will cause a VM exit
 - keep track of pages that have already been populated in order to
   avoid overwriting the content of the page that is already
   initialised.

Signed-off-by: Nikita Kalyazin <[email protected]>
These are used for communication of page faults between Firecracker and
a UFFD handler.

Signed-off-by: Nikita Kalyazin <[email protected]>
If configured, userfault bitmap is registered with KVM and controls
whether KVM will exit to userspace on a fault of the corresponding page.

We are going to allocate the bitmap in a memfd in Firecracker, set bits
for all pages to request notifications for vCPU faults and send
it to the UFFD handler to delegate clearing the bits as pages get
populated.

Since the KVM userfault patches are still in review,
set_user_memory_region2 is not aware of the userfault flag and the
userfault bitmap address in its input structure.  Define it in
Firecracker code temporarily.

Signed-off-by: Nikita Kalyazin <[email protected]>
This is needed to instruct the kernel to exit to userspace when a vCPU
fault occurs and the corresponding bit in the userfault bitmap is set.

The userfault bitmap is allocated in a memfd by Firecracker and sent to
the UFFD handler.

This also sends 3 fds to the UFFD handler in the handshake:
 - UFFD (original)
 - guest_memfd: for the handler to be able to populate guest memory
 - userfault bitmap memfd: for the handler to be able to disable exits
   to userspace for the pages that have already been populated

Signed-off-by: Nikita Kalyazin <[email protected]>
This will be removed after upgrading to a new version of mmap support
kernel patches.

Signed-off-by: Nikita Kalyazin <[email protected]>
This is because vCPUs reason in GPAs while the secret-free UFFD
protocol is guest_memfd-offset-based.

Note that offset_to_gpa is not used yet, but will likely be needed to
support async PF to pass the GPA to a new ioctl when notifying KVM of a
fault resolution.

Signed-off-by: Nikita Kalyazin <[email protected]>
It contains two parts:
 - external: between the VMM thread and the UFFD handler
 - internal: between vCPUs and the VMM thread

An outline of the workflow:
 - When a vCPU fault occurs, vCPU exits to userspace
 - The vCPU thread sends sends the exit syndrome in the vCPU to VMM
   channel and writes to the eventfd
 - The VMM thread forwards the syndrome to the UFFD handler via the UDS
   socket
 - The UFFD handler populates the page, clears the corresponding bit in
   the userfault bitmap and sends a reply to Firecracker
 - The VMM thread receives the reply and updates a vCPU condvar to
   notify the vCPU that the fault has been resolved
 - The vCPU resumes execution

Note that as a result of this change, an ability to exit the VM
gracefully is lost (at least on x86).  In the existing implementation,
the VMM thread initiated an exit if an event was read from the eventfd,
but no VcpuResponse::Exited responses were read for unknown reason.
Since the exit_evt eventfd is now also used by vCPUs to notify the VMM
thread of the VM exits caused by pagefaults, this situation (an eventfd
event, but response in the channel) can occur also because we have read
all VcpuResponse::Userfault in response to the previous eventfd event.

Signed-off-by: Nikita Kalyazin <[email protected]>
kvmclock is currently not supported by Secret Freedom and calling
kvmclock_ctrl will always fail.

Signed-off-by: Nikita Kalyazin <[email protected]>
In a regular VM, we mmap the memory snapshot file and supply the address
in the KVM memory slot.  In Secret Free VMs, we provide guest_memfd in
the memory slot instead.  There is no way we can restore a Secret Free
VM from a file, unless we prepopulate the guest_memfd with the file
content, which is inefficient and is not practically useful.

Signed-off-by: Nikita Kalyazin <[email protected]>
It is not supported by Secret Freedom.

Signed-off-by: Nikita Kalyazin <[email protected]>
This includes both functional and performance tests.

Signed-off-by: Nikita Kalyazin <[email protected]>
Do not add a balloon device to a Secret Free VM as it is not currently
supported.

Signed-off-by: Nikita Kalyazin <[email protected]>
When taking a snapshot from a Secret Free VM, we create a bounce buffer
to be able to pass it to the host kernel to store in a file.  Exclude it
from the memory monitor calculation.

Signed-off-by: Nikita Kalyazin <[email protected]>
This is because the error type has changed due the implementation of
snapshot restore support for Secret Free VMs.

Signed-off-by: Nikita Kalyazin <[email protected]>
Graceful shutdown is currently broken on x86_64.

Signed-off-by: Nikita Kalyazin <[email protected]>
@kalyazin kalyazin enabled auto-merge (rebase) June 26, 2025 11:43
@kalyazin kalyazin merged commit a487771 into firecracker-microvm:feature/secret-hiding Jun 26, 2025
5 of 7 checks passed
@kalyazin kalyazin deleted the sh_uf branch June 26, 2025 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants