Skip to content

Pick required FIPS changes + Add FIPS RNG performance overhaul #372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

kerneltoast
Copy link

This adds the missing baseline FIPS changes ported over by @jallisonciq, in addition to my new FIPS RNG performance overhaul.

jallisonciq and others added 10 commits June 25, 2025 16:53
        Using the kernel crypto API, the SHA3-256 algorithm is used as
        conditioning element to replace the LFSR in the Jitter RNG. All other
        parts of the Jitter RNG are unchanged.

        The application and use of the SHA-3 conditioning operation is identical
        to the user space Jitter RNG 3.4.0 by applying the following concept:

        - the Jitter RNG initializes a SHA-3 state which acts as the "entropy
          pool" when the Jitter RNG is allocated.

        - When a new time delta is obtained, it is inserted into the "entropy
          pool" with a SHA-3 update operation. Note, this operation in most of
          the cases is a simple memcpy() onto the SHA-3 stack.

        - To cause a true SHA-3 operation for each time delta operation, a
          second SHA-3 operation is performed hashing Jitter RNG status
          information. The final message digest is also inserted into the
          "entropy pool" with a SHA-3 update operation. Yet, this data is not
          considered to provide any entropy, but it shall stir the entropy pool.

        - To generate a random number, a SHA-3 final operation is performed to
          calculate a message digest followed by an immediate SHA-3 init to
          re-initialize the "entropy pool". The obtained message digest is one
          block of the Jitter RNG that is returned to the caller.

        Mathematically speaking, the random number generated by the Jitter RNG
        is:

        aux_t = SHA-3(Jitter RNG state data)

        Jitter RNG block = SHA-3(time_i || aux_i || time_(i-1) || aux_(i-1) ||
                                 ... || time_(i-255) || aux_(i-255))

        when assuming that the OSR = 1, i.e. the default value.

        This operation implies that the Jitter RNG has an output-blocksize of
        256 bits instead of the 64 bits of the LFSR-based Jitter RNG that is
        replaced with this patch.

        The patch also replaces the varying number of invocations of the
        conditioning function with one fixed number of invocations. The use
        of the conditioning function consistent with the userspace Jitter RNG
        library version 3.4.0.

        The code is tested with a system that exhibited the least amount of
        entropy generated by the Jitter RNG: the SiFive Unmatched RISC-V
        system. The measured entropy rate is well above the heuristically
        implied entropy value of 1 bit of entropy per time delta. On all other
        tested systems, the measured entropy rate is even higher by orders
        of magnitude. The measurement was performed using updated tooling
        provided with the user space Jitter RNG library test framework.

        The performance of the Jitter RNG with this patch is about en par
        with the performance of the Jitter RNG without the patch.

        Signed-off-by: Stephan Mueller <[email protected]>
        Signed-off-by: Herbert Xu <[email protected]>

            Back-port of commit bb897c5
            Author: Stephan Müller <[email protected]>
            Date:   Fri Apr 21 08:08:04 2023 +0200

Signed-off-by: Jeremy Allison <[email protected]>
    I.G 9.7.B for FIPS 140-3 specifies that variables temporarily holding
    cryptographic information should be zeroized once they are no longer
    needed. Accomplish this by using kfree_sensitive for buffers that
    previously held the private key.

    Signed-off-by: Hailey Mothershead <[email protected]>
    Signed-off-by: Herbert Xu <[email protected]>

        Back-ported from commit 23e4099
        Author: Hailey Mothershead <[email protected]>
        Date:   Mon Apr 15 22:19:15 2024 +0000

Signed-off-by: Jeremy Allison <[email protected]>
private_key is overwritten with the key parameter passed in by the
caller (if present), or alternatively a newly generated private key.
However, it is possible that the caller provides a key (or the newly
generated key) which is shorter than the previous key. In that
scenario, some key material from the previous key would not be
overwritten. The easiest solution is to explicitly zeroize the entire
private_key array first.

Note that this patch slightly changes the behavior of this function:
previously, if the ecc_gen_privkey failed, the old private_key would
remain. Now, the private_key is always zeroized. This behavior is
consistent with the case where params.key is set and ecc_is_key_valid
fails.

Signed-off-by: Joachim Vandersmissen <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
[ Upstream commit ba3c557 ]

When the mpi_ec_ctx structure is initialized, some fields are not
cleared, causing a crash when referencing the field when the
structure was released. Initially, this issue was ignored because
memory for mpi_ec_ctx is allocated with the __GFP_ZERO flag.
For example, this error will be triggered when calculating the
Za value for SM2 separately.

Fixes: d58bb7e ("lib/mpi: Introduce ec implementation to MPI library")
Cc: [email protected] # v6.5
Signed-off-by: Tianjia Zhang <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
None of the ciphers used by the DRBG have an alignment requirement; thus,
they all return 0 from .crypto_init, resulting in inconsistent alignment
across all buffers.

Align all buffers to at least a cache line to improve performance. This is
especially useful when multiple DRBG instances are used, since it prevents
false sharing of cache lines between the different instances.

Signed-off-by: Sultan Alsawaf <[email protected]>
Like pin_user_pages_fast(), but with the internal-only FOLL_FAST_ONLY flag.

This complements the get_user_pages*() API, which already has
get_user_pages_fast_only().

Signed-off-by: Sultan Alsawaf <[email protected]>
There is no reason this refcount should be a signed int. Convert it to an
unsigned int, thereby also making it less likely to ever overflow.

Signed-off-by: Sultan Alsawaf <[email protected]>
Since crypto_devrandom_read_iter() is invoked directly by user tasks and is
accessible by every task in the system, there are glaring priority
inversions on crypto_reseed_rng_lock and crypto_default_rng_lock.

Tasks of arbitrary scheduling priority access crypto_devrandom_read_iter().
When a low-priority task owns one of the mutex locks, higher-priority tasks
waiting on that mutex lock are stalled until the low-priority task is done.

Fix the priority inversions by converting the mutex locks into rt_mutex
locks which have PI support.

Signed-off-by: Sultan Alsawaf <[email protected]>
@kerneltoast
Copy link
Author

Please closely review the commits I authored in this PR, since they are new original works.

@PlaidCat
Copy link
Collaborator

Do you have any testing you can share in the PR header?

@kerneltoast
Copy link
Author

Do you have any testing you can share in the PR header?

I can't think of a great way to tabulate the test methodology for this, but what I tested was:

  • vm.overcommit_memory=2 (for full overcommit to trigger failure of the optimistic page pin attempt)
  • An extremely large request of 250MB in a single getrandom() call without GRND_RANDOM
  • Request sizes less than one page, equal to one page, and greater than one page, with and without GRND_RANDOM

I printk'd all the edge cases to verify that they were hit and that they worked correctly, too.

Copy link

@jallisonciq jallisonciq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worked with Sultan whilst he was creating these patches (and I may be the reason for some of the comments :-). I am completely happy with the code logic, but as I told him I'm going to have to trust him on the locking design.

Copy link
Collaborator

@bmastbergen bmastbergen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't pretend to have a full understanding of what is going on here, but from the commit messages and comments I think I get the gist of it. With that being said, LGTM 🥌

When the kernel is booted with fips=1, the RNG exposed to userspace is
hijacked away from the CRNG and redirects to crypto_devrandom_read_iter(),
which utilizes the DRBG.

Notably, crypto_devrandom_read_iter() maintains just two global DRBG
instances _for the entire system_, and the two instances serve separate
request types: one instance for GRND_RANDOM requests (crypto_reseed_rng),
and one instance for non-GRND_RANDOM requests (crypto_default_rng). So in
essence, for requests of a single type, there is just one global RNG for
all CPUs in the entire system, which scales _very_ poorly.

To make matters worse, the temporary buffer used to ferry data between the
DRBG and userspace is woefully small at only 256 bytes, which doesn't do a
good job of maximizing throughput from the DRBG. This results in lost
performance when userspace requests >256 bytes; it is observed that DRBG
throughput improves by 70% on an i9-13900H when the buffer size is
increased to 4096 bytes (one page). Going beyond the size of one page up to
the DRBG maximum request limit of 65536 bytes produces diminishing returns
of only 3% improved throughput in comparison. And going below the size of
one page produces progressively less throughput at each power of 2: there's
a 5% loss going from 4096 bytes to 2048 bytes and a 9% loss going from 2048
bytes to 1024 bytes.

Thus, this implements per-CPU DRBG instances utilizing a page-sized buffer
for each CPU to utilize the DRBG itself more effectively. On top of that,
for non-GRND_RANDOM requests, the DRBG's operations now occur under a local
lock that disables preemption on non-PREEMPT_RT kernels, which not only
keeps each CPU's DRBG instance isolated from another, but also improves
temporal cache locality while the DRBG actively generates a new string of
random bytes.

Prefaulting one user destination page at a time is also employed to prevent
a DRBG instance from getting blocked on page faults, thereby maximizing the
use of the DRBG so that the only bottleneck is the DRBG itself.

Signed-off-by: Sultan Alsawaf <[email protected]>
@kerneltoast kerneltoast force-pushed the {sultan}_fips-rng_rocky9_6 branch from d1bd048 to 472f628 Compare June 28, 2025 02:06
@kerneltoast
Copy link
Author

FYI, I added a few more small comments and micro-optimized 2 lines of code. Going to merge it since it's not any different from what it was before. Thanks for the reviews 🙂

@kerneltoast kerneltoast merged commit 472f628 into fips-9-compliant/5.14.0-570.18.1.el9_6 Jun 28, 2025
2 of 4 checks passed
@kerneltoast kerneltoast deleted the {sultan}_fips-rng_rocky9_6 branch June 28, 2025 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

7 participants