Skip to content

Integrate formally verified AArch64 Keccak-x1 assembly from s2n-bignum/mlkem-native #2539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

hanno-becker
Copy link
Contributor

@hanno-becker hanno-becker commented Jul 14, 2025

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.


NOTE: This PR integrates code that is not yet in s2n-bignum's main, see awslabs/s2n-bignum#238.

This PR is ready for review, but ideally we wait with merge until the Keccak code has been integrated into s2n-bignum, and we can just re-import s2n-bignum using @torben-hansen's importer script.


Previously, a static dispatch would choose between the C implementation of Keccak-F1600 or the assembly implementations (one scalar, one SIMD) provided by OpenSSL. The C<->ASM interface was Keccak1600_Absorb and Keccak1600_Squeeze.

This PR lowers the C<->ASM interface to the core Keccak permutation itself; the Absorb/Squeeze assembly wrappers in keccak1600-armv8.pl are removed accordingly.

Moreover, the PR integrates the Keccak-F1600 implementations from s2n-bignum/mlkem-native into the build and replaces the above static dispatch by a runtime dispatch based on CPU detection / CPU capabilities:

  1. If ASM is disabled, we use the C implementation.
  2. If ASM is enabled:
  • For Neoverse N1, V1, V2, we use scalar Keccak assembly from s2n-bignum, leveraging lazy rotations from https://eprint.iacr.org/2022/1243.
  • For Arm-based Apple CPUs, we use Neon Keccak assembly from s2n-bignum, leveraging the AArch64 SHA3 extension.
  • Otherwise, fall back to scalar Keccak implementation from OpenSSL, not using lazy rotations.

Lazy rotations improve performance by up to 10% on CPUs with free Barrel shifting, which includes Neoverse N1, V1, and V2. Not all CPUs have free Barrel shifting (e.g. Apple M1 or Cortex-A72), so we don't use it by default.

Neoverse V1 and V2 do support SHA3 instructions, but they are only implemented on 1/4 of Neon units, and are thus slower than a scalar implementation.

Finally, while keccak1600-armv8.pl includes an implementation based on SHA3 instruction, this implementation was never used. It is now obsolete with the introduction of the verified SHA3-instruction based implementation from s2n-bignum, and removed from keccak1600-armv8.pl. This leaves only the scalar assembly implementation for the core Keccak permutation in keccak1600-armv8.pl.

Performance impact

Apple M1

Algorithm Size Main New Gain %
SHA3-224 16b 71.5 MB/s 88.3 MB/s +16.8 MB/s +23.5%
256b 584.5 MB/s 754.7 MB/s +170.2 MB/s +29.1%
1350b 633.8 MB/s 815.2 MB/s +181.4 MB/s +28.6%
8kb 694.4 MB/s 872.4 MB/s +178.0 MB/s +25.6%
16kb 696.9 MB/s 864.8 MB/s +167.9 MB/s +24.1%
SHA3-256 16b 71.6 MB/s 88.6 MB/s +17.0 MB/s +23.7%
256b 600.8 MB/s 759.0 MB/s +158.2 MB/s +26.3%
1350b 638.8 MB/s 817.5 MB/s +178.7 MB/s +28.0%
8kb 652.3 MB/s 820.5 MB/s +168.2 MB/s +25.8%
16kb 658.9 MB/s 823.8 MB/s +164.9 MB/s +25.0%
SHA3-384 16b 71.9 MB/s 86.8 MB/s +14.9 MB/s +20.7%
256b 402.3 MB/s 505.4 MB/s +103.1 MB/s +25.6%
1350b 493.1 MB/s 636.0 MB/s +142.9 MB/s +29.0%
8kb 507.3 MB/s 639.7 MB/s +132.4 MB/s +26.1%
16kb 507.2 MB/s 626.2 MB/s +119.0 MB/s +23.5%
SHA3-512 16b 70.6 MB/s 89.2 MB/s +18.6 MB/s +26.3%
256b 305.7 MB/s 390.8 MB/s +85.1 MB/s +27.8%
1350b 347.2 MB/s 436.7 MB/s +89.5 MB/s +25.8%
8kb 355.0 MB/s 446.3 MB/s +91.3 MB/s +25.7%
16kb 356.1 MB/s 445.7 MB/s +89.6 MB/s +25.2%
SHAKE-128 16b 68.8 MB/s 87.4 MB/s +18.6 MB/s +27.0%
256b 572.2 MB/s 747.5 MB/s +175.3 MB/s +30.6%
1350b 780.8 MB/s 1016.4 MB/s +235.6 MB/s +30.2%
8kb 932.8 MB/s 1215.4 MB/s +282.6 MB/s +30.3%
16kb 932.4 MB/s 1215.9 MB/s +283.5 MB/s +30.4%
SHAKE-256 16b 69.0 MB/s 87.6 MB/s +18.6 MB/s +27.0%
256b 574.7 MB/s 750.1 MB/s +175.4 MB/s +30.5%
1350b 629.4 MB/s 817.0 MB/s +187.6 MB/s +29.8%
8kb 652.3 MB/s 820.5 MB/s +168.2 MB/s +25.8%
16kb 658.9 MB/s 823.8 MB/s +164.9 MB/s +25.0%

Neoverse-V2

Algorithm Size Main New Gain %
SHA3-224 16b 53.4 MB/s 56.9 MB/s +3.5 MB/s +6.7%
256b 449.9 MB/s 487.0 MB/s +37.1 MB/s +8.2%
1350b 500.0 MB/s 541.5 MB/s +41.5 MB/s +8.3%
8kb 537.9 MB/s 585.3 MB/s +47.4 MB/s +8.8%
16kb 530.7 MB/s 577.5 MB/s +46.8 MB/s +8.8%
SHA3-256 16b 53.5 MB/s 57.2 MB/s +3.7 MB/s +7.0%
256b 451.6 MB/s 488.1 MB/s +36.5 MB/s +8.1%
1350b 500.1 MB/s 542.0 MB/s +41.9 MB/s +8.4%
8kb 503.0 MB/s 546.9 MB/s +43.9 MB/s +8.7%
16kb 500.2 MB/s 544.9 MB/s +44.7 MB/s +8.9%
SHA3-384 16b 53.8 MB/s 57.7 MB/s +3.9 MB/s +7.2%
256b 306.9 MB/s 333.3 MB/s +26.4 MB/s +8.6%
1350b 386.6 MB/s 420.5 MB/s +33.9 MB/s +8.8%
8kb 389.9 MB/s 424.5 MB/s +34.6 MB/s +8.9%
16kb 384.9 MB/s 420.1 MB/s +35.2 MB/s +9.1%
SHA3-512 16b 53.4 MB/s 57.8 MB/s +4.4 MB/s +8.3%
256b 233.5 MB/s 254.0 MB/s +20.5 MB/s +8.8%
1350b 266.7 MB/s 290.2 MB/s +23.5 MB/s +8.8%
8kb 271.9 MB/s 295.8 MB/s +23.9 MB/s +8.8%
16kb 268.7 MB/s 292.7 MB/s +24.0 MB/s +8.9%
SHAKE-128 16b 49.6 MB/s 53.1 MB/s +3.5 MB/s +7.0%
256b 432.9 MB/s 468.0 MB/s +35.1 MB/s +8.1%
1350b 547.5 MB/s 592.5 MB/s +45.0 MB/s +8.2%
8kb 621.6 MB/s 676.1 MB/s +54.5 MB/s +8.8%
16kb 613.4 MB/s 667.7 MB/s +54.3 MB/s +8.9%
SHAKE-256 16b 49.7 MB/s 53.2 MB/s +3.5 MB/s +7.2%
256b 432.9 MB/s 469.1 MB/s +36.2 MB/s +8.4%
1350b 494.6 MB/s 537.9 MB/s +43.3 MB/s +8.8%
8kb 502.3 MB/s 546.6 MB/s +44.3 MB/s +8.8%
16kb 499.6 MB/s 545.2 MB/s +45.6 MB/s +9.1%

Neoverse-N1

Algorithm Size Main New Gain %
SHA3-224 16b 32.7 MB/s 36.5 MB/s +3.8 MB/s +11.7%
256b 277.2 MB/s 311.2 MB/s +34.0 MB/s +12.3%
1350b 309.5 MB/s 347.2 MB/s +37.7 MB/s +12.2%
8kb 334.4 MB/s 375.8 MB/s +41.4 MB/s +12.4%
16kb 331.1 MB/s 372.5 MB/s +41.4 MB/s +12.5%
SHA3-256 16b 33.0 MB/s 36.8 MB/s +3.8 MB/s +11.5%
256b 279.4 MB/s 312.5 MB/s +33.1 MB/s +11.9%
1350b 310.0 MB/s 348.3 MB/s +38.3 MB/s +12.4%
8kb 312.8 MB/s 352.1 MB/s +39.3 MB/s +12.6%
16kb 312.4 MB/s 353.0 MB/s +40.6 MB/s +13.0%
SHA3-384 16b 33.1 MB/s 36.9 MB/s +3.8 MB/s +11.5%
256b 190.7 MB/s 214.1 MB/s +23.4 MB/s +12.3%
1350b 240.2 MB/s 269.9 MB/s +29.7 MB/s +12.4%
8kb 242.7 MB/s 273.2 MB/s +30.5 MB/s +12.6%
16kb 240.4 MB/s 271.7 MB/s +31.3 MB/s +13.0%
SHA3-512 16b 33.1 MB/s 36.9 MB/s +3.8 MB/s +11.6%
256b 145.1 MB/s 162.8 MB/s +17.7 MB/s +12.2%
1350b 165.7 MB/s 186.2 MB/s +20.5 MB/s +12.3%
8kb 169.1 MB/s 190.0 MB/s +20.9 MB/s +12.4%
16kb 167.5 MB/s 189.2 MB/s +21.7 MB/s +13.0%
SHAKE-128 16b 30.3 MB/s 33.6 MB/s +3.3 MB/s +10.9%
256b 263.7 MB/s 293.2 MB/s +29.5 MB/s +11.2%
1350b 338.2 MB/s 379.4 MB/s +41.2 MB/s +12.2%
8kb 387.2 MB/s 435.4 MB/s +48.2 MB/s +12.5%
16kb 383.6 MB/s 432.9 MB/s +49.3 MB/s +12.9%
SHAKE-256 16b 30.5 MB/s 33.8 MB/s +3.3 MB/s +10.9%
256b 264.9 MB/s 294.5 MB/s +29.6 MB/s +11.2%
1350b 306.5 MB/s 344.1 MB/s +37.6 MB/s +12.3%
8kb 312.0 MB/s 351.5 MB/s +39.5 MB/s +12.7%
16kb 312.1 MB/s 352.7 MB/s +40.6 MB/s +13.0%

Neoverse-V1

Algorithm Size Main New Gain %
SHA3-224 16b 45.6 MB/s 49.3 MB/s +3.7 MB/s +8.2%
256b 382.6 MB/s 419.4 MB/s +36.8 MB/s +9.6%
1350b 422.5 MB/s 464.0 MB/s +41.5 MB/s +9.8%
8kb 454.9 MB/s 500.5 MB/s +45.6 MB/s +10.0%
16kb 449.2 MB/s 495.5 MB/s +46.3 MB/s +10.3%
SHA3-256 16b 45.7 MB/s 49.5 MB/s +3.8 MB/s +8.5%
256b 383.2 MB/s 420.8 MB/s +37.6 MB/s +9.8%
1350b 422.8 MB/s 464.4 MB/s +41.6 MB/s +9.8%
8kb 425.8 MB/s 467.7 MB/s +41.9 MB/s +9.8%
16kb 424.0 MB/s 468.1 MB/s +44.1 MB/s +10.4%
SHA3-384 16b 45.7 MB/s 49.7 MB/s +4.0 MB/s +8.7%
256b 261.3 MB/s 284.5 MB/s +23.2 MB/s +8.9%
1350b 327.8 MB/s 359.6 MB/s +31.8 MB/s +9.7%
8kb 330.5 MB/s 362.7 MB/s +32.2 MB/s +9.8%
16kb 326.3 MB/s 360.6 MB/s +34.3 MB/s +10.5%
SHA3-512 16b 45.7 MB/s 49.5 MB/s +3.8 MB/s +8.3%
256b 198.4 MB/s 216.7 MB/s +18.3 MB/s +9.2%
1350b 226.1 MB/s 247.5 MB/s +21.4 MB/s +9.5%
8kb 230.2 MB/s 252.0 MB/s +21.8 MB/s +9.4%
16kb 227.7 MB/s 250.3 MB/s +22.6 MB/s +9.9%
SHAKE-128 16b 42.1 MB/s 45.8 MB/s +3.7 MB/s +8.9%
256b 366.4 MB/s 402.3 MB/s +35.9 MB/s +9.8%
1350b 463.5 MB/s 508.8 MB/s +45.3 MB/s +9.8%
8kb 525.7 MB/s 580.0 MB/s +54.3 MB/s +10.3%
16kb 519.4 MB/s 574.4 MB/s +55.0 MB/s +10.6%
SHAKE-256 16b 42.3 MB/s 46.0 MB/s +3.7 MB/s +8.8%
256b 367.6 MB/s 404.2 MB/s +36.6 MB/s +9.9%
1350b 418.8 MB/s 459.9 MB/s +41.1 MB/s +9.8%
8kb 425.1 MB/s 466.9 MB/s +41.8 MB/s +9.8%
16kb 423.7 MB/s 467.4 MB/s +43.7 MB/s +10.3%

Cortex-A72

Algorithm Size Main New Gain %
SHA3-224 16b 19.9 MB/s 19.6 MB/s -0.3 MB/s -1.2%
256b 169.9 MB/s 168.1 MB/s -1.8 MB/s -1.0%
1350b 195.7 MB/s 189.3 MB/s -6.4 MB/s -3.3%
8kb 211.7 MB/s 204.8 MB/s -6.9 MB/s -3.2%
16kb 212.2 MB/s 205.3 MB/s -6.9 MB/s -3.2%
SHA3-256 16b 19.6 MB/s 19.7 MB/s +0.1 MB/s +0.6%
256b 168.9 MB/s 168.7 MB/s -0.2 MB/s -0.1%
1350b 195.2 MB/s 189.0 MB/s -6.2 MB/s -3.2%
8kb 198.6 MB/s 191.8 MB/s -6.8 MB/s -3.4%
16kb 200.7 MB/s 193.8 MB/s -6.9 MB/s -3.4%
SHA3-384 16b 20.0 MB/s 19.8 MB/s -0.2 MB/s -0.9%
256b 118.3 MB/s 115.6 MB/s -2.7 MB/s -2.3%
1350b 151.6 MB/s 146.8 MB/s -4.8 MB/s -3.2%
8kb 154.2 MB/s 148.9 MB/s -5.3 MB/s -3.4%
16kb 154.5 MB/s 149.1 MB/s -5.4 MB/s -3.5%
SHA3-512 16b 20.0 MB/s 19.7 MB/s -0.3 MB/s -1.5%
256b 90.2 MB/s 87.8 MB/s -2.4 MB/s -2.6%
1350b 104.9 MB/s 100.6 MB/s -4.3 MB/s -4.1%
8kb 107.4 MB/s 102.7 MB/s -4.7 MB/s -4.3%
16kb 107.5 MB/s 102.9 MB/s -4.6 MB/s -4.3%
SHAKE-128 16b 16.8 MB/s 17.7 MB/s +0.9 MB/s +5.0%
256b 157.2 MB/s 159.2 MB/s +2.0 MB/s +1.3%
1350b 211.4 MB/s 206.0 MB/s -5.4 MB/s -2.6%
8kb 245.1 MB/s 236.1 MB/s -9.0 MB/s -3.7%
16kb 245.9 MB/s 237.6 MB/s -8.3 MB/s -3.4%
SHAKE-256 16b 17.6 MB/s 17.8 MB/s +0.2 MB/s +1.3%
256b 158.9 MB/s 158.1 MB/s -0.8 MB/s -0.5%
1350b 192.5 MB/s 186.9 MB/s -5.6 MB/s -3.0%
8kb 198.0 MB/s 191.1 MB/s -6.9 MB/s -3.5%
16kb 200.4 MB/s 193.2 MB/s -7.2 MB/s -3.6%

Signed-off-by: Hanno Becker [email protected]

@codecov-commenter
Copy link

codecov-commenter commented Jul 14, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.73%. Comparing base (0beb210) to head (90ab7b7).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2539   +/-   ##
=======================================
  Coverage   78.72%   78.73%           
=======================================
  Files         645      645           
  Lines      110641   110644    +3     
  Branches    15648    15654    +6     
=======================================
+ Hits        87105    87117   +12     
+ Misses      22835    22828    -7     
+ Partials      701      699    -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hanno-becker hanno-becker changed the title Integrate formally verified AArch64 Keccak-x1 assembly from s2n-bignum Integrate formally verified AArch64 Keccak-x1 assembly from s2n-bignum/mlkem-native Jul 15, 2025
@hanno-becker hanno-becker force-pushed the keccak_asm branch 2 times, most recently from 0b42805 to d975c8d Compare July 15, 2025 06:45
@hanno-becker hanno-becker marked this pull request as ready for review July 15, 2025 07:50
@hanno-becker hanno-becker requested a review from a team as a code owner July 15, 2025 07:50
@hanno-becker hanno-becker force-pushed the keccak_asm branch 4 times, most recently from d8cf093 to 19a652e Compare July 24, 2025 10:55
This commit imports AArch64 assembly for two implementations
of the Keccak-F1600 permutation from s2n-bignum.

The first implementation leverages the 'lazy rotation' technique
described in [1] to accelerate scalar Keccak computations on AArch64
CPUs with free Barrel shifting (that is, where Barrel shifted instructions
have the same performance characteristics as unshifted ones).
Notable examples are Neoverse N1, V1 and V2. Notable non-examples
are Cortex-A72 and Apple M1; on those CPUs, the existing scalar
assembly from OpenSSL is faster.

This commit does not yet integrate the assembly into AWS-LC.

[1]: https://eprint.iacr.org/2022/1243

     Hybrid scalar/vector implementations of
     Keccak and SPHINCS+ on AArch64

Signed-off-by: Hanno Becker <[email protected]>
include(CheckCSourceCompiles)
set(CMAKE_REQUIRED_FLAGS_BACKUP "${CMAKE_REQUIRED_FLAGS}")
set(CMAKE_REQUIRED_FLAGS "-march=armv8.4-a+sha3")
check_c_source_compiles("
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works. But rest of code implements compiler feature probes here https://github.com/aws/aws-lc/tree/main/tests/compiler_features_tests and use check_compiler. Should use the same.

Copy link
Contributor Author

@hanno-becker hanno-becker Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This required a bit of rework on check_compiler, but I think it's working now. Please take a look at fd56682

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this locally, and on a docker image with gcc7 (no sha3 support). Output is

-- neon_sha3_check.c probe is negative, NOT enabling MY_ASSEMBLER_SUPPORTS_NEON_SHA3_EXTENSION:
--     Change Dir: /workspace/build/CMakeFiles/CMakeTmp

Run Build Command:"/usr/bin/make" "cmTC_128bc/fast"
/usr/bin/make -f CMakeFiles/cmTC_128bc.dir/build.make CMakeFiles/cmTC_128bc.dir/build
make[1]: Entering directory '/workspace/build/CMakeFiles/CMakeTmp'
Building C object CMakeFiles/cmTC_128bc.dir/neon_sha3_check.c.o
/usr/bin/cc   -Wredundant-decls -Wextra -Wunused -Wcomment -Wchar-subscripts -Wuninitialized -Wshadow -Wwrite-strings -Wformat-security -Wunused-result -Wno-overlength-strings  -Wall -fvisibility=hidden -fno-common -Wno-c11-extensions -Wvla -Wtype-limits -Wno-unused-parameter -Werror -Wformat=2 -Wsign-compare -Wmissing-field-initializers -Wwrite-strings -Wno-free-nonheap-object -Wmissing-braces -Wimplicit-fallthrough -Wformat-signedness -Wmissing-prototypes -Wold-style-definition -Wstrict-prototypes  -DAWS_LC_STDALIGN_AVAILABLE -DAWS_LC_BUILTIN_SWAP_SUPPORTED -Wshadow -D_XOPEN_SOURCE=700  -fPIE   -Werror -march=armv8.4-a+sha3 -o CMakeFiles/cmTC_128bc.dir/neon_sha3_check.c.o   -c /workspace/tests/compiler_features_tests/neon_sha3_check.c
cc1: compiler_error: unknown value 'armv8.4-a+sha3' for -march
cc1: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a did you mean 'armv8.1-a'?
cc1: compiler_error: unrecognized command line option '-Wno-c11-extensions' [-Werror]
cc1: all warnings being treated as errors
CMakeFiles/cmTC_128bc.dir/build.make:65: recipe for target 'CMakeFiles/cmTC_128bc.dir/neon_sha3_check.c.o' failed
make[1]: *** [CMakeFiles/cmTC_128bc.dir/neon_sha3_check.c.o] Error 1
make[1]: Leaving directory '/workspace/build/CMakeFiles/CMakeTmp'
Makefile:126: recipe for target 'cmTC_128bc/fast' failed
make: *** [cmTC_128bc/fast] Error 2

-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Generating test executable mem_test.
-- Generating test executable mem_set_test.
-- Generating test executable dynamic_loading_test.
-- Generating test executable rwlock_static_init.
-- Installing: /workspace/build/tool-openssl/c_rehash
-- Installing: /workspace/build/tool-openssl/c_rehash_test
-- Configuring done
-- Generating done
-- Build files have been written to: /workspace/build

which seems right (check_compiler does print the error output).

@@ -83,6 +83,9 @@ void OPENSSL_cpuid_setup(void) {
// Check if the CPU model is Neoverse V1 or V2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outdated comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


void KeccakF1600(uint64_t A[KECCAK1600_ROWS][KECCAK1600_ROWS]) {
#if defined(KECCAK1600_S2N_BIGNUM_ASM)
#if defined(MY_ASSEMBLER_SUPPORTS_NEON_SHA3_EXTENSION)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other ways to arrange this to remove 1 or 2 levels of if/defs. At this point we are 3 levels deep into if/defs.
For other s2n-bignum integrations, we would typically stub out any missing functions.
I understand it's hard to always pass the s2n-bignum files intot he build if sha3 is used in the asm implementation and not it's actual encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to get rid of the #defines altogether, but am happy about concrete suggestions. For now, I think the following is already a bit better:

void KeccakF1600(uint64_t A[KECCAK1600_ROWS][KECCAK1600_ROWS]) {
#if defined(KECCAK1600_S2N_BIGNUM_ASM) && defined(MY_ASSEMBLER_SUPPORTS_NEON_SHA3_EXTENSION)
    if (keccak_use_s2n_bignum_alt()) {
        sha3_keccak_f1600_alt((uint64_t *)A, iotas);
        return;
    }
#endif // KECCAK1600_S2N_BIGNUM_ASM && MY_ASSEMBLER_SUPPORTS_NEON_SHA3_EXTENSION

#if defined(KECCAK1600_S2N_BIGNUM_ASM)
    if (keccak_use_s2n_bignum_main()) {
        sha3_keccak_f1600((uint64_t *)A, iotas);
        return;
    }
#endif // KECCAK1600_S2N_BIGNUM_ASM

    KeccakF1600_hw((uint64_t *) A);
}


// Scalar implementation from OpenSSL provided by keccak1600-armv8.pl
extern void KeccakF1600_hw(uint64_t state[25]);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
OPENSSL_STATIC_ASSERT(KECCAK1600_ROWS * KECCAK1600_ROWS == 25, unexpected_array_size_for_A)

Copy link
Contributor Author

@hanno-becker hanno-becker Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Previously, a static dispatch would choose between the C implementation
of Keccak-F1600 or the assembly implementations (one scalar, one SIMD)
provided by OpenSSL. The C<->ASM interface was Keccak1600_Absorb and
Keccak1600_Squeeze.

This commit lowers the C<->ASM interface to the core Keccak permutation
itself; the Absorb/Squeeze assembly wrappers in keccak1600-armv8.pl
are removed accordingly.

Moroever the commit integrates the Keccak-F1600 implementations from
s2n-bignum into the build and replaces the above static dispatch by
a runtime dispatch based on CPU detection / CPU capabilities:

1. If ASM is disabled, we use the C implementation.
2. If ASM is enabled:
  - For Neoverse N1, V1, V2, we use scalar Keccak assembly from s2n-bignum,
    leveraging lazy rotations from https://eprint.iacr.org/2022/1243.
  - For Arm-based Apple CPUs, we use Neon Keccak assembly from s2n-bignum,
    leveraging the AArch64 SHA3 extension.
  - Otherwise, fall back to scalar Keccak implementation from OpenSSL,
    not using lazy rotations.

Lazy rotations improve performance by up to 10% on CPUs with free
Barrel shifting, which includes Neoverse N1, V1, and V2. Not all
CPUs have free Barrel shifting (e.g. Apple M1 or Cortex-A72), so we
don't use it by default.

Neoverse V1 and V2 do support SHA3 instructions, but they are only
implemented on 1/4 of Neon units, and are thus slower than a scalar
implementation.

Finally, since the Neon Keccak assembly from s2n-bignum is faster than
the Neon Keccak assembly from the OpenSSL implementation, the latter
is removed from keccak1600-armv8.pl, leaving only the scalar assembly
implementation for the core Keccak permutation.

Performance impact
------------------

* Apple M1

| Algorithm | Size  |       Main |        New |       Gain |      % |
|:----------|:------|------------|------------|------------|-------:|
| SHA3-224  | 16b   | 71.5 MB/s  | 88.3 MB/s  | +16.8 MB/s | +23.5% |
|           | 256b  | 584.5 MB/s | 754.7 MB/s | +170.2 MB/s| +29.1% |
|           | 1350b | 633.8 MB/s | 815.2 MB/s | +181.4 MB/s| +28.6% |
|           | 8kb   | 694.4 MB/s | 872.4 MB/s | +178.0 MB/s| +25.6% |
|           | 16kb  | 696.9 MB/s | 864.8 MB/s | +167.9 MB/s| +24.1% |
| SHA3-256  | 16b   | 71.6 MB/s  | 88.6 MB/s  | +17.0 MB/s | +23.7% |
|           | 256b  | 600.8 MB/s | 759.0 MB/s | +158.2 MB/s| +26.3% |
|           | 1350b | 638.8 MB/s | 817.5 MB/s | +178.7 MB/s| +28.0% |
|           | 8kb   | 652.3 MB/s | 820.5 MB/s | +168.2 MB/s| +25.8% |
|           | 16kb  | 658.9 MB/s | 823.8 MB/s | +164.9 MB/s| +25.0% |
| SHA3-384  | 16b   | 71.9 MB/s  | 86.8 MB/s  | +14.9 MB/s | +20.7% |
|           | 256b  | 402.3 MB/s | 505.4 MB/s | +103.1 MB/s| +25.6% |
|           | 1350b | 493.1 MB/s | 636.0 MB/s | +142.9 MB/s| +29.0% |
|           | 8kb   | 507.3 MB/s | 639.7 MB/s | +132.4 MB/s| +26.1% |
|           | 16kb  | 507.2 MB/s | 626.2 MB/s | +119.0 MB/s| +23.5% |
| SHA3-512  | 16b   | 70.6 MB/s  | 89.2 MB/s  | +18.6 MB/s | +26.3% |
|           | 256b  | 305.7 MB/s | 390.8 MB/s | +85.1 MB/s | +27.8% |
|           | 1350b | 347.2 MB/s | 436.7 MB/s | +89.5 MB/s | +25.8% |
|           | 8kb   | 355.0 MB/s | 446.3 MB/s | +91.3 MB/s | +25.7% |
|           | 16kb  | 356.1 MB/s | 445.7 MB/s | +89.6 MB/s | +25.2% |
| SHAKE-128 | 16b   | 68.8 MB/s  | 87.4 MB/s  | +18.6 MB/s | +27.0% |
|           | 256b  | 572.2 MB/s | 747.5 MB/s | +175.3 MB/s| +30.6% |
|           | 1350b | 780.8 MB/s | 1016.4 MB/s| +235.6 MB/s| +30.2% |
|           | 8kb   | 932.8 MB/s | 1215.4 MB/s| +282.6 MB/s| +30.3% |
|           | 16kb  | 932.4 MB/s | 1215.9 MB/s| +283.5 MB/s| +30.4% |
| SHAKE-256 | 16b   | 69.0 MB/s  | 87.6 MB/s  | +18.6 MB/s | +27.0% |
|           | 256b  | 574.7 MB/s | 750.1 MB/s | +175.4 MB/s| +30.5% |
|           | 1350b | 629.4 MB/s | 817.0 MB/s | +187.6 MB/s| +29.8% |
|           | 8kb   | 652.3 MB/s | 820.5 MB/s | +168.2 MB/s| +25.8% |
|           | 16kb  | 658.9 MB/s | 823.8 MB/s | +164.9 MB/s| +25.0% |

* Neoverse-V2

| Algorithm | Size  |       Main |        New |       Gain |     % |
|:----------|:------|------------|------------|------------|------:|
| SHA3-224  | 16b   | 53.4 MB/s  | 56.9 MB/s  | +3.5 MB/s  | +6.7% |
|           | 256b  | 449.9 MB/s | 487.0 MB/s | +37.1 MB/s | +8.2% |
|           | 1350b | 500.0 MB/s | 541.5 MB/s | +41.5 MB/s | +8.3% |
|           | 8kb   | 537.9 MB/s | 585.3 MB/s | +47.4 MB/s | +8.8% |
|           | 16kb  | 530.7 MB/s | 577.5 MB/s | +46.8 MB/s | +8.8% |
| SHA3-256  | 16b   | 53.5 MB/s  | 57.2 MB/s  | +3.7 MB/s  | +7.0% |
|           | 256b  | 451.6 MB/s | 488.1 MB/s | +36.5 MB/s | +8.1% |
|           | 1350b | 500.1 MB/s | 542.0 MB/s | +41.9 MB/s | +8.4% |
|           | 8kb   | 503.0 MB/s | 546.9 MB/s | +43.9 MB/s | +8.7% |
|           | 16kb  | 500.2 MB/s | 544.9 MB/s | +44.7 MB/s | +8.9% |
| SHA3-384  | 16b   | 53.8 MB/s  | 57.7 MB/s  | +3.9 MB/s  | +7.2% |
|           | 256b  | 306.9 MB/s | 333.3 MB/s | +26.4 MB/s | +8.6% |
|           | 1350b | 386.6 MB/s | 420.5 MB/s | +33.9 MB/s | +8.8% |
|           | 8kb   | 389.9 MB/s | 424.5 MB/s | +34.6 MB/s | +8.9% |
|           | 16kb  | 384.9 MB/s | 420.1 MB/s | +35.2 MB/s | +9.1% |
| SHA3-512  | 16b   | 53.4 MB/s  | 57.8 MB/s  | +4.4 MB/s  | +8.3% |
|           | 256b  | 233.5 MB/s | 254.0 MB/s | +20.5 MB/s | +8.8% |
|           | 1350b | 266.7 MB/s | 290.2 MB/s | +23.5 MB/s | +8.8% |
|           | 8kb   | 271.9 MB/s | 295.8 MB/s | +23.9 MB/s | +8.8% |
|           | 16kb  | 268.7 MB/s | 292.7 MB/s | +24.0 MB/s | +8.9% |
| SHAKE-128 | 16b   | 49.6 MB/s  | 53.1 MB/s  | +3.5 MB/s  | +7.0% |
|           | 256b  | 432.9 MB/s | 468.0 MB/s | +35.1 MB/s | +8.1% |
|           | 1350b | 547.5 MB/s | 592.5 MB/s | +45.0 MB/s | +8.2% |
|           | 8kb   | 621.6 MB/s | 676.1 MB/s | +54.5 MB/s | +8.8% |
|           | 16kb  | 613.4 MB/s | 667.7 MB/s | +54.3 MB/s | +8.9% |
| SHAKE-256 | 16b   | 49.7 MB/s  | 53.2 MB/s  | +3.5 MB/s  | +7.2% |
|           | 256b  | 432.9 MB/s | 469.1 MB/s | +36.2 MB/s | +8.4% |
|           | 1350b | 494.6 MB/s | 537.9 MB/s | +43.3 MB/s | +8.8% |
|           | 8kb   | 502.3 MB/s | 546.6 MB/s | +44.3 MB/s | +8.8% |
|           | 16kb  | 499.6 MB/s | 545.2 MB/s | +45.6 MB/s | +9.1% |

* Neoverse-N1

| Algorithm | Size  |       Main |        New |       Gain |      % |
|:----------|:------|------------|------------|------------|-------:|
| SHA3-224  | 16b   | 32.7 MB/s  | 36.5 MB/s  | +3.8 MB/s  | +11.7% |
|           | 256b  | 277.2 MB/s | 311.2 MB/s | +34.0 MB/s | +12.3% |
|           | 1350b | 309.5 MB/s | 347.2 MB/s | +37.7 MB/s | +12.2% |
|           | 8kb   | 334.4 MB/s | 375.8 MB/s | +41.4 MB/s | +12.4% |
|           | 16kb  | 331.1 MB/s | 372.5 MB/s | +41.4 MB/s | +12.5% |
| SHA3-256  | 16b   | 33.0 MB/s  | 36.8 MB/s  | +3.8 MB/s  | +11.5% |
|           | 256b  | 279.4 MB/s | 312.5 MB/s | +33.1 MB/s | +11.9% |
|           | 1350b | 310.0 MB/s | 348.3 MB/s | +38.3 MB/s | +12.4% |
|           | 8kb   | 312.8 MB/s | 352.1 MB/s | +39.3 MB/s | +12.6% |
|           | 16kb  | 312.4 MB/s | 353.0 MB/s | +40.6 MB/s | +13.0% |
| SHA3-384  | 16b   | 33.1 MB/s  | 36.9 MB/s  | +3.8 MB/s  | +11.5% |
|           | 256b  | 190.7 MB/s | 214.1 MB/s | +23.4 MB/s | +12.3% |
|           | 1350b | 240.2 MB/s | 269.9 MB/s | +29.7 MB/s | +12.4% |
|           | 8kb   | 242.7 MB/s | 273.2 MB/s | +30.5 MB/s | +12.6% |
|           | 16kb  | 240.4 MB/s | 271.7 MB/s | +31.3 MB/s | +13.0% |
| SHA3-512  | 16b   | 33.1 MB/s  | 36.9 MB/s  | +3.8 MB/s  | +11.6% |
|           | 256b  | 145.1 MB/s | 162.8 MB/s | +17.7 MB/s | +12.2% |
|           | 1350b | 165.7 MB/s | 186.2 MB/s | +20.5 MB/s | +12.3% |
|           | 8kb   | 169.1 MB/s | 190.0 MB/s | +20.9 MB/s | +12.4% |
|           | 16kb  | 167.5 MB/s | 189.2 MB/s | +21.7 MB/s | +13.0% |
| SHAKE-128 | 16b   | 30.3 MB/s  | 33.6 MB/s  | +3.3 MB/s  | +10.9% |
|           | 256b  | 263.7 MB/s | 293.2 MB/s | +29.5 MB/s | +11.2% |
|           | 1350b | 338.2 MB/s | 379.4 MB/s | +41.2 MB/s | +12.2% |
|           | 8kb   | 387.2 MB/s | 435.4 MB/s | +48.2 MB/s | +12.5% |
|           | 16kb  | 383.6 MB/s | 432.9 MB/s | +49.3 MB/s | +12.9% |
| SHAKE-256 | 16b   | 30.5 MB/s  | 33.8 MB/s  | +3.3 MB/s  | +10.9% |
|           | 256b  | 264.9 MB/s | 294.5 MB/s | +29.6 MB/s | +11.2% |
|           | 1350b | 306.5 MB/s | 344.1 MB/s | +37.6 MB/s | +12.3% |
|           | 8kb   | 312.0 MB/s | 351.5 MB/s | +39.5 MB/s | +12.7% |
|           | 16kb  | 312.1 MB/s | 352.7 MB/s | +40.6 MB/s | +13.0% |

* Neoverse-V1

| Algorithm | Size  |       Main |        New |       Gain |      % |
|:----------|:------|------------|------------|------------|-------:|
| SHA3-224  | 16b   | 45.6 MB/s  | 49.3 MB/s  | +3.7 MB/s  |  +8.2% |
|           | 256b  | 382.6 MB/s | 419.4 MB/s | +36.8 MB/s |  +9.6% |
|           | 1350b | 422.5 MB/s | 464.0 MB/s | +41.5 MB/s |  +9.8% |
|           | 8kb   | 454.9 MB/s | 500.5 MB/s | +45.6 MB/s | +10.0% |
|           | 16kb  | 449.2 MB/s | 495.5 MB/s | +46.3 MB/s | +10.3% |
| SHA3-256  | 16b   | 45.7 MB/s  | 49.5 MB/s  | +3.8 MB/s  |  +8.5% |
|           | 256b  | 383.2 MB/s | 420.8 MB/s | +37.6 MB/s |  +9.8% |
|           | 1350b | 422.8 MB/s | 464.4 MB/s | +41.6 MB/s |  +9.8% |
|           | 8kb   | 425.8 MB/s | 467.7 MB/s | +41.9 MB/s |  +9.8% |
|           | 16kb  | 424.0 MB/s | 468.1 MB/s | +44.1 MB/s | +10.4% |
| SHA3-384  | 16b   | 45.7 MB/s  | 49.7 MB/s  | +4.0 MB/s  |  +8.7% |
|           | 256b  | 261.3 MB/s | 284.5 MB/s | +23.2 MB/s |  +8.9% |
|           | 1350b | 327.8 MB/s | 359.6 MB/s | +31.8 MB/s |  +9.7% |
|           | 8kb   | 330.5 MB/s | 362.7 MB/s | +32.2 MB/s |  +9.8% |
|           | 16kb  | 326.3 MB/s | 360.6 MB/s | +34.3 MB/s | +10.5% |
| SHA3-512  | 16b   | 45.7 MB/s  | 49.5 MB/s  | +3.8 MB/s  |  +8.3% |
|           | 256b  | 198.4 MB/s | 216.7 MB/s | +18.3 MB/s |  +9.2% |
|           | 1350b | 226.1 MB/s | 247.5 MB/s | +21.4 MB/s |  +9.5% |
|           | 8kb   | 230.2 MB/s | 252.0 MB/s | +21.8 MB/s |  +9.4% |
|           | 16kb  | 227.7 MB/s | 250.3 MB/s | +22.6 MB/s |  +9.9% |
| SHAKE-128 | 16b   | 42.1 MB/s  | 45.8 MB/s  | +3.7 MB/s  |  +8.9% |
|           | 256b  | 366.4 MB/s | 402.3 MB/s | +35.9 MB/s |  +9.8% |
|           | 1350b | 463.5 MB/s | 508.8 MB/s | +45.3 MB/s |  +9.8% |
|           | 8kb   | 525.7 MB/s | 580.0 MB/s | +54.3 MB/s | +10.3% |
|           | 16kb  | 519.4 MB/s | 574.4 MB/s | +55.0 MB/s | +10.6% |
| SHAKE-256 | 16b   | 42.3 MB/s  | 46.0 MB/s  | +3.7 MB/s  |  +8.8% |
|           | 256b  | 367.6 MB/s | 404.2 MB/s | +36.6 MB/s |  +9.9% |
|           | 1350b | 418.8 MB/s | 459.9 MB/s | +41.1 MB/s |  +9.8% |
|           | 8kb   | 425.1 MB/s | 466.9 MB/s | +41.8 MB/s |  +9.8% |
|           | 16kb  | 423.7 MB/s | 467.4 MB/s | +43.7 MB/s | +10.3% |

* Cortex-A72

| Algorithm | Size  |       Main |        New |       Gain |     % |
|:----------|:------|------------|------------|------------|------:|
| SHA3-224  | 16b   | 19.9 MB/s  | 19.6 MB/s  | -0.3 MB/s  | -1.2% |
|           | 256b  | 169.9 MB/s | 168.1 MB/s | -1.8 MB/s  | -1.0% |
|           | 1350b | 195.7 MB/s | 189.3 MB/s | -6.4 MB/s  | -3.3% |
|           | 8kb   | 211.7 MB/s | 204.8 MB/s | -6.9 MB/s  | -3.2% |
|           | 16kb  | 212.2 MB/s | 205.3 MB/s | -6.9 MB/s  | -3.2% |
| SHA3-256  | 16b   | 19.6 MB/s  | 19.7 MB/s  | +0.1 MB/s  | +0.6% |
|           | 256b  | 168.9 MB/s | 168.7 MB/s | -0.2 MB/s  | -0.1% |
|           | 1350b | 195.2 MB/s | 189.0 MB/s | -6.2 MB/s  | -3.2% |
|           | 8kb   | 198.6 MB/s | 191.8 MB/s | -6.8 MB/s  | -3.4% |
|           | 16kb  | 200.7 MB/s | 193.8 MB/s | -6.9 MB/s  | -3.4% |
| SHA3-384  | 16b   | 20.0 MB/s  | 19.8 MB/s  | -0.2 MB/s  | -0.9% |
|           | 256b  | 118.3 MB/s | 115.6 MB/s | -2.7 MB/s  | -2.3% |
|           | 1350b | 151.6 MB/s | 146.8 MB/s | -4.8 MB/s  | -3.2% |
|           | 8kb   | 154.2 MB/s | 148.9 MB/s | -5.3 MB/s  | -3.4% |
|           | 16kb  | 154.5 MB/s | 149.1 MB/s | -5.4 MB/s  | -3.5% |
| SHA3-512  | 16b   | 20.0 MB/s  | 19.7 MB/s  | -0.3 MB/s  | -1.5% |
|           | 256b  | 90.2 MB/s  | 87.8 MB/s  | -2.4 MB/s  | -2.6% |
|           | 1350b | 104.9 MB/s | 100.6 MB/s | -4.3 MB/s  | -4.1% |
|           | 8kb   | 107.4 MB/s | 102.7 MB/s | -4.7 MB/s  | -4.3% |
|           | 16kb  | 107.5 MB/s | 102.9 MB/s | -4.6 MB/s  | -4.3% |
| SHAKE-128 | 16b   | 16.8 MB/s  | 17.7 MB/s  | +0.9 MB/s  | +5.0% |
|           | 256b  | 157.2 MB/s | 159.2 MB/s | +2.0 MB/s  | +1.3% |
|           | 1350b | 211.4 MB/s | 206.0 MB/s | -5.4 MB/s  | -2.6% |
|           | 8kb   | 245.1 MB/s | 236.1 MB/s | -9.0 MB/s  | -3.7% |
|           | 16kb  | 245.9 MB/s | 237.6 MB/s | -8.3 MB/s  | -3.4% |
| SHAKE-256 | 16b   | 17.6 MB/s  | 17.8 MB/s  | +0.2 MB/s  | +1.3% |
|           | 256b  | 158.9 MB/s | 158.1 MB/s | -0.8 MB/s  | -0.5% |
|           | 1350b | 192.5 MB/s | 186.9 MB/s | -5.6 MB/s  | -3.0% |
|           | 8kb   | 198.0 MB/s | 191.1 MB/s | -6.9 MB/s  | -3.5% |
|           | 16kb  | 200.4 MB/s | 193.2 MB/s | -7.2 MB/s  | -3.6% |

Signed-off-by: Hanno Becker <[email protected]>
The `check_compiler` macro in the root CMakeLists.txt can be
used to try-compile a C file and set a C preprocessor directive
upon success.

While sufficient for the current uses from the root CMakeLists.txt,
it has some limitations:
- It cannot be called from CMakeLists.txt files in subdirectories
  without creating the `tests/compiler_feature_tests/...` directory
  in the subdirectory of that CMake file.
- It does not allow setting a CMake variable indicating success/failure
  of compilation, for later reference. (The code use the default 'RESULT',
  but that may be overwritten by other calls and is thus not suitable
  for later reference).
- It does not allow specifying additional CFLAGS for the attempted
  compilation.

This commit fixes those issues, in the following way:
- It allows check_compiler to be called from sub-CMakeLists.txt
  while still refering to tests/compiler_feature_tests in the
  root AWS-LC directory.
- It always stores the result of the attempted compilation in a
  CMake variable _of the same name_ as the preprocessor define.
  In principle, this could be generalized, but it seems unnecessary,
  and there is already precedent for using the same names for
  preprocessor directives and CMake variables
  (e.g. MY_ASSEMBLER_IS_TOO_OLD_FOR_AVX).
- It interprets additional arguments to `check_compiler` as
  additional CFLAGS. This can be omitted, and hence existing
  calls to `check_compiler` need not be changed.

Signed-off-by: Hanno Becker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants