-
Notifications
You must be signed in to change notification settings - Fork 136
Integrate formally verified AArch64 Keccak-x1 assembly from s2n-bignum/mlkem-native #2539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2539 +/- ##
=======================================
Coverage 78.72% 78.73%
=======================================
Files 645 645
Lines 110641 110644 +3
Branches 15648 15654 +6
=======================================
+ Hits 87105 87117 +12
+ Misses 22835 22828 -7
+ Partials 701 699 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
0b42805
to
d975c8d
Compare
d8cf093
to
19a652e
Compare
This commit imports AArch64 assembly for two implementations of the Keccak-F1600 permutation from s2n-bignum. The first implementation leverages the 'lazy rotation' technique described in [1] to accelerate scalar Keccak computations on AArch64 CPUs with free Barrel shifting (that is, where Barrel shifted instructions have the same performance characteristics as unshifted ones). Notable examples are Neoverse N1, V1 and V2. Notable non-examples are Cortex-A72 and Apple M1; on those CPUs, the existing scalar assembly from OpenSSL is faster. This commit does not yet integrate the assembly into AWS-LC. [1]: https://eprint.iacr.org/2022/1243 Hybrid scalar/vector implementations of Keccak and SPHINCS+ on AArch64 Signed-off-by: Hanno Becker <[email protected]>
Signed-off-by: Hanno Becker <[email protected]>
crypto/fipsmodule/CMakeLists.txt
Outdated
include(CheckCSourceCompiles) | ||
set(CMAKE_REQUIRED_FLAGS_BACKUP "${CMAKE_REQUIRED_FLAGS}") | ||
set(CMAKE_REQUIRED_FLAGS "-march=armv8.4-a+sha3") | ||
check_c_source_compiles(" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this works. But rest of code implements compiler feature probes here https://github.com/aws/aws-lc/tree/main/tests/compiler_features_tests and use check_compiler
. Should use the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This required a bit of rework on check_compiler
, but I think it's working now. Please take a look at fd56682
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this locally, and on a docker image with gcc7 (no sha3 support). Output is
-- neon_sha3_check.c probe is negative, NOT enabling MY_ASSEMBLER_SUPPORTS_NEON_SHA3_EXTENSION:
-- Change Dir: /workspace/build/CMakeFiles/CMakeTmp
Run Build Command:"/usr/bin/make" "cmTC_128bc/fast"
/usr/bin/make -f CMakeFiles/cmTC_128bc.dir/build.make CMakeFiles/cmTC_128bc.dir/build
make[1]: Entering directory '/workspace/build/CMakeFiles/CMakeTmp'
Building C object CMakeFiles/cmTC_128bc.dir/neon_sha3_check.c.o
/usr/bin/cc -Wredundant-decls -Wextra -Wunused -Wcomment -Wchar-subscripts -Wuninitialized -Wshadow -Wwrite-strings -Wformat-security -Wunused-result -Wno-overlength-strings -Wall -fvisibility=hidden -fno-common -Wno-c11-extensions -Wvla -Wtype-limits -Wno-unused-parameter -Werror -Wformat=2 -Wsign-compare -Wmissing-field-initializers -Wwrite-strings -Wno-free-nonheap-object -Wmissing-braces -Wimplicit-fallthrough -Wformat-signedness -Wmissing-prototypes -Wold-style-definition -Wstrict-prototypes -DAWS_LC_STDALIGN_AVAILABLE -DAWS_LC_BUILTIN_SWAP_SUPPORTED -Wshadow -D_XOPEN_SOURCE=700 -fPIE -Werror -march=armv8.4-a+sha3 -o CMakeFiles/cmTC_128bc.dir/neon_sha3_check.c.o -c /workspace/tests/compiler_features_tests/neon_sha3_check.c
cc1: compiler_error: unknown value 'armv8.4-a+sha3' for -march
cc1: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a did you mean 'armv8.1-a'?
cc1: compiler_error: unrecognized command line option '-Wno-c11-extensions' [-Werror]
cc1: all warnings being treated as errors
CMakeFiles/cmTC_128bc.dir/build.make:65: recipe for target 'CMakeFiles/cmTC_128bc.dir/neon_sha3_check.c.o' failed
make[1]: *** [CMakeFiles/cmTC_128bc.dir/neon_sha3_check.c.o] Error 1
make[1]: Leaving directory '/workspace/build/CMakeFiles/CMakeTmp'
Makefile:126: recipe for target 'cmTC_128bc/fast' failed
make: *** [cmTC_128bc/fast] Error 2
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Generating test executable mem_test.
-- Generating test executable mem_set_test.
-- Generating test executable dynamic_loading_test.
-- Generating test executable rwlock_static_init.
-- Installing: /workspace/build/tool-openssl/c_rehash
-- Installing: /workspace/build/tool-openssl/c_rehash_test
-- Configuring done
-- Generating done
-- Build files have been written to: /workspace/build
which seems right (check_compiler
does print the error output).
@@ -83,6 +83,9 @@ void OPENSSL_cpuid_setup(void) { | |||
// Check if the CPU model is Neoverse V1 or V2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outdated comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
crypto/fipsmodule/sha/keccak1600.c
Outdated
|
||
void KeccakF1600(uint64_t A[KECCAK1600_ROWS][KECCAK1600_ROWS]) { | ||
#if defined(KECCAK1600_S2N_BIGNUM_ASM) | ||
#if defined(MY_ASSEMBLER_SUPPORTS_NEON_SHA3_EXTENSION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there other ways to arrange this to remove 1 or 2 levels of if/defs. At this point we are 3 levels deep into if/defs.
For other s2n-bignum integrations, we would typically stub out any missing functions.
I understand it's hard to always pass the s2n-bignum files intot he build if sha3
is used in the asm implementation and not it's actual encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to get rid of the #defines altogether, but am happy about concrete suggestions. For now, I think the following is already a bit better:
void KeccakF1600(uint64_t A[KECCAK1600_ROWS][KECCAK1600_ROWS]) {
#if defined(KECCAK1600_S2N_BIGNUM_ASM) && defined(MY_ASSEMBLER_SUPPORTS_NEON_SHA3_EXTENSION)
if (keccak_use_s2n_bignum_alt()) {
sha3_keccak_f1600_alt((uint64_t *)A, iotas);
return;
}
#endif // KECCAK1600_S2N_BIGNUM_ASM && MY_ASSEMBLER_SUPPORTS_NEON_SHA3_EXTENSION
#if defined(KECCAK1600_S2N_BIGNUM_ASM)
if (keccak_use_s2n_bignum_main()) {
sha3_keccak_f1600((uint64_t *)A, iotas);
return;
}
#endif // KECCAK1600_S2N_BIGNUM_ASM
KeccakF1600_hw((uint64_t *) A);
}
|
||
// Scalar implementation from OpenSSL provided by keccak1600-armv8.pl | ||
extern void KeccakF1600_hw(uint64_t state[25]); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OPENSSL_STATIC_ASSERT(KECCAK1600_ROWS * KECCAK1600_ROWS == 25, unexpected_array_size_for_A) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Hanno Becker <[email protected]>
Signed-off-by: Hanno Becker <[email protected]>
Previously, a static dispatch would choose between the C implementation of Keccak-F1600 or the assembly implementations (one scalar, one SIMD) provided by OpenSSL. The C<->ASM interface was Keccak1600_Absorb and Keccak1600_Squeeze. This commit lowers the C<->ASM interface to the core Keccak permutation itself; the Absorb/Squeeze assembly wrappers in keccak1600-armv8.pl are removed accordingly. Moroever the commit integrates the Keccak-F1600 implementations from s2n-bignum into the build and replaces the above static dispatch by a runtime dispatch based on CPU detection / CPU capabilities: 1. If ASM is disabled, we use the C implementation. 2. If ASM is enabled: - For Neoverse N1, V1, V2, we use scalar Keccak assembly from s2n-bignum, leveraging lazy rotations from https://eprint.iacr.org/2022/1243. - For Arm-based Apple CPUs, we use Neon Keccak assembly from s2n-bignum, leveraging the AArch64 SHA3 extension. - Otherwise, fall back to scalar Keccak implementation from OpenSSL, not using lazy rotations. Lazy rotations improve performance by up to 10% on CPUs with free Barrel shifting, which includes Neoverse N1, V1, and V2. Not all CPUs have free Barrel shifting (e.g. Apple M1 or Cortex-A72), so we don't use it by default. Neoverse V1 and V2 do support SHA3 instructions, but they are only implemented on 1/4 of Neon units, and are thus slower than a scalar implementation. Finally, since the Neon Keccak assembly from s2n-bignum is faster than the Neon Keccak assembly from the OpenSSL implementation, the latter is removed from keccak1600-armv8.pl, leaving only the scalar assembly implementation for the core Keccak permutation. Performance impact ------------------ * Apple M1 | Algorithm | Size | Main | New | Gain | % | |:----------|:------|------------|------------|------------|-------:| | SHA3-224 | 16b | 71.5 MB/s | 88.3 MB/s | +16.8 MB/s | +23.5% | | | 256b | 584.5 MB/s | 754.7 MB/s | +170.2 MB/s| +29.1% | | | 1350b | 633.8 MB/s | 815.2 MB/s | +181.4 MB/s| +28.6% | | | 8kb | 694.4 MB/s | 872.4 MB/s | +178.0 MB/s| +25.6% | | | 16kb | 696.9 MB/s | 864.8 MB/s | +167.9 MB/s| +24.1% | | SHA3-256 | 16b | 71.6 MB/s | 88.6 MB/s | +17.0 MB/s | +23.7% | | | 256b | 600.8 MB/s | 759.0 MB/s | +158.2 MB/s| +26.3% | | | 1350b | 638.8 MB/s | 817.5 MB/s | +178.7 MB/s| +28.0% | | | 8kb | 652.3 MB/s | 820.5 MB/s | +168.2 MB/s| +25.8% | | | 16kb | 658.9 MB/s | 823.8 MB/s | +164.9 MB/s| +25.0% | | SHA3-384 | 16b | 71.9 MB/s | 86.8 MB/s | +14.9 MB/s | +20.7% | | | 256b | 402.3 MB/s | 505.4 MB/s | +103.1 MB/s| +25.6% | | | 1350b | 493.1 MB/s | 636.0 MB/s | +142.9 MB/s| +29.0% | | | 8kb | 507.3 MB/s | 639.7 MB/s | +132.4 MB/s| +26.1% | | | 16kb | 507.2 MB/s | 626.2 MB/s | +119.0 MB/s| +23.5% | | SHA3-512 | 16b | 70.6 MB/s | 89.2 MB/s | +18.6 MB/s | +26.3% | | | 256b | 305.7 MB/s | 390.8 MB/s | +85.1 MB/s | +27.8% | | | 1350b | 347.2 MB/s | 436.7 MB/s | +89.5 MB/s | +25.8% | | | 8kb | 355.0 MB/s | 446.3 MB/s | +91.3 MB/s | +25.7% | | | 16kb | 356.1 MB/s | 445.7 MB/s | +89.6 MB/s | +25.2% | | SHAKE-128 | 16b | 68.8 MB/s | 87.4 MB/s | +18.6 MB/s | +27.0% | | | 256b | 572.2 MB/s | 747.5 MB/s | +175.3 MB/s| +30.6% | | | 1350b | 780.8 MB/s | 1016.4 MB/s| +235.6 MB/s| +30.2% | | | 8kb | 932.8 MB/s | 1215.4 MB/s| +282.6 MB/s| +30.3% | | | 16kb | 932.4 MB/s | 1215.9 MB/s| +283.5 MB/s| +30.4% | | SHAKE-256 | 16b | 69.0 MB/s | 87.6 MB/s | +18.6 MB/s | +27.0% | | | 256b | 574.7 MB/s | 750.1 MB/s | +175.4 MB/s| +30.5% | | | 1350b | 629.4 MB/s | 817.0 MB/s | +187.6 MB/s| +29.8% | | | 8kb | 652.3 MB/s | 820.5 MB/s | +168.2 MB/s| +25.8% | | | 16kb | 658.9 MB/s | 823.8 MB/s | +164.9 MB/s| +25.0% | * Neoverse-V2 | Algorithm | Size | Main | New | Gain | % | |:----------|:------|------------|------------|------------|------:| | SHA3-224 | 16b | 53.4 MB/s | 56.9 MB/s | +3.5 MB/s | +6.7% | | | 256b | 449.9 MB/s | 487.0 MB/s | +37.1 MB/s | +8.2% | | | 1350b | 500.0 MB/s | 541.5 MB/s | +41.5 MB/s | +8.3% | | | 8kb | 537.9 MB/s | 585.3 MB/s | +47.4 MB/s | +8.8% | | | 16kb | 530.7 MB/s | 577.5 MB/s | +46.8 MB/s | +8.8% | | SHA3-256 | 16b | 53.5 MB/s | 57.2 MB/s | +3.7 MB/s | +7.0% | | | 256b | 451.6 MB/s | 488.1 MB/s | +36.5 MB/s | +8.1% | | | 1350b | 500.1 MB/s | 542.0 MB/s | +41.9 MB/s | +8.4% | | | 8kb | 503.0 MB/s | 546.9 MB/s | +43.9 MB/s | +8.7% | | | 16kb | 500.2 MB/s | 544.9 MB/s | +44.7 MB/s | +8.9% | | SHA3-384 | 16b | 53.8 MB/s | 57.7 MB/s | +3.9 MB/s | +7.2% | | | 256b | 306.9 MB/s | 333.3 MB/s | +26.4 MB/s | +8.6% | | | 1350b | 386.6 MB/s | 420.5 MB/s | +33.9 MB/s | +8.8% | | | 8kb | 389.9 MB/s | 424.5 MB/s | +34.6 MB/s | +8.9% | | | 16kb | 384.9 MB/s | 420.1 MB/s | +35.2 MB/s | +9.1% | | SHA3-512 | 16b | 53.4 MB/s | 57.8 MB/s | +4.4 MB/s | +8.3% | | | 256b | 233.5 MB/s | 254.0 MB/s | +20.5 MB/s | +8.8% | | | 1350b | 266.7 MB/s | 290.2 MB/s | +23.5 MB/s | +8.8% | | | 8kb | 271.9 MB/s | 295.8 MB/s | +23.9 MB/s | +8.8% | | | 16kb | 268.7 MB/s | 292.7 MB/s | +24.0 MB/s | +8.9% | | SHAKE-128 | 16b | 49.6 MB/s | 53.1 MB/s | +3.5 MB/s | +7.0% | | | 256b | 432.9 MB/s | 468.0 MB/s | +35.1 MB/s | +8.1% | | | 1350b | 547.5 MB/s | 592.5 MB/s | +45.0 MB/s | +8.2% | | | 8kb | 621.6 MB/s | 676.1 MB/s | +54.5 MB/s | +8.8% | | | 16kb | 613.4 MB/s | 667.7 MB/s | +54.3 MB/s | +8.9% | | SHAKE-256 | 16b | 49.7 MB/s | 53.2 MB/s | +3.5 MB/s | +7.2% | | | 256b | 432.9 MB/s | 469.1 MB/s | +36.2 MB/s | +8.4% | | | 1350b | 494.6 MB/s | 537.9 MB/s | +43.3 MB/s | +8.8% | | | 8kb | 502.3 MB/s | 546.6 MB/s | +44.3 MB/s | +8.8% | | | 16kb | 499.6 MB/s | 545.2 MB/s | +45.6 MB/s | +9.1% | * Neoverse-N1 | Algorithm | Size | Main | New | Gain | % | |:----------|:------|------------|------------|------------|-------:| | SHA3-224 | 16b | 32.7 MB/s | 36.5 MB/s | +3.8 MB/s | +11.7% | | | 256b | 277.2 MB/s | 311.2 MB/s | +34.0 MB/s | +12.3% | | | 1350b | 309.5 MB/s | 347.2 MB/s | +37.7 MB/s | +12.2% | | | 8kb | 334.4 MB/s | 375.8 MB/s | +41.4 MB/s | +12.4% | | | 16kb | 331.1 MB/s | 372.5 MB/s | +41.4 MB/s | +12.5% | | SHA3-256 | 16b | 33.0 MB/s | 36.8 MB/s | +3.8 MB/s | +11.5% | | | 256b | 279.4 MB/s | 312.5 MB/s | +33.1 MB/s | +11.9% | | | 1350b | 310.0 MB/s | 348.3 MB/s | +38.3 MB/s | +12.4% | | | 8kb | 312.8 MB/s | 352.1 MB/s | +39.3 MB/s | +12.6% | | | 16kb | 312.4 MB/s | 353.0 MB/s | +40.6 MB/s | +13.0% | | SHA3-384 | 16b | 33.1 MB/s | 36.9 MB/s | +3.8 MB/s | +11.5% | | | 256b | 190.7 MB/s | 214.1 MB/s | +23.4 MB/s | +12.3% | | | 1350b | 240.2 MB/s | 269.9 MB/s | +29.7 MB/s | +12.4% | | | 8kb | 242.7 MB/s | 273.2 MB/s | +30.5 MB/s | +12.6% | | | 16kb | 240.4 MB/s | 271.7 MB/s | +31.3 MB/s | +13.0% | | SHA3-512 | 16b | 33.1 MB/s | 36.9 MB/s | +3.8 MB/s | +11.6% | | | 256b | 145.1 MB/s | 162.8 MB/s | +17.7 MB/s | +12.2% | | | 1350b | 165.7 MB/s | 186.2 MB/s | +20.5 MB/s | +12.3% | | | 8kb | 169.1 MB/s | 190.0 MB/s | +20.9 MB/s | +12.4% | | | 16kb | 167.5 MB/s | 189.2 MB/s | +21.7 MB/s | +13.0% | | SHAKE-128 | 16b | 30.3 MB/s | 33.6 MB/s | +3.3 MB/s | +10.9% | | | 256b | 263.7 MB/s | 293.2 MB/s | +29.5 MB/s | +11.2% | | | 1350b | 338.2 MB/s | 379.4 MB/s | +41.2 MB/s | +12.2% | | | 8kb | 387.2 MB/s | 435.4 MB/s | +48.2 MB/s | +12.5% | | | 16kb | 383.6 MB/s | 432.9 MB/s | +49.3 MB/s | +12.9% | | SHAKE-256 | 16b | 30.5 MB/s | 33.8 MB/s | +3.3 MB/s | +10.9% | | | 256b | 264.9 MB/s | 294.5 MB/s | +29.6 MB/s | +11.2% | | | 1350b | 306.5 MB/s | 344.1 MB/s | +37.6 MB/s | +12.3% | | | 8kb | 312.0 MB/s | 351.5 MB/s | +39.5 MB/s | +12.7% | | | 16kb | 312.1 MB/s | 352.7 MB/s | +40.6 MB/s | +13.0% | * Neoverse-V1 | Algorithm | Size | Main | New | Gain | % | |:----------|:------|------------|------------|------------|-------:| | SHA3-224 | 16b | 45.6 MB/s | 49.3 MB/s | +3.7 MB/s | +8.2% | | | 256b | 382.6 MB/s | 419.4 MB/s | +36.8 MB/s | +9.6% | | | 1350b | 422.5 MB/s | 464.0 MB/s | +41.5 MB/s | +9.8% | | | 8kb | 454.9 MB/s | 500.5 MB/s | +45.6 MB/s | +10.0% | | | 16kb | 449.2 MB/s | 495.5 MB/s | +46.3 MB/s | +10.3% | | SHA3-256 | 16b | 45.7 MB/s | 49.5 MB/s | +3.8 MB/s | +8.5% | | | 256b | 383.2 MB/s | 420.8 MB/s | +37.6 MB/s | +9.8% | | | 1350b | 422.8 MB/s | 464.4 MB/s | +41.6 MB/s | +9.8% | | | 8kb | 425.8 MB/s | 467.7 MB/s | +41.9 MB/s | +9.8% | | | 16kb | 424.0 MB/s | 468.1 MB/s | +44.1 MB/s | +10.4% | | SHA3-384 | 16b | 45.7 MB/s | 49.7 MB/s | +4.0 MB/s | +8.7% | | | 256b | 261.3 MB/s | 284.5 MB/s | +23.2 MB/s | +8.9% | | | 1350b | 327.8 MB/s | 359.6 MB/s | +31.8 MB/s | +9.7% | | | 8kb | 330.5 MB/s | 362.7 MB/s | +32.2 MB/s | +9.8% | | | 16kb | 326.3 MB/s | 360.6 MB/s | +34.3 MB/s | +10.5% | | SHA3-512 | 16b | 45.7 MB/s | 49.5 MB/s | +3.8 MB/s | +8.3% | | | 256b | 198.4 MB/s | 216.7 MB/s | +18.3 MB/s | +9.2% | | | 1350b | 226.1 MB/s | 247.5 MB/s | +21.4 MB/s | +9.5% | | | 8kb | 230.2 MB/s | 252.0 MB/s | +21.8 MB/s | +9.4% | | | 16kb | 227.7 MB/s | 250.3 MB/s | +22.6 MB/s | +9.9% | | SHAKE-128 | 16b | 42.1 MB/s | 45.8 MB/s | +3.7 MB/s | +8.9% | | | 256b | 366.4 MB/s | 402.3 MB/s | +35.9 MB/s | +9.8% | | | 1350b | 463.5 MB/s | 508.8 MB/s | +45.3 MB/s | +9.8% | | | 8kb | 525.7 MB/s | 580.0 MB/s | +54.3 MB/s | +10.3% | | | 16kb | 519.4 MB/s | 574.4 MB/s | +55.0 MB/s | +10.6% | | SHAKE-256 | 16b | 42.3 MB/s | 46.0 MB/s | +3.7 MB/s | +8.8% | | | 256b | 367.6 MB/s | 404.2 MB/s | +36.6 MB/s | +9.9% | | | 1350b | 418.8 MB/s | 459.9 MB/s | +41.1 MB/s | +9.8% | | | 8kb | 425.1 MB/s | 466.9 MB/s | +41.8 MB/s | +9.8% | | | 16kb | 423.7 MB/s | 467.4 MB/s | +43.7 MB/s | +10.3% | * Cortex-A72 | Algorithm | Size | Main | New | Gain | % | |:----------|:------|------------|------------|------------|------:| | SHA3-224 | 16b | 19.9 MB/s | 19.6 MB/s | -0.3 MB/s | -1.2% | | | 256b | 169.9 MB/s | 168.1 MB/s | -1.8 MB/s | -1.0% | | | 1350b | 195.7 MB/s | 189.3 MB/s | -6.4 MB/s | -3.3% | | | 8kb | 211.7 MB/s | 204.8 MB/s | -6.9 MB/s | -3.2% | | | 16kb | 212.2 MB/s | 205.3 MB/s | -6.9 MB/s | -3.2% | | SHA3-256 | 16b | 19.6 MB/s | 19.7 MB/s | +0.1 MB/s | +0.6% | | | 256b | 168.9 MB/s | 168.7 MB/s | -0.2 MB/s | -0.1% | | | 1350b | 195.2 MB/s | 189.0 MB/s | -6.2 MB/s | -3.2% | | | 8kb | 198.6 MB/s | 191.8 MB/s | -6.8 MB/s | -3.4% | | | 16kb | 200.7 MB/s | 193.8 MB/s | -6.9 MB/s | -3.4% | | SHA3-384 | 16b | 20.0 MB/s | 19.8 MB/s | -0.2 MB/s | -0.9% | | | 256b | 118.3 MB/s | 115.6 MB/s | -2.7 MB/s | -2.3% | | | 1350b | 151.6 MB/s | 146.8 MB/s | -4.8 MB/s | -3.2% | | | 8kb | 154.2 MB/s | 148.9 MB/s | -5.3 MB/s | -3.4% | | | 16kb | 154.5 MB/s | 149.1 MB/s | -5.4 MB/s | -3.5% | | SHA3-512 | 16b | 20.0 MB/s | 19.7 MB/s | -0.3 MB/s | -1.5% | | | 256b | 90.2 MB/s | 87.8 MB/s | -2.4 MB/s | -2.6% | | | 1350b | 104.9 MB/s | 100.6 MB/s | -4.3 MB/s | -4.1% | | | 8kb | 107.4 MB/s | 102.7 MB/s | -4.7 MB/s | -4.3% | | | 16kb | 107.5 MB/s | 102.9 MB/s | -4.6 MB/s | -4.3% | | SHAKE-128 | 16b | 16.8 MB/s | 17.7 MB/s | +0.9 MB/s | +5.0% | | | 256b | 157.2 MB/s | 159.2 MB/s | +2.0 MB/s | +1.3% | | | 1350b | 211.4 MB/s | 206.0 MB/s | -5.4 MB/s | -2.6% | | | 8kb | 245.1 MB/s | 236.1 MB/s | -9.0 MB/s | -3.7% | | | 16kb | 245.9 MB/s | 237.6 MB/s | -8.3 MB/s | -3.4% | | SHAKE-256 | 16b | 17.6 MB/s | 17.8 MB/s | +0.2 MB/s | +1.3% | | | 256b | 158.9 MB/s | 158.1 MB/s | -0.8 MB/s | -0.5% | | | 1350b | 192.5 MB/s | 186.9 MB/s | -5.6 MB/s | -3.0% | | | 8kb | 198.0 MB/s | 191.1 MB/s | -6.9 MB/s | -3.5% | | | 16kb | 200.4 MB/s | 193.2 MB/s | -7.2 MB/s | -3.6% | Signed-off-by: Hanno Becker <[email protected]>
Signed-off-by: Hanno Becker <[email protected]>
The `check_compiler` macro in the root CMakeLists.txt can be used to try-compile a C file and set a C preprocessor directive upon success. While sufficient for the current uses from the root CMakeLists.txt, it has some limitations: - It cannot be called from CMakeLists.txt files in subdirectories without creating the `tests/compiler_feature_tests/...` directory in the subdirectory of that CMake file. - It does not allow setting a CMake variable indicating success/failure of compilation, for later reference. (The code use the default 'RESULT', but that may be overwritten by other calls and is thus not suitable for later reference). - It does not allow specifying additional CFLAGS for the attempted compilation. This commit fixes those issues, in the following way: - It allows check_compiler to be called from sub-CMakeLists.txt while still refering to tests/compiler_feature_tests in the root AWS-LC directory. - It always stores the result of the attempted compilation in a CMake variable _of the same name_ as the preprocessor define. In principle, this could be generalized, but it seems unnecessary, and there is already precedent for using the same names for preprocessor directives and CMake variables (e.g. MY_ASSEMBLER_IS_TOO_OLD_FOR_AVX). - It interprets additional arguments to `check_compiler` as additional CFLAGS. This can be omitted, and hence existing calls to `check_compiler` need not be changed. Signed-off-by: Hanno Becker <[email protected]>
Signed-off-by: Hanno Becker <[email protected]>
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.
NOTE: This PR integrates code that is not yet in s2n-bignum's
main
, see awslabs/s2n-bignum#238.This PR is ready for review, but ideally we wait with merge until the Keccak code has been integrated into s2n-bignum, and we can just re-import s2n-bignum using @torben-hansen's importer script.
Previously, a static dispatch would choose between the C implementation of Keccak-F1600 or the assembly implementations (one scalar, one SIMD) provided by OpenSSL. The C<->ASM interface was Keccak1600_Absorb and Keccak1600_Squeeze.
This PR lowers the C<->ASM interface to the core Keccak permutation itself; the Absorb/Squeeze assembly wrappers in keccak1600-armv8.pl are removed accordingly.
Moreover, the PR integrates the Keccak-F1600 implementations from s2n-bignum/mlkem-native into the build and replaces the above static dispatch by a runtime dispatch based on CPU detection / CPU capabilities:
Lazy rotations improve performance by up to 10% on CPUs with free Barrel shifting, which includes Neoverse N1, V1, and V2. Not all CPUs have free Barrel shifting (e.g. Apple M1 or Cortex-A72), so we don't use it by default.
Neoverse V1 and V2 do support SHA3 instructions, but they are only implemented on 1/4 of Neon units, and are thus slower than a scalar implementation.
Finally, while keccak1600-armv8.pl includes an implementation based on SHA3 instruction, this implementation was never used. It is now obsolete with the introduction of the verified SHA3-instruction based implementation from s2n-bignum, and removed from keccak1600-armv8.pl. This leaves only the scalar assembly implementation for the core Keccak permutation in keccak1600-armv8.pl.
Performance impact
Apple M1
Neoverse-V2
Neoverse-N1
Neoverse-V1
Cortex-A72
Signed-off-by: Hanno Becker [email protected]