Skip to content

Commit 19a652e

Browse files
committed
Dispatch between C, s2n-bignum and OpenSSL Keccak implementations
Previously, a static dispatch would choose between the C implementation of Keccak-F1600 or the assembly implementations (one scalar, one SIMD) provided by OpenSSL. The C<->ASM interface was Keccak1600_Absorb and Keccak1600_Squeeze. This commit lowers the C<->ASM interface to the core Keccak permutation itself; the Absorb/Squeeze assembly wrappers in keccak1600-armv8.pl are removed accordingly. Moroever the commit integrates the Keccak-F1600 implementations from s2n-bignum into the build and replaces the above static dispatch by a runtime dispatch based on CPU detection / CPU capabilities: 1. If ASM is disabled, we use the C implementation. 2. If ASM is enabled: - For Neoverse N1, V1, V2, we use scalar Keccak assembly from s2n-bignum, leveraging lazy rotations from https://eprint.iacr.org/2022/1243. - For Arm-based Apple CPUs, we use Neon Keccak assembly from s2n-bignum, leveraging the AArch64 SHA3 extension. - Otherwise, fall back to scalar Keccak implementation from OpenSSL, not using lazy rotations. Lazy rotations improve performance by up to 10% on CPUs with free Barrel shifting, which includes Neoverse N1, V1, and V2. Not all CPUs have free Barrel shifting (e.g. Apple M1 or Cortex-A72), so we don't use it by default. Neoverse V1 and V2 do support SHA3 instructions, but they are only implemented on 1/4 of Neon units, and are thus slower than a scalar implementation. Finally, since the Neon Keccak assembly from s2n-bignum is faster than the Neon Keccak assembly from the OpenSSL implementation, the latter is removed from keccak1600-armv8.pl, leaving only the scalar assembly implementation for the core Keccak permutation. Performance impact ------------------ * Apple M1 | Algorithm | Size | Main | New | Gain | % | |:----------|:------|------------|------------|------------|-------:| | SHA3-224 | 16b | 71.5 MB/s | 88.3 MB/s | +16.8 MB/s | +23.5% | | | 256b | 584.5 MB/s | 754.7 MB/s | +170.2 MB/s| +29.1% | | | 1350b | 633.8 MB/s | 815.2 MB/s | +181.4 MB/s| +28.6% | | | 8kb | 694.4 MB/s | 872.4 MB/s | +178.0 MB/s| +25.6% | | | 16kb | 696.9 MB/s | 864.8 MB/s | +167.9 MB/s| +24.1% | | SHA3-256 | 16b | 71.6 MB/s | 88.6 MB/s | +17.0 MB/s | +23.7% | | | 256b | 600.8 MB/s | 759.0 MB/s | +158.2 MB/s| +26.3% | | | 1350b | 638.8 MB/s | 817.5 MB/s | +178.7 MB/s| +28.0% | | | 8kb | 652.3 MB/s | 820.5 MB/s | +168.2 MB/s| +25.8% | | | 16kb | 658.9 MB/s | 823.8 MB/s | +164.9 MB/s| +25.0% | | SHA3-384 | 16b | 71.9 MB/s | 86.8 MB/s | +14.9 MB/s | +20.7% | | | 256b | 402.3 MB/s | 505.4 MB/s | +103.1 MB/s| +25.6% | | | 1350b | 493.1 MB/s | 636.0 MB/s | +142.9 MB/s| +29.0% | | | 8kb | 507.3 MB/s | 639.7 MB/s | +132.4 MB/s| +26.1% | | | 16kb | 507.2 MB/s | 626.2 MB/s | +119.0 MB/s| +23.5% | | SHA3-512 | 16b | 70.6 MB/s | 89.2 MB/s | +18.6 MB/s | +26.3% | | | 256b | 305.7 MB/s | 390.8 MB/s | +85.1 MB/s | +27.8% | | | 1350b | 347.2 MB/s | 436.7 MB/s | +89.5 MB/s | +25.8% | | | 8kb | 355.0 MB/s | 446.3 MB/s | +91.3 MB/s | +25.7% | | | 16kb | 356.1 MB/s | 445.7 MB/s | +89.6 MB/s | +25.2% | | SHAKE-128 | 16b | 68.8 MB/s | 87.4 MB/s | +18.6 MB/s | +27.0% | | | 256b | 572.2 MB/s | 747.5 MB/s | +175.3 MB/s| +30.6% | | | 1350b | 780.8 MB/s | 1016.4 MB/s| +235.6 MB/s| +30.2% | | | 8kb | 932.8 MB/s | 1215.4 MB/s| +282.6 MB/s| +30.3% | | | 16kb | 932.4 MB/s | 1215.9 MB/s| +283.5 MB/s| +30.4% | | SHAKE-256 | 16b | 69.0 MB/s | 87.6 MB/s | +18.6 MB/s | +27.0% | | | 256b | 574.7 MB/s | 750.1 MB/s | +175.4 MB/s| +30.5% | | | 1350b | 629.4 MB/s | 817.0 MB/s | +187.6 MB/s| +29.8% | | | 8kb | 652.3 MB/s | 820.5 MB/s | +168.2 MB/s| +25.8% | | | 16kb | 658.9 MB/s | 823.8 MB/s | +164.9 MB/s| +25.0% | * Neoverse-V2 | Algorithm | Size | Main | New | Gain | % | |:----------|:------|------------|------------|------------|------:| | SHA3-224 | 16b | 53.4 MB/s | 56.9 MB/s | +3.5 MB/s | +6.7% | | | 256b | 449.9 MB/s | 487.0 MB/s | +37.1 MB/s | +8.2% | | | 1350b | 500.0 MB/s | 541.5 MB/s | +41.5 MB/s | +8.3% | | | 8kb | 537.9 MB/s | 585.3 MB/s | +47.4 MB/s | +8.8% | | | 16kb | 530.7 MB/s | 577.5 MB/s | +46.8 MB/s | +8.8% | | SHA3-256 | 16b | 53.5 MB/s | 57.2 MB/s | +3.7 MB/s | +7.0% | | | 256b | 451.6 MB/s | 488.1 MB/s | +36.5 MB/s | +8.1% | | | 1350b | 500.1 MB/s | 542.0 MB/s | +41.9 MB/s | +8.4% | | | 8kb | 503.0 MB/s | 546.9 MB/s | +43.9 MB/s | +8.7% | | | 16kb | 500.2 MB/s | 544.9 MB/s | +44.7 MB/s | +8.9% | | SHA3-384 | 16b | 53.8 MB/s | 57.7 MB/s | +3.9 MB/s | +7.2% | | | 256b | 306.9 MB/s | 333.3 MB/s | +26.4 MB/s | +8.6% | | | 1350b | 386.6 MB/s | 420.5 MB/s | +33.9 MB/s | +8.8% | | | 8kb | 389.9 MB/s | 424.5 MB/s | +34.6 MB/s | +8.9% | | | 16kb | 384.9 MB/s | 420.1 MB/s | +35.2 MB/s | +9.1% | | SHA3-512 | 16b | 53.4 MB/s | 57.8 MB/s | +4.4 MB/s | +8.3% | | | 256b | 233.5 MB/s | 254.0 MB/s | +20.5 MB/s | +8.8% | | | 1350b | 266.7 MB/s | 290.2 MB/s | +23.5 MB/s | +8.8% | | | 8kb | 271.9 MB/s | 295.8 MB/s | +23.9 MB/s | +8.8% | | | 16kb | 268.7 MB/s | 292.7 MB/s | +24.0 MB/s | +8.9% | | SHAKE-128 | 16b | 49.6 MB/s | 53.1 MB/s | +3.5 MB/s | +7.0% | | | 256b | 432.9 MB/s | 468.0 MB/s | +35.1 MB/s | +8.1% | | | 1350b | 547.5 MB/s | 592.5 MB/s | +45.0 MB/s | +8.2% | | | 8kb | 621.6 MB/s | 676.1 MB/s | +54.5 MB/s | +8.8% | | | 16kb | 613.4 MB/s | 667.7 MB/s | +54.3 MB/s | +8.9% | | SHAKE-256 | 16b | 49.7 MB/s | 53.2 MB/s | +3.5 MB/s | +7.2% | | | 256b | 432.9 MB/s | 469.1 MB/s | +36.2 MB/s | +8.4% | | | 1350b | 494.6 MB/s | 537.9 MB/s | +43.3 MB/s | +8.8% | | | 8kb | 502.3 MB/s | 546.6 MB/s | +44.3 MB/s | +8.8% | | | 16kb | 499.6 MB/s | 545.2 MB/s | +45.6 MB/s | +9.1% | * Neoverse-N1 | Algorithm | Size | Main | New | Gain | % | |:----------|:------|------------|------------|------------|-------:| | SHA3-224 | 16b | 32.7 MB/s | 36.5 MB/s | +3.8 MB/s | +11.7% | | | 256b | 277.2 MB/s | 311.2 MB/s | +34.0 MB/s | +12.3% | | | 1350b | 309.5 MB/s | 347.2 MB/s | +37.7 MB/s | +12.2% | | | 8kb | 334.4 MB/s | 375.8 MB/s | +41.4 MB/s | +12.4% | | | 16kb | 331.1 MB/s | 372.5 MB/s | +41.4 MB/s | +12.5% | | SHA3-256 | 16b | 33.0 MB/s | 36.8 MB/s | +3.8 MB/s | +11.5% | | | 256b | 279.4 MB/s | 312.5 MB/s | +33.1 MB/s | +11.9% | | | 1350b | 310.0 MB/s | 348.3 MB/s | +38.3 MB/s | +12.4% | | | 8kb | 312.8 MB/s | 352.1 MB/s | +39.3 MB/s | +12.6% | | | 16kb | 312.4 MB/s | 353.0 MB/s | +40.6 MB/s | +13.0% | | SHA3-384 | 16b | 33.1 MB/s | 36.9 MB/s | +3.8 MB/s | +11.5% | | | 256b | 190.7 MB/s | 214.1 MB/s | +23.4 MB/s | +12.3% | | | 1350b | 240.2 MB/s | 269.9 MB/s | +29.7 MB/s | +12.4% | | | 8kb | 242.7 MB/s | 273.2 MB/s | +30.5 MB/s | +12.6% | | | 16kb | 240.4 MB/s | 271.7 MB/s | +31.3 MB/s | +13.0% | | SHA3-512 | 16b | 33.1 MB/s | 36.9 MB/s | +3.8 MB/s | +11.6% | | | 256b | 145.1 MB/s | 162.8 MB/s | +17.7 MB/s | +12.2% | | | 1350b | 165.7 MB/s | 186.2 MB/s | +20.5 MB/s | +12.3% | | | 8kb | 169.1 MB/s | 190.0 MB/s | +20.9 MB/s | +12.4% | | | 16kb | 167.5 MB/s | 189.2 MB/s | +21.7 MB/s | +13.0% | | SHAKE-128 | 16b | 30.3 MB/s | 33.6 MB/s | +3.3 MB/s | +10.9% | | | 256b | 263.7 MB/s | 293.2 MB/s | +29.5 MB/s | +11.2% | | | 1350b | 338.2 MB/s | 379.4 MB/s | +41.2 MB/s | +12.2% | | | 8kb | 387.2 MB/s | 435.4 MB/s | +48.2 MB/s | +12.5% | | | 16kb | 383.6 MB/s | 432.9 MB/s | +49.3 MB/s | +12.9% | | SHAKE-256 | 16b | 30.5 MB/s | 33.8 MB/s | +3.3 MB/s | +10.9% | | | 256b | 264.9 MB/s | 294.5 MB/s | +29.6 MB/s | +11.2% | | | 1350b | 306.5 MB/s | 344.1 MB/s | +37.6 MB/s | +12.3% | | | 8kb | 312.0 MB/s | 351.5 MB/s | +39.5 MB/s | +12.7% | | | 16kb | 312.1 MB/s | 352.7 MB/s | +40.6 MB/s | +13.0% | * Neoverse-V1 | Algorithm | Size | Main | New | Gain | % | |:----------|:------|------------|------------|------------|-------:| | SHA3-224 | 16b | 45.6 MB/s | 49.3 MB/s | +3.7 MB/s | +8.2% | | | 256b | 382.6 MB/s | 419.4 MB/s | +36.8 MB/s | +9.6% | | | 1350b | 422.5 MB/s | 464.0 MB/s | +41.5 MB/s | +9.8% | | | 8kb | 454.9 MB/s | 500.5 MB/s | +45.6 MB/s | +10.0% | | | 16kb | 449.2 MB/s | 495.5 MB/s | +46.3 MB/s | +10.3% | | SHA3-256 | 16b | 45.7 MB/s | 49.5 MB/s | +3.8 MB/s | +8.5% | | | 256b | 383.2 MB/s | 420.8 MB/s | +37.6 MB/s | +9.8% | | | 1350b | 422.8 MB/s | 464.4 MB/s | +41.6 MB/s | +9.8% | | | 8kb | 425.8 MB/s | 467.7 MB/s | +41.9 MB/s | +9.8% | | | 16kb | 424.0 MB/s | 468.1 MB/s | +44.1 MB/s | +10.4% | | SHA3-384 | 16b | 45.7 MB/s | 49.7 MB/s | +4.0 MB/s | +8.7% | | | 256b | 261.3 MB/s | 284.5 MB/s | +23.2 MB/s | +8.9% | | | 1350b | 327.8 MB/s | 359.6 MB/s | +31.8 MB/s | +9.7% | | | 8kb | 330.5 MB/s | 362.7 MB/s | +32.2 MB/s | +9.8% | | | 16kb | 326.3 MB/s | 360.6 MB/s | +34.3 MB/s | +10.5% | | SHA3-512 | 16b | 45.7 MB/s | 49.5 MB/s | +3.8 MB/s | +8.3% | | | 256b | 198.4 MB/s | 216.7 MB/s | +18.3 MB/s | +9.2% | | | 1350b | 226.1 MB/s | 247.5 MB/s | +21.4 MB/s | +9.5% | | | 8kb | 230.2 MB/s | 252.0 MB/s | +21.8 MB/s | +9.4% | | | 16kb | 227.7 MB/s | 250.3 MB/s | +22.6 MB/s | +9.9% | | SHAKE-128 | 16b | 42.1 MB/s | 45.8 MB/s | +3.7 MB/s | +8.9% | | | 256b | 366.4 MB/s | 402.3 MB/s | +35.9 MB/s | +9.8% | | | 1350b | 463.5 MB/s | 508.8 MB/s | +45.3 MB/s | +9.8% | | | 8kb | 525.7 MB/s | 580.0 MB/s | +54.3 MB/s | +10.3% | | | 16kb | 519.4 MB/s | 574.4 MB/s | +55.0 MB/s | +10.6% | | SHAKE-256 | 16b | 42.3 MB/s | 46.0 MB/s | +3.7 MB/s | +8.8% | | | 256b | 367.6 MB/s | 404.2 MB/s | +36.6 MB/s | +9.9% | | | 1350b | 418.8 MB/s | 459.9 MB/s | +41.1 MB/s | +9.8% | | | 8kb | 425.1 MB/s | 466.9 MB/s | +41.8 MB/s | +9.8% | | | 16kb | 423.7 MB/s | 467.4 MB/s | +43.7 MB/s | +10.3% | * Cortex-A72 | Algorithm | Size | Main | New | Gain | % | |:----------|:------|------------|------------|------------|------:| | SHA3-224 | 16b | 19.9 MB/s | 19.6 MB/s | -0.3 MB/s | -1.2% | | | 256b | 169.9 MB/s | 168.1 MB/s | -1.8 MB/s | -1.0% | | | 1350b | 195.7 MB/s | 189.3 MB/s | -6.4 MB/s | -3.3% | | | 8kb | 211.7 MB/s | 204.8 MB/s | -6.9 MB/s | -3.2% | | | 16kb | 212.2 MB/s | 205.3 MB/s | -6.9 MB/s | -3.2% | | SHA3-256 | 16b | 19.6 MB/s | 19.7 MB/s | +0.1 MB/s | +0.6% | | | 256b | 168.9 MB/s | 168.7 MB/s | -0.2 MB/s | -0.1% | | | 1350b | 195.2 MB/s | 189.0 MB/s | -6.2 MB/s | -3.2% | | | 8kb | 198.6 MB/s | 191.8 MB/s | -6.8 MB/s | -3.4% | | | 16kb | 200.7 MB/s | 193.8 MB/s | -6.9 MB/s | -3.4% | | SHA3-384 | 16b | 20.0 MB/s | 19.8 MB/s | -0.2 MB/s | -0.9% | | | 256b | 118.3 MB/s | 115.6 MB/s | -2.7 MB/s | -2.3% | | | 1350b | 151.6 MB/s | 146.8 MB/s | -4.8 MB/s | -3.2% | | | 8kb | 154.2 MB/s | 148.9 MB/s | -5.3 MB/s | -3.4% | | | 16kb | 154.5 MB/s | 149.1 MB/s | -5.4 MB/s | -3.5% | | SHA3-512 | 16b | 20.0 MB/s | 19.7 MB/s | -0.3 MB/s | -1.5% | | | 256b | 90.2 MB/s | 87.8 MB/s | -2.4 MB/s | -2.6% | | | 1350b | 104.9 MB/s | 100.6 MB/s | -4.3 MB/s | -4.1% | | | 8kb | 107.4 MB/s | 102.7 MB/s | -4.7 MB/s | -4.3% | | | 16kb | 107.5 MB/s | 102.9 MB/s | -4.6 MB/s | -4.3% | | SHAKE-128 | 16b | 16.8 MB/s | 17.7 MB/s | +0.9 MB/s | +5.0% | | | 256b | 157.2 MB/s | 159.2 MB/s | +2.0 MB/s | +1.3% | | | 1350b | 211.4 MB/s | 206.0 MB/s | -5.4 MB/s | -2.6% | | | 8kb | 245.1 MB/s | 236.1 MB/s | -9.0 MB/s | -3.7% | | | 16kb | 245.9 MB/s | 237.6 MB/s | -8.3 MB/s | -3.4% | | SHAKE-256 | 16b | 17.6 MB/s | 17.8 MB/s | +0.2 MB/s | +1.3% | | | 256b | 158.9 MB/s | 158.1 MB/s | -0.8 MB/s | -0.5% | | | 1350b | 192.5 MB/s | 186.9 MB/s | -5.6 MB/s | -3.0% | | | 8kb | 198.0 MB/s | 191.1 MB/s | -6.9 MB/s | -3.5% | | | 16kb | 200.4 MB/s | 193.2 MB/s | -7.2 MB/s | -3.6% | Signed-off-by: Hanno Becker <[email protected]>
1 parent 9d7d55f commit 19a652e

File tree

7 files changed

+156
-2656
lines changed

7 files changed

+156
-2656
lines changed

crypto/fipsmodule/CMakeLists.txt

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -291,6 +291,37 @@ if((((ARCH STREQUAL "x86_64") AND NOT MY_ASSEMBLER_IS_TOO_OLD_FOR_512AVX) OR
291291
${S2N_BIGNUM_DIR}/generic/bignum_copy_row_from_table_16.S
292292
${S2N_BIGNUM_DIR}/generic/bignum_copy_row_from_table_32.S
293293
)
294+
295+
#
296+
# Keccak assembly from s2n-bignum/mlkem-native
297+
#
298+
299+
# Check if assembler supports SHA3 extension
300+
include(CheckCSourceCompiles)
301+
set(CMAKE_REQUIRED_FLAGS_BACKUP "${CMAKE_REQUIRED_FLAGS}")
302+
set(CMAKE_REQUIRED_FLAGS "-march=armv8.4-a+sha3")
303+
check_c_source_compiles("
304+
int main(void) {
305+
__asm__(\"eor3 v0.16b, v1.16b, v2.16b, v3.16b\");
306+
__asm__(\"bcax v0.16b, v1.16b, v2.16b, v3.16b\");
307+
__asm__(\"rax1 v0.2d, v1.2d, v2.2d \");
308+
__asm__(\"xar v0.2d, v1.2d, v2.2d, #0x2a \");
309+
return 0;
310+
}
311+
" MY_ASSEMBLER_SUPPORTS_NEON_SHA3_EXTENSION)
312+
set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS_BACKUP}")
313+
314+
# Scalar Keccak-x1 assembly from s2n-bignum/mlkem-native
315+
list(APPEND BCM_ASM_SOURCES
316+
${S2N_BIGNUM_DIR}/sha3/sha3_keccak_f1600.S
317+
)
318+
319+
# SIMD Keccak-x1 assembly from s2n-bignum/mlkem-native, using SHA3 extension
320+
if(MY_ASSEMBLER_SUPPORTS_NEON_SHA3_EXTENSION)
321+
list(APPEND BCM_ASM_SOURCES ${S2N_BIGNUM_DIR}/sha3/sha3_keccak_f1600_alt.S)
322+
set_source_files_properties(${S2N_BIGNUM_DIR}/sha3/sha3_keccak_f1600_alt.S
323+
PROPERTIES COMPILE_FLAGS "-march=armv8.4-a+sha3")
324+
endif()
294325
endif()
295326

296327
endif()

0 commit comments

Comments
 (0)