Skip to content

Conversation

yaoyaoding
Copy link
Member

This PR adds the example of fused kernel for flash linear attention:

Sigmoid Gating Delta Rule Update Benchmark Results:
      name     (B, T, H, K)   (HV, V) latency (ms)
0    torch   (1, 1, 4, 128)  (8, 128)        1.164
1   triton   (1, 1, 4, 128)  (8, 128)        0.010
2    tilus   (1, 1, 4, 128)  (8, 128)        0.006
3    torch   (1, 2, 4, 128)  (8, 128)        2.270
4   triton   (1, 2, 4, 128)  (8, 128)        0.011
5    tilus   (1, 2, 4, 128)  (8, 128)        0.007
6    torch   (1, 4, 4, 128)  (8, 128)        4.475
7   triton   (1, 4, 4, 128)  (8, 128)        0.013
8    tilus   (1, 4, 4, 128)  (8, 128)        0.009
9    torch   (1, 8, 4, 128)  (8, 128)        8.848
10  triton   (1, 8, 4, 128)  (8, 128)        0.018
11   tilus   (1, 8, 4, 128)  (8, 128)        0.013
12   torch  (1, 16, 4, 128)  (8, 128)       17.589
13  triton  (1, 16, 4, 128)  (8, 128)        0.029
14   tilus  (1, 16, 4, 128)  (8, 128)        0.022
15   torch  (1, 32, 4, 128)  (8, 128)       35.413
16  triton  (1, 32, 4, 128)  (8, 128)        0.051
17   tilus  (1, 32, 4, 128)  (8, 128)        0.044

Signed-off-by: Yaoyao Ding <[email protected]>
@yaoyaoding yaoyaoding changed the title [Example] Add the fused kernel for decoding stage of flash linear attention [Example] Add the fused kernel for decoding of flash linear attention Sep 1, 2025
@yaoyaoding yaoyaoding merged commit 28dd71f into main Sep 1, 2025
8 of 9 checks passed
@yaoyaoding yaoyaoding deleted the yaoyao/fla branch September 3, 2025 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant