-
Notifications
You must be signed in to change notification settings - Fork 404
[CombToSynth] Use parallel-prefix tree for unsigned comparisons #9048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit extends the unsigned comparison lowering to support multiple parallel-prefix architectures (Sklanskey, Kogge-Stone, Brent-Kung) in addition to the existing ripple-carry implementation. Previously, all comparisons used a ripple-carry style implementation that processed bits sequentially from LSB to MSB, resulting in O(n) depth for n-bit comparisons. This was a significant performance bottleneck for wide comparisons. The comparison lowering is now refactored to use the same parallel-prefix tree algorithms as the adder, reducing depth to O(log n). Small comparisons (less than 8 bits) continue to use ripple-carry by default since the overhead of parallel-prefix structures is not worthwhile, while larger comparisons use parallel-prefix trees for better performance. The architecture can be explicitly specified via the synth.test.arch attribute. The comparison logic is formulated as a prefix computation where equal bits are computed as ~(a_i ^ b_i) and greater bits as ~a_i & b_i, with propagate and generate signals based on equality and greater-than conditions. Signed comparisons extract the sign bit and compare magnitudes separately using the unsigned comparison infrastructure. Integration tests for all architectures verify logical equivalence via circt-lec. The parallel-prefix approach reduces comparison depth from O(n) to O(log n) for n-bit comparisons, matching the delay characteristics of parallel-prefix adders.
|
FileCheck test is missing in non-integration test and I'll add a test for parallel prefix tree. |
cowardsa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very neat - nice work and pleasing to see the longest path improvements - ideal how we can reuse the prefix computation.
For reduced overhead - could pass a flag to prevent the prefix computation generating a lot of gates that will then need to be removed by DCE? Namely, we are only interested in generating the carry-out and propagate out? (however the code already generates unused gates so its already far from optimal in terms of efficiency)
Good points, but I'm not sure there is a non-complicated way to prune prefix computation since it's hard to know beforehand which index is actually used by carry-out and propagate out in the last stage. I think we can change the prefix computation for comparison to use recursive function with memoization that by nature lazily computes ony necessary prefixes, but it requires quite a bit of changes for prefix tree generation functions we have now. So please let me stick with the current implementation for this PR. |
Absolutely non-blocking for sure - just a thought for future improvements if we hit against performance issues. |
This commit extends the unsigned comparison lowering to support multiple parallel-prefix architectures (Sklanskey, Kogge-Stone, Brent-Kung) in addition to the existing ripple-carry implementation. No functional change in prefix-tree/adder lowering.
Previously, all comparisons used a ripple-carry style implementation that processed bits sequentially from LSB to MSB, resulting in O(n) depth for n-bit comparisons. This was a significant performance bottleneck for wide comparisons. The comparison lowering is now refactored to use the same parallel-prefix tree algorithms as the adder, reducing depth to O(log n).
The comparison logic is formulated as a prefix computation where equal bits are computed as ~(a_i ^ b_i) and greater bits as ~a_i & b_i, with propagate and generate signals based on equality and greater-than conditions.
Integration tests for all architectures verify logical equivalence via circt-lec.
before/after for 64-bit unsigned comparision.