-
Notifications
You must be signed in to change notification settings - Fork 404
[AIGLongestPathAnalysis] Implement incremental longest path analysis for IR transformations #8890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e01ee81 to
fd778a3
Compare
fd778a3 to
0f473d2
Compare
cowardsa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this connection Hideto - looks really valuable for improving synthesis results! I also like the interface but wanted to clarify usage a little:
- Is it the responsibility of the pass author to call notifyOperation... whenever you make changes to the IR? Or is this automatically hooked up with existing MLIR infra?
- Does knowing precisely where changes have been made (due to IR transformations) mean we can only look at local changes to the delay paths?
P.s. if you hook up the comb delay model interface then others (myself included) can add additional or more precise models as needed - similar to the KnownBits Analysis
Yes, it's user's responsibility to notify the mutation but MLIR's DialectConversion/PartialConversion/PatternRewriter automatically send notification to the listener as long as the listener is registered (
Yes, that's current limitation. Currently IncrementalLongestPathAnalysis will only accept local scope (hw::HWModuleOp). For inter-module timing analysis I'm inclined to prefer hw.module @Foo(in %x: i3) {
// bit 0/1/2 arrive at time 1/5/10.
%x_with_timing = synth.arrival_times %x [10, 5, 1] : i3
%out = hw.instance @Bar(%x_with_timing): (i3) -> i2
%out_with_timing = synth.arrival_times %out_timing [100, 20] : i2
} |
… IR transformations This commit introduces IncrementalLongestPathAnalysis to support lazy computation of delay information during IR mutations in synthesis passes. The analysis is updated incrementally so that we can lazily compute the delay while performing IR mutations. A first use case is the DatapathToComb conversion pass. Key changes include adding the IncrementalLongestPathAnalysis class that extends the existing longest path analysis with incremental update capabilities. The implementation includes the PatternRewriter::Listener interface to track IR mutations and maintain analysis validity. New getOrComputeDelay() and getOrComputePaths() methods enable on-demand delay computation, and the incremental analysis is integrated into DatapathCompressOpConversion for timing-aware compress operation lowering. The longest path analysis is inherently complex and use-def chains must be taken into account when rewriting the IR. A top-down rewriting approach is necessary to maintain analysis correctness, as bottom-up mutations can invalidate paths that depend on erased values. The incremental analysis validates that IR mutations don't break existing timing paths and provides early error detection when values used in critical paths are inappropriately modified.
efa345e to
da0e7d3
Compare
|
@cowardsa LongestPathAnalysis.cpp requires more clean up and testing, so not ready for review but the analysis now returns timing model (= logic depth in AIG) for most of comb/hw operations (so it should be usable for datapath lowering). This is a change of AIG depth for your benchmark set for the current PR FYI. |
|
This is fantastic! Very pleasing to see this play out in the results as well - you need the top-down traversal so that we have lowered everything which the current operator depends on to ensure that we can compute the timing information. Is that correct? I actually like this a lot as it also means that partial_product operators will be lowered before we reach the compressor trees that consume their outputs. Nice! I'm hoping to have a timing-driven compressor tree PR ready today or early next week - but will perhaps separate out the datapathToComb changes to avoid conflicts :) |
Exactly! It's necessary to topologically sort the IR beforehand because HWModule has a graph region but after that top-down lowering should ensure that operand is lowered before the operation. |
e4b2787 to
e0d8429
Compare
d990563 to
831be2d
Compare
…able timing-aware DatapathToComb optimization This commit enhances the AIG longest path analysis infrastructure to support operations from the Comb and HW dialects, enabling comprehensive timing analysis across mixed-dialect circuits. The DatapathToComb pass has been updated to use timing-aware optimization strategies through the incremental longest path analysis framework. The longest path analysis now handles comb.and, comb.or, comb.xor, and hw.constant operations alongside existing AIG operations, providing more accurate delay calculations for circuits that haven't been fully lowered to AIG. An Oracle class has been introduced to dynamically analyze unsupported operations by creating temporary modules and running synthesis pipelines to extract timing characteristics. The DatapathToComb pass integrates IncrementalLongestPathAnalysis as a listener in its greedy rewrite driver, enabling timing-aware transformations. The compressor lowering logic now sorts addends by arrival time to minimize critical path delays during Wallace tree construction. Configuration options for timing-aware datapath optimizations have been added to the synthesis pipeline. Test coverage has been expanded to verify longest path analysis behavior with mixed dialects and ensure proper timing optimization during datapath-to-combinational lowering. This enables more sophisticated timing optimization strategies throughout the CIRCT synthesis pipeline by providing accurate delay information across
831be2d to
0cf5b31
Compare
cowardsa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incremental review - will continue but hit a weird GitHub interface bug...
| for (size_t j = 0; j < addends[0].size(); ++j) { | ||
| SmallVector<std::pair<int64_t, Value>> delays; | ||
| for (auto &addend : addends) { | ||
| auto delay = analysis->getOrComputeMaxDelay(addend[j], 0); | ||
| if (failed(delay)) | ||
| return rewriter.notifyMatchFailure(op, | ||
| "Failed to get delay for input"); | ||
| delays.push_back(std::make_pair(*delay, addend[j])); | ||
| } | ||
| std::stable_sort(delays.begin(), delays.end(), | ||
| [](const std::pair<int64_t, Value> &a, | ||
| const std::pair<int64_t, Value> &b) { | ||
| return a.first < b.first; | ||
| }); | ||
| for (size_t i = 0; i < addends.size(); ++i) | ||
| addends[i][j] = delays[i].second; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI - will pass the "delays" directly to the constructor in an upcoming PR - but current placeholder looks good
| // Set the listener to update timing information | ||
| // HACK: Setting max iterations to 2 to ensure that the patterns are one-shot, | ||
| // making sure target operations are datapath operations are replaced. | ||
| config.setMaxIterations(2).setListener(analysis).setUseTopDownTraversal(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add a check after applying the rewriter to ensure that there are no "Illegal" operators from the Datapath dialect - to ensure we satisfy the conditions imposed by the conversion pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in runOnOperation 👍
| //===----------------------------------------------------------------------===// | ||
| // OperationAnalyzer | ||
| //===----------------------------------------------------------------------===// | ||
|
|
||
| // OperationAnalyzer handles timing analysis for individual operations that | ||
| // haven't been converted to AIG yet. It creates isolated modules for each | ||
| // unique operation type/signature combination, runs the conversion pipeline | ||
| // (HW -> Comb -> AIG), and analyzes the resulting AIG representation. | ||
| // | ||
| // This is used as a fallback when the main analysis encounters operations | ||
| // that don't have direct AIG equivalents. The analyzer: | ||
| // 1. Creates a wrapper HW module for the operation | ||
| // 2. Runs the standard conversion pipeline to lower it to AIG | ||
| // 3. Analyzes the resulting AIG to compute timing paths | ||
| // 4. Caches results based on operation name and function signature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! It really nicely captures any improvements to the lowering code itself e.g. adding a new faster adder architecture! This assumes uniform arrival times for all inputs is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, input arrival times are assumed to be uniform.
| // Temporary module used to hold wrapper modules during analysis | ||
| // Each operation gets its own wrapper module created inside this parent | ||
| mlir::OwningOpRef<mlir::ModuleOp> moduleOp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking! Presumably the scope is a single operation, therefore no input limits are known e.g. if I perform:
{4'd0, a} + {4'd0, b}
Does this get viewed as an 8-bit adder for the purpose of this local lowering? I think this is a perfectly reasonable over-approximation for guiding initial lowering decisions. If we envision a future iterative timing closure loop then we may need a more accurate delay model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this get viewed as an 8-bit adder for the purpose of this local lowering?
Yes, constants are not visible so it will be viewed as 8-bit adder. Maybe we can include the known bits as a part of analysis, e.g. cache[{opName, funcType, inputKnownBits]} that makes things more accurate.
cowardsa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great! I'll be interested to test the performance but the interface looks ideal and will track changes to the lowering code. The tests are also very nice - look forward to seeing how we might incorporate this across the synthesis pipeline :)
| auto opName = op->getName(); | ||
| auto functionType = getFunctionTypeForOp(op); | ||
| auto key = std::make_pair(opName, functionType); | ||
| auto it = cache.find(key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
| // Connect module inputs to cloned operation operands | ||
| // Handle type mismatches with bitcast operations | ||
| for (auto arg : hwModule.getBodyBlock()->getArguments()) { | ||
| Value input = arg; | ||
| auto idx = arg.getArgNumber(); | ||
|
|
||
| // Insert bitcast if input port type differs from operand type | ||
| if (input.getType() != cloned->getOperand(idx).getType()) | ||
| input = builder.create<hw::BitcastOp>( | ||
| op->getLoc(), cloned->getOperand(idx).getType(), input); | ||
|
|
||
| cloned->setOperand(idx, input); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Why do we get type mismatches? Is this always safe to cast to different bitwidths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LongestPathAnalysis doesn't look dataflow through array/struct, so it's necessary to use integers for ports. Integers will be bit-cast to original type and they always have the same bitwidth.
This commit introduces IncrementalLongestPathAnalysis to support lazy delay computation during IR mutations in synthesis passes, with the DatapathToComb conversion pass as the first use case.
The implementation adds the IncrementalLongestPathAnalysis class that extends the existing analysis with incremental capabilities through the PatternRewriter::Listener interface to track mutations and maintain validity. New getOrComputeDelay() and getOrComputePaths() methods enable on-demand computation, while an OperationAnalyzer class handles dynamic analysis of Comb and HW dialect operations through temporary module synthesis.
The analysis requires a top-down rewriting approach to maintain correctness, as bottom-up mutations can invalidate paths that depend on erased values. The incremental framework validates mutations to prevent breaking existing timing paths and provides early error detection for inappropriate modifications to critical path values.