Skip to content

Conversation

@cliffburdick
Copy link
Collaborator

Introduced a mutable bool prerun_done_ flag to 40 transform operators that allocate temporary tensors in their PreRun() method. This prevents duplicate memory allocations and executions when PreRun() is called multiple times on the same operator instance.

In an fft convolution example it was inadvertently launching 2 more kernels than it needed to since the assignment inside fft_impl was causing a second PreRun to be called after the initial binary operator.

The flag is checked at the start of PreRun() and causes an early return if the method has already been executed. The flag is set to true after temporary tensor allocation but before calling Exec().

Modified operators:

  • FFT operators: FFT2Op
  • Reduction ops: SumOp, ProdOp, MeanOp, VarOp, StddOp, MinOp, MaxOp, NormOp, MedianOp, PercentileOp, AllOp, AnyOp, ReduceOp, TraceOp
  • Linear algebra: MatMulOp, MatVecOp, OuterOp, TransposeMatrixOp, SolveOp, InvOp, DetOp, PinvOp, CholOp, CovOp, CGSolveOp
  • Signal processing: PWelchOp, ResamplePolyOp, ChannelizePolyOp, SoftmaxOp, NormalizeOp
  • Sparse ops: Sparse2DenseOp
  • Sorting: SortOp, ArgsortOp
  • Convolution/correlation: Conv1DOp, CorrOp, FilterOp
  • Other: HistOp, CumSumOp, AmbgFunOp

Operators not modified (don't allocate temps in PreRun): lu, unique, svd, qr, sparse2sparse, argmax, argmin, argminmax, find, find_idx, find_peaks, einsum, eig, dense2sparse

…form operators

Introduced a mutable bool prerun_done_ flag to 40 transform operators that
allocate temporary tensors in their PreRun() method. This prevents duplicate
memory allocations and executions when PreRun() is called multiple times on
the same operator instance.

In an fft convolution example it was inadvertently launching 2 more
kernels than it needed to since the assignment inside fft_impl was
causing a second PreRun to be called after the initial binary operator.

The flag is checked at the start of PreRun() and causes an early return if
the method has already been executed. The flag is set to true after temporary
tensor allocation but before calling Exec().

Modified operators:
- FFT operators: FFT2Op
- Reduction ops: SumOp, ProdOp, MeanOp, VarOp, StddOp, MinOp, MaxOp, NormOp,
  MedianOp, PercentileOp, AllOp, AnyOp, ReduceOp, TraceOp
- Linear algebra: MatMulOp, MatVecOp, OuterOp, TransposeMatrixOp, SolveOp,
  InvOp, DetOp, PinvOp, CholOp, CovOp, CGSolveOp
- Signal processing: PWelchOp, ResamplePolyOp, ChannelizePolyOp, SoftmaxOp,
  NormalizeOp
- Sparse ops: Sparse2DenseOp
- Sorting: SortOp, ArgsortOp
- Convolution/correlation: Conv1DOp, CorrOp, FilterOp
- Other: HistOp, CumSumOp, AmbgFunOp

Operators not modified (don't allocate temps in PreRun):
lu, unique, svd, qr, sparse2sparse, argmax, argmin, argminmax, find,
find_idx, find_peaks, einsum, eig, dense2sparse
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 9, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cliffburdick
Copy link
Collaborator Author

/build

@cliffburdick cliffburdick merged commit 711df5f into main Oct 10, 2025
1 check passed
@cliffburdick cliffburdick deleted the xform_prerun_single branch October 10, 2025 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants