Releases: NVIDIA/MatX
v0.4.0
New Features
- slice optimization to use builtin tensor function when possible by @luitjens in #360
- Slice support for std::array shapes by @luitjens in #363
- svd power iteration example, benchmark and unit tests. by @luitjens in #366
- matmul: support real/complex tensors by @kshitij12345 in #362
- Adding sign/index operators: by @luitjens in #369
- optimized cast and conj op to return a tensor view when possible. by @luitjens in #371
- implement QR for small batched matrices. by @luitjens in #373
- Implement block power iteration (qr iterations) for svd by @luitjens in #375
- Added output iterator support for CUB sums, and converted all sum() by @cliffburdick in #380
- Removing inheritance from std::iterator by @cliffburdick in #381
- DLPack support by @cliffburdick in #392
- Adding ref-count for DLPack by @cliffburdick in #394
- updating cub optimization selection for >= 2.0 by @tylera-nvidia in #395
- Refactored make_tensor to allow lvalue init by @cliffburdick in #397
- Updated notebook documentation and refactored some code by @cliffburdick in #398
- Allow 0-stride dimensions for cublas input/output by @tbensonatl in #400
- 16-bit float reductions + updated softmax by @cliffburdick in #399
Bug Fixes
- Fix Duplicate Print and remove member prints by @tylera-nvidia in #364
- cublasLT col major detection fix. by @luitjens in #368
- Fixes for 32b mode by @cliffburdick in #388
- Fixed a bogus maybe-unitialized warning/error in release mode by @cliffburdick in #389
- Fixed issue with using const pointers by @cliffburdick in #393
- Generator Printing Patch by @tylera-nvidia in #370
New Contributors
- @kshitij12345 made their first contribution in #362
- @tbensonatl made their first contribution in #400
Full Changelog: v0.3.0...v0.4.0
v0.3.0
v0.3.0 marks a major release with over 100 features and bug fixes. Release cadence will occur more frequently after this release to support users not living at the HEAD.
What's Changed
- Added squeeze operator by @cliffburdick in #163
- Change name of squeeze to flatten by @cliffburdick in #164
- Updated version of cuTENSOR and fixed paths by @cliffburdick in #166
- Added reduction example with einsum by @cliffburdick in #168
- Fixed bug with wrong type on argmin/max by @cliffburdick in #170
- Fixed missing return on operator() for sum by @cliffburdick in #171
- Fixed error with reduction with invalid indices. Only shows up on Jetson by @cliffburdick in #172
- Fixed bug with matmul use-after-free by @cliffburdick in #173
- Added test for batches GEMMs by @cliffburdick in #174
- Throw an exception if using SetVals on non-managed pointer by @cliffburdick in #176
- Added missing assert in release mode by @cliffburdick in #178
- Fixed einsum in release mode by @cliffburdick in #179
- Updates to docs by @cliffburdick in #180
- Added unit test for transpose and fixed bug with grid size by @cliffburdick in #181
- Fix grid dimensions for transpose. by @galv in #182
- Added missing include by @cliffburdick in #184
- Remove CUB from sum reduction while bug is being investigated by @cliffburdick in #186
- Fix for cub reductions by @luitjens in #187
- Reenable CUB tests by @cliffburdick in #188
- Fixing incorrect parameter to CUB sort for 2D tensors by @cliffburdick in #190
- Remove 4D restriction on Clone by @cliffburdick in #191
- Added support for N-D convolutions by @cliffburdick in #189
- Download RAPIDS.cmake only if it does not exist. by @cwharris in #192
- Fix 11.4 compilation issues by @cliffburdick in #195
- Improve FFT batching by @cliffburdick in #196
- Fixed argmax initialization value by @cliffburdick in #198
- Fix issue #199 by @pkestene in #200
- Fix type on concatenate by @cliffburdick in #201
- Fix documentation type-o by @dagardner-nv in #202
- Missing host annotation on some generators by @cliffburdick in #203
- Fixed TotalSize on cub operators by @cliffburdick in #204
- Implementing remap operator. by @luitjens in #205
- Update reverse/shift APIs by @luitjens in #207
- batching conv1d across filters. by @luitjens in #208
- Added Print for operators by @cliffburdick in #211
- Complex div by @cliffburdick in #213
- Added lcollapse and rcollapse operator by @luitjens in #212
- Baseops by @luitjens in #214
- Only allow View() on contigious tensors. by @luitjens in #215
- Remove caching on some CUB types temporarily by @cliffburdick in #216
- Fixed convolution mode SAME and added unit tests by @cliffburdick in #217
- Added convolution VALID support by @cliffburdick in #218
- Allow operators on cumsum by @cliffburdick in #219
- Using async allocation in median() by @cliffburdick in #220
- Various CUB fixes -- got rid of offset pointers (async allocation + copy), allowed operators on more types, and fixed caching on sort by @cliffburdick in #222
- Fixed memory leak on CUB cache bypass by @cliffburdick in #223
- Update to pipe type through for scalars on set operation by @tylera-nvidia in #225
- Added complex version of mean and variance by @cliffburdick in #227
- Fixed FFT batching for non-contiguous tensors by @cliffburdick in #228
- Added fmod operator by @cliffburdick in #230
- Fmod by @cliffburdick in #231
- Changing name to fmod by @cliffburdick in #232
- Cloneop by @luitjens in #233
- Making the shift parameter in shift an operator by @luitjens in #234
- Change sign of shift to match python/matlab. by @luitjens in #235
- Changing output operator type to by value to allow temporary operators to be used as an output type. by @luitjens in #236
- Adding slice() operator. by @luitjens in #237
- Fix cuTensorNet workspace size by @leofang in #241
- adding permute operator by @luitjens in #239
- Cleaning up operators/transforms. by @luitjens in #243
- Rapids cmake no fetch by @cliffburdick in #245
- Cleanup of include directory by @luitjens in #246
- Fixed conv SAME mode by @cliffburdick in #248
- Use singleton on GIL interpreter by @cliffburdick in #249
- make owning a runtime parameter by @luitjens in #247
- Fixed bug with batched 1D convoultion size by @cliffburdick in #250
- Adding 2d convolution tests by @luitjens in #251
- Properly initialize pybind object by @cliffburdick in #252
- Fixed sum() using wrong iterator type by @cliffburdick in #253
- g++11 fixes by @cliffburdick in #254
- Fixed size on conv and added benchmarks by @cliffburdick in #256
- Adding unit tests for collapse with remap by @luitjens in #255
- Collapse tests by @luitjens in #257
- adding madd function to improve convolution throughput by @luitjens in #258
- Conv opt by @luitjens in #259
- Fixed compiler errors in release mode by @cliffburdick in #261
- Add streaming make_tensor APIs. by @luitjens in #262
- adding random benchmark by @luitjens in #264
- remove depricated APIs in make_tensor by @luitjens in #266
- Host unit tests by @luitjens in #267
- Fixed bug with FFT size shorter than length of tensor by @cliffburdick in #270
- removing unused pybind call made before pybind initialize by @tylera-nvidia in #271
- Fixed visualization tests by @cliffburdick in #275
- Fix cmake function check_python_libs. by @pkestene in #274
- Support CubSortSegmented by @tylera-nvidia in #272
- Executor cleanup. by @luitjens in #277
- Transpose operators changes by @luitjens in #278
- Remove Deprecated Shape and add metadata to Print by @tylera-nvidia in #280
- Update Documentation by @tylera-nvidia in #282
- NVTX Macros by @tylera-nvidia in #276
- Adding throw to file reading by @tylera-nvidia in #281
- Adding str() function to generators and operators by @luitjens in #283
- Added reshape op by @luitjens in #287
- 0D tensor printing was broken since they don't have a stride by @cliffburdick in #289
- Allow hermitian to take any rank by @cliffburdick in #292
- Hermitian nd by @cliffburdick in #293
- Fixed batched inverse by @cliffburdick in #294
- Added 4D matmul unit test and fixed batching bug by @cliffburdick in #297
- Fixing batched half precision complex GEMM by @cliffburdick in #298
- Rename simple_pipeline to simple_radar_pipeline for added clarity by @awthomp in #299
- Remove cuda::std::min/max by @cliffburdick in #301
- Fixed chained concatenations by @cliffburdick ...
Minor fix on name collision
v0.2.5 Changed MAX name to not collide with other libraries (#162)
Minor fix
Fixed argmin initialization issue that gave wrong results sometimes
v0.2.3
- Improved error messages
- Added support for
einsumfunction. Includes tensor contractions, GEMMs with transposed outputs, dot products, and trace - Integrated cuTENSOR library
- Added real/imag/r2c operators
- Added
chirpfunction - Added file readers for .mat files
- Fixes to conv2, fft2
- Switched to CUB for certain reductions. Results in a 4x speedup in some cases
- Added
find()andfind_idx()functions - Added
unique()function - Many CMake fixes to clean up transitive target
- Added casting operators
- Added negate operator
N-D Tensors
Added support for N-D tensors for:
- Operators
- FFTs
- GEMMs
- Reductions
- Solver
- Tensor/operator accesses
v0.2.1
Added unlimited concatenation of tensors
Tensor class refactoring
This release adds major changes with the main tensor class to allow for custom types for storage and descriptors. In addition, static tensor descriptors are now possible for compile time pointer arithmetic. As of this release it is not longer recommended to construct tensor_t objects directly. Instead, prefer the make_ variants of the functions.
Other features of this release are:
-
Refactored tensor class to use generic storage and descriptors
-
Adding comments on all make functions. Fixing spectrogram examples
-
Added concatenation operator
-
Added static tensors
-
Adding const on all operator() where applicable
-
Add more creation of tensors
-
Changed convolution example to use static tensor sizes
-
Added documentation for make
v0.1.1
- Added make_tensor helper functions
- Updated Black-Scholes example
- Moved host-specific defines into separate file
- Updated build system to better track libcuda++ and nvbench
- Improved release mode speed by turning off assertion checking
- Improved host operator creation time by storing intermediate variables
- Updated recursive filter example to error if not enough shared memory is available
v0.1.0
First public release of MatX. Brief list of supported features are:
- Frontend API for cuBLAS, CUTLASS, cuFFT, cuSolver, cuRAND, and CUB
- All standard POD data types supported, as well as fp16/bf16 and complex
- Template expression trees to generate optimized device kernels
- Examples for both performance and accuracy
- Over 500 unit tests
- Benchmarks using nvbench
- Native CMake build system
- and more!