Releases: cupy/cupy
v9.4.0
This is the release note of v9.4.0. See here for the complete list of solved issues and merged PRs.
We are running a Gitter chat for general discussions and quick questions. Feel free to join the channel to talk with developers and users!
Highlights
Compile with SASS (CUBIN) for CUDA versions >= 11.1 (#5097)
Changes NVRTC compile process to produce SASS (CUBIN files) instead of PTX so that kernels compiled with a new CUDA Toolkit version can be run with earlier CUDA Drivers. Check the CUDA Compatibility Guide and NVRTC Documentation for detailed information. We believe most users will not be affected by this change, but you can revert to the previous behavior by setting CUPY_COMPILE_WITH_PTX=1
environment variable just in case.
Support for AMD ROCm 4.3
Support for ROCm 4.3 has been added in the latest release and binary wheels are provided as well. Note that there is currently an issue with ROCm 4.3 that prevents it from running in several environments. The current workaround is to set the LLVM_PATH
variable to the llvm folder included in ROCm 4.3 installation (e.g., export LLVM_PATH=/opt/rocm-4.3/llvm
).
Changes
Enhancements
- Compile with SASS for CUDA versions >= 11.1 (#5611)
- Allow to compile using PTX with an envvar (#5634)
- Add
ncclAvg
andncclBfloat16
for NCCL (#5656) - Fix version check for new ROCm version definition (#5661)
- Rest of version check fix for new ROCm version definition (#5670)
Bug Fixes
- Fix FFT convolve for shapes containing 1 (#5613)
- Fix the RTC call path for HIP (#5620)
- Fix compute capability check (#5646)
- Fix squareness checks (#5652)
- Fix
unique
for empty array (#5658)
Code Fixes
Documentation
- Update Sphinx to 4.1.2 (#5616)
__array_function__
feature by default (#5653)- Support ROCm v4.3 in document (#5674)
Tests
- Increase test timeout (#5615)
- Increase timeout for CUDA 11.4 tests (#5617)
- Add CI for ROCm 4.3 (#5632)
- Reload GPG key for ROCm 4.2 test (#5637)
- Fix cubic
for_all_dtypes_combination
tests (#5639) - Add a workaround for ROCm 4.3.0 for testing (#5663)
- Fix
skipTest
intest_decomp_lu
(#5672)
Others
- Bump version to v9.4.0 (#5680)
Contributors
The CuPy Team would like to thank all those who contributed to this release!
v10.0.0b1
This is the release note of v10.0.0b1. See here for the complete list of solved issues and merged PRs.
We are running a Gitter chat for general discussions and quick questions. Feel free to join the channel to talk with developers and users!
Highlights
CuPy now supports CUDA 11.4 (cupy-cuda114
)
Along with the new CUDA toolkit version, support for NCCL 2.10.3 and cuDNN 8.2.2 libraries is added.
Compute capability 86 support for GPUs of the RTX 30X0 and AX000 series is also added.
Google Summer of Code
CuPy is participating in Google Summer of Code under the NumFOCUS organization.
Our student @povinsahu1909 is working hard to add support for sparse linear algebra solvers and increasing the compatibility of the new random number generation API.
Compile with SASS (CUBIN) for CUDA versions >= 11.1 (#5097)
Changes NVRTC compile process to produce SASS (CUBIN files) instead of PTX so that kernels compiled with a new CUDA Toolkit version can be run with earlier CUDA Drivers. Check the CUDA Compatibility Guide and NVRTC Documentation for detailed information.
Changes without compatibility
Support the new DLPack exchange protocol (#5306)
By adopting the new DLPack exchange protocol proposed in the Python array API standard, cupy.fromDlpack
has been deprecated in favor of cupy.from_dlpack
.
Known Issues
cupy-cuda102
,cupy-cuda110
andcupy-cuda111
wheels are not available yet in PyPI. In the meantime, they can be downloaded from the Assets section below. See #4971 for detailed instructions.
Changes
New Features
- Texture memory 2D/3D affine transformations (#5171)
- Support the new DLPack exchange protocol (#5306)
- Add cupyx.scipy.sparse.linalg.lsmr (#5331)
- JIT: Support all atomic intrinsics (#5387)
- Expose
_GUFunc
throughcupyx
(#5408) - Add geometric distribution to new Generator (#5443)
- Support Numba-like
jit.gridsize()
syntax in CuPy JIT (#5461) - Support Numba-like
jit.laneid()
andjit.warpsize
syntax in CuPy JIT (#5462) - Add cupyx.scipy.sparse.linalg.cgs (#5524)
- Add hypergeometric distribution to new Generator (#5560)
Enhancements
- Compile with SASS for CUDA versions >= 11.1 (#5097)
- Support NCCL v2.9.9 (#5268)
- Support CUDA 11.4 and
compute_86
(#5434) - Update NumPy/SciPy pinning in
setup.py
(#5453) - Make
matrix_power
support stacked matrices (#5458) - Support hipSPARSE and fix streams not set in some generic APIs in cuSPARSE (#5472)
- Add
cudaDeviceDisablePeerAccess
wrapper (#5495) - Support cuDNN v8.2.2 (#5516)
- Support NCCL v2.10.3: library installer and document (#5521)
Bug Fixes
- JIT: Fix supported dtype of
atomic_add
on HIP (#5383) - Fix cupy.nanmedian's axis parameter to accept a sequence other than a tuple (#5389)
- Fix astype from boolean (#5410)
- Fix compatibility issues of
ndarray.view
(#5428) - Fix
types
attribute of ufunc (#5448) - Fix new DLPack protocol error messages and tests (#5449)
texture_memory
option inaffine_transform
not supported by HIP (#5464)- Fix
linalg.lstsq
for empty matrix (#5467) - Fix reshape (#5470)
- Fix random generator output not being raveled (#5478)
- Fix random
integers
(#5479) - Fix availability tests in cuSOLVER and cuSPARSE (#5492)
- Add missing hipSPARSE include to builder (#5515)
- prune cuFFT static lib by major cc ver (#5531)
- Fix casts from bool in ufunc inputs (#5539)
- Access
cudaMemoryType
in the pointer attributes and fix for HIP (#5544) - Fix casts in ufunc outputs (#5550)
- Code fix for {cu, roc}SOLVER (#5558)
- Fix CUDA API call on module initialization (#5561)
- Fix the RTC call path for HIP (#5569)
- Fix broadcast error messages (#5579)
Code Fixes
- Do not call
cudnnGetVersion
on import (#5326) - JIT: Fix
__call__()
for built-in functions (#5361) - Add HIP symbol redefinitions (#5362)
- Remove the data member
use_32bit_indexing
fromCArray
(#5376) - Use
dtype.name
insteaddtype.char
(#5444) - Try to use
-I
in hipRTC (#5486) - Hide modules from public APIs (#5522)
- consistent kernel names (#5551)
- Use the new macro
__HIP_PLATFORM_AMD__
at build time (#5554)
Documentation
- Add upgrade guide for v10 (#5278)
- Update tag lines in package description and docs index (#5399)
- Fix typo in
apply_along_axis
(#5432) - Fix indent of
Returns
section (#5433) - Update
user_guide/basic.rst
device agnostic section (#5435) - Support CUDA 11.4 on documents (#5447)
- Update install guide with new NumPy/SciPy versions (#5454)
- Use
from_dlpack
instead of fromDlpack (#5488) - Use Sphinx 4.1.0 (#5489)
- Bump ReadTheDocs configuration to version 2 (#5491)
- Fix docs of eigh and eigvalsh (#5494)
- Add a lingering doc page for
fromDlpack()
(#5509) - Document
scipy.fft
backend usage (#5514) - Replaced the links for NumPy docs as per issue #3418 (#5548)
- Use Sphinx's
envvar
construct (#5570) - Fix intersphinx for SciPy 1.7.1 docs (#5587)
Installation
Tests
- Add tests for num_to_num's optional parameters (#5337)
- Add script for ROCm CI on Jenkins (#5378)
- Skip unwrap tests for
numpy<1.21
(#5384) - Enable strict xfail in pytest (#5407)
- Remove xfail in windows jitify test (#5409)
- Fix preloading slow tests (#5440)
- Add script for CUDA 11.4 CI on FlexCI (#5457)
- Increase memory for CUDA 11.4 tests (#5477)
- Fix DLPack test for ROCm/HIP (#5485)
- Fix "Revert test decorators order" (#5498)
- Fix some tests for HIP (#5501)
- Fix FlexCI Linux tests (#5505)
- Add CUDA 11.4 for FlexCI helper script (#5528)
- Increase timeout for CUDA 11.4 tests (#5575)
- Update tests to install all requirements and add PATH (#5576)
- Add Cython to
all
requirements (#5577)
Others
- Notify conflict by mergify (#5371)
- Fix mergify to only comment when pull-request is open (#5439)
- Fix mergify condition (#5513)
- Add auto notify bot for
hip
label (#5538) - Use
pull_request_target
instead for auto notify bot (#5541) - Fix auto notify bot for issues (#5546)
- Disable Mergify's auto-merge (#5556)
- Bump version to v10.0.0b1 (#5595)
- Fix signal tests for scipy 1.7.0 (#5368)
- Fix
numpy.unwrap
for NumPy 1.21 (#5385) - Fix signaltools
medfilt
forscipy>=1.7.0
(#5386) - Fix deprecated
numpy.typeDict
utilization (#5388)
The CuPy Team would like to thank all those who contributed to this release!
@12rambau @grlee77 @leofang @maxim-belkin @Palash-Vishnani @povinsahu1909 @the-lay
v9.3.0
This is the release note of v9.3.0. See here for the complete list of solved issues and merged PRs.
We are running a Gitter chat for general discussions and quick questions. Feel free to join the channel to talk with developers and users!
Highlights
CuPy now supports CUDA 11.4 (cupy-cuda114
)
Along with the new CUDA toolkit version, support for NCCL 2.10.3 and cuDNN 8.2.2 libraries is added.
Compute capability 86 support for GPUs of the RTX 30X0 and AX000 series is also added.
Known Issues
cupy-cuda102
,cupy-cuda110
andcupy-cuda111
wheels are not available yet in PyPI. In the meantime, they can be downloaded from the Assets section below. See #4971 for detailed instructions.
Changes
Enhancements
- Support NCCL v2.9.9 (#5402)
- Update NumPy/SciPy pinning in
setup.py
(#5471) - Support CUDA 11.4 and support
compute_86
(#5519) - Support cuDNN v8.2.2 (#5523)
- Make
matrix_power
support stacked matrices (#5525) - Support NCCL v2.10.3: library installer and document (#5526)
Bug Fixes
- JIT: Fix supported dtype of
atomic_add
on HIP (#5405) - Fix cupy.nanmedian's axis parameter to accept a sequence other than a tuple (#5416)
- Fix compatibility issues of
ndarray.view
(#5442) - Fix
types
attribute of ufunc (#5455) - Fix random
integers
(#5484) - Fix random generator output not being raveled (#5487)
- Fix astype from boolean (#5490)
- Fix reshape (#5504)
- Fix
linalg.lstsq
for empty matrix (#5506) - Add missing checks and
_setStream()
(#5507) - Fix availability tests in cuSOLVER and cuSPARSE (#5534)
- prune cufft static lib by major cc ver (#5536)
- Fix casts from bool in ufunc inputs (#5549)
- Code fix for {cu, roc}SOLVER (#5566)
- Access
cudaMemoryType
in the pointer attributes and fix for HIP (#5571) - Fix broadcast error messages (#5584)
- Fix casts in ufunc outputs (#5589)
- Fix broken build on CUDA 9.2 (#5598)
Code Fixes
- Remove the data member
use_32bit_indexing
fromCArray
(#5414) - JIT: Fix
__call__()
for built-in functions (#5422) - Do not call
cudnnGetVersion
on import (#5446) - Add HIP symbol redefinitions (#5475)
- Try to use
-I
in hipRTC (#5502) - Hide modules from public APIs (#5533)
- Use the new macro
__HIP_PLATFORM_AMD__
at build time (#5565)
Documentation
- Update tag lines in package description and docs index (#5415)
- Fix typo in
apply_along_axis
(#5441) - Fix indent of
Returns
section (#5452) - Update
user_guide/basic.rst
device agnostic section (#5456) - Update install guide with new NumPy/SciPy versions (#5465)
- Bump ReadTheDocs configuration to version 2 (#5497)
- Fix docs of
eigh
andeigvalsh
(#5499) - Use Sphinx 4.1.0 (#5500)
- Document
scipy.fft
backend usage (#5532) - Support CUDA 11.4 on documents (#5535)
- Replaced the links for NumPy docs as per issue #3418 (#5553)
- Use Sphinx's
envvar
construct (#5586) - Fix intersphinx for SciPy 1.7.1 docs (#5588)
Installation
Examples
Tests
- Skip unwrap tests for
numpy<1.21
(#5412) - Remove xfail in windows jitify test (#5418)
- Enable strict xfail in pytest (#5423)
- Add missing DLPack test for complex numbers (#5425)
- Fix
unwrap
tests for v9 (#5426) - Fix preloading slow tests (#5445)
- Add script for ROCm CI on Jenkins (#5468)
- Add script for CUDA 11.4 CI on FlexCI (#5473)
- Increase memory for CUDA 11.4 tests (#5480)
- Fix "Revert test decorators order" (#5518)
- Fix FlexCI Linux tests (#5520)
- Add CUDA 11.4 for FlexCI helper script (#5543)
- Fix scipy requirement in tests (#5563)
- Fix some tests for HIP (#5578)
- Update tests to install all requirements and add PATH (#5581)
- Add Cython to
all
requirements (#5582)
Others
- Notify conflict by mergify (#5419)
- Fix mergify to only comment when pull-request is open (#5510)
- Fix mergify condition (#5517)
- Add auto notify bot for
hip
label (#5540) - Use
pull_request_target
instead for auto notify bot (#5542) - Fix auto notify bot for issues (#5547)
- Disable Mergify's auto-merge (#5562)
- Bump version to v9.3.0 (#5596)
- Fix deprecated
numpy.typeDict
utilization (#5403) - Fix signal tests for SciPy 1.7.0 (#5413)
Contributors
The CuPy Team would like to thank all those who contributed to this release!
v10.0.0a2
This is the release note of v10.0.0a2. See here for the complete list of solved issues and merged PRs.
We are running a Gitter chat for general discussions and quick questions. Feel free to join the channel to talk with developers and users!
Highlights
- CuPy now supports CUDA 11.3 (
cupy-cuda113
) and AMD ROCm 4.2 (cupy-rocm-4-2
) and binary wheels are now available on PyPI. - The following Python syntax and new APIs can now be used in JIT target functions.
- Calling
len
,min
,max
Python built-ins.len(arr)
: Equivalent toarr.shape[0]
.min(scalar1, scalar2, ...)
: Returns the minimum value of the inputs.max(scalar1, scalar2, ...)
: Returns the maximum value of the inputs.
- Accessing
.ndim
,.size
attributes ofndarray
. - Unpacking nested tuples.
(x, y), z = ...
jit.grid()
API, similar tonumba.cuda.grid
.x, y, z = cupyx.jit.grid(3)
(x
is equal tothreadIdx.x + blockIdx.x * blockDim.x
.)
- Warp shuffle and sync functions.
cupyx.jit.shfl_down_sync(mask, var, val_id)
(__shfl_down_sync(mask, var, val_id)
)
- Calling
cupyx.scipy.sparse.{coo,csr,csc}_matrix
now provides thereshape
method.
Changes without compatibility
Drop CUDA 9.2 & NCCL 2.4 Support (#5214)
CUDA 9.2 and NCCL 2.4 are no longer supported in CuPy v10.
Changes in Stream behavior (#5251)
The same cupy.cuda.Stream
instance can now safely be shared between multiple threads. To achieve this, CuPy v10 will not destroy the stream (i.e., call cudaStreamDestroy
) if the stream is the current stream of any thread.
Known Issues
cupy-cuda111
wheels only support CUDA 11.1.1 and will not work with CUDA 11.1.0 (#5313).cupy-cuda110
andcupy-cuda111
wheels are not available yet in PyPI. In the meantime, they can be downloaded from the Assets section below. See #4971 for detailed instructions.
Changes
New Features
- Add reshape method for COO, CSR and CSC matrices (#5301)
- Support
len
,min
,max
,.ndim
,.size
in jit (#5319) - Support nested tuple unpack in CuPy JIT (#5332)
- Support Numba-like
jit.grid()
syntax in CuPy JIT (#5334) - Support warp shuffle and sync functions in CuPy JIT (#5335)
Enhancements
- Do not use handles unless requested in
cupy.show_config()
(#5073) - Fix to allow sharing a Stream instance between threads (#5251)
- Adding GUFunc order, dtype and casting kwarg support (#5260)
- Support
nan
,posinf
,neginf
incupy.nan_to_num
(#5295) - Use independent version of hipFFT for ROCm 4.1 and later (#5318)
- Support cuTENSOR v1.3.1 (#5338)
- Support cuDNN v8.2.1 (#5357)
Performance Improvements
- Make cuTENSOR available in
cupy.einsum
(#5203)
Bug Fixes
- Fix
check_availablity
forcupy.cusolver
(#5207) - Fix
MemoryAsync
to keep a weakref to stream (#5264) - Fix cuFFT callback for
sm_61
etc (#5304) - Fix cuDNN preloading (#5327)
- Fix large arrays assignment (#5330)
- Ensure source array is C-contiguous before copying to
CUDAArray
(#5342) - Increase test coverage for Generalized Universal Functions (#5344)
- Remove unnecessary print (#5374)
Code Fixes
- Fix cub repository url (#5236)
- Code and comment fixes for stream (#5243)
- Use
cdef
instead ofcpdef
where appropriate (#5274)
Documentation
- Fix
matmul
docstring (#5174) - Update list of wheels in README (#5267)
- Add user guide for FFT (#5272)
- Bump CuPy version in docs (#5277)
- Add user guide for streams & events (#5283)
- Fix deadlink to tutorial and reorder in README (#5287)
- Document
ExternalStream
(#5305) - Add ROCm 4.2 support to install docs (#5354)
user_guide/basic.rst
: various improvements (#5356)
Installation
- Drop support for CUDA 9.2 & NCCL 2.4 (#5214)
- Add upper restrictions to NumPy/SciPy versions (#5225)
- Exclude Cython 3 from
setup_requires
(#5273)
Tests
- Fix threading memory pool tests (#5263)
- Temporarily remove the async pool test from
TestAllocator
(#5308) - Fix Windows CI kernel cache (#5310)
- Tentatively skip unstable
MemoryPoolAsync
tests (#5350) - Xfail random generator tests for HIP (#5355)
- Tentatively pin to SciPy 1.6 in Windows CI (#5366)
Contributors
The CuPy Team would like to thank all those who contributed to this release!
@anaruse @eternalphane @leofang @maxim-belkin @povinsahu1909
v9.2.0
This is the release note of v9.2.0. See here for the complete list of solved issues and merged PRs.
We are running a Gitter chat for general discussions and quick questions. Feel free to join the channel to talk with developers and users!
Highlights
- CuPy now supports CUDA 11.3 (
cupy-cuda113
) and AMD ROCm 4.2 (cupy-rocm-4-2
) and binary wheels are now available on PyPI.
Known Issues
cupy-cuda111
wheels only support CUDA 11.1.1 and will not work with CUDA 11.1.0 (#5313).cupy-cuda110
andcupy-cuda111
wheels are not available yet in PyPI. In the meantime, they can be downloaded from the Assets section below. See #4971 for detailed instructions.
Changes
Enhancements
- Add CUDA 11.3 headers (#5232)
- Do not use handles unless requested in
cupy.show_config()
(#5285) - Use independent version of hipFFT for ROCm 4.1 and later (#5351)
- Support cuTENSOR v1.3.1 (#5370)
- Support cuDNN v8.2.1 (#5372)
Bug Fixes
MemoryAsyncPool
: Use the "current" mempool instead of the "default" one (#5271)- Fix MemoryAsync to keep a weakref to stream (#5307)
- Fix cuFFT callback for sm_61 etc (#5325)
- Fix large arrays assignment (#5333)
- Fix
check_availablity
forcupy.cusolver
(#5336) - Fix cuDNN preloading (#5365)
- Ensure source array is C-contiguous before copying to
CUDAArray
(#5375) - Remove unnecessary print (#5377)
Code Fixes
Documentation
- Fix
matmul
docstring (#5281) - Update list of wheels in README (#5284)
- Add user guide for FFT (#5286)
- Fix deadlink to tutorial and reorder in README (#5291)
- Add user guide for streams & events (#5302)
- Document
ExternalStream
(#5312) user_guide/basic.rst
: various improvements (#5356)- Add ROCm 4.2 support to install docs (#5360)
Installation
Tests
- Fix threading memory pool tests (#5289)
- Fix Windows CI kernel cache (#5317)
- Xfail random generator tests for HIP (#5359)
- Tentatively pin to SciPy 1.6 in Windows CI (#5369)
Contributors
The CuPy Team would like to thank all those who contributed to this release!
v10.0.0a1
This is the release note of v10.0.0a1. See here for the complete list of solved issues and merged PRs.
We are running a Gitter chat for general discussions and quick questions. Feel free to join the channel to talk with developers and users!
Highlights
CUDA 11.0 and 11.1 wheels not available yet in PyPI (#4971)
In the meantime, they can be downloaded from the Assets section below. See #4971 for the detailed instructions.
Changes without compatibility
Current stream is now managed per device (#5172)
CuPy now automatically manages the stream switching when changing a device, so the user is not responsible for changing the stream anymore.
This pull-request also includes a bug fix for #5143. An existing code mixing with stream:
blocks and stream.use()
may get different results as the stream set via use()
API will not be reactivated when exiting a stream context.
s1 = cupy.cuda.Stream()
s2 = cupy.cuda.Stream()
s3 = cupy.cuda.Stream()
with s1:
s2.use()
with s3:
pass
cupy.cuda.get_current_stream() # -> CuPy v10 returns `s1` instead of `s2`.
Make cupy.cuda.Device
context manager interface thread safe (#5083)
The use of a single cupy.cuda.Device
context manager object with multiple threads was leading to incorrect behavior when restoring the previous device since the first versions of CuPy. Now the correct device is restored back so user code relying on this incorrect behavior might need to be updated.
Deprecate cupyx.allow_synchronize
and cupyx.DeviceSynchronized
APIs (#5226)
These APIs used for detecting when synchronization to a device was happening have been deprecated since they don’t provide reliable behavior.
Changes
Note: many of these PRs are backported to the v9 series and available since the release.
New Features
- CUDA 11.2: Add
MemoryAsyncPool
to supportmalloc_async
(#4592) - Add APIs for creating NumPy arrays backed by pinned memory (#4870)
- Support cuSPARSELt (#4883)
- Add gamma distributions to random API (#4905)
- Add
random
for uniform [0, 1) generation (#4906) - Add
poisson
distribution to random API (#4927) - Add SciPy compatible connected_components (#4940)
- Support shared memory in CuPy JIT (#4950)
- Add cupyx.scipy.sparse.kronsum() (#4968)
- Add
hfft2
,ihfft2
,hfftn
, andihfftn
tocupyx.scipy.fft
(#4996) - CuPy JIT: Print kernel code (#5017)
- Add
cupyx.jit.atomic_add
(#5169) - CUDA 11.2/11.3: Support
MemoryAsyncPool
statistics and limits (#5177)
Enhancements
- Ability to pass structured data types by value as kernel parameters (#4829)
- Move the NVTX module to
cupy_backends.cuda.libs
(#4930) - Disable CUB SpMV on CUDA 11.x (#4949)
- CuPy JIT: Readable compile error messages (#4991)
- Fix JIT test failures on ROCm (#4998)
- Mark
cupyx.jit.rawkernel
as experimental (#5005) - HIP: add
-ftz=true
(#5007) - Give gufunc a name (#5013)
- CuPy JIT: Use C++-like typing rule in 'cuda' mode (#5028)
- Add PCI Bus ID to show_config (#5037)
- Print cuSPARSELt version in
show_config
(#5054) - Support custom getsource option in CuPy JIT (#5071)
- Make
cupy.cuda.Device
context manager interface thread safe (#5083) - Add a new argument
out
tocupy.asnumpy()
(#5155) - Support cuSPARSELt v0.1.0 (#5158)
- Per device stream (#5172)
- cuTENSOR v1.3.0 for library installer (#5192)
- Add
sum_labels
tocupyx.scipy.ndimage.measure
(#5200) - Support NCCL v2.9.8 (#5201)
- Fix thrust compilation for ROCm 4.2.0 (#5209)
- Add NVCC path and Python version to
show_config
(#5215) - Add CUDA 11.3 headers (#5218)
- Add libraries for CUDA 11.3 (#5219)
- Remove
syncdetect
APIs (#5226)
Bug Fixes
- Use
THRUST_OPTIONAL_CPP11_CONSTEXPR
(#5002) - Use async memcpy in
ndarray.copy
(#5004) - Fix DLPack
lanes
(#5045) - Disable cuFFT plan cache on CUDA 11.1 (#5046)
- Support PTDS in CuPy memory pool (#5072)
- CuPy JIT: Fix range type (#5077)
- Fix
poisson
to support lam array (#5087) - Adjust PATH when preloading to load cuDNN v8 correctly on Windows (#5103)
- Bugfix for typing rule of CuPy JIT (#5125)
- Fix TypeError in
svds
(#5140) - Properly handle non-contiguous RHS in
cupyx.scipy.sparse.linalg.spsolve
(#5168) - Fix integer
scatter_add
failure on Windows (#5173) MemoryAsyncPool
: Use the "current" mempool instead of the "default" one (#5191)- Fix
matmul
for input with relaxed strides (#5205) - Add
check_availability
for cuTensor routines (#5206) - Fix windows
constexpr
(#5233) - Remove duplicated subtraction in
cupy.random.Generator.integers
(#5247)
Code Fixes
- Rename
cupy.core
submodule tocupy._core
(#3820) - Fix some internal
cpdef
functions tocdef
in_kernel.pyx
(#5084) - Remove
cupy.cupy
(#5121) - Cosmetic change in cuSPARSELt stub header (#5149)
- Cosmetic changes of CuPy JIT implementation (#5152)
Documentation
- Follow the latest NumPy/SciPy docs style (#4945)
- Fix docs: cupy-cuda112 now on PyPI (#4957)
- Update installation guide for Conda-Forge (#4985)
- CuPy JIT documentation (#5012)
- Document
cupyx.time.repeat
(#5015) - Document
cupy.cuda.runtime.getDeviceProperties
(#5016) - More documentation on the supported backends (#5019)
- Add links to Anaconda, Gitter, StackOverflow (#5020)
- Improve the documentation on interoperability (#5023)
- Document
CFunctionAllocator
andManagedMemory
(#5025) - Fix code block in installation guide (#5033)
- Improve comments for memory and stream API usage (#5060)
- Point to the correct numpy random docs (#5088)
- Add user guide (#5093)
- Add ROCm limitations to docs (#5107)
- Reorganize API reference pages (#5108)
- Revise ROCm doc (#5122)
- Fix docs of
scatter_add
(#5129) - Mention baseline API change in upgrade guide (#5131)
- Fix ROCm wheel install steps (#5133)
- Fix docstring in
coo.py
(#5139) - Fix docs in
stream.pyx
(#5144) - cuDNN v8.2 on documentation (#5148)
- Mention PTDS in ROCm Limitation (#5159)
- Use Sphinx 4 (#5188)
- cuTENSOR v1.3 on documentation (#5196)
- Fix cuSPARSELt not covered in docs (#5221)
- Add
cupyx.scipy.ndimage.sum_labels
to docs (#5223) - Improve README (#5254)
- Update logo image (#5255)
- Tentatively remove CUDA 11.3 from support list (#5256)
Installation
- Fix Windows dll loading for Conda (#4974)
- Add warnings for duplicate installation (#5032)
- cuDNN v8.2.0 for library installer (#5146)
- Bump version to v10.0.0a1 (#5269)
Examples
- Fix cuSPARSELt example not to use internal function (#4995)
- Update examples for current version of CuPy (#4999)
Tests
- Refactor random tests (#4907)
- Tentatively pin CI to ROCm 4.0.1 (#4961)
- Fix
cutensor
import in the test (#4965) - Make
install_tests
runnable without depending on current path (#4969) - Avoid using
pip install -e
on Windows CI for performance (#4970) - Update known base branches in flexCI config (#4973)
- Update list of known branches (#4982)
- Fix
TestStream
cleanup (#5042) - Mark some memory tests as
testing.slow
(#5061) - Fix stream usage on D2D copy test under HIP (#5091)
- Xfail tests for random distribution generator under HIP/ROCm (#5096)
- Adjust testing tolerance for
hfftn
for HIP/ROCm (#5099) - Use current device in tests (#5127)
- Fix for updated FlexCI base image (#5164)
- Relax tolerance of
cupyx.jit.atomic_add
test (#5186) - Test build for ROCm 4.0 and latest (#5224)
- Fix mergify configuration (#5248)
Others
- Use bot mode in automatic backport (#5051)
Contributors
The CuPy Team would like to thank all those who contributed to this release!
@anaruse @beingaryan @eternalphane @grlee77 @insertinterestingnamehere @keckj @leofang @povinsahu1909 @UmashankarTriforce
v9.1.0
This is the release note of v9.1.0. See here for the complete list of solved issues and merged PRs.
We are running a Gitter chat for general discussions and quick questions. Feel free to join the channel to talk with developers and users!
Highlights
CUDA 11.0 and 11.1 wheels not available yet in PyPI (#4971)
In the meantime, they can be downloaded from the Assets section below. See #4971 for the detailed instructions.
Changes without compatibility
Make cupy.cuda.Device
context manager interface thread safe (#5083)
The use of a single cupy.cuda.Device
context manager object with multiple threads was leading to incorrect behavior when restoring the previous device since the first versions of CuPy. Now the correct device is restored back so user code relying on this incorrect behavior might need to be updated.
Changes
Enhancements
- Add
cupyx.jit.atomic_add
(#5181) - Support custom
getsource
option in CuPy JIT (#5089) - Fix JIT test failures on ROCm (#5101)
- Make
cupy.cuda.Device
context manager interface thread safe (#5147) - Fix thrust compilation for ROCm 4.2.0 (#5212)
- Add
sum_labels
tocupyx.scipy.ndimage.measure
(#5222) - Support cuSPARSELt v0.1.0 (#5227)
- Fix Stream destructor not taking care of PTDS (#5228)
- NCCL v2.9.8 (#5229)
- Add NVCC path and Python version to
show_config
(#5230) - cuTENSOR v1.3.0 for library installer (#5234)
- Add libraries for CUDA 11.3 (#5235)
Bug Fixes
- Fix DLPack
lanes
(#5094) - Fix TypeError in
svds
(#5161) - Fix integer
scatter_add
failure on Windows (#5178) - Properly handle non-contiguous RHS in
cupyx.scipy.sparse.linalg.spsolve
(#5180) - Fix
poisson
to support lam array (#5182) - Fix
matmul
for input with relaxed strides (#5240) - Add
check_availability
for cuTensor routines (#5244) - Fix windows
constexpr
(#5250) - Remove duplicated subtraction in
cupy.random.Generator.integers
(#5261)
Code Fixes
- Remove
cupy.cupy
(#5137) - Cosmetic change in cuSPARSELt stub header (#5160)
- Cosmetic changes of CuPy JIT implementation (#5162)
Documentation
- Mention baseline API change in upgrade guide (#5132)
- Fix docstring in
coo.py
(#5141) - Fix docs in
stream.pyx
(#5150) - Fix docs of scatter_add (#5153)
- Fix ROCm wheel install steps (#5154)
- Mention PTDS in ROCm Limitation (#5166)
- Use Sphinx 4 (#5198)
- cuDNN v8.2 on documentation (#5217)
- Fix cuSPARSELt not covered in docs (#5231)
- cuTENSOR v1.3 on documentation (#5238)
- Add
cupyx.scipy.ndimage.sum_labels
to docs (#5245) - Update logo image (#5257)
- Improve README (#5259)
Installation
Tests
- Use current device in tests (#5151)
- Fix stream usage on D2D copy test under HIP (#5157)
- Fix for updated FlexCI base image (#5167)
- Relax tolerance of
cupyx.jit.atomic_add
test (#5187) - Test build for ROCm 4.0 and latest (#5239)
- Avoid using
pip install -e
on Windows CI for performance (#5242) - Fix mergify configuration (#5249)
Others
Contributors
The CuPy Team would like to thank all those who contributed to this release!
@anaruse @beingaryan @eternalphane @grlee77 @insertinterestingnamehere @leofang
v9.0.0
This is the release note of v9.0.0.
This release note only covers the changes since v9.0.0rc1 release. Read the blog for the details of new features introduced in CuPy v9!
We are running a Gitter chat for general discussions and quick questions. Feel free to join the channel to talk with developers and users!
Highlights
NVIDIA cuSPARSELt
CuPy now integrates the Python binding for the cuSPARSELt library that accelerates sparse matrix multiplications on NVIDIA Ampere GPUs. We are planning to start using it in CuPy sparse APIs to transparently improve performance.
RAPIDS cuGraph
cupyx.scipy.sparse.csgraph
is added to the API with support for the connected_components
method. The support for cuGraph is optional and can be installed through conda-forge or by manually building CuPy. Currently, PyPI wheels do not have built-in support for cuGraph.
Add MemoryAsyncPool
to support malloc_async
(#5034)
By using cupy.cuda.set_allocator(cupy.cuda.MemoryAsyncPool().malloc)
it is now possible to use the stream ordered memory allocations introduced in CUDA 11.2.
APIs for creating NumPy arrays backed by pinned memory (#5100)
By using the cupyx.empty_pinned()
, cupyx.empty_like_pinned()
, cupyx.zeros_pinned()
cupyx.zeros_like_pinned()
it is possible to obtain NumPy ndarrays with their storage located in pinned memory to improve performance of data movement.
CUDA 11.0 and 11.1 wheels not available yet in PyPI (#4971)
In the meantime, they can be downloaded from the Assets section below. See #4971 for the detailed instructions.
Changes
See here for the complete list of solved issues and merged PRs after v9.0.0rc1 release. For all changes since v9 series, please refer to the release notes of the pre-releases ((alpha1, beta1, beta2, beta3, rc1).
New Features
- Support shared memory in CuPy JIT (#4977)
- Support cuSPARSELt (#4994)
- Add
random
for uniform [0, 1) generation (#5003) - CUDA 11.2: Add
MemoryAsyncPool
to supportmalloc_async
(#5034) - Add poisson distribution to random API (#5036)
- CuPy JIT: Print kernel code (#5038)
- Add gamma distributions to random API (#5086)
- Add APIs for creating NumPy arrays backed by pinned memory (#5100)
- Add SciPy compatible
connected_components
(#5113)
Enhancements
- Disable CUB SpMV on CUDA 11.x (#4978)
- Move the NVTX module to
cupy_backends.cuda.libs
(#5014) - HIP: add
-ftz=true
(#5035) - CuPy JIT: Readable compile error messages (#5041)
- CuPy JIT: Use C++-like typing rule in 'cuda' mode (#5053)
- Mark
cupyx.jit.rawkernel
as experimental (#5057) - Add PCI Bus ID to show_config (#5062)
- Print cuSPARSELt version in
show_config
(#5065) - Give gufunc a name (#5085)
Bug Fixes
- Use THRUST_OPTIONAL_CPP11_CONSTEXPR (#5011)
- Disable cuFFT plan cache on CUDA 11.1 (#5068)
- Use async memcpy in
ndarray.copy
(#5078) - CuPy JIT: Fix range type (#5081)
- Support PTDS in CuPy memory pool (#5082)
- Adjust PATH when preloading to load cuDNN v8 correctly on Windows (#5116)
Code Fixes
- Rename
cupy.core
submodule tocupy._core
(#4987) - Fix some internal
cpdef
functions tocdef
in_kernel.pyx
(#5098)
Documentation
- Fix docs: cupy-cuda112 now on PyPI (#4990)
- Update installation guide for Conda-Forge (#4993)
- Document
cupyx.time.repeat
(#5027) - Document
cupy.cuda.runtime.getDeviceProperties
(#5029) - Doc: Add links to Anaconda, Gitter, StackOverflow (#5030)
- More documentation on the supported backends (#5039)
- Fix code block in installation guide (#5043)
- Document
CFunctionAllocator
andManagedMemory
(#5059) - Improve the documentation on interoperability (#5064)
- CuPy JIT documentation (#5076)
- Improve comments for memory and stream API usage (#5079)
- Add user guide (#5109)
- Reorganize API reference pages (#5114)
- Point to the correct numpy random docs (#5115)
- Follow the latest NumPy/SciPy docs style (#5118)
- Add ROCm limitations to docs (#5119)
- Revise ROCm doc (#5123)
Installation
- Fix Windows dll loading for Conda (#5106)
Examples
- Update examples for current version of CuPy (#5009)
- Fix cuSPARSELt example not to use internal function (#5066)
Tests
- Tentatively pin CI to ROCm 4.0.1 (#4976)
- Update known base branches in flexCI config (#4980)
- Fix
cutensor
import in the test (#4981) - Update list of known branches (#4989)
- Make install_tests runnable without depending on current path (#4992)
- Fix
TestStream
cleanup (#5052) - Mark some memory tests as
testing.slow
(#5063) - Refactor random tests (#5102)
Others
- Use bot mode in automatic backport (#5058)
Contributors
The CuPy Team would like to thank all those who contributed to this release!
v9.0.0rc1
This is the release note of v9.0.0rc1. See here for the complete list of solved issues and merged PRs.
We are planning to release the final v9.0.0 on April 22th. Please start testing your workload with this release. See the Upgrade Guide for the list of possible breaking changes.
We are running a Gitter chat for general discussions and quick questions. Feel free to join the channel to talk with developers and users!
Highlights
CuPy JIT (#4774)
Now creating raw kernels out of python functions is possible thanks to the introduction of the @cupyx.jit.rawkernel
decorator.
from cupyx import jit
@jit.rawkernel()
def f(x, y, z, n):
tid = jit.threadIdx.x + jit.blockIdx.x * jit.blockDim.x
ntid = jit.blockDim.x * jit.gridDim.x
for i in range(tid, n, ntid):
z[i] = x[i] + y[i]
n = numpy.uint32(1024)
x = cupy.arange(n)
y = cupy.arange(n)
z = cupy.empty((n,), dtype='l')
f[16, 16](x, y, z, n)
Support for Generalized Universal Functions (#4675)
We have added an interface to support Generalized Universal Functions based on the one in Dask. Currently, it is used in matmul
to ensure compatibility with __array_ufunc__
numpy dispatching.
cuTENSOR Support in Binary Packages (#4600)
cuTENSOR support is now enabled in wheel packages. To use cuTENSOR features you will need to install the shared library using python -m cupyx.tools.install_library --cuda 11.2 --library cutensor
after installing wheels.
New Sphinx Theme in Documentation (#4351)
Following NumPy, we have adopted the pydata_sphinx_theme
in our documentation site starting from this release.
CUDA 11.0 and 11.1 wheels not available yet in PyPI (#4971)
In the meantime they can be downloaded from the Assets section below. See #4971 for the detailed instructions.
Changes without compatibility
cupy.cuda.nccl
is hidden by default (#4919)
NCCL wrapper is no longer imported in cupy/cuda/__init__.py
requiring it to be explicitly imported from cupy.cuda.nccl
.
Drop NCCL & cuDNN shared libraries from wheels (#4850, #4932)
NCCL and cuDNN shared libraries are no longer bundled in all wheels. To activate features using NCCL / cuDNN in CuPy v9, you will need to install these libraries using python -m cupyx.tools.install_library
tool after installing CuPy wheels. See the Installation Guide for details.
By eliminating the default bundling of cuDNN & NCCL we have achieved further reductions in the wheel size averaging 5x.
Deprecate cupy.bool
, cupy.int
, cupy.float
and cupy.complex
(#4790)
Following NumPy 1.20 API, these aliases for the Python scalar types have been deprecated.
cupy.bool_
, cupy.int_
, cupy.float_
and cupy.complex_
should be used instead when required.
Docker image updated to CUDA 11.2 and Python 3.8
The official Docker image is now updated to use CUDA 11.2 and Python 3.8.
Changes
New Features
- LOBPCG solver -
cupyx.scipy.sparse.linalg.lobpcg
(#4281) - Add diagonal and setdiag methods for COO sparse matrices (#4664)
- Support for Generalized Universal Functions (#4675)
- Support batched
pinv
(#4686) - Add CuPy JIT Kernel definition (#4774)
- Add
cupy.random.Generator.standard_normal
(#4885) - Support tuple in CuPy JIT (#4890)
- Add exponential distribution to random API (#4915)
- Support tuple indexing in CuPy JIT (#4939)
- Support
__syncthreads()
in CuPy JIT (#4941)
Enhancements
- Support
nvrtcGetSupportedArchs
(#4691) - Update DLPack support (#4695)
- Bump cuDNN to v8.1.1 in library installer tool (#4780)
- Support
norm='forward'
/'backward'
incupy.fft
functions (#4797) - Fix for flake8 F541 (#4803)
- Complete build only when all of the essential modules are available (#4815)
- Support
norm='forward'
/'backward'
incupyx.scipy.fft
functions (#4816) - Support cuSparse functions for matrix conversion added in CUDA 11.2 (#4844)
- Add NCCL to library installer (#4848)
- Improve cuTENSOR installer (#4852)
- Support
cupy.ndarray
typeshift
incupy.roll
(#4884) - Fix uniform random generation interval (#4894)
- Use NVCC
--threads
option when building CuPy (#4908) - Bump headers to CUDA 11.2.2 (#4911)
- Update preload to look for
lib
directory to support cuTENSOR/NCCL (#4912) - Move the NCCL module to
cupy_backends.cuda.libs
(#4919) - Add
cupy/cuda/cutensor.py
(#4920)
Performance Improvements
Bug Fixes
- Fix dtypes in
cupy.linalg
(#4363) - Fix: avoid redeclaring attributes (#4764)
- Windows: Fix compiler error for CUB block reduction kernels (#4771)
- Support int argument for Dirichlet shape (#4772)
- Windows: Fix
histogram
test failures (#4777) - Windows: fix sparse matrix indexing type (#4778)
- Unify linux/windows
randint
with NumPy (#4808) - Improve/fix csc/csr argmax/argmin (#4813)
- ROCm: Fix sorting bug (#4823)
- Fixed choice function for 0 samples from 0 candidates (#4830)
- Fix redeclaration of sparse warning classes (#4837)
- Fix cuFFT callback compilations - v2 (#4853)
- Solve
UnboundLocalError
oncopy_from_host_async
(#4900) - Add
out
arg verifier in new random interface. (#4904) - Fix compilation error due to invalid complex-to-real casting in
_SimpleReductionKernel
(#4909) - Fix C++ compilation error (#4922)
- Fix cutensor import (#4933)
- Fix flaky
CUDAarray
tests (#4946) - Declare
CArray._indexing()
only in CuPy JIT mode (#4951)
Code Fixes
- Rename submodules under
cupy.testing
package (#3868) - Fix: code quality issues (#4587)
- Use newest versions of stylecheck packages (#4694)
- Clean-up sparse max/min argmax/argmin (#4860)
Documentation
- Use pydata_sphinx_theme in Sphinx (#4351)
- Remove
cupy-cuda112
support from documentation (#4761) - Revert "Remove
cupy-cuda112
support from documentation" (#4785) - Fix broken Stream docs (#4843)
- Reformat environment variables table (#4845)
- Revert memory back to reference (#4857)
- Update wheel list in README (#4910)
- Merge ROCm installation guide (#4928)
- Document that cuDNN and NCCL are no longer included (#4932)
- Update install docs (#4943)
Installation
- Support optional dependencies from Conda-Forge (#4873)
- Bump version to v9.0.0rc1 (#4953)
- Bump Docker image to use CUDA 11.2 (#4972)
Tests
- Show config on Windows CI (#4649)
- Windows: Fix test condition for CUB device kernels (#4776)
- Xfail some tests for
cupyx.scipy.statistics.correlation
under ROCm/HIP (#4781) - Windows: fix vectorize tests (#4794)
- Windows: fix OOM errors in the CI (#4801)
- Windows: Fix
sepfir2d
tests (#4804) - Windows: Fix cuTENSOR tests (#4806)
- Windows: Fix cuTENSOR tests (#4818)
- Remove AppVeyor configurations (#4836)
- Windows: Fix
test_poly1d_pow_scalar
(#4854) - Fix for flake8 E741 (#4888)
- Windows: Skip failing cuDNN tests (#4893)
- Add names for workflows (#4913)
- Prioritize FlexCI daemon in Windows CI (#4916)
- Fix to work with scheduled FlexCI job (#4929)
- Change irfft tests tolerance (#4937)
- Xfail tests for ndarray indexing under HIP (#4653)
- Adjust tolerance of
TestPolyArithmeticDiffTypes
under HIP/ROCm (#4657) - Xfail tests in polynomial roots (#4658)
- Xfail tests for manipulation dims under HIP/ROCm (#4662)
- Xfail
TestPolyfitParametersCombinations
whendeg == 0
under ROCm/HIP (#4758) - Xfail
TestPolyfitCovMode
whendeg == 0
under ROCm/HIP (#4759) - Xfail
TestInvh
under ROCm/HIP (#4760) - ROCm: remove the need to set
HCC_AMDGPU_TARGET
at runtime (#4766) - Assert
MT19937
not implemented inhipRAND
(#4769) - Xfail chi-squared test for some random functions under ROCm/HIP (#4770)
- Remove duplicated typedef in example when HIP (#4782)
- Xfail cuDNN version check test under ROCm/HIP (#4791)
- Remove solved xfail mark for msort (#4792)
- Fix to test checking HIP version (#4859)
- Xfail test on sparse handle under ROCm/HIP (#4861)
- Xfail some tests under ROCm/HIP (#4868)
- Xfail some conditions of ndimage filter under ROCm/HIP (#4877)
- Xfail some conditions of ndimage interpolation tests under ROCm/HIP (#4878)
- Xfail some conditions of ndimage measurements under ROCm/HIP (#4879)
- Xfail some conditions of signal tests under ROCm/HIP (#4880)
Others
- Add
CODEOWNERS
file (#4757) - Add GitHub Actions workflow for automatic backport (#4812)
- Fix pytest opts for Windows CI (#4820)
- Use access token for automated backport (#4833)
- Fix automated backport workflow (#4835)
- Use pull_request_target trigger in backport automation (#4841)
Contributors
The CuPy Team would like to thank all those who contributed to this release!
@anaruse @aryamccarthy @grlee77 @leofang @mattvend @povinsahu1909 @venkywonka @viantirreau @withshubh
v8.6.0
This is the release note of v8.6.0. See here for the complete list of solved issues and merged PRs.
We are running a Gitter chat for general discussions and quick questions. Feel free to join the channel to talk with developers and users!
Notes
Final release for v8.x series
We expect this version to be the final release for v8.x series. Please start testing your workloads with the latest v9.x pre-release.
CUDA 11.0 and 11.1 wheels for Windows not available yet in PyPI (#4971)
In the meantime they can be downloaded from the Assets section below. See #4971 for the detailed instructions.
Changes
Enhancements
- Bump cuDNN to v8.1.1 in library installer tool (#4795)
- Update DLPack support (#4849)
- Bump headers to CUDA 11.2.2 (#4917)
Bug Fixes
- [v8] Fix
linalg.pinv
on empty matrices (#4783) - Windows: Fix histogram test failures (#4784)
- Windows: fix sparse matrix indexing type (#4796)
- Support int argument for Dirichlet shape (#4798)
- Windows: Fix compiler error for CUB block reduction kernels (#4814)
- ROCm: Fix sorting bug (#4826)
- Unify linux/windows randint with NumPy (#4827)
- Fix dtypes in
cupy.linalg
(#4839) - Fixed choice function for 0 samples from 0 candidates (#4851)
- Improve/fix
csc
/csr
argmax
/argmin
(#4858) - Fix cooperative kernel launch (#4887)
Code Fixes
Documentation
- Remove
cupy-cuda112
support from documentation (#4762) - Revert " Remove
cupy-cuda112
support from documentation" (#4786) - Reformat environment variables table (#4856)
Installation
- Bump version to v8.6.0 (#4954)
Tests
- Windows: Fix test condition for CUB device kernels (#4793)
- Windows: Fix cuTENSOR tests (#4818)
- Remove AppVeyor configurations (#4846)
- Windows: fix OOM errors in the CI (#4862)
- Fix raw kernel test (#4871)
- Windows: Fix
test_poly1d_pow_scalar
(#4889) - Windows: Skip failing cuDNN tests (#4901)
- Add names for workflows (#4914)
- Show config on Windows CI (#4918)
- Prioritize FlexCI daemon in Windows CI (#4921)
- Fix to work with scheduled FlexCI job (#4931)
Others
- Add
CODEOWNERS
file (#4788) - Fix pytest opts for Windows CI (#4822)
- Rename submodules under
cupy.testing
package (#4876)
Contributors
The CuPy Team would like to thank all those who contributed to this release!