-
Notifications
You must be signed in to change notification settings - Fork 146
Description
I get the following output (using verbosity=3
) when running kmeans_cuda from python on a certain input (attached here):
performing kmeans++...
kmeans++: dump 292 64 0x564e90a8e000
kmeans++: dev #0: 0x7fd5f5000000 0x7fd5f51ef600 0x7fd5f51fd5c0
step 1[0] dev_dists: 0x7fd5f51fdc00 - 0x7fd5f51fdc40 (64)
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 2[0] dev_dists: 0x7fd5f51fdc00 - 0x7fd5f51fdc40 (64)
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 3[0] dev_dists: 0x7fd5f51fdc00 - 0x7fd5f51fdc40 (64)
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 4[0] dev_dists: 0x7fd5f51fdc00 - 0x7fd5f51fdc40 (64)
internal bug inside kmeans_init_centroids: dist_sum is NaN
step 5[0] dev_dists: 0x7fd5f51fdc00 - 0x7fd5f51fdc40 (64)
internal bug inside kmeans_init_centroids: dist_sum is NaN
internal bug in kmeans_init_centroids: j = 0
step 6[0] dev_dists: 0x7fd5f51fdc00 - 0x7fd5f51fdc40 (64)
cudaMemcpyAsync( host_dists + offset, (*dists)[devi].get(), length * sizeof(float), cudaMemcpyDeviceToHost)
....../kmcuda/src/kmeans.cu:810 -> an illegal memory access was encountered
kmeans_cuda_plus_plus failed
kmeans_init_centroids() failed for yinyang groups: an illegal memory access was encountered
kmeans_cuda_yy failed: no error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: cudaMemcpy failed
There are 14641 vectors, and their dimension is 64, trying to get 292 clusters. I'm using the default yinyang_t=0.1
. If I reduce it to yinyang_t=0.01
the function succeeds, with only a single dist_sum is NaN
error for step 1.
This would have been fine if I could wrap the function call with try-except, but unfortunately after the first failure there is probably some memory error, and running the code again with yinyang_t=0.01
results in:
...../kmcuda/src/kmcuda.cc:151 -> an illegal memory access was encountered
And I need to restart python again.
I'm using ubuntu 20.04 and RTX 2080Ti, and compiled the library using CUDA_ARCH=75.
The errors can be reproduced using the attached file and the following code:
from libKMCUDA import kmeans_cuda
import pickle
with open('kmeans_input.pickle', 'rb') as f:
params = pickle.load(f)
kmeans_cuda(**params)
I tried to look at the code and figure out where the NaNs come from (my data has no NaNs in it), but couldn't find the source of the problem. I also didn't find a way to handle this problem in a recoverable way.
kmeans_input.zip