Skip to content

Unrecoverable cudaMemcpy error #112

@assaf127

Description

@assaf127

I get the following output (using verbosity=3) when running kmeans_cuda from python on a certain input (attached here):

performing kmeans++...
kmeans++: dump 292 64 0x564e90a8e000
kmeans++: dev #0: 0x7fd5f5000000 0x7fd5f51ef600 0x7fd5f51fd5c0
step 1[0] dev_dists: 0x7fd5f51fdc00 - 0x7fd5f51fdc40 (64)

internal bug inside kmeans_init_centroids: dist_sum is NaN
step 2[0] dev_dists: 0x7fd5f51fdc00 - 0x7fd5f51fdc40 (64)

internal bug inside kmeans_init_centroids: dist_sum is NaN
step 3[0] dev_dists: 0x7fd5f51fdc00 - 0x7fd5f51fdc40 (64)

internal bug inside kmeans_init_centroids: dist_sum is NaN
step 4[0] dev_dists: 0x7fd5f51fdc00 - 0x7fd5f51fdc40 (64)

internal bug inside kmeans_init_centroids: dist_sum is NaN
step 5[0] dev_dists: 0x7fd5f51fdc00 - 0x7fd5f51fdc40 (64)

internal bug inside kmeans_init_centroids: dist_sum is NaN

internal bug in kmeans_init_centroids: j = 0
step 6[0] dev_dists: 0x7fd5f51fdc00 - 0x7fd5f51fdc40 (64)
cudaMemcpyAsync( host_dists + offset, (*dists)[devi].get(), length * sizeof(float), cudaMemcpyDeviceToHost)
....../kmcuda/src/kmeans.cu:810 -> an illegal memory access was encountered

kmeans_cuda_plus_plus failed
kmeans_init_centroids() failed for yinyang groups: an illegal memory access was encountered
kmeans_cuda_yy failed: no error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: cudaMemcpy failed

There are 14641 vectors, and their dimension is 64, trying to get 292 clusters. I'm using the default yinyang_t=0.1. If I reduce it to yinyang_t=0.01 the function succeeds, with only a single dist_sum is NaN error for step 1.
This would have been fine if I could wrap the function call with try-except, but unfortunately after the first failure there is probably some memory error, and running the code again with yinyang_t=0.01 results in:

...../kmcuda/src/kmcuda.cc:151 -> an illegal memory access was encountered

And I need to restart python again.

I'm using ubuntu 20.04 and RTX 2080Ti, and compiled the library using CUDA_ARCH=75.
The errors can be reproduced using the attached file and the following code:

from libKMCUDA import kmeans_cuda
import pickle
with open('kmeans_input.pickle', 'rb') as f:
    params = pickle.load(f)
kmeans_cuda(**params)

I tried to look at the code and figure out where the NaNs come from (my data has no NaNs in it), but couldn't find the source of the problem. I also didn't find a way to handle this problem in a recoverable way.
kmeans_input.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions