Skip to content

Remove caching for functions that take distance metrics as arguments#226

Merged
lmcinnes merged 2 commits intolmcinnes:masterfrom
jakobhansen-blai:valid-caching
Aug 24, 2023
Merged

Remove caching for functions that take distance metrics as arguments#226
lmcinnes merged 2 commits intolmcinnes:masterfrom
jakobhansen-blai:valid-caching

Conversation

@jakobhansen-blai
Copy link
Contributor

There are currently a number of functions with cache=True that take distance metric functions as arguments. If you observe the caching behavior, every query is a cache miss and a new copy of the code is cached each time the function is run (even with the same argument types). The reason for this is that Numba is overly conservative with typing the distance function, including the actual memory address in the type. This means that the type signatures for these functions (almost) never match between different Python invocations.

You can check that the cache for these functions is missed every time by running something like

import numpy as np
import pynndescent

X = np.random.randn(20, 10)
nnd = pynndescent.NNDescent(X, n_neighbors=5)

print(pynndescent.pynndescent_.process_candidates.stats)

There has been some discussion at the Numba project about this and related issues in numba/numba#6772 and numba/numba#6972. It seems that one workaround is to explicitly specify that the distance function argument is a FunctionType with a given signature. However, this prevents inlining of the distance function, which I expect would have significant runtime performance implications. It's possible that there is some other workaround involving producing these functions as closures when requested for a given distance metric, but that seems to also run into caching problems in some situations. Ultimately, I think it will be necessary to wait on the Numba project to resolve the problems with caching functions with inlined function arguments.

I wouldn't bother with proposing this change, except that it also seems that filling up the cache index causes significant slowdowns in compilation times. On my machine (macOS M1), in a fresh conda environment, compiling the necessary code to instantiate an NNDescent object tends to take around 20 seconds. As the number of cached copies of the code increases, this time slowly increases. But after sufficiently many copies (seems to be around 130), the compilation time abruptly increases to around 3 minutes. I suspect this is due to a bug somewhere in the Numba caching system, but have not been able to figure out why it is happening. In any event, since caching these functions is not currently doing any good and it seems to have detrimental effects in at least some situations, it makes sense to turn caching off.

@lmcinnes
Copy link
Owner

Thanks for the detailed analysis. I'm not actually sure whether the lack of inlining will actually cause significant performance differences -- it is something I'll have to experiment with. On the other hand this definitely is a solid work-around while numba figures out the right thing to do. Thanks for taking the time to figure out a good workaround.

@lmcinnes lmcinnes merged commit 82f9c5a into lmcinnes:master Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants