Remove caching for functions that take distance metrics as arguments by jakobhansen-blai · Pull Request #226 · lmcinnes/pynndescent

jakobhansen-blai · 2023-08-24T16:52:19Z

There are currently a number of functions with cache=True that take distance metric functions as arguments. If you observe the caching behavior, every query is a cache miss and a new copy of the code is cached each time the function is run (even with the same argument types). The reason for this is that Numba is overly conservative with typing the distance function, including the actual memory address in the type. This means that the type signatures for these functions (almost) never match between different Python invocations.

You can check that the cache for these functions is missed every time by running something like

import numpy as np
import pynndescent

X = np.random.randn(20, 10)
nnd = pynndescent.NNDescent(X, n_neighbors=5)

print(pynndescent.pynndescent_.process_candidates.stats)

There has been some discussion at the Numba project about this and related issues in numba/numba#6772 and numba/numba#6972. It seems that one workaround is to explicitly specify that the distance function argument is a FunctionType with a given signature. However, this prevents inlining of the distance function, which I expect would have significant runtime performance implications. It's possible that there is some other workaround involving producing these functions as closures when requested for a given distance metric, but that seems to also run into caching problems in some situations. Ultimately, I think it will be necessary to wait on the Numba project to resolve the problems with caching functions with inlined function arguments.

I wouldn't bother with proposing this change, except that it also seems that filling up the cache index causes significant slowdowns in compilation times. On my machine (macOS M1), in a fresh conda environment, compiling the necessary code to instantiate an NNDescent object tends to take around 20 seconds. As the number of cached copies of the code increases, this time slowly increases. But after sufficiently many copies (seems to be around 130), the compilation time abruptly increases to around 3 minutes. I suspect this is due to a bug somewhere in the Numba caching system, but have not been able to figure out why it is happening. In any event, since caching these functions is not currently doing any good and it seems to have detrimental effects in at least some situations, it makes sense to turn caching off.

lmcinnes · 2023-08-24T18:33:39Z

Thanks for the detailed analysis. I'm not actually sure whether the lack of inlining will actually cause significant performance differences -- it is something I'll have to experiment with. On the other hand this definitely is a solid work-around while numba figures out the right thing to do. Thanks for taking the time to figure out a good workaround.

jakobhansen-blai added 2 commits August 24, 2023 08:38

set cache=False for uncacheable functions

2d67cf2

remove more caching for distance functions

0cb9a91

lmcinnes merged commit 82f9c5a into lmcinnes:master Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove caching for functions that take distance metrics as arguments#226

Remove caching for functions that take distance metrics as arguments#226
lmcinnes merged 2 commits intolmcinnes:masterfrom
jakobhansen-blai:valid-caching

jakobhansen-blai commented Aug 24, 2023

Uh oh!

lmcinnes commented Aug 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jakobhansen-blai commented Aug 24, 2023

Uh oh!

lmcinnes commented Aug 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants