Remove caching for functions that take distance metrics as arguments#226
Merged
lmcinnes merged 2 commits intolmcinnes:masterfrom Aug 24, 2023
Merged
Remove caching for functions that take distance metrics as arguments#226lmcinnes merged 2 commits intolmcinnes:masterfrom
lmcinnes merged 2 commits intolmcinnes:masterfrom
Conversation
Owner
|
Thanks for the detailed analysis. I'm not actually sure whether the lack of inlining will actually cause significant performance differences -- it is something I'll have to experiment with. On the other hand this definitely is a solid work-around while numba figures out the right thing to do. Thanks for taking the time to figure out a good workaround. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
There are currently a number of functions with
cache=Truethat take distance metric functions as arguments. If you observe the caching behavior, every query is a cache miss and a new copy of the code is cached each time the function is run (even with the same argument types). The reason for this is that Numba is overly conservative with typing the distance function, including the actual memory address in the type. This means that the type signatures for these functions (almost) never match between different Python invocations.You can check that the cache for these functions is missed every time by running something like
There has been some discussion at the Numba project about this and related issues in numba/numba#6772 and numba/numba#6972. It seems that one workaround is to explicitly specify that the distance function argument is a
FunctionTypewith a given signature. However, this prevents inlining of the distance function, which I expect would have significant runtime performance implications. It's possible that there is some other workaround involving producing these functions as closures when requested for a given distance metric, but that seems to also run into caching problems in some situations. Ultimately, I think it will be necessary to wait on the Numba project to resolve the problems with caching functions with inlined function arguments.I wouldn't bother with proposing this change, except that it also seems that filling up the cache index causes significant slowdowns in compilation times. On my machine (macOS M1), in a fresh conda environment, compiling the necessary code to instantiate an
NNDescentobject tends to take around 20 seconds. As the number of cached copies of the code increases, this time slowly increases. But after sufficiently many copies (seems to be around 130), the compilation time abruptly increases to around 3 minutes. I suspect this is due to a bug somewhere in the Numba caching system, but have not been able to figure out why it is happening. In any event, since caching these functions is not currently doing any good and it seems to have detrimental effects in at least some situations, it makes sense to turn caching off.