Skip to content

Conversation

@nvjonwong
Copy link
Contributor

When making tensors with MATX_ASYNC_DEVICE_MEMORY the stream is recorded in the memtracker. If the stream is destroyed, before the tensor is deallocated a segfault will occur because the stream is no longer valid.

Since there is no way to check if a stream is valid after it has been destroyed, the proposed fix is to add a matxStreamDestroy that changes references to the stream to be destroyed to the null stream. This keeps the allocations alive and allows the memtracker to properly deallocate device_async allocations even if the original stream is gone.

@nvjonwong nvjonwong self-assigned this Feb 15, 2024
@luitjens
Copy link
Collaborator

Why would we have a destroy api without a create api? Is this really a matx issue or a calling code issue?

@nvjonwong
Copy link
Contributor Author

Why would we have a destroy api without a create api? Is this really a matx issue or a calling code issue?

It is not a calling code issue. See unit_test.

The issue is that we memorize the stream used to allocate device_async. This stream may not exist (already been destroyed) when the memory finally gets deallocated (say with memtracker/program scope).

@luitjens
Copy link
Collaborator

but the problem here is that you are destroy the stream before the object goes out of scope. Another possible fix would be to provide you a delete_tensor(t) option which frees the memory associated with it. This is one of the reasons I hate refcounting and deleting as the scope exits.

@nvjonwong
Copy link
Contributor Author

nvjonwong commented Feb 15, 2024

but the problem here is that you are destroy the stream before the object goes out of scope. Another possible fix would be to provide you a delete_tensor(t) option which frees the memory associated with it. This is one of the reasons I hate refcounting and deleting as the scope exits.

To provide more context, this unit_test is a poor analogy to the problem I initially found. The original issue that I found is, when we do matx:fft(...).run(stream) we create a workspace on stream that the user is not privy to. Then I delete the stream I created in my function at the end of my function scope (normal CUDA semantics). Since the workspace for fft lives past the function scope, the memorized stream gets carried over to program scope. Currently there's no reliable way to check if the stream memorized in the memtracker (AFAIK) is still alive.

@cliffburdick
Copy link
Collaborator

@nvjonwong can we close this in lieu of #579

@nvjonwong nvjonwong closed this Feb 16, 2024
@cliffburdick cliffburdick deleted the nvjonwong/streamdestroy branch November 19, 2024 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants