[api] Fix memory leaks in TracerProvider.GetTracer API by CodeBlanch · Pull Request #4906 · open-telemetry/opentelemetry-dotnet

CodeBlanch · 2023-09-29T23:20:13Z

Changes

TracerProvider now maintains a cache of the Tracers it has issued. When disposed it will turn them into no-op instances and release their associated ActivitySources.

Details

Consider the following simple application:

using var tracerProvider = Sdk.CreateTracerProviderBuilder().Build();

for (int i = 0; i < 2_500_000; i++)
{
    var tracer = tracerProvider.GetTracer("MyTracer");

    if (i % 500_000 == 0)
    {
        GC.Collect();
    }
}

Running that we will see memory growing per iteration that is never released:

What's going on here?

Today we create a Tracer each time GetTracer is called which is handed its own ActivitySource. Creating spurious ActivitySources is dangerous because there is a static list of all active sources.

Tracer does NOT implement IDisposable so users aren't given a chance to do this correctly.

After the cache introduced on this PR the graph looks like this:

Merge requirement checklist

CONTRIBUTING guidelines followed (nullable enabled, static analysis, etc.)
Unit tests added/updated
Appropriate CHANGELOG.md files updated for non-trivial changes

codecov · 2023-09-29T23:32:17Z

Codecov Report

Merging #4906 (83e550b) into main (3e885c7) will increase coverage by 0.29%.
The diff coverage is 94.11%.

@@            Coverage Diff             @@
##             main    #4906      +/-   ##
==========================================
+ Coverage   83.21%   83.51%   +0.29%     
==========================================
  Files         295      295              
  Lines       12294    12324      +30     
==========================================
+ Hits        10231    10292      +61     
+ Misses       2063     2032      -31

Flag	Coverage Δ
unittests	`83.51% <94.11%> (+0.29%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
src/OpenTelemetry.Api/Trace/Tracer.cs	`92.68% <100.00%> (+0.18%)`	⬆️
src/OpenTelemetry.Api/Trace/TracerProvider.cs	`93.93% <93.33%> (-6.07%)`	⬇️

... and 6 files with indirect coverage changes

test/OpenTelemetry.Api.Tests/Trace/TracerTest.cs

src/OpenTelemetry.Api/Trace/TracerProvider.cs

utpilla · 2023-10-04T21:51:54Z

src/OpenTelemetry.Api/Trace/TracerProvider.cs

+            {
+                if (this.tracers == null)
+                {
+                    // Note: We check here for a race with Dispose and return a


I believe we need to set this.tracers = null inside the same lock. Else we could still run into a situation where some thread calling Dispose sets this.tracers to null after this if check and before the new entry is added to the dictionary. We would want to return a no-op tracer in that case, but we would end up returning a valid tracer.

I just checked it a couple times. I think it is good! Could be I'm not seeing something though. Can you write out a flow for me that you think is flawed? Here are a couple flows I'm imagining.

Case where Dispose runs in the middle of the writer and gets the lock...

Writer thread reads the this.tracers on Line 58. It is valid so it begins its work.

Dispose thread sets this.tracers to null.

Dispose thread takes the lock.

Reader thread misses the cache and tries to take the lock. It has to wait.

Dispose thread finishes its clean up and releases the lock.

Writer thread gets the lock. Now it checks this.tracers == null. This will be true now and it will return a no-op instance.

Case where Dispose runs in the middle of the writer and waits on the lock...

Writer thread reads the this.tracers on Line 58. It is valid so it begins its work.

Reader thread misses the cache and takes the lock. Inside the lock it checks this.tracers == null which is false. It begins to do its work.

Dispose thread sets this.tracers to null.

Dispose thread tries to takes the lock. It has to wait.

Writer thread adds a new tracer to the cache and releases the lock. It doesn't care that this.tracers is now actually null because it is working on a local copy.

Dispose thread gets the lock and makes all the tracers in the cache no-ops including the one that was just added.

For case 2,

Writer thread adds a new tracer to the cache and releases the lock. It doesn't care that this.tracers is now actually null because it is working on a local copy.

I think this is more of design choice. Yes, it doesn't care that this.tracers is now actually null but it could care about it 😄.

I was thinking we could offer a stronger guarantee that we would never return a Tracer when TracerProvider is disposed or being disposed. We could avoid this limbo state where the Dispose method may or may not have marked the newly returned Tracer no-op when its being used.

I merged the PR because I think what's there will work well enough. I'll circle back to this comment when I have a sec to see if I can simplify it or clean it up in a way that doesn't introduce a bunch of contention.

src/OpenTelemetry.Api/Trace/TracerProvider.cs

utpilla

Left a non-blocking comment: #4906 (comment)

Fix memory leaks in TracerProvider.GetTracer API.

5725dda

Tweak.

0bd778c

CodeBlanch added the pkg:OpenTelemetry.Api Issues related to OpenTelemetry.Api NuGet package label Oct 2, 2023

CodeBlanch added 4 commits October 3, 2023 16:13

Added a test for tracer becoming no-op when parent provider is disposed.

87a1bbc

CHANGELOG patch.

31b045f

Test tweak.

8afee1b

Tweak design to avoid any potential breaking changes.

6d84dec

CodeBlanch marked this pull request as ready for review October 3, 2023 23:27

CodeBlanch requested a review from a team October 3, 2023 23:27

Merge branch 'main' into api-gettracer-memoryleak

5951223

reyang approved these changes Oct 3, 2023

View reviewed changes

reyang reviewed Oct 3, 2023

View reviewed changes

test/OpenTelemetry.Api.Tests/Trace/TracerTest.cs Outdated Show resolved Hide resolved