Skip to content
This repository was archived by the owner on Feb 25, 2025. It is now read-only.

Fix crash about 'SkiaUnrefQueue::Drain' is called after 'IOManager' reset #32106

Merged
merged 6 commits into from
Mar 21, 2022

Conversation

ColdPaleLight
Copy link
Member

@ColdPaleLight ColdPaleLight commented Mar 18, 2022

fix flutter/flutter#87895

change the type of context_ of SkiaUnrefQueue from fml::WeakPtr<GrDirectContext> to sk_sp<GrDirectContext> context

We should ensure that resource context is valid when SkiaUnrefQueue::Drain is called, otherwise a crash may occur.
See flutter/flutter#87895

The new test will crash if without this patch, stack trace is following

Stack Trace
[----------] 1 test from ShellIOManagerTest
[ RUN      ] ShellIOManagerTest.ItDoesNotCrashThatSkiaUnrefQueueDrainAfterIOManagerReset
flutter: called back
[ERROR:flutter/fml/backtrace.cc(116)] Caught signal SIGSEGV during program execution.
Frame 0: 0x106a4de57 GrGLContextInfo::caps()
Frame 1: 0x106a4cea2 GrGLGpu::glCaps()
Frame 2: 0x106a8b4fb GrGLGpu::deleteSync()
Frame 3: 0x106ac2c3a GrGLSemaphore::~GrGLSemaphore()
Frame 4: 0x106ac2cd3 GrGLSemaphore::~GrGLSemaphore()
Frame 5: 0x106ac2d27 GrGLSemaphore::~GrGLSemaphore()
Frame 6: 0x106551c2a std::__1::default_delete<>::operator()()
Frame 7: 0x106551bca std::__1::unique_ptr<>::reset()
Frame 8: 0x106551b37 std::__1::unique_ptr<>::~unique_ptr()
Frame 9: 0x10654ea63 std::__1::unique_ptr<>::~unique_ptr()
Frame 10: 0x10654e85c GrBackendTextureImageGenerator::RefHelper::~RefHelper()
Frame 11: 0x10654eb23 GrBackendTextureImageGenerator::RefHelper::~RefHelper()
Frame 12: 0x10654f4eb SkNVRefCnt<>::unref()
Frame 13: 0x10654f439 GrBackendTextureImageGenerator::~GrBackendTextureImageGenerator()
Frame 14: 0x10654f543 GrBackendTextureImageGenerator::~GrBackendTextureImageGenerator()
Frame 15: 0x10654f597 GrBackendTextureImageGenerator::~GrBackendTextureImageGenerator()
Frame 16: 0x105e8832a std::__1::default_delete<>::operator()()
Frame 17: 0x105e8822a std::__1::unique_ptr<>::reset()
Frame 18: 0x105e88197 std::__1::unique_ptr<>::~unique_ptr()
Frame 19: 0x105e87763 std::__1::unique_ptr<>::~unique_ptr()
Frame 20: 0x1060f91f8 SharedGenerator::~SharedGenerator()
Frame 21: 0x1060f9193 SharedGenerator::~SharedGenerator()
Frame 22: 0x1060f913b SkNVRefCnt<>::unref()
Frame 23: 0x1060f909e SkSafeUnref<>()
Frame 24: 0x1060f903a sk_sp<>::~sk_sp()
Frame 25: 0x1060f4e93 sk_sp<>::~sk_sp()
Frame 26: 0x1060f8c50 SkImage_Lazy::~SkImage_Lazy()
Frame 27: 0x1060f7863 SkImage_Lazy::~SkImage_Lazy()
Frame 28: 0x1060f78b7 SkImage_Lazy::~SkImage_Lazy()
Frame 29: 0x105cf5a4d SkRefCntBase::internal_dispose()
Frame 30: 0x104e4452e SkRefCntBase::unref()
Frame 31: 0x106efb95f flutter::SkiaUnrefQueue::Drain()
Frame 32: 0x106f03a3b flutter::SkiaUnrefQueue::Unref()::$_0::operator()()
Frame 33: 0x106f0399b std::__1::__invoke<>()
Frame 34: 0x106f038fb std::__1::__invoke_void_return_wrapper<>::__call<>()
Frame 35: 0x106f038ab std::__1::__function::__alloc_func<>::operator()()
Frame 36: 0x106f01687 std::__1::__function::__func<>::operator()()
Frame 37: 0x104e58fb0 std::__1::__function::__value_func<>::operator()()
Frame 38: 0x104e58f43 std::__1::function<>::operator()()
Frame 39: 0x105636dc5 fml::MessageLoopImpl::FlushTasks()
Frame 40: 0x105636c58 fml::MessageLoopImpl::RunExpiredTasksNow()
Frame 41: 0x1056611b0 fml::MessageLoopDarwin::OnTimerFire()
Frame 42: 0x7fff205d6279 __CFRUNLOOP_IS_CALLING_OUT_TO_A_TIMER_CALLBACK_FUNCTION__
Frame 43: 0x7fff205d5d6d __CFRunLoopDoTimer
Frame 44: 0x7fff205d58ca __CFRunLoopDoTimers
Frame 45: 0x7fff205bc4a3 __CFRunLoopRun
Frame 46: 0x7fff205bb61c CFRunLoopRunSpecific
Frame 47: 0x105661671 fml::MessageLoopDarwin::Run()
Frame 48: 0x105636bc7 fml::MessageLoopImpl::DoRun()
Frame 49: 0x1056356cb fml::MessageLoop::Run()
Frame 50: 0x10565a3bc fml::Thread::Thread()::$_0::operator()()
Frame 51: 0x10565a26b std::__1::__invoke<>()
Frame 52: 0x10565a173 std::__1::__thread_execute<>()
Frame 53: 0x105659ad6 std::__1::__thread_proxy<>()
Frame 54: 0x7fff204c48fc _pthread_start
Frame 55: 0x7fff204c0443 thread_start

cc @dnfield

Pre-launch Checklist

  • I read the [Contributor Guide] and followed the process outlined there for submitting PRs.
  • I read the [Tree Hygiene] wiki page, which explains my responsibilities.
  • I read and followed the [Flutter Style Guide] and the [C++, Objective-C, Java style guides].
  • I listed at least one issue that this PR fixes in the description above.
  • I added new tests to check the change I am making or feature I am adding, or Hixie said the PR is test-exempt. See [testing the engine] for instructions on
    writing and running engine tests.
  • I updated/added relevant documentation (doc comments with ///).
  • I signed the [CLA].
  • All existing and new tests are passing.

@skia-gold
Copy link

Gold has detected about 1 new digest(s) on patchset 1.
View them at https://flutter-engine-gold.skia.org/cl/github/32106

@ColdPaleLight ColdPaleLight requested a review from dnfield March 18, 2022 13:10
@skia-gold
Copy link

Gold has detected about 1 new digest(s) on patchset 3.
View them at https://flutter-engine-gold.skia.org/cl/github/32106

dnfield
dnfield previously approved these changes Mar 18, 2022
Copy link
Contributor

@dnfield dnfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nit.

@chinmaygarde FYI


@pragma('vm:entry-point')
void frameCallback(_Image, int) {
print('called back');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the print

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and consider adding a comment about why this method is empty and what test is using it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

PostTaskSync(runners.GetIOTaskRunner(), [&]() {
// 'SkiaUnrefQueue.Drain' will be called after 'io_manager.reset()' in this
// test, If the resource context has been destroyed at that time, it will
// crash.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is surprising to me. But tracing through it seems to make sense. Adding some notes here so I dont' forget this later:

  • Drain() currently checks whether the weak pointer is still valid or not before trying to call anything on it
  • However, calling unref on the SkImage_Lazy ends up freeing a GrBackendTexture. That object seems to assume that something else is keeping the context alive. This seems like it might be a bad assumption on Skia's part, but in Skia's defense we're doing something pretty weird here by keeping GPU resident objects alive without keeping the GrDirectContext alive ourselves.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jason-simmons
Copy link
Member

This PR changes the SkiaUnrefQueue to hold a strong reference to its GrDirectContext.

Does that create a risk that the resource GrDirectContext could be destructed on a thread other than the IO thread? Would that be safe?

SkiaUnrefQueue references are held by UI thread objects such as UIDartState or various UI thread lambdas. I don't think there is any guarantee about which thread will delete the SkiaUnrefQueue.

@dnfield @chinmaygarde

@dnfield dnfield dismissed their stale review March 18, 2022 18:08

Need to resolve @jason-simmons' comment

@dnfield
Copy link
Contributor

dnfield commented Mar 18, 2022

Ahh @jason-simmons I forgot about #13237.

In that case it seems like we'r emissing a drain call somewhere.

@dnfield
Copy link
Contributor

dnfield commented Mar 18, 2022

Here's what seems like it's happening:

  • SkiaUnrefQueue::Unref gets called, scheduling a task on the IO runner with an 8ms delay.
  • Before 8ms is up, the IO manager gets collected on the IO runner.

I'm thining we could fix this by having another method on UnrefQueue to let it know the GrDirectContext is getting collected. That would be a signal to the queue to unref all of its objects immediately and then unref the context. I think this is safe because then Drain would just get 0 objects. WDYT?

@jason-simmons
Copy link
Member

During Shell shutdown the ShellIOManager is deleted on the IO thread. The ShellIOManager dtor synchronously calls SkiaUnrefQueue::Drain.
Any remaining objects in the SkiaUnrefQueue will be deleted on the IO thread. After that the ShellIOManager dtor will drop its reference to the resource GrDirectContext.

At that point the assumption is that no further objects will be queued to the SkiaUnrefQueue. However, this crash implies that there may still be pending tasks that could queue more objects.

In particular, image decode tasks run on the ConcurrentTaskRunner. AFAICT there is nothing in the Shell shutdown sequence which ensures that ConcurrentTaskRunner tasks are finished before the ShellIOManager is deleted.

Without that there is a risk of a race like this:

  • [UI thread] Queues an image decode to the concurrent task runner.
  • [UI thread] Shell shutdown deletes the engine.
  • [Concurrent worker] Decode completes and queues the upload task to the IO thread.
  • [IO thread] Uploads the image to the GPU and queues the result task to the UI thread.
  • [IO thread] Shell shutdown deletes the ShellIOManager and the GrDirectContext
  • [UI thread] Result task runs and discovers that the isolate has been destroyed. The result task then discards the image's SkiaGPUObject, causing the object to be added to the SkiaUnrefQueue. The SkiaUnrefQueue also schedules a drain task.
  • [IO thread] The drain deletes the image object. But by this point the GrDirectContext has already been destroyed.

The ConcurrentTaskRunner is a challenge for Shell shutdown. It's a process-wide shared resource owned by the DartVM class. For safe shutdown, after the Shell deletes the engine it should wait for completion of any ConcurrentTaskRunner tasks queued on behalf of that engine.

If that is not possible, then maybe this might work?

  • Have the SkiaUnrefQueue hold a strong reference to the GrDirectContext (as is done in this PR)
  • In the SkiaUnrefQueue dtor, queue a task to the SkiaUnrefQueue's task runner.
    • Ownership of the SkiaUnrefQueue's objects and its reference to the GrDirectContext will be transferred to the task.
    • The task will unref all the objects and then unref the GrDirectContext

@dnfield
Copy link
Contributor

dnfield commented Mar 18, 2022

Jason is right - I was incorrect, we're guarding everything we need to with a Drain.

@jason-simmons I think what you're suggesting replacing the DCHECK added in #13237 with a task schedule to make sure one final Drain is called and the GrDirectContext is released on the IO task runner (i.e. the task runner it's holding).

@ColdPaleLight
Copy link
Member Author

Thanks for your suggestions! @jason-simmons @dnfield
I tweaked the code as suggested to make sure the 'GrDirectContext' is destroyed on the IO thread. Please take a look at it again, thanks!

@ColdPaleLight
Copy link
Member Author

(Add a new test to verify that the resource context is destroyed in the task runner's thread 44be8cd)

@ColdPaleLight ColdPaleLight requested a review from dnfield March 20, 2022 13:52
void Unref(SkRefCnt* object);
using ResourceContext = T;

void Unref(SkRefCnt* object) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove these back to their own implementation file?

Copy link
Member Author

@ColdPaleLight ColdPaleLight Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test TEST_F(SkiaGpuObjectTest, UnrefResourceContextInTaskRunnerThread) in this commit 44be8cd to make sure the context is released in the io thread, but in order to write this test , I had to change UnrefQueue to a template class. So need to move all the code into the header file. I have no idea how to change my new test if the code is removed back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe my lack of C++ knowledge is showing, but shouldn't we be able to define the template methods in the cpp file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the template is defined in the implementation file, the test file cannot included it, so the compiler cannot help us generate the class UnrefQueue<TestResourceContext>, so the test code cannot be compiled.

If we really need to hide the implementation, we can consider introducing a skia_gpu_object_impl.h and put the implementation code in that file. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave this as is in that case

Copy link
Contributor

@dnfield dnfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move more of the impl of SkiaUnrefQueue back to its own C++ file. Otherwise, thsi change LGTM if it LGT @jason-simmons

@chinmaygarde
Copy link
Member

chinmaygarde commented Mar 21, 2022

We believe flutter/flutter#48062, #32106, flutter/flutter#50959, and flutter/flutter#100122 are related.

@dnfield dnfield added the waiting for tree to go green This PR is approved and tested, but waiting for the tree to be green to land. label Mar 21, 2022
@fluttergithubbot fluttergithubbot merged commit fab823f into flutter:main Mar 21, 2022
@zanderso
Copy link
Member

zanderso commented Mar 22, 2022

Great work, @ColdPaleLight! Thanks for fixing this.

And thanks to @dnfield and @jason-simmons for the thorough review!

@ColdPaleLight
Copy link
Member Author

Great work, @ColdPaleLight! Thanks for fixing this.

And thanks to @dnfield and @jason-simmons for the thorough review!

my pleasure.

@dodatw
Copy link

dodatw commented Mar 24, 2022

Is there any plan release in new version?

@zanderso
Copy link
Member

This fix will be in the April beta release, and in the Q2 stable release. Unfortunately, I don't think we're able to cherry-pick it into the 2.10 stable.

@Sheng20152
Copy link

Sheng20152 commented Jun 17, 2022

@ColdPaleLight
Does this issue #105680 fixed by this solution too?

@zanderso
Does flutter sdk V3.0.2 have this fix? we still met this crash in v2.10.5.

@zanderso
Copy link
Member

@Sheng20152 Yes the patches mentioned above are in the 3.0 stable release.

thor-ola added a commit to olaparty/engine-builds that referenced this pull request Nov 26, 2022
fml::RefPtr<fml::TaskRunner> unref_queue_task_runner);
fml::RefPtr<fml::TaskRunner> unref_queue_task_runner,
fml::TimeDelta unref_queue_drain_delay =
fml::TimeDelta::FromMilliseconds(8));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ColdPaleLight I'm wondering whether number 8 here has any potential meaning?
I'm facing a problem when Supporting another platform because the delayed task posted in UnrefQueue#Unref in release mode (In debug mode, everything goes as expected)
When I change 8 to 0, it goes as expected in release mode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ColdPaleLight I'm wondering whether number 8 here has any potential meaning? I'm facing a problem when Supporting another platform because the delayed task posted in UnrefQueue#Unref in release mode (In debug mode, everything goes as expected) When I change 8 to 0, it goes as expected in release mode.

I guess you can find the answer in the PR #9486. My commit did not change the value of delay, it just moved it to another place.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
waiting for tree to go green This PR is approved and tested, but waiting for the tree to be green to land.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[engine] Multithread Bug when Flutter Exit(Cause Native Crash)
10 participants