Skip to content

Investigating test hangs#545

Closed
dunhor wants to merge 1 commit intomasterfrom
dunhor/test-failures
Closed

Investigating test hangs#545
dunhor wants to merge 1 commit intomasterfrom
dunhor/test-failures

Conversation

@dunhor
Copy link
Copy Markdown
Member

@dunhor dunhor commented Sep 5, 2025

Tests pretty reliably hang and time out when running in CI. Trying to track down the issue

@dunhor
Copy link
Copy Markdown
Member Author

dunhor commented Sep 5, 2025

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@dunhor
Copy link
Copy Markdown
Member Author

dunhor commented Sep 9, 2025

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@dunhor
Copy link
Copy Markdown
Member Author

dunhor commented Sep 24, 2025

Finally tracked down the random crashes to an issue in Catch2. catchorg/Catch2#3031 to fix

@dunhor
Copy link
Copy Markdown
Member Author

dunhor commented Sep 28, 2025

Going to start documenting failures/hangs here so I don't lose track of them. Here's the latest hang:

===== RNG Seed is 3302575063 =====
0.000 s: UniqueEnvironmentStrings
0.000 s: ResultTests::NoOriginationByDefault
0.003 s: WilWintrustWrapperTest::VerifyWintrustDataAllocateAndFree
0.000 s: RegistryWatcherTests::VerifyCallbackFinishesBeforeFreed
0.297 s: RegistryWatcherTests::VerifyResetAfterDelete
0.001 s: RegistryWatcherTests::VerifyDelivery
0.001 s: WinRTTests::VectorToVectorTest
0.000 s: ResultTests::ProcessLocalStorage
0.000 s: WilWintrustWrapperTest::VerifyUniqueHCATADMINAllocateAndFree
0.000 s: ComTests::Test_Make
===== HANG IS HERE DURING THE BELOW TEST =====
WinRTTests::IterableRangeTest

@dunhor
Copy link
Copy Markdown
Member Author

dunhor commented Sep 28, 2025

Another hang:

===== RNG Seed is: 3725117151 =====
0.000 s: BasicRegistryTests::ExampleUsage
0.000 s: Read values
0.000 s: BasicRegistryTests::ExampleUsage
0.001 s: Write values
0.001 s: BasicRegistryTests::ExampleUsage
0.042 s: Helper functions
0.042 s: BasicRegistryTests::ExampleUsage
0.016 s: CppWinRTAuthoringTests::RegisterComServerThrowIsSafe
0.001 s: GetModuleFileNameTests::VerifyWithRealResultsAndShortInitialBufferLength
0.000 s: CppWinRTTests::ZStringViewFromHString
===== HANG IS HERE DURING THE BELOW TEST =====
Don't inhibit arrays (first section in WinRTTests::HStringComparison)

@dunhor dunhor force-pushed the dunhor/test-failures branch from 10baf45 to fd4f5c9 Compare October 3, 2025 05:22
}

// NOTE: Requires 's_lock' to be held
apartment_variable_storage* ensure_current_apartment_variables()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it requires the lock to be held should it take one of those WIL lock holder parameters to try and enforce that?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes here were just to move internal functions to be protected instead of public.

{
_doneHangingHandle.SetEvent();
}
CWMO_DISPATCH_CALLS | CWMO_DISPATCH_WINDOW_MESSAGES, INFINITE, ARRAYSIZE(handles), handles, &index));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing a 10 second timeout to infinite isn't going to help with tests hanging :|

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not, but I'm trying to remove non-determinism as much as I can, especially since CI machines are unpredictable. I've added some "hang detection" logic to terminate tests after 10 min w/ callstacks, so INFINITE waits aren't that bad anymore.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I haven't (yet) turned all waits INFINITE, but maybe I should...

wil::shared_event doneHangingHandle;
doneHangingHandle.create();
// There is a race between cancellation and the underlying implementation getting called such that it's possible for the call
// to get marked as cancelled and then the marshalling code sees that the call is cancelled and never invoke the
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that is sneaky

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI - this draft is just to get CI runs more easily (yeah I could technically schedule on my branch, but I won't get runs on push). I'm likely to delete this PR & re-issue a new one once everything is ready

@dunhor dunhor force-pushed the dunhor/test-failures branch from af0d282 to 9fbe99c Compare October 3, 2025 21:04
@dunhor dunhor closed this Oct 3, 2025
@dunhor dunhor deleted the dunhor/test-failures branch October 6, 2025 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants