-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Description
The equeue library is used by the EventQueue objects in mbed-os. When we use these queue, and cancel events, we can corrupt the internal data structure of the equeue. In turn causing events to be lost or the queue to hang forever.
We've been experiencing this issues at low frequency in our complex application. But I've managed to track down the problem.
The problem is located in this piece of code:
https://github.com/ARMmbed/mbed-os/blob/master/events/equeue/equeue.c#L291
It's a bit complex to understand. The queue is kept in a linked list with siblings. Entries on the same time slot are kept as siblings of their "parent", and the main linked list is kept in order of timed events (first event is first in the list)
Next to that, each entry has a "ref" pointer, which points to the pointer which points to this entry. This ref is used to update whatever is pointing at you for insertion and removal.
So, what goes wrong? The above linked code tries to remove an entry from the linked list. However, it does so only correctly if the item is not a sibling. As soon as it is a sibling, it updates the "ref" with it's "next" pointer, however, the "next" pointer of a sibling isn't really valid (it's left at whatever it was before it became a sibling). As extra insult, it updates the "ref" of whatever that next is pointing at to point at it's own next (which, remember, isn't a really valid one)
In the end, it corrupts the ref from one entry, and possibly unlinks events from the list. With the corrupted ref, it's then possible to have entries that are canceled to stay in the list as well.
I've created the following "offline" test to simulate this:
https://gist.github.com/daid/3c9f20fb0abbf2d71043287f626ca2e9
This outputs:
Before:
--E:0036304C NEXT:00363078 SIB:00363020 REF:0028FE6C = 0028FE6C CB:0040259A 3
E:00363020 NEXT:00362FF4 SIB:00362FC8 REF:00363058 = 00363058 CB:0040259A 2
E:00362FC8 NEXT:00362FF4 SIB:00000000 REF:0036302C = 0036302C CB:0040259A 0
--E:00363078 NEXT:00362FF4 SIB:00000000 REF:00363054 = 00363054 CB:0040259A 4
--E:00362FF4 NEXT:00000000 SIB:00000000 REF:00363080 = 00363080 CB:0040259A 1
After:
--E:0036304C NEXT:00363078 SIB:00362FC8 REF:0028FE6C = 0028FE6C CB:0040259A 3
E:00362FC8 NEXT:00000000 SIB:00000000 REF:00363058 = 00363058 CB:0040259A 0
--E:00363078 NEXT:00362FF4 SIB:00000000 REF:00363054 = 00363054 CB:0040259A 4
--E:00362FF4 NEXT:00000000 SIB:00363020 REF:00362FD0!= 00363080 CB:00000000 1
E:00363020 NEXT:00000000 SIB:00000000 REF:00363058!= 00363000 CB:00000000 2
test_func: 00000000
test_func: 00000003
test_func: 00000004
This dumps the event queue before and after removal of entries.
The entries that start with a double dash are the main list, the other ones are the siblings. The final number at the end is the data number I use to identify the events.
I remove entry labeled 2 first, as this is a middle sibling. Which has the biggest problem with causing the corruption. This corrupts the ref of entry labeled 1, and then when I remove that, the entries 1 and 2 are back in the list. But without proper callback functions and timeouts. Most likely causing a crash at some point. Or even more corruption later on if they are tried to be inserted back into the list.
I've been trying to write up a patch, but the equeue_unqueue function does not have enough information to properly distinguish between a sibling or a main entry. So I'm looking for some advice, I have the following options:
- Remove the siblings. This makes the code a bit less efficient on insertion, but the code less complex, and uses less memory.
- Add knowledge about the siblings in the events, uses more memory. Makes the code even more complex.
- Walk the "tree" on removal to see if we are a sibling, makes the unqueue function not very CPU efficient, but it's only used when canceling an event. During dispatch this isn't used.
Currently, I'm mostly thinking about option 1, as this is becoming a critical bug in our system, and it is not like there are a 100 events in there, there are generally like ~10. So I'm willing to say that this sibling structure is unneeded complexity.
Implemented option 1, removal of the siblings: Ultimaker@2ffb08a
Issue request type
[ ] Question
[ ] Enhancement
[X] Bug