Comment 15 for bug 924416

Revision history for this message
David Henningsson (diwic) wrote : And now, something is wrong with the flist implementation...

Hi,

I've had my eyes on a couple of bugs filed in Ubuntu, where PulseAudio
crashes with SIGABRT somewhere inside pa_memblock_* or pa_memblockq_*,
often after several hours of playback. I now got a core dump from the
bug reporter (thanks, Eric Casteleijn!) and I started analysing it.

It turns out that the same memblockq list_items is used as part of two
memblockq chains, in essence for one of the blocks q->next->prev != q.
How could this be? Well, they seem to be allocated from a static flist
pool. I've never looked at flist before, but it claims to be, "A
multiple-reader multipler-write lock-free free list implementation".

I had a look at tests/flist-test.c and decided to rewrite the test case
to make it a bit tougher and to see if I could verify that it never
handed out the same block twice. The test seemed to pass at first, but
the ~10th time I ran it, the code hung. All 20 threads accessing the
flist seemed to have stalled. I attached to the process with gdb, and
found that l->stored, l->empty, and l->stored->next [2] were all equal,
so clearly something is wrong.

Instead of trying to verify the algorithm, I went to Google to look for
a reference implementation to compare against, and quickly found [1].
And indeed our flist looks like the one under the section "Naive
lock-free stack which suffers from ABA problem." on that page. :-/
What's worse, there does not seem to be an easy fix.

At this point, it seems likely this is the root cause for the problem.
But I could use some input on how to proceed.

One trivial idea just to add a mutex around the flist as an immediate
workaround. It won't make the implementation lock-free, but futexes
should be pretty fast, and the lock won't be held for long anyway. So
this might be good enough?

I'm also noticing that in most cases where we use the flist, is as an
optimisation to avoid using malloc, essentially pointer recycling. If
this is the case, and the objects are not usually allocated by one
thread and freed by another, one could have per-thread flists instead of
one flist shared by all threads.

Any other ideas?

--
David Henningsson, Canonical Ltd.
http://launchpad.net/~diwic

[1] https://en.wikipedia.org/wiki/ABA_problem

[2] and l->empty->next, obviously