Deadlock when creating geometry in worker thread

Bug #1202448 reported by Matt Stine
54
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Panda3D
Fix Released
Undecided
rdb

Bug Description

I create complex geometry in a worker thread and attach it to the scene graph in my main thread. This works fine unless I attempt to enable the multithreaded render pipeline in my prc, including "/Draw", "Cull/Draw", and other models.

When the multithreaded render pipeline is enabled, my application hangs in a deadlock. The two threads fighting over mutices are the main thread (acquiring a lock inside Panda's do_frame()), and my worker thread (holding a lock inside the GeomVertexData() constructor).

I encountered this problem in C++ against Panda 1.8.1, and it seems to occur whenever a user generates custom geometry in a worker thread or whenever a model is loaded in a worker thread. I've attached a repro test case written in Python, just be sure to add "threading-model Cull/Draw" to your .prc.

Revision history for this message
Matt Stine (stine-matt) wrote :
Revision history for this message
rdb (rdb) wrote :

Thanks for reporting this bug. Incidentally, I ran into a very similar deadlock in renderFrame() today when threading-model is set to Cull/Draw.

Since the constructor itself doesn't really do much, one guess could be that the problem is in the PStat code, since the constructor does initialise some pstat collectors. Compiling without DO_PSTATS would reveal whether or not that is indeed the case. That's just a shot in the dark, though.

Unfortunately, I have too little experience debugging threading issues to help you out here; this is really David's field of expertise.

Changed in panda3d:
status: New → Confirmed
Revision history for this message
rdb (rdb) wrote :

Perhaps it's related to this bug?
https://bugs.launchpad.net/panda3d/+bug/1186845

Revision history for this message
Matt Stine (stine-matt) wrote :

Seems likely.

I haven't yet had a chance to compile without PStats, but the deadlock occurs inside GeomVertexData()'s constructor's initializer list, specifically inside the constructor of PipelineCyclerTrueImpl(), where _pipeline->add_cycler(this) is called. The _pipeline in this context seems to be the "render" pipeline, if that helps.

Any idea on a priority of an ETA for this fix? I realize there are many high priority tasks in Panda, but I'm just trying to get an idea if this is on anyone's radar. Thanks again.

Revision history for this message
David Rose (droklaunchpad) wrote :

This is definitely on my radar, and I hope to get a chance to investigate it soon. Unfortunately I've been overbooked the past couple of weeks and haven't had much chance to look at these and other issues.

Revision history for this message
rdb (rdb) wrote : Re: [Bug 1202448] Re: Deadlock when creating geometry in worker thread

Ugh, I stayed up all night for this but I couldn't figure it out. I know
the following; when the deadlock happens:
* GeomVertexData constructor initializes a PipelineCycler.
* The PipelineCycler constructor calls Pipeline::add_cycler() to add itself.
* However, at that time, the Pipeline mutex is being held by
Pipeline::cycle().
* Pipeline::cycle() deadlocks on the lock of one of the cyclers.

I have not been able to find out which cycler could be locked at that
point. It's certainly not the newly constructed cycler.
Any ideas?

On 8 August 2013 20:34, David Rose <email address hidden> wrote:

> This is definitely on my radar, and I hope to get a chance to
> investigate it soon. Unfortunately I've been overbooked the past couple
> of weeks and haven't had much chance to look at these and other issues.
>
> --
> You received this bug notification because you are subscribed to
> Panda3D.
> https://bugs.launchpad.net/bugs/1202448
>
> Title:
> Deadlock when creating geometry in worker thread
>
> Status in Panda3D:
> Confirmed
>
> Bug description:
> I create complex geometry in a worker thread and attach it to the
> scene graph in my main thread. This works fine unless I attempt to
> enable the multithreaded render pipeline in my prc, including "/Draw",
> "Cull/Draw", and other models.
>
> When the multithreaded render pipeline is enabled, my application
> hangs in a deadlock. The two threads fighting over mutices are the
> main thread (acquiring a lock inside Panda's do_frame()), and my
> worker thread (holding a lock inside the GeomVertexData()
> constructor).
>
> I encountered this problem in C++ against Panda 1.8.1, and it seems to
> occur whenever a user generates custom geometry in a worker thread or
> whenever a model is loaded in a worker thread. I've attached a repro
> test case written in Python, just be sure to add "threading-model
> Cull/Draw" to your .prc.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/panda3d/+bug/1202448/+subscriptions
>

Revision history for this message
David Rose (droklaunchpad) wrote :

Might it be a bounding volume? I've had a lot of problems with those in
the past, it's a bit of a conundrum because bounding volumes propagate up
the graph while everything else propagates down, and this means that
different nodes will need to grab the same lock at the same time. I've
made some efforts to work around this, but I could have missed a case.

But I have no idea if this is the particular deadlock cycle happening in
this case, sorry.

David

On Sat, Oct 11, 2014 at 7:25 PM, Reinier de Blois <email address hidden>
wrote:

> Ugh, I stayed up all night for this but I couldn't figure it out. I know
> the following; when the deadlock happens:
> * GeomVertexData constructor initializes a PipelineCycler.
> * The PipelineCycler constructor calls Pipeline::add_cycler() to add
> itself.
> * However, at that time, the Pipeline mutex is being held by
> Pipeline::cycle().
> * Pipeline::cycle() deadlocks on the lock of one of the cyclers.
>
> I have not been able to find out which cycler could be locked at that
> point. It's certainly not the newly constructed cycler.
> Any ideas?
>
> On 8 August 2013 20:34, David Rose <email address hidden> wrote:
>
>> This is definitely on my radar, and I hope to get a chance to
>> investigate it soon. Unfortunately I've been overbooked the past couple
>> of weeks and haven't had much chance to look at these and other issues.
>>
>> --
>> You received this bug notification because you are subscribed to
>> Panda3D.
>> https://bugs.launchpad.net/bugs/1202448
>>
>> Title:
>> Deadlock when creating geometry in worker thread
>>
>> Status in Panda3D:
>> Confirmed
>>
>> Bug description:
>> I create complex geometry in a worker thread and attach it to the
>> scene graph in my main thread. This works fine unless I attempt to
>> enable the multithreaded render pipeline in my prc, including "/Draw",
>> "Cull/Draw", and other models.
>>
>> When the multithreaded render pipeline is enabled, my application
>> hangs in a deadlock. The two threads fighting over mutices are the
>> main thread (acquiring a lock inside Panda's do_frame()), and my
>> worker thread (holding a lock inside the GeomVertexData()
>> constructor).
>>
>> I encountered this problem in C++ against Panda 1.8.1, and it seems to
>> occur whenever a user generates custom geometry in a worker thread or
>> whenever a model is loaded in a worker thread. I've attached a repro
>> test case written in Python, just be sure to add "threading-model
>> Cull/Draw" to your .prc.
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/panda3d/+bug/1202448/+subscriptions
>>
>
>

Revision history for this message
rdb (rdb) wrote :
Download full text (4.2 KiB)

Ah! I think I tracked down a deadlock.

I stripped the code to the minimum amount I need to trigger the error,
which seems to be:
    vdata = GeomVertexData('Data', GeomVertexFormat.getV3(), Geom.UHStatic)

This is the sequence causing the deadlock:
* GeomVertexData constructor first constructs CDStageWriter, which locks
its cycler and adds it to the Pipeline.
* In the meantime, Pipeline::cycle() gets to that GeomVertexData cycler and
tries to cycle it, but is waiting for it to unlock.
* The constructor creates new GeomVertexArrayData objects. Those try to
add their cyclers using add_cycler, but the Pipeline is still locked.

In this case the workaround seems to be to construct the
GeomVertexArrayData objects *before* locking the cycler. That resolves
*that* deadlock, though others remain.

The fundamental problem here seems to be that it's possible for code to try
and add a cycler to the Pipeline while another cycler is locked by the same
thread.
Might we perhaps solve this issue on a deeper level, by only calling
add_cycler when a CDStageWriter *destructs* rather than constructs, or by
never locking a cycler in a constructor?
Or, by re-engineering add_cycler such that it doesn't rely on the Pipeline
lock?

Something I'm also a bit curious about that would help me understand how
this is supposed to work: why does GeomVertexData need to create the arrays
on *all* the stages rather than just the current and upstream ones? Won't
they automatically propagate downstream anyway?

Thanks for any insights you might have!

On 12 October 2014 17:27, David Rose <email address hidden> wrote:

> Might it be a bounding volume? I've had a lot of problems with those in
> the past, it's a bit of a conundrum because bounding volumes propagate up
> the graph while everything else propagates down, and this means that
> different nodes will need to grab the same lock at the same time. I've
> made some efforts to work around this, but I could have missed a case.
>
> But I have no idea if this is the particular deadlock cycle happening in
> this case, sorry.
>
> David
>
> On Sat, Oct 11, 2014 at 7:25 PM, Reinier de Blois <email address hidden>
> wrote:
>
>> Ugh, I stayed up all night for this but I couldn't figure it out. I know
>> the following; when the deadlock happens:
>> * GeomVertexData constructor initializes a PipelineCycler.
>> * The PipelineCycler constructor calls Pipeline::add_cycler() to add
>> itself.
>> * However, at that time, the Pipeline mutex is being held by
>> Pipeline::cycle().
>> * Pipeline::cycle() deadlocks on the lock of one of the cyclers.
>>
>> I have not been able to find out which cycler could be locked at that
>> point. It's certainly not the newly constructed cycler.
>> Any ideas?
>>
>> On 8 August 2013 20:34, David Rose <email address hidden> wrote:
>>
>>> This is definitely on my radar, and I hope to get a chance to
>>> investigate it soon. Unfortunately I've been overbooked the past couple
>>> of weeks and haven't had much chance to look at these and other issues.
>>>
>>> --
>>> You received this bug notification because you are subscribed to
>>> Panda3D.
>>> https://bugs.launchpad.net/bugs/1202448
>>>
>>> Title:
>>> ...

Read more...

Revision history for this message
rdb (rdb) wrote :
Download full text (5.5 KiB)

David, I think I may have a general solution or two. I think we're close
to finally putting the nail in this bug's coffin.

The first idea is that we allow add_cycler and add_dirty_cycler to be
called while the Pipeline is cycling. This means that cycle() won't hold
the Pipeline lock while it's cycling. This *seems* to work OK since it
first copies the _dirty list into a prev_dirty list, so adding new dirty
cyclers doesn't affect it.
I tested this approach and it seemed to fix *all* of the deadlocks with
this code.
I just don't know how this will affect remove_cycler, which might remove a
cycler from the middle of the linked list. Maybe all we have to do is lock
the _prev and _next cycler of the one we're removing, or, alternatively, we
only let it mark the cycler as needing removal by cycle() later?

The second idea is that we require add_cycler to be called while the cycler
lock is *not* held, but instead *first* waits for the Pipeline lock to come
free and *then* locks the cycler. A bit more tricky to implement. Haven't
tested this.

What do you think?

On 13 October 2014 22:17, Reinier de Blois <email address hidden> wrote:

> Ah! I think I tracked down a deadlock.
>
> I stripped the code to the minimum amount I need to trigger the error,
> which seems to be:
> vdata = GeomVertexData('Data', GeomVertexFormat.getV3(), Geom.UHStatic)
>
> This is the sequence causing the deadlock:
> * GeomVertexData constructor first constructs CDStageWriter, which locks
> its cycler and adds it to the Pipeline.
> * In the meantime, Pipeline::cycle() gets to that GeomVertexData cycler
> and tries to cycle it, but is waiting for it to unlock.
> * The constructor creates new GeomVertexArrayData objects. Those try to
> add their cyclers using add_cycler, but the Pipeline is still locked.
>
> In this case the workaround seems to be to construct the
> GeomVertexArrayData objects *before* locking the cycler. That resolves
> *that* deadlock, though others remain.
>
> The fundamental problem here seems to be that it's possible for code to
> try and add a cycler to the Pipeline while another cycler is locked by the
> same thread.
> Might we perhaps solve this issue on a deeper level, by only calling
> add_cycler when a CDStageWriter *destructs* rather than constructs, or by
> never locking a cycler in a constructor?
> Or, by re-engineering add_cycler such that it doesn't rely on the Pipeline
> lock?
>
> Something I'm also a bit curious about that would help me understand how
> this is supposed to work: why does GeomVertexData need to create the arrays
> on *all* the stages rather than just the current and upstream ones? Won't
> they automatically propagate downstream anyway?
>
> Thanks for any insights you might have!
>
> On 12 October 2014 17:27, David Rose <email address hidden> wrote:
>
>> Might it be a bounding volume? I've had a lot of problems with those in
>> the past, it's a bit of a conundrum because bounding volumes propagate up
>> the graph while everything else propagates down, and this means that
>> different nodes will need to grab the same lock at the same time. I've
>> made some efforts to work around this, but I could have missed a case.
>...

Read more...

Revision history for this message
David Rose (droklaunchpad) wrote :
Download full text (6.6 KiB)

Hi Reinier,

Hmm, I'm worried about the first idea--leaving the Pipeline lock open
during cycle() is troublesome because most of the work done in cycle() is
managing the linked lists, which have to be done while the lock is held
because these are nonatomic operations. With the lock open, we risk race
conditions where two different threads try to manipulate the neighboring
nodes in the list and end up destroying the entire list.

One possible variation on the first idea is to let Pipeline::cycle() hold
the lock most of the time, but release just before each call to
cycler->cycle() (and grab it back afterwards). I think this would probably
work, but now we're paying the overhead of locking and unlocking many more
times per frame. That might be OK (we have to grab the individual cycler's
locks each time anyway).

Oops, got to go--I'll write some more on this later.

On Tue, Oct 14, 2014 at 7:30 AM, Reinier de Blois <email address hidden>
wrote:

> David, I think I may have a general solution or two. I think we're close
> to finally putting the nail in this bug's coffin.
>
> The first idea is that we allow add_cycler and add_dirty_cycler to be
> called while the Pipeline is cycling. This means that cycle() won't hold
> the Pipeline lock while it's cycling. This *seems* to work OK since it
> first copies the _dirty list into a prev_dirty list, so adding new dirty
> cyclers doesn't affect it.
> I tested this approach and it seemed to fix *all* of the deadlocks with
> this code.
> I just don't know how this will affect remove_cycler, which might remove a
> cycler from the middle of the linked list. Maybe all we have to do is lock
> the _prev and _next cycler of the one we're removing, or, alternatively, we
> only let it mark the cycler as needing removal by cycle() later?
>
> The second idea is that we require add_cycler to be called while the
> cycler lock is *not* held, but instead *first* waits for the Pipeline lock
> to come free and *then* locks the cycler. A bit more tricky to implement.
> Haven't tested this.
>
> What do you think?
>
> On 13 October 2014 22:17, Reinier de Blois <email address hidden> wrote:
>
>> Ah! I think I tracked down a deadlock.
>>
>> I stripped the code to the minimum amount I need to trigger the error,
>> which seems to be:
>> vdata = GeomVertexData('Data', GeomVertexFormat.getV3(),
>> Geom.UHStatic)
>>
>> This is the sequence causing the deadlock:
>> * GeomVertexData constructor first constructs CDStageWriter, which locks
>> its cycler and adds it to the Pipeline.
>> * In the meantime, Pipeline::cycle() gets to that GeomVertexData cycler
>> and tries to cycle it, but is waiting for it to unlock.
>> * The constructor creates new GeomVertexArrayData objects. Those try to
>> add their cyclers using add_cycler, but the Pipeline is still locked.
>>
>> In this case the workaround seems to be to construct the
>> GeomVertexArrayData objects *before* locking the cycler. That resolves
>> *that* deadlock, though others remain.
>>
>> The fundamental problem here seems to be that it's possible for code to
>> try and add a cycler to the Pipeline while another cycler is locked by the
>> same thread.
>> Might we perhaps ...

Read more...

Revision history for this message
rdb (rdb) wrote :
Download full text (8.4 KiB)

Hi David,

Hmm, the thing is, though, that add_dirty_cycler only prepends the dirty
cycler to the beginning of the _dirty linked list. Unless I'm
misunderstanding the code, cycle() steals the dirty list at the beginning
and reassigns _dirty to a new empty head node, so that cycle() won't
actually touch anything that add_dirty_cycler does beyond that point
(except for when it adds a cycler *back* to _dirty, but we can grab the
lock at that point without risk of deadlock).

In other words, it doesn't seem like add_dirty_cycler *can* actually mess
with the list that PipelineCycler is iterating over. So what I did in my
approach was for cycle() to grab the lock only momentarily while it's (1)
stealing the _dirty list and (2) adding a dirty cycler back to _dirty. In
both cases there is no risk that this particular deadlock will occur.

As for add_cycler, it only touches the _clean linked list, which cycle()
doesn't seem to touch directly at all, so there doesn't seem to be a reason
for that to lock the lists.

The problem is with remove_cycler, which *can* remove a cycler from the
middle of the list that cycle() is iterating over, which is why I suggested
either (1) letting remove_cycle lock the neighboring list nodes or (2)
letting remove_cycler add a flag to PipelineCycler to mark it for removal,
which cycler() will then do when it gets to it.

Of course, I could just be missing an important detail that makes this all
a load of nonsense. This all assumes that these three methods are the only
ones modifying the linked list.

Thanks for your feedback.

On 16 October 2014 03:28, David Rose <email address hidden> wrote:

> Hi Reinier,
>
> Hmm, I'm worried about the first idea--leaving the Pipeline lock open
> during cycle() is troublesome because most of the work done in cycle() is
> managing the linked lists, which have to be done while the lock is held
> because these are nonatomic operations. With the lock open, we risk race
> conditions where two different threads try to manipulate the neighboring
> nodes in the list and end up destroying the entire list.
>
> One possible variation on the first idea is to let Pipeline::cycle() hold
> the lock most of the time, but release just before each call to
> cycler->cycle() (and grab it back afterwards). I think this would probably
> work, but now we're paying the overhead of locking and unlocking many more
> times per frame. That might be OK (we have to grab the individual cycler's
> locks each time anyway).
>
> Oops, got to go--I'll write some more on this later.
>
> On Tue, Oct 14, 2014 at 7:30 AM, Reinier de Blois <email address hidden>
> wrote:
>
>> David, I think I may have a general solution or two. I think we're close
>> to finally putting the nail in this bug's coffin.
>>
>> The first idea is that we allow add_cycler and add_dirty_cycler to be
>> called while the Pipeline is cycling. This means that cycle() won't hold
>> the Pipeline lock while it's cycling. This *seems* to work OK since it
>> first copies the _dirty list into a prev_dirty list, so adding new dirty
>> cyclers doesn't affect it.
>> I tested this approach and it seemed to fix *all* of the deadlocks with
>> this code.
>> I jus...

Read more...

Moguri (mogurijin)
tags: added: threading
removed: thread
Revision history for this message
rdb (rdb) wrote :

I think this is fixed now. See https://github.com/panda3d/panda3d/issues/217 for more information.

Changed in panda3d:
assignee: nobody → rdb (rdb)
milestone: none → 1.10.0
status: Confirmed → Fix Committed
rdb (rdb)
Changed in panda3d:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.