Crash in queueRequest

Bug #1511553 reported by Michi Henning
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
thumbnailer (Ubuntu)
Fix Released
High
Michi Henning

Bug Description

https://errors.ubuntu.com/problem/fa22f0b18ff6dc949c4cdf768f32121d64a83b8d

We hit this occasionally, but not very often. I suspect it is because of the way we shut down: we destroy the thumbnailer instance before the dbus interface so we don't hit the race condition where one instance of the service still has the database lock while a second instance of the service is activated by a new incoming request. But, destroying the thumbnailer first means that, if a request arrives at just the right time, the dbu sinterface fires the request at the already-destroyed thumbnailer.

I recently hit a whole slew of issues in a test that destroyed the thumbnailer and dbus interface (in that order) while there were still requests sitting in the scheduler and thread pools, and fixed a bunch of these issues (see the request-cancellation branch.) In effect, via the handler class that calls back into the thumbnailer and the closures that sit in the thread pools, we have created a circular dependency between the dbus interface and thumbnailer, which makes it impossible to shut down either without causing problems.

I think the best solution would be to add a shutdown() method to the dbus interface that prevents new requests from firing. We also need some way to wait for all *received* (not just scheduled) requests to finish executing before physically shutting down.

Does DBus have a way to say "I'm in the process of shutting down. Send the request that just arrived and that I can't process anymore to a new instance of the service"? Ideally, I would like to "close the gate" on incoming requests such that they are transparently re-sent, and then wait for all currently executing requests to finish processing.

Related branches

Revision history for this message
Michi Henning (michihenning) wrote :

There is an ancient thread about this here: https://bugs.freedesktop.org/show_bug.cgi?id=11454

Another related discussion: http://comments.gmane.org/gmane.comp.freedesktop.dbus/11549

Doesn't look like we can do what we need. It appears to be impossible to shut down without at least causing an error on the client side.

Changed in thumbnailer:
importance: Undecided → High
description: updated
Changed in thumbnailer (Ubuntu):
importance: Undecided → High
no longer affects: thumbnailer
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in thumbnailer (Ubuntu):
status: New → Confirmed
Revision history for this message
Pete Woods (pete-woods) wrote :

The way DBus handles this is very basic. It queues up the requests to the name. So you'll have to release the name before you close the database. I might suggest you could make your service wait to acquire a database lock up to a certain timeout on startup. This would mean that the second instance of your process that gets spawned waits until the first instance has finished messing with the db.

Revision history for this message
Michi Henning (michihenning) wrote :

Thanks Pete! It's sort of a catch-22: if I shut down the dbus connection first, a new instance can be started while the old instance is still in the process of shutting down. And if I destroy the db instance first, requests can still be dispatched that now no longer have a valid target.

The problem with the db lock is that it's buried inside leveldb and, technically, we don't even know where that lock file is or how it works. It's a leveldb implementation detail. I'll have a look at adding a separate lock that we can maintain around the leveldb start-up/shut-down code.

The problem really is DBus here: there is no way to say to DBus "I'm in the process of shutting down, hold any requests for me until I tell you that I'm done shutting down and then send those requests to a new service instance."

Revision history for this message
Pete Woods (pete-woods) wrote :

Well I'd also argue that level DB is a problem too, if it can't say "try acquire lock up to some timeout". If it can only do an immediate "lock could not be acquired" type failure, I guess it's not the end of the world to try and get the lock every n milliseconds up to the timeout instead.

Revision history for this message
Michi Henning (michihenning) wrote :

Yes, fair point, although this functionality is really easy to layer on top so, arguably, not leveldb's job. But what is annoying is that, when the DB is locked, no unique error code is returned for the condition. Instead, the fact that the DB is locked gets lumped into the same status code that is used to indicate dozens of other problems. Sigh… When will people learn that it is necessary for the calling code to tell different error conditions apart without having to parse the text in the error string? :-(

I think I'll just use a lock file of our own around the initialisation/finalization.

Changed in thumbnailer (Ubuntu):
assignee: nobody → Michi Henning (michihenning)
status: Confirmed → In Progress
Changed in thumbnailer (Ubuntu):
status: In Progress → Fix Committed
Changed in thumbnailer (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.