Comment 2 for bug 1490595

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote : Re: Floating ip assigning failed with Error: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') None None (HTTP 400)

(OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') is a very typical error, which means that Galera is running in multi-master mode (in Galera terms, all nodes are equal, multi-master here simply means you've got connections to different MySQL servers) and concurrent processes are trying to update a single row on two different nodes simultaneously. Galera does not support write-intent InnoDB locks by design, so one or both of the transactions will fail with a bogus deadlock error, which effectively means that application has to resolve the conflict of a concurrent update and retry the transaction. See [1] for details.

Nova (and other OpenStack projects) already provides @_retry_on_deadlock to be applied to DB API methods, so that ones updating rows retry transactions on the deadlock error. The problem with it, is that you have to apply it to every single writer method to ensure all the deadlocks are handled properly - this is both tedious and error-prone. EngineFacade [2] of oslo.db seems to be the right place to implement transparent retries on deadlocks, but we have to update all the OpenStack projects to use it first.

At the same time, in MOS we *intentionally* deploy Galera in active-backup mode (configured in HAProxy), so that all connections go the same MySQL server at each moment of time. The only possible reason you still see such deadlocks in this case is that HAProxy thought that the active MySQL server became down for some time and promoted a backup server to be active (and switched back again, once the original server became online). Looks like both MySQL servers were up, it's just that the active one didn't respond in time on the health check. Due to connection pooling in OpenStack services we ran into situation, when you've got connections to multiple Galera nodes, i.e. effectively enabled multi-master setup.

One way to avoid that would be to close all backup connections, when the active server is back online [3], but that would also cause connection errors (and effectively abort all ongoing transactions handled by backup MySQL nodes, as HAProxy does not care about application level data here). So it's not really a good solution either.

[1] http://www.joinfu.com/2015/01/understanding-reservations-concurrency-locking-in-nova/
[2] http://specs.openstack.org/openstack/oslo-specs/specs/kilo/make-enginefacade-a-facade.html
[3] http://comments.gmane.org/gmane.comp.web.haproxy/8707