Fuel for OpenStack

Make stronger sticky sessions guarantees for galera cluster behind haproxy: Floating ip assigning failed with Error: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') None None (HTTP 400)

Bug #1490595 reported by Andrey Sledzinskiy on 2015-08-31

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	High	Bogdan Dobrelya	Fuel for OpenStack 9.0
8.0.x	Fix Released	High	Bogdan Dobrelya	Fuel for OpenStack 8.0
Mitaka	Fix Released	High	Bogdan Dobrelya	Fuel for OpenStack 9.0

Bug Description

Floating ip assigning failed with Error: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') None None (HTTP 400)

Steps:
1. Create and deploy next cluster - Ubuntu, HA, Neutron with tunnelling, 3 controllers, 2 compute nodes
2. After deployment open health check tab and start tests

Actual result - Check network connectivity from instance via floating IP test failed with

BadRequest: Unable to associate floating ip 10.109.11.130 to fixed ip 10.0.7.4 for instance 75d081e4-9a62-4992-acfb-211041846a69. Error: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') None None (HTTP 400) (Request-ID: req-99c2bd02-7f99-436d-bc9d-ebb94ede742b) fuel_health.common.test_mixins: DEBUG: Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/fuel_health/common/test_mixins.py", line 177, in verify result = func(*args, **kwargs) File "/usr/lib/python2.6/site-packages/fuel_health/nmanager.py", line 698, in _assign_floating_ip_to_instance self.fail('Can not assign floating ip to instance') File "/usr/lib/python2.6/site-packages/unittest2/case.py", line 415, in fail raise self.failureException(msg) AssertionError: Can not assign floating ip to instance

Test passed after second run.

{

    "build_id": "257",
    "build_number": "257",
    "release_versions":

{

"2015.1.0-7.0":

{

"VERSION":

{

    "build_id": "257",
    "build_number": "257",
    "api": "1.0",
    "fuel-library_sha": "bc04a7092d92400c79e6ea6ede25e7b67c6a6355",
    "nailgun_sha": "3189ccfb8c1dac888e351f535b03bdbc9d392406",
    "feature_groups":

            [
                "mirantis"
            ],
            "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd",
            "openstack_version": "2015.1.0-7.0",
            "fuel-agent_sha": "1e8f38bbb864ed99aa8fe862b6367e82afec3263",
            "production": "docker",
            "python-fuelclient_sha": "9643fa07f1290071511066804f962f62fe27b512",
            "astute_sha": "53c86cba593ddbac776ce5a3360240274c20738c",
            "fuel-ostf_sha": "644db51186dc23c9b27e9b5486c120c8363dc87c",
            "release": "7.0",
            "fuelmain_sha": "0e54d68392b359bc122e5bbba9249c729eeaf579"
        }
    }

},
"auth_required": true,
"api": "1.0",
"fuel-library_sha": "bc04a7092d92400c79e6ea6ede25e7b67c6a6355",
"nailgun_sha": "3189ccfb8c1dac888e351f535b03bdbc9d392406",
"feature_groups":

    [
        "mirantis"
    ],
    "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd",
    "openstack_version": "2015.1.0-7.0",
    "fuel-agent_sha": "1e8f38bbb864ed99aa8fe862b6367e82afec3263",
    "production": "docker",
    "python-fuelclient_sha": "9643fa07f1290071511066804f962f62fe27b512",
    "astute_sha": "53c86cba593ddbac776ce5a3360240274c20738c",
    "fuel-ostf_sha": "644db51186dc23c9b27e9b5486c120c8363dc87c",
    "release": "7.0",
    "fuelmain_sha": "0e54d68392b359bc122e5bbba9249c729eeaf579"

}

Logs are attached

Tags:

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2015-08-31:

fail_error_deploy_neutron_tun_ha_with_public_network-fuel-snapshot-2015-08-31_02-43-02.tar.xz Edit (47.9 MiB, application/octet-stream)

Roman Podoliaka (rpodolyaka) on 2015-08-31

tags:

added: galera

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-08-31:

(OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') is a very typical error, which means that Galera is running in multi-master mode (in Galera terms, all nodes are equal, multi-master here simply means you've got connections to different MySQL servers) and concurrent processes are trying to update a single row on two different nodes simultaneously. Galera does not support write-intent InnoDB locks by design, so one or both of the transactions will fail with a bogus deadlock error, which effectively means that application has to resolve the conflict of a concurrent update and retry the transaction. See [1] for details.

Nova (and other OpenStack projects) already provides @_retry_on_deadlock to be applied to DB API methods, so that ones updating rows retry transactions on the deadlock error. The problem with it, is that you have to apply it to every single writer method to ensure all the deadlocks are handled properly - this is both tedious and error-prone. EngineFacade [2] of oslo.db seems to be the right place to implement transparent retries on deadlocks, but we have to update all the OpenStack projects to use it first.

At the same time, in MOS we *intentionally* deploy Galera in active-backup mode (configured in HAProxy), so that all connections go the same MySQL server at each moment of time. The only possible reason you still see such deadlocks in this case is that HAProxy thought that the active MySQL server became down for some time and promoted a backup server to be active (and switched back again, once the original server became online). Looks like both MySQL servers were up, it's just that the active one didn't respond in time on the health check. Due to connection pooling in OpenStack services we ran into situation, when you've got connections to multiple Galera nodes, i.e. effectively enabled multi-master setup.

One way to avoid that would be to close all backup connections, when the active server is back online [3], but that would also cause connection errors (and effectively abort all ongoing transactions handled by backup MySQL nodes, as HAProxy does not care about application level data here). So it's not really a good solution either.

[1] http://www.joinfu.com/2015/01/understanding-reservations-concurrency-locking-in-nova/
[2] http://specs.openstack.org/openstack/oslo-specs/specs/kilo/make-enginefacade-a-facade.html
[3] http://comments.gmane.org/gmane.comp.web.haproxy/8707

(OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') is a very typical error, which means that Galera is running in multi-master mode (in Galera terms, all nodes are equal, multi-master here simply means you've got connections to different MySQL servers)  and concurrent processes are trying to update a single row on two different nodes simultaneously.  Galera does not support write-intent InnoDB locks by design, so one or both of the transactions will fail with a bogus deadlock error, which effectively means that application has to resolve the conflict of a concurrent update and retry the transaction. See [1] for details.

Nova (and other OpenStack projects) already  provides @_retry_on_deadlock  to be applied to DB API methods, so that ones updating rows retry transactions on the deadlock error. The problem with it, is that you have to apply it to every single writer method to ensure all the deadlocks are handled properly - this is both tedious and error-prone.  EngineFacade [2] of oslo.db seems to be the right place to implement transparent retries on deadlocks, but we have to update all the OpenStack projects to use it first.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-08-31:

Having said that, I think we should fix this on two sides:

1) on the OpenStack side we should make the services ready for Galera multi-master mode (i.e. retry all write on a potential deadlock error). Ideally we should achieve this by updating all the projects to use EngineFacade and implement this in oslo.db, rather then decorating every single DB API method

2) on the Fuel side we should figure out the best health check settings, so that HAProxy does not use one of the backup MySQL nodes, when the active node is really online (i.e. all existing connections remain opened and operational)

Also, this does not seems to be of High importance to me, as this is kind of an exceptional situation, which must be rare.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-09-08:

No longer fixing medium bugs in 7.0, moving to 8.0

Changed in fuel:
status:	Triaged → Won't Fix
Changed in mos:
status:	Triaged → Won't Fix

Dmitry Pyzhov (dpyzhov) on 2015-10-12

Changed in fuel:
milestone:	7.0 → 8.0
status:	Won't Fix → Triaged
no longer affects:	fuel/8.0.x

Dmitry Pyzhov (dpyzhov) on 2015-10-22

tags:

added: area-mos

Dmitry Pyzhov (dpyzhov) on 2015-10-29

tags:

added: area-library
removed: area-mos

Revision history for this message

Michael Polenchuk (mpolenchuk) wrote on 2015-12-03:

Into fuel-lib would be useful to get the following changes:
* Sync galeracheck script from upstream
* Turn on available_when_donor option
* Adjust haproxy check intervals (downinter 20s rise 4) to not rush with DOWN->UP shift

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-03:

@Michael, IIUC, we use the available_when_donor https://review.openstack.org/#/c/168803/1/deployment/puppet/openstack/templates/galera_clustercheck.erb

Revision history for this message

Michael Polenchuk (mpolenchuk) wrote on 2015-12-03:

available_when_donor option is in place already so nevermind ...

Another suggestion:
Emulate active/passing clustering with haproxy. I mean we should consider to employ automatic failover w/o failback. As soon as main mysql server comes back, then all the traffic will fail back to it again, which ain't acceptable for active/passive mode. Therefore in this case "stick table" have to be used to keep persistence based on destination IP address (no failback will be processed automatically).

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-03:

+1 to the stronger sticky sessions guarantees as galera consistency claims (TI) depend on that heavily

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-03:

AFAICT, we configure pacemaker resources to not fail-back after the fail-over (resource stickyness) but it would be nice to have a dedicated test case for this behavior verification as well!

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-03:

#10

Oh, and note that for the MySQL, which is just a pacemaker clone resource, the stickiness depends on the HAProxy configuration anyway...

summary:

- Floating ip assigning failed with Error: (OperationalError) (1213,
- 'Deadlock found when trying to get lock; try restarting transaction')
- None None (HTTP 400)
+ Make stronger sticky sessions guarantees for galera cluster behind
+ haproxy: Floating ip assigning failed with Error: (OperationalError)
+ (1213, 'Deadlock found when trying to get lock; try restarting
+ transaction') None None (HTTP 400)

Revision history for this message

Michael Polenchuk (mpolenchuk) wrote on 2015-12-03:

#11

I've meant haproxy conf with smth like that:
stick-table type ip size 1 nopurge peers ....
stick on dst

Timur Nurlygayanov (tnurlygayanov) on 2015-12-14

no longer affects:	fuel
no longer affects:	fuel/7.0.x

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-14:

#12

Timur, this affects fuel-library as it configures HAProxy

Changed in fuel:
status:	New → Triaged
importance:	Undecided → Medium
assignee:	nobody → Fuel Library Team (fuel-library)
milestone:	none → 8.0

Dmitry Pyzhov (dpyzhov) on 2015-12-14

no longer affects:	mos/7.0.x
no longer affects:	mos/8.0.x
no longer affects:	mos

Stanislaw Bogatkin (sbogatkin) on 2015-12-14

tags:

added: team-bugfix

Dmitry Pyzhov (dpyzhov) on 2015-12-29

Changed in fuel:
milestone:	8.0 → 9.0

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-15:

#13

A note for how-to http://blog.haproxy.com/2014/01/17/emulating-activepassing-application-clustering-with-haproxy/#comment-4631

Changed in fuel:
importance:	Medium → High

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-15:

#14

Raised to high as the sticky sessions will reduce amount of deadlocks, hence overall performance of transactions in OpenStack env under load

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-15: Fix proposed to fuel-library (master)

#15

Fix proposed to branch: master
Review: https://review.openstack.org/268173

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-15:

#16

Note, that the current reference architecture is each haproxy instance is running isolated from other ones, inside its own netns.
And listens always at public and management VIPs, despite if the node is currently hosting the VIPs or not (yet).

So, we cannot use dst based stick tables. The only option we have is to use source based balance, see for details http://blog.haproxy.com/2012/03/29/load-balancing-affinity-persistence-sticky-sessions-what-you-need-to-know/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-15: Change abandoned on fuel-library (master)

#17

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/268173
Reason: not required

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-15: Fix proposed to fuel-library (master)

#18

Fix proposed to branch: master
Review: https://review.openstack.org/268195

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-26: Fix merged to fuel-library (master)

#19

Reviewed: https://review.openstack.org/268195
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=496c57db3b673465d929aafd313c2d09caac0f90
Submitter: Jenkins
Branch: master

commit 496c57db3b673465d929aafd313c2d09caac0f90
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Jan 15 16:22:10 2016 +0100

Use source based balance for the A/P galera

    Ensure sticky sessions based routing decisions
    for clients connecting to the galera cluster active
    node.
    See for details:
    http://blog.haproxy.com/2012/03/29/load-balancing-affinity-persistence-sticky-sessions-what-you-need-to-know/
    https://cbonte.github.io/haproxy-dconv/configuration-1.5.html#hash-type

Closes-bug: #1490595

Change-Id: Ia12cffd38008a930a9d1cf1db45a0d6591cae5cc
Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-26: Fix proposed to fuel-library (stable/8.0)

#21

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/272534

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-02: Fix merged to fuel-library (stable/8.0)

#22

Reviewed: https://review.openstack.org/272534
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=3b2e4b8c5408a87a8bb81f55cf56ae83fcf8bc19
Submitter: Jenkins
Branch: stable/8.0

commit 3b2e4b8c5408a87a8bb81f55cf56ae83fcf8bc19
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Jan 15 16:22:10 2016 +0100

Use source based balance for the A/P galera

Closes-bug: #1490595

    Change-Id: Ia12cffd38008a930a9d1cf1db45a0d6591cae5cc
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 496c57db3b673465d929aafd313c2d09caac0f90)

Veronica Krayneva (vkrayneva) on 2016-02-04

tags:

added: on-verification

Revision history for this message

Veronica Krayneva (vkrayneva) wrote on 2016-02-04:

#23

Verified on #521 iso

tags:

removed: on-verification

Maksym Strukov (unbelll) on 2016-05-30

tags:

added: on-verification

Revision history for this message

Maksym Strukov (unbelll) wrote on 2016-05-30:

#24

Verified as fixed in 9.0-404

tags:	removed: on-verification
Changed in fuel:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

fail_error_deploy_neutron_tun_ha_with_public_network-fuel-snapshot-2015-08-31_02-43-02.tar.xz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.