ebtables calls can race with libvirt

Bug #1316621 reported by Pavel Sedlák
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Chet Burgess
neutron
Fix Released
Medium
Kevin Benton

Bug Description

Sometimes request to associate floating IP may fail, when using nova network with libvirt like:

> http://192.168.1.12:8774/v2/258a4b20c77240bf9b386411430683fa/servers/a9e734e4-5310-4191-a7f0-78fca4b367e7/action
>
> BadRequest: Bad request
> Details: {'message': 'Error. Unable to associate floating ip', 'code': '400'}

Real issue is that ebtables rootwrap call fails:
Command: sudo nova-rootwrap /etc/nova/rootwrap.conf ebtables -t nat -I PREROUTING --logical-in br100 -p ipv4 --ip-src 192.168.32.10 ! --ip-dst 192.168.32.0/22 -j redirect --redirect-target ACCEPT
Exit code: 255
Stdout: ''
Stderr: "Unable to update the kernel. Two possible causes:\n1. Multiple ebtables programs were executing simultaneously. The ebtables\n userspace tool doesn't by default support multiple ebtables programs running\n concurrently. The ebtables option --concurrent or a tool like flock can be\n used to support concurrent scripts that update the ebtables kernel tables.\n2. The kernel doesn't support a certain ebtables extension, consider\n recompiling your kernel or insmod the extension.\n.\n"

It happens like once in whole tempest run, and also not always, so kernel support and other reasons should not apply here.
Probably already mentioned in https://<email address hidden>/msg23422.html.

As that call in nova is synchronized, locked, it could be that nova can actually race with libvirt itself calling ebtables?

Revision history for this message
Pavel Sedlák (psedlak) wrote :

Happened with Havana on RHEL6 and Icehouse on RHEL 7.
As it's flaky I don't have detailed info mostly common logs or versions - though as it's with both Havana and Icehouse on different versions of kernel etc, it seems as not related anyway.

Attaching part of nova-network.log showing that locks were obtained and command failed.

Tracy Jones (tjones-i)
tags: added: libvirt
Solly Ross (sross-7)
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Vish Ishaya (vishvananda) wrote :

Well that is annoying. If it is that rare, perhaps doing a few retrys is good enough. I'm not sure if there is an easy way to do a shared lock with kvm.

Revision history for this message
Michael Still (mikal) wrote :

We ignore the exit code on the delete we do before an insert of a rule, which leaves me thinking a retry would be hard to implement here. I guess we could change the delete to check the list of ebtables rules to make sure the entry exists, but I am unsure how expensive that would be.

Michael Still (mikal)
Changed in nova:
assignee: nobody → Chet Burgess (cfb-n)
Revision history for this message
jazeltq (jazeltq-k) wrote :
Download full text (13.1 KiB)

This bug can be reproduced by rally. For example, you can run boot-run-command-delete task.
the import point here is that you should test you cloud with high pressure, then the bug will be reproduced.
I use rally test my could.
rally configuration is
{
    "VMTasks.boot_runcommand_delete": [
        {
            "args": {
                "flavor": {
                    "name": "m1.small"
                },
                "image": {
                    "name": "ubuntu-12-04-raw-rally-test"
                },
                "script": "/home/rally/rally/doc/samples/ec_script/ubuntu_ls_test.sh",
                "interpreter": "bash",
                "username": "root",
                "floating_network": "LTQ",
                "use_floatingip": true,
                "availability_zone": "dell420",
            },
            "runner": {
                "type": "constant",
                "times": 1000,
                "concurrency": 40,
                "timeout": 6000
            },
            "context": {
                "users": {
                    "tenants": 1,
                    "users_per_tenant": 1
                },
                "quotas": {
                     "nova": {
                         "instances": -1,
                         "cores": -1,
                         "ram": -1,
                         "fixed_ips": -1,
                         "floating_ips": -1,
                     }
                 },
            }
        }
    ]
}

the rally post error is
2014-08-15 13:55:53.602 29457 INFO rally.benchmark.runners.base [-] Task bd820c37-2eaf-49d6-99b8-7952d453197d | ITER: 747 END: Error <class 'novaclient.exceptions.BadRequest'>: Error. Unable to associate floating ip (HTTP 400) (Request-ID: req-fa6fa661-e41d-4235-9da7-74ba882dd3c8)

the nova-network.log in the same time is
2014-08-15 13:55:52.990 23291 DEBUG nova.network.linux_net [req-fa6fa661-e41d-4235-9da7-74ba882dd3c8 737b99c364a64253920c67313655e171 a8c948f6e70648608603b5079537c525] IPTablesManager.apply completed with success _apply /usr/lib/python2.7/dist-packages/nova/network/linux_net.py:451
2014-08-15 13:55:52.991 23291 DEBUG nova.openstack.common.lockutils [req-fa6fa661-e41d-4235-9da7-74ba882dd3c8 737b99c364a64253920c67313655e171 a8c948f6e70648608603b5079537c525] Released file lock "iptables" at /var/lock/nova/nova-iptables lock /usr/lib/python2.7/dist-packages/nova/openstack/common/lockutils.py:208
2014-08-15 13:55:52.991 23291 DEBUG nova.openstack.common.lockutils [req-fa6fa661-e41d-4235-9da7-74ba882dd3c8 737b99c364a64253920c67313655e171 a8c948f6e70648608603b5079537c525] Got semaphore "ebtables" lock /usr/lib/python2.7/dist-packages/nova/openstack/common/lockutils.py:166
2014-08-15 13:55:52.992 23291 DEBUG nova.openstack.common.lockutils [req-fa6fa661-e41d-4235-9da7-74ba882dd3c8 737b99c364a64253920c67313655e171 a8c948f6e70648608603b5079537c525] Attempting to grab file lock "ebtables" lock /usr/lib/python2.7/dist-packages/nova/openstack/common/lockutils.py:176
2014-08-15 13:55:52.993 23291 DEBUG nova.openstack.common.lockutils [req-fa6fa661-e41d-4235-9da7-74ba882dd3c8 737b99c364a64253920c67313655e171 a8c948f6e70648608603b5079537c525] Got file lock "eb...

Revision history for this message
jazeltq (jazeltq-k) wrote :

The ebtables problem is also talked about here.
http://www.spinics.net/linux/fedora/libvirt-users/msg06645.html

Revision history for this message
Matthew Treinish (treinish) wrote :

Marking as high, because this has been seen more recently in bringing up multi-node gate tests.

tags: added: testing
Changed in nova:
importance: Medium → High
Revision history for this message
Daniel Berrange (berrange) wrote :

This patch to upstream libvirt adds use of --concurrent to ebtables and --wait to iptables/ip6tables.

https://www.redhat.com/archives/libvir-list/2014-November/msg00330.html

For this to help the race condition we'd need to modify Nova to use the same args too.

Revision history for this message
Chet Burgess (cfb-n) wrote :

@berrange

Thats excellent. I was going to propose a change to do just that now that I'm back from vacation. Since thats already done I can work on the other required pieces to make that work in nova.

Since this is currently hurting the gate the current plan is the following.

1) Submit a quick fix that adds a simple retry to nova for ebtables. This should get the gate working smoothly again.

2) Add support timing out long running commans to oslo.concurrency.processutils. ebtables --concurrent will block, forever until it gets the lock. We need a way to reliable time this out after some period of time to prevent nova blocking on this forever.

3) Once we can timeout an operation in processutils we can patch nova to use --concurrent.

I should have patch #1 up in the next day.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/136217

Changed in nova:
status: Confirmed → In Progress
Changed in nova:
assignee: Chet Burgess (cfb-n) → Brent Eagles (beagles)
Changed in nova:
assignee: Brent Eagles (beagles) → Chet Burgess (cfb-n)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/136217
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fb9b2058051b771732f4425c97651128c8060441
Submitter: Jenkins
Branch: master

commit fb9b2058051b771732f4425c97651128c8060441
Author: Chet Burgess <email address hidden>
Date: Thu Nov 20 18:29:15 2014 -0800

    Retry ebtables on race

    Calls to ebtables can race with libvirt and cause nova, or libvirt
    to fail to apply ebtables rules.

    The goal of this patch is to provide a simple fix to improve the
    stability of the gate.

    We now call ebtables in a simple loop that retries on failure.
    Long term we want to update nova to make use of the --concurrent
    flag in newer versions of ebtables. The --concurrent flag
    implements a lock to prevent multiple invocations of ebtables from
    racing. This will require a newer libvirt and the ability to
    timeout long running execs (--concurrent can block forever if it
    never gets the lock).

    A future patch is forthcoming to add support for --concurrent.

    DocImpact
    Add ebtables_exec_attempts option (default=3).

    Change-Id: I3e04782ac4678581462f9bee4bb10d5f3b223457
    Partial-Bug: #1316621

Revision history for this message
Chet Burgess (cfb-n) wrote :

Will someone with the proper permissions please change the status back to medium? We have a work around for the gate now. I'm still tracking the long term fix for K but the immediate symptoms have been addressed.

Chet Burgess (cfb-n)
Changed in nova:
importance: High → Medium
milestone: none → kilo-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/140514

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/140514
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4f418727f7de689a2387d3a7a2cc90ae9503c91e
Submitter: Jenkins
Branch: master

commit 4f418727f7de689a2387d3a7a2cc90ae9503c91e
Author: Chet Burgess <email address hidden>
Date: Tue Dec 9 14:51:40 2014 -0800

    Add backoff to ebtables retry

    We need a backoff between ebtables retries. In some tempest tests we
    have seen the retries complete in 100ms and still fail.

    We now sleep for ebtables_retry_interval * loop count seconds. With
    a default of 1.0 this means by default we sleep for 1.0s, 2.0s, and
    3.0s before we finally giving up.

    Change-Id: I0b9b664a592364bedd11124a1ec921d8ea011704
    Partial-Bug: #1316621

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Looks like this has merged, switching status to "Fix Committed"

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
Matt Riedemann (mriedem) wrote :

The patch from danpb merged into upstream libvirt:

http://libvirt.org/git/?p=libvirt.git;a=commit;h=dc33e6e4a5a5d429198b2c63ff6b63729353e2cf

It's in version 1.2.11 which is way too new for what we're testing with in the gate.

Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-3 → 2015.1.0
Changed in neutron:
assignee: nobody → Kevin Benton (kevinbenton)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/431773
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=486e2f4eb5a02c98958582e366a4d6081ea897e0
Submitter: Jenkins
Branch: master

commit 486e2f4eb5a02c98958582e366a4d6081ea897e0
Author: Kevin Benton <email address hidden>
Date: Thu Feb 9 15:10:20 2017 -0800

    Pass --concurrent flag to ebtables calls

    This flag will force ebtables to acquire a lock so we don't
    have to worry about ebtables errors occuring if something else
    on the system is trying to use ebtables as well.

    Closes-Bug: #1316621
    Change-Id: I695c01e015fdc201df8f23d9b48f9d3678240266

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.0.0b1

This issue was fixed in the openstack/neutron 11.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/460916

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/460917

Changed in neutron:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/460916
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=470833d36d5313d549c9905d9a36af8cfbcc3330
Submitter: Jenkins
Branch: stable/ocata

commit 470833d36d5313d549c9905d9a36af8cfbcc3330
Author: Kevin Benton <email address hidden>
Date: Thu Feb 9 15:10:20 2017 -0800

    Pass --concurrent flag to ebtables calls

    This flag will force ebtables to acquire a lock so we don't
    have to worry about ebtables errors occuring if something else
    on the system is trying to use ebtables as well.

    Closes-Bug: #1316621
    Change-Id: I695c01e015fdc201df8f23d9b48f9d3678240266
    (cherry picked from commit 486e2f4eb5a02c98958582e366a4d6081ea897e0)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/460917
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f6ae49b020859b7878a992d2bb158b2c912a5765
Submitter: Jenkins
Branch: stable/newton

commit f6ae49b020859b7878a992d2bb158b2c912a5765
Author: Kevin Benton <email address hidden>
Date: Thu Feb 9 15:10:20 2017 -0800

    Pass --concurrent flag to ebtables calls

    This flag will force ebtables to acquire a lock so we don't
    have to worry about ebtables errors occuring if something else
    on the system is trying to use ebtables as well.

    Closes-Bug: #1316621
    Change-Id: I695c01e015fdc201df8f23d9b48f9d3678240266
    (cherry picked from commit 486e2f4eb5a02c98958582e366a4d6081ea897e0)
    (cherry picked from commit 470833d36d5313d549c9905d9a36af8cfbcc3330)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.4.0

This issue was fixed in the openstack/neutron 9.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.2

This issue was fixed in the openstack/neutron 10.0.2 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.