nova-network appears to lock up while associating floating ip's

Bug #1349590 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
High
Unassigned

Bug Description

Seeing oslo.messaging timeouts while assocating floating IPs in the API logs:

http://logs.openstack.org/08/110008/1/gate/gate-tempest-dsvm-postgres-full/04890fa/logs/screen-n-api.txt.gz?level=TRACE#_2014-07-28_16_18_36_915

2014-07-28 16:18:36.915 ERROR nova.api.openstack.compute.contrib.floating_ips [req-d0ee3c07-0ae3-49d2-bad7-c4874096cc52 FloatingIPsTestJSON-352614651 FloatingIPsTestJSON-1773137245] Error. Unable to associate floating ip
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips Traceback (most recent call last):
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips File "/opt/stack/new/nova/nova/api/openstack/compute/contrib/floating_ips.py", line 258, in _add_floating_ip
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips fixed_address=fixed_address)
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips File "/opt/stack/new/nova/nova/network/api.py", line 44, in wrapped
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips return func(self, context, *args, **kwargs)
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips File "/opt/stack/new/nova/nova/network/base_api.py", line 60, in wrapper
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips res = f(self, context, *args, **kwargs)
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips File "/opt/stack/new/nova/nova/network/api.py", line 190, in associate_floating_ip
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips context, floating_address, fixed_address, affect_auto_assigned)
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips File "/opt/stack/new/nova/nova/utils.py", line 949, in wrapper
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips return func(*args, **kwargs)
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/server.py", line 138, in inner
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips return func(*args, **kwargs)
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips File "/opt/stack/new/nova/nova/network/floating_ips.py", line 347, in associate_floating_ip
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips fixed_ip.instance_uuid)
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips File "/opt/stack/new/nova/nova/network/rpcapi.py", line 324, in _associate_floating_ip
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips interface=interface, instance_uuid=instance_uuid)
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips retry=self.retry)
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips timeout=timeout, retry=retry)
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 404, in send
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips retry=retry)
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 393, in _send
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips result = self._waiter.wait(msg_id, timeout)
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 281, in wait
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips reply, ending = self._poll_connection(msg_id, timeout)
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 231, in _poll_connection
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips % msg_id)
2014-07-28 16:18:36.915 18485 TRACE nova.api.openstack.compute.contrib.floating_ips MessagingTimeout: Timed out waiting for a reply to message ID b463ca53f2bb42f6992b5e262b494f78

And when you look in the nova-network logs, it appears there is a deadlock with iptables while playing with floating IPs:

http://logs.openstack.org/08/110008/1/gate/gate-tempest-dsvm-postgres-full/04890fa/logs/screen-n-net.txt.gz

The last message in the n-net logs is trying to grab a lock:

2014-07-28 16:17:37.011 DEBUG nova.openstack.common.lockutils [req-d0ee3c07-0ae3-49d2-bad7-c4874096cc52 FloatingIPsTestJSON-352614651 FloatingIPsTestJSON-1773137245] Attempting to grab external lock "iptables" external_lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:230

From logstash:

message:"HTTP exception thrown: Error. Unable to associate floating ip" AND tags:"screen-n-api.txt"

There are 94 hits in 2 weeks, looks like it started on 7/19:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiSFRUUCBleGNlcHRpb24gdGhyb3duOiBFcnJvci4gVW5hYmxlIHRvIGFzc29jaWF0ZSBmbG9hdGluZyBpcFwiIEFORCB0YWdzOlwic2NyZWVuLW4tYXBpLnR4dFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiJjdXN0b20iLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsiZnJvbSI6IjIwMTQtMDctMTRUMjE6NDM6MTIrMDA6MDAiLCJ0byI6IjIwMTQtMDctMjhUMjE6NDM6MTIrMDA6MDAiLCJ1c2VyX2ludGVydmFsIjoiMCJ9LCJzdGFtcCI6MTQwNjU4MzkyMjc5Nn0=

This is also showing up in the api logs:

2014-07-28 16:20:48.331 WARNING nova.compute.api [req-5c540168-bde5-45fe-926f-f24a8f242b72 ServersAdminTestJSON-253527515 ServersAdminTestJSON-735286156] [instance: 748ad6e1-7741-4ff4-be1b-66af4466e019] instance's host devstack-trusty-hpcloud-b5-1122950 is down, deleting from database

I'm not sure if it's related though, but looking at logstash:

message:"instance's host" AND message:"is down, deleting from database" AND tags:"screen-n-api.txt"

There are 548 hits in 2 weeks, which also appears to have started around 7/19:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiaW5zdGFuY2UncyBob3N0XCIgQU5EIG1lc3NhZ2U6XCJpcyBkb3duLCBkZWxldGluZyBmcm9tIGRhdGFiYXNlXCIgQU5EIHRhZ3M6XCJzY3JlZW4tbi1hcGkudHh0XCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6ImN1c3RvbSIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJmcm9tIjoiMjAxNC0wNy0xNFQyMTozNTozNiswMDowMCIsInRvIjoiMjAxNC0wNy0yOFQyMTozNTozNiswMDowMCIsInVzZXJfaW50ZXJ2YWwiOiIwIn0sInN0YW1wIjoxNDA2NTg0MDIwNjE5fQ==

Those are only on nova-network jobs though so most likely related.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in nova:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Matt Riedemann (mriedem) wrote :

For the latter issue on 'instance's host is down, deleting from database', I'm wondering is this is related, it merged on 7/18:

https://github.com/openstack/nova/commit/d6ca1cc02ab58615a6bb1b337db34dd06525112e

There isn't much for nova commits that jumps out at me which merged on 7/19. This merged on 7/19 and touches nova/network/manager.py but it seems pretty harmless:

https://github.com/openstack/nova/commit/284e5ac022e20ac8150df7d3a7692bd1e2deadbf

Revision history for this message
Matt Riedemann (mriedem) wrote :

So the 7/19 thing is not correct, we only keep 10 days of logstash data so that's why it appears to show up then. :(

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like this has dropped down, 5 hits in 10 days:

message:"HTTP exception thrown: Error. Unable to associate floating ip" AND tags:"screen-n-api.txt"

It's all check queue right now.

Revision history for this message
Matt Riedemann (mriedem) wrote :

There are no hits after 9/10, and this merged on 9/12, so we should check the logs again if we have a more recent hit so we can see if the logs in nova-network make more sense:

https://review.openstack.org/#/c/120934/

Revision history for this message
Matt Riedemann (mriedem) wrote :

Might be worth noting if this helps at all: https://review.openstack.org/#/c/115859/

Revision history for this message
Matt Riedemann (mriedem) wrote :

This might also be useful because looking through the nova-net logs on one of these fails you'll see the nw info dict rebuilt several times:

https://review.openstack.org/#/c/119523/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/121916

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/121916
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=404f7ca9ce96c0b321b8fdfc9d2f9329970fc051
Submitter: Jenkins
Branch: master

commit 404f7ca9ce96c0b321b8fdfc9d2f9329970fc051
Author: Matt Riedemann <email address hidden>
Date: Tue Sep 16 09:01:57 2014 -0700

    Add more information to generic _add_floating_ip exception message

    When we fail in some unexpected way in the _add_floating_ip method it'd
    be good to have more information in the error message that's passed back
    to the caller so they don't have to dig into the logs for the variables
    involved, especially since the fixed IP might not be provided on the
    request.

    We're seeing messaging timeouts in the gate and it'd be nice to have the
    error information right up front so we can fingerprint on it in
    elastic-recheck.

    Related-Bug: #1349590

    Change-Id: I7c19c5c1bc5d41a922435d78aca0160c45130276

Revision history for this message
Matt Riedemann (mriedem) wrote :

This doesn't seem to be showing up anymore.

Changed in nova:
status: Confirmed → Incomplete
Revision history for this message
Joe Gordon (jogo) wrote :

Since we haven't seen this in a while marking the bug as closed. If this is seen again feel free to re-open.

Changed in nova:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.