Floating IP assignment to instance fails: 'FloatingIpNotFoundForAddress: Floating ip not found for address 10.109.8.131'

Bug #1529937 reported by Artem Panchenko
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Jerzy Mikolajczak
8.0.x
Won't Fix
High
Registry Administrators
Mitaka
Fix Released
High
Jerzy Mikolajczak

Bug Description

Sometimes OSTF test "Launch instance with file injection" fails on step #3 "Assign floating ip to instance":

fuel_health.test: DEBUG: Adding <Server: ost1_test-server-smoke-file_inj-1204451443> to shared resources of TestNovaNetwork fuel_health.common.test_mixins: INFO: STEP:3, verify action: 'floating IP creation' fuel_health.common.test_mixins: DEBUG: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/fuel_health/common/test_mixins.py", line 177, in verify result = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/fuel_health/nmanager.py", line 733, in _create_floating_ip pool=floating_ips_pool[0].name) File "/usr/lib/python2.7/site-packages/novaclient/v2/floating_ips.py", line 41, in create return self._create("/os-floating-ips", {'pool': pool}, "floating_ip") File "/usr/lib/python2.7/site-packages/novaclient/base.py", line 169, in _create _resp, body = self.api.client.post(url, body=body) File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 449, in post return self._cs_request(url, 'POST', **kwargs) File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 424, in _cs_request resp, body = self._time_request(url, method, **kwargs) File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 397, in _time_request resp, body = self.request(url, method, **kwargs) File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 391, in request raise exceptions.from_response(resp, body, url, method) ClientException: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. <class 'nova.exception.FloatingIpNotFoundForAddress'> (HTTP 500) (Request-ID: req-6f1ed80b-d0c0-4ccc-b0c4-b1d29c20d3e5)

Here is appropriate Nova logs:

2015-12-29 01:55:03.940 26843 ERROR nova.api.openstack.extensions [req-6f1ed80b-d0c0-4ccc-b0c4-b1d29c20d3e5 a5858f2179534119aeef4f844b3e97db e026785f12d94783bd5a75d9c820a1e6 - - -] Unexpected exception in API method
2015-12-29 01:55:03.940 26843 ERROR nova.api.openstack.extensions Traceback (most recent call last):
2015-12-29 01:55:03.940 26843 ERROR nova.api.openstack.extensions File "/usr/lib/python2.7/dist-packages/nova/api/openstack/extensions.py", line 478, in wrapped
2015-12-29 01:55:03.940 26843 ERROR nova.api.openstack.extensions return f(*args, **kwargs)
2015-12-29 01:55:03.940 26843 ERROR nova.api.openstack.extensions File "/usr/lib/python2.7/dist-packages/nova/api/openstack/compute/floating_ips.py", line 131, in create
2015-12-29 01:55:03.940 26843 ERROR nova.api.openstack.extensions ip = self.network_api.get_floating_ip_by_address(context, address)
2015-12-29 01:55:03.940 26843 ERROR nova.api.openstack.extensions File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 1393, in get_floating_ip_by_address
2015-12-29 01:55:03.940 26843 ERROR nova.api.openstack.extensions fip = self._get_floating_ip_by_address(client, address)
2015-12-29 01:55:03.940 26843 ERROR nova.api.openstack.extensions File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 1467, in _get_floating_ip_by_address
2015-12-29 01:55:03.940 26843 ERROR nova.api.openstack.extensions raise exception.FloatingIpNotFoundForAddress(address=address)
2015-12-29 01:55:03.940 26843 ERROR nova.api.openstack.extensions FloatingIpNotFoundForAddress: Floating ip not found for address 10.109.8.131.
2015-12-29 01:55:03.940 26843 ERROR nova.api.openstack.extensions

And Neutron logs:

2015-12-29 01:55:03.844 22363 DEBUG neutron.db.ipam_non_pluggable_backend [req-98c00f16-8c46-4a73-ae37-d2d8d4defacf a5858f2179534119aeef4f844b3e97db e026785f12d94783bd5a75d9c820a1e6 - - -] Allocated IP - 10.109.8.131 from 10.109.8.132 to 10.109.8.254 _try_generate_ip /usr/lib/python2.7/dist-packages/neutron/db/ipam_non_pluggable_backend.py:79

2015-12-29 01:55:03.845 22363 DEBUG neutron.db.db_base_plugin_common [req-98c00f16-8c46-4a73-ae37-d2d8d4defacf a5858f2179534119aeef4f844b3e97db e026785f12d94783bd5a75d9c820a1e6 - - -] Allocated IP 10.109.8.131 (e5301901-4822-4650-8cc7-0fb5183fcbda/21c9a552-4786-431c-8427-77d2f85d4832/252e3cdb-d316-44c3-8737-2c1b56cf3e32) _store_ip_allocation /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_common.py:102

2015-12-29 01:55:03.935 13739 INFO neutron.wsgi [req-54353934-50b4-4c1c-90a8-84d3efe7fdd6 a5858f2179534119aeef4f844b3e97db e026785f12d94783bd5a75d9c820a1e6 - - -] 10.109.6.3 - - [29/Dec/2015 01:55:03] "GET /v2.0/floatingips.json?floating_ip_address=10.109.8.131 HTTP/1.1" 200 227 0.007953

Steps to reproduce:

            1. Create cluster with neutron vlan
            2. Add 3 node with controller role
            3. Add 1 node with compute
            4. Deploy cluster
            5. Remove one controller add new controller
            6. Deploy changes
            7. Run OSTF

Expected result: 1 OSTF test is failed, because some nova services are down (from removed controller)
Actual result: 2 health checks fail, floating IP can't be assigned to instance in the "Launch instance with file injection" test

Tags: area-mos
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Changed in fuel:
milestone: none → 9.0
Changed in fuel:
status: New → Confirmed
tags: added: area-nova
removed: area-mos nova
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I've checked Nova logs and see the following (http://paste.openstack.org/show/482869/):

1) Nova allocates a floating IP in Neutron - receives 201 and info about the newly created resource
2) right after that Nova tries to get detailed info on the created floating IP by issuing a GET request and receives an *empty* list of floating IPs, just like if the DB entry hasn't been created yet or has been filtered out for some reason

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Double checked this in Neutron: floating IPs are created synchronously in the sense that 201 (Created) status is only returned when the floating IP entry is already *committed* to a DB. Given the fact Galera guarantees synchronous replication and simultaneous COMMIT on all nodes of Galera cluster, I'm confused how can we see the results from #2, where there is no corresponding floating IP entry in the DB on the subsequent HTTP request.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Tentatively marking this as Incomplete and waiting for the next repo to analyse further. I haven't managed to reproduce this locally so far.

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

So this was reproduced again. As previously, there was an exception in Nova logs:

http://paste.openstack.org/show/484699/

If you take a look at the code, Nova fails to get a floating IP from Neutron *right after* it was successfully created (Neutron returns 201 Created after *committing* data to a DB).

Here we can see two subsequent requests to neutron-server:

http://paste.openstack.org/show/484688/

I reverted the snapshot of the environment. This cluster had 3 controllers and 2 compute nodes.

The interesting fact is that despite the fact we use Galera in active-backup mode we had sessions to two different mysql nodes, as we can see in HAProxy stats

http://paste.openstack.org/show/484698/

or in mysql shell:

http://paste.openstack.org/show/484701/

This happens when active mysql backend goes down from haproxy point of view, so the latter switches to a backup. When active mysql node is back again, haproxy switches back to it, but it *does not* close sessions to the backup server. Given the fact we use connection pools in OpenStack services, it happens that we actually start to use more than 1 Galera node, even if we tried to avoid that in the first place.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Which brings me to the next point. If we want to make sure this INSERT followed by immediate SELECT via two different connections to two different Galera nodes works "correctly", we'll need to tweak wsrep_sync_wait variable ( http://mysql.rjweb.org/doc.php/galera#Critical_Reads)

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

This seems to be a corner case and I don't think we should do anything in 8.0, but we'll need to give it another try in 9.0.

For the time being, setting wsrep_sync_wait to 1 in my.cnf on Galera nodes must be a sufficient workaround (although, we sacrifice latency a bit).

Changed in fuel:
status: Incomplete → Confirmed
tags: added: release-notes
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

On the second thought, it's probably not worth it to be put into release notes:

- we don't set up Galera to be run in the multiwriter mode, it's a corner case when it does after a failover of a controller

- if that actually happens we could see different issues while this one is a very specific case of a general problem - it makes no sense to document this particular one alone

tags: removed: release-notes
Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

version

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Changed in fuel:
assignee: MOS Nova (mos-nova) → Roman Podoliaka (rpodolyaka)
Changed in fuel:
milestone: 9.0 → 10.0
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Roman, could you elaborate on the https://bugs.launchpad.net/fuel/+bug/1529937/comments/6 two things:
1) which consistency level the transaction issued the SELECT after INSERT had had? Was it a serializable or repeatable read?
2) what was the select modifier in the code invokation, was it " FOR UPDATE" or something another?

I'm asking as I've read that interesting InnoDB vs Galera transactions nuance https://github.com/codership/galera/issues/336#issuecomment-138805890

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

@Bogdan,

These are transactions done by two different Python processes (nova-api and neutron-server, potentially running on different nodes) talking to two different Galera servers.

1) we do not set transactions isolation level explicitly in OpenStack services - the default is used (which is REPEATABLE READ in MySQL, and we do not change it in our deployments)

2) it's simple SELECT as the transactions in neutron-server is a reader - it does not modify any rows

I believe it's simply a problem with 'casual reads', which we discuss here - http://lists.openstack.org/pipermail/openstack-dev/2016-April/093529.html

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

We could tweak haproxy as Mike Bayer suggests here - http://lists.openstack.org/pipermail/openstack-dev/2016-April/093562.html - to make sure that we do not start routing connections to the original active node when goes back online.

It's not really a Nova or Neutron problem - all services are potentially affected here after a failover of an active MySQL node.

Bogdan, Puppet team, what do you think?

tags: added: area-mos
removed: area-nova need-info
Changed in fuel:
assignee: Roman Podoliaka (rpodolyaka) → MOS Puppet Team (mos-puppet)
Changed in fuel:
assignee: MOS Puppet Team (mos-puppet) → Jerzy Mikolajczak (jmikolajczak-b)
Revision history for this message
Jerzy Mikolajczak (jmikolajczak-b) wrote :

@Roman,

I think, You are right and this issue can affect other services also.
I'm testing haproxy with sticky table which should prevent fallback to active node.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/323797

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Jerzy Mikolajczak (jmikolajczak-b) wrote :

I'm posting update here from https://review.openstack.org/323797

@Bogdan,

I'm guessing that You needed "dedicated IPs" for peer sync option.
I propose not to use peer option and keep sticky table for mysqld local.

So we have 3 controller deployment which results in /etc/haproxy/conf.d/110-mysqld.cfg file like this:

server node-2 192.168.0.3:3307 check port 49000 inter 20s fastinter 2s downinter 2s rise 3 fall 3
server node-3 192.168.0.6:3307 backup check port 49000 inter 20s fastinter 2s downinter 2s rise 3 fall 3
server node-4 192.168.0.4:3307 backup check port 49000 inter 20s fastinter 2s downinter 2s rise 3 fall 3

node-2 is current owner of vip__management

With my settings enabled, mysqld sticky table at node-2 looks like this:
# table: mysqld, type: ip, size:1, used:1
0x2951104: key=192.168.0.2 use=0 exp=0 server_id=1

and for the rest of the nodes it's empty as they don't get traffic as they are backup nodes.

So what happens when node-2 goes down?
1) vip__management is moved to node-3
2) haproxy on node-3 detects node-2 failure and routes traffic to node-3
3) mysqld sticky table at node-3 is being updated and now looks like this:

# table: mysqld, type: ip, size:1, used:1
0x2951104: key=192.168.0.2 use=0 exp=0 server_id=2

So what happens when node-2 comes back?
1) vip__management stays on node-3
2) mysql sticky table at node-3 still shows server_id=2 so traffic still goes to node-3

So what happens when node-3 goes down?
1) vip__management is moved to node-2
2) haproxy on node-2 detects failure of node-3 but it's one of the backup nodes
3) mysqld sticky table at node-2 gets an update to:

# table: mysqld, type: ip, size:1, used:1
0x2951104: key=192.168.0.2 use=0 exp=0 server_id=1

I propose not to use:
- nopurge option which prevents further updates of one-row-mysqld-sticky-table
- peer option, as there is no need to sync mysqld table

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/323797
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=fa8f1f7746bf639f73d09964e482bd7847180d1d
Submitter: Jenkins
Branch: master

commit fa8f1f7746bf639f73d09964e482bd7847180d1d
Author: Jerzy Mikolajczak <email address hidden>
Date: Wed Jun 1 12:23:17 2016 +0200

    Add fallback prevention to mysql in haproxy config

    Using sticky-table + stick on dst prevents fallback to active node in
    scenario where it comes back from being down. This may lead to sql
    connections from openstack services to different mysql nodes.

    This patch also removes hash-type and balance option as stickiness
    is being done with sticky-table.

    Change-Id: Ibc2fb4fa5a1938aec447694638c4c9e99edbba53
    Closes-bug: #1529937

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/325097

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/325097
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=8b65abe1c41bbe9983823fcfa3554a3dd8e033e4
Submitter: Jenkins
Branch: stable/mitaka

commit 8b65abe1c41bbe9983823fcfa3554a3dd8e033e4
Author: Jerzy Mikolajczak <email address hidden>
Date: Wed Jun 1 12:23:17 2016 +0200

    Add fallback prevention to mysql in haproxy config

    Using sticky-table + stick on dst prevents fallback to active node in
    scenario where it comes back from being down. This may lead to sql
    connections from openstack services to different mysql nodes.

    This patch also removes hash-type and balance option as stickiness
    is being done with sticky-table.

    Change-Id: Ibc2fb4fa5a1938aec447694638c4c9e99edbba53
    Closes-bug: #1529937
    (cherry picked from commit fa8f1f7746bf639f73d09964e482bd7847180d1d)

tags: added: on-verification
Revision history for this message
Sofiia Andriichenko (sandriichenko) wrote :

verified on :
[root@nailgun ~]# shotgun2 short-report
cat /etc/fuel_build_id:
481
cat /etc/fuel_build_number:
481
cat /etc/fuel_release:
9.0
cat /etc/fuel_openstack_version:
mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
fuel-release-9.0.0-1.mos6349.noarch
fuel-misc-9.0.0-1.mos8458.noarch
python-packetary-9.0.0-1.mos140.noarch
fuel-bootstrap-cli-9.0.0-1.mos285.noarch
fuel-migrate-9.0.0-1.mos8458.noarch
rubygem-astute-9.0.0-1.mos750.noarch
fuel-mirror-9.0.0-1.mos140.noarch
shotgun-9.0.0-1.mos90.noarch
fuel-openstack-metadata-9.0.0-1.mos8742.noarch
fuel-notify-9.0.0-1.mos8458.noarch
nailgun-mcagents-9.0.0-1.mos750.noarch
python-fuelclient-9.0.0-1.mos325.noarch
fuel-9.0.0-1.mos6349.noarch
fuel-utils-9.0.0-1.mos8458.noarch
fuel-setup-9.0.0-1.mos6349.noarch
fuel-provisioning-scripts-9.0.0-1.mos8742.noarch
fuel-library9.0-9.0.0-1.mos8458.noarch
network-checker-9.0.0-1.mos74.x86_64
fuel-agent-9.0.0-1.mos285.noarch
fuel-ui-9.0.0-1.mos2717.noarch
fuel-ostf-9.0.0-1.mos936.noarch
fuelmenu-9.0.0-1.mos274.noarch
fuel-nailgun-9.0.0-1.mos8742.noarch

tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.