Failed to uplink subnet to router after destroying one controller

Bug #1560097 reported by Andrey Sledzinskiy
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Fix Committed
High
Ivan Berezovskiy
6.1.x
Fix Released
High
Sergii Rizvan
7.0.x
Fix Released
High
Sergii Rizvan
8.0.x
Fix Released
High
Sergii Rizvan
9.x
Fix Released
High
Ivan Berezovskiy

Bug Description

iso - 9.0-90

Steps:
1. Create and deploy next cluster - Neutrol Vlan, default storages, 3 controller, 2 compute, 1 cinder
2. After deployment destroy one controller
3. Open Health Check tab and wait that HA suite passes within 20 minutes
4. After it pass try to run 'Check network connectivity from instance via floating IP'

Actual result - test failed

fuel_health.common.test_mixins: INFO: STEP:5, verify action: 'Uplink subnet to router' neutronclient.client: DEBUG: REQ: curl -i https://public.fuel.local:9696/v2.0/routers/e594f584-168b-4ad8-a495-59a76b6f72f5/add_router_interface.json -X PUT -H "User-Agent: python-neutronclient" -H "X-Auth-Token: gAAAAABW73XHffPD-itGuIyKwKR6Me8oMWyNjkfGN0HKifv0Phd2K9ZJbe-kS8Z-YFWvUP6GVQAoNsICNwEICJVjFBqgDZSEtbSttywbSO2vD4wm_pjLJpgMFKGtauwwTtCbS-M0FiXnoJhQlPFpxmrYq0BboaHrVI1I1TxO_GOCmM88hu069dY" -d '{"subnet_id": "0de9368a-24a6-40f5-8038-b7b82bcdc951"}' neutronclient.client: DEBUG: throwing ConnectionFailed : fuel_health.common.test_mixins: DEBUG: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/fuel_health/common/test_mixins.py", line 177, in verify result = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/fuel_health/neutronmanager.py", line 103, in uplink_subnet_to_router router["id"], {"subnet_id": subnet["id"]}) File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 97, in with_params ret = self.function(instance, *args, **kwargs) File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 803, in add_interface_router body=body) File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 367, in put headers=headers, params=params) File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 335, in retry_request headers=headers, params=params) File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 286, in do_request resp, replybody = self.httpclient.do_request(action, method, body=body) File "/usr/lib/python2.7/site-packages/neutronclient/client.py", line 170, in do_request **kwargs) File "/usr/lib/python2.7/site-packages/neutronclient/client.py", line 106, in _cs_request raise exceptions.ConnectionFailed(reason=e) ConnectionFailed: Connection to neutron failed:

Unfortunately I couldn't find any info in neutron logs related to that failure
After second test run issue didn't reproduce

fuel version - http://paste.openstack.org/show/491307/

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Dina Belova (dbelova)
tags: added: area-neutron
Dina Belova (dbelova)
Changed in mos:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

expected result

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Changed in mos:
assignee: MOS Neutron (mos-neutron) → Oleg Bondarev (obondarev)
Revision history for this message
Oleg Bondarev (obondarev) wrote :
Download full text (8.9 KiB)

Seems like problems with connecting to AMQP in neutron caused timeout for the request in the test. More info below:

Request to add router interface issued at 04:17:19:

2016-03-21 04:17:19 DEBUG (utils) REQ: curl -i https://public.fuel.local:9696/v2.0/routers/e594f584-168b-4ad8-a495-59a76b6f72f5/add_router_interface.json -X PUT -H "User-Agent: python-neutronclient" -H "X-Auth-Token: gAAAAABW73XHffPD-itGuIyKwKR6Me8oMWyNjkfGN0HKifv0Phd2K9ZJbe-kS8Z-YFWvUP6GVQAoNsICNwEICJVjFBqgDZSEtbSttywbSO2vD4wm_pjLJpgMFKGtauwwTtCbS-M0FiXnoJhQlPFpxmrYq0BboaHrVI1I1TxO_GOCmM88hu069dY" -d '{"subnet_id": "0de9368a-24a6-40f5-8038-b7b82bcdc951"}'

Failure at 04:17:39 (after 20 seconds):

2016-03-21 04:17:39 DEBUG (test_mixins) Traceback (most recent call last):
...
    raise exceptions.ConnectionFailed(reason=e)
ConnectionFailed: Connection to neutron failed:

As part of adding router interface neutron server should notify l3 agent.
Corresponding neutron logs (req-1372e3b7-6137-44b3-8bc4-09da1d0f8699 is for adding router interface):

2016-03-21 04:17:20.741 21664 DEBUG oslo.messaging._drivers.impl_rabbit [req-1372e3b7-6137-44b3-8bc4-09da1d0f8699 9f6df653b10340029b51da845ba83d9a 6d3859f589ef4627bc14a14290652ab8 - - -] Connecting to AMQP server on 10.109.16.4:5673 __init__ /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/impl_rabbit.py:535

5 sec later:

2016-03-21 04:17:25.764 21664 DEBUG oslo.messaging._drivers.impl_rabbit [req-1372e3b7-6137-44b3-8bc4-09da1d0f8699 9f6df653b10340029b51da845ba83d9a 6d3859f589ef4627bc14a14290652ab8 - - -] Received recoverable error from kombu: on_error /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/impl_rabbit.py:678
2016-03-21 04:17:25.764 21664 ERROR oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2016-03-21 04:17:25.764 21664 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 436, in _ensured
2016-03-21 04:17:25.764 21664 ERROR oslo.messaging._drivers.impl_rabbit return fun(*args, **kwargs)
2016-03-21 04:17:25.764 21664 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 507, in __call__
2016-03-21 04:17:25.764 21664 ERROR oslo.messaging._drivers.impl_rabbit self.revive(create_channel())
2016-03-21 04:17:25.764 21664 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 242, in channel
2016-03-21 04:17:25.764 21664 ERROR oslo.messaging._drivers.impl_rabbit chan = self.transport.create_channel(self.connection)
2016-03-21 04:17:25.764 21664 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 741, in connection
2016-03-21 04:17:25.764 21664 ERROR oslo.messaging._drivers.impl_rabbit self._connection = self._establish_connection()
2016-03-21 04:17:25.764 21664 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 696, in _establish_connection
2016-03-21 04:17:25.764 21664 ERROR oslo.messaging._drivers.impl_rabbit conn = self.transport.establish_connection()
2016-03-21 04:17:25.764 21664 ERROR os...

Read more...

Changed in mos:
assignee: Oleg Bondarev (obondarev) → MOS Oslo (mos-oslo)
Revision history for this message
Oleg Bondarev (obondarev) wrote :

QA team please confirm that 20 seconds timeout is used for neutron operations using neutron client

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Raising priority to high because it seems like slow reconnects cause failures in various areas after failover

Changed in mos:
importance: Medium → High
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

After looking into logs it is obvious that the error is caused by long reconnection time. The major part in it is taken by sleeping for kombu_reconnect_delay seconds. Right now in many components we set this parameter to 5, and in bad scenario reconnection takes 15 seconds, which is too much for the tests.

Puppet team, please remove kombu_reconnect_delay from configs of every OpenStack component. In that case the default 1 second will be used, which should be just enough.

Changed in mos:
assignee: MOS Oslo (mos-oslo) → MOS Puppet Team (mos-puppet)
assignee: MOS Puppet Team (mos-puppet) → Ivan Berezovskiy (iberezovskiy)
no longer affects: mos/8.0.x
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Actually, the same issue plagues 8.0.x, so it is worthy to backport the change there as well

Revision history for this message
Ivan Berezovskiy (iberezovskiy) wrote :
Revision history for this message
Ivan Berezovskiy (iberezovskiy) wrote :

Both patches are merged

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

verified 285 iso for mitaka

Revision history for this message
OSCI Robot (oscirobot) wrote :

NOTE: Changeset is not merged, created temporary package repository.
RPM package fuel-library6.0 has been built for project openstack/fuel-library.
Files placed in repository:
fuel-ha-utils6.0-6.0.0-6212.2.gerrit316885.1.gite90f172.noarch.rpm
fuel-library6.0-6.0.0-6212.2.gerrit316885.1.gite90f172.noarch.rpm
Repository URL: http://osci-obs.vm.mirantis.net:82/centos-fuel-6.0-updates-stable-LP1560097/centos .

Revision history for this message
OSCI Robot (oscirobot) wrote :

NOTE: Changeset is not merged, created temporary package repository.
DEB package fuel-library has been built for project openstack/fuel-library.
Files placed in repository:
fuel-ha-utils6.0_6.0.0-6212.2.gerrit316885.1.gite90f172_all.deb
fuel-library6.0_6.0.0-6212.2.gerrit316885.1.gite90f172_all.deb
Repository URL: http://osci-obs.vm.mirantis.net:82/ubuntu-fuel-6.0-updates-stable-LP1560097/ubuntu .

Revision history for this message
Sergii Rizvan (srizvan) wrote :

Steps to verify:
The reproducibility of the bug when testing manually is very low. That's why in order to verify the fix we just have to check Neutron log and make sure that reconnect interval became 1 second after deployment with patched Puppet code:

<163>Jun 1 16:11:17 node-1 neutron-server 2016-06-01 16:11:17.041 20271 ERROR oslo_messaging._drivers.impl_rabbit [req-7d2971fb-8f19-4174-97a6-19dc63e83003 ] AMQP server on 192.168.0.4:5673 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds.

Revision history for this message
Sergii Rizvan (srizvan) wrote :
Revision history for this message
OSCI Robot (oscirobot) wrote :

NOTE: Changeset is not merged, created temporary package repository.
RPM package fuel-library6.0 has been built for project openstack/fuel-library.
Files placed in repository:
fuel-ha-utils6.0-6.0.0-6212.2.gerrit316885.2.git11606ec.noarch.rpm
fuel-library6.0-6.0.0-6212.2.gerrit316885.2.git11606ec.noarch.rpm
Repository URL: http://osci-obs.vm.mirantis.net:82/centos-fuel-6.0-updates-stable-LP1560097/centos .

Revision history for this message
OSCI Robot (oscirobot) wrote :

NOTE: Changeset is not merged, created temporary package repository.
DEB package fuel-library has been built for project openstack/fuel-library.
Files placed in repository:
fuel-ha-utils6.0_6.0.0-6212.2.gerrit316885.2.git11606ec_all.deb
fuel-library6.0_6.0.0-6212.2.gerrit316885.2.git11606ec_all.deb
Repository URL: http://osci-obs.vm.mirantis.net:82/ubuntu-fuel-6.0-updates-stable-LP1560097/ubuntu .

tags: added: on-verification
Revision history for this message
Ekaterina Shutova (eshutova) wrote :

Used scenario from the description.
All HA suite passed except 'Check state of haproxy backends on controllers' since 1 controller destroyed.
Checked OSTF test 'Check network connectivity from instance via floating IP' - passed.

According to the comment #14:
Checked reconnection time, it became 1sec:
server.log:2016-06-23 13:42:31.863 4940 ERROR oslo.messaging._drivers.impl_rabbit [req-f3eda16a-ef75-4c23-9973-11d11837386c - - - - -] AMQP server on 10.109.1.8:5673 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds.

Verified on MOS 8.0 build 570 + MU2 updates.

tags: removed: on-verification
tags: added: on-verification
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :

Verified on 6.1 + MU7 updates

2016-07-25 10:36:47 ERR

oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.109.7.6:5673 is unreachable: timed out. Trying again in 1 seconds.

tags: removed: on-verification
tags: added: on-verification
Revision history for this message
Ekaterina Shutova (eshutova) wrote :

Verified on MOS 7.0 + MU5 updates.

All HA suite passed.
Checked OSTF test 'Check network connectivity from instance via floating IP' - passed.

According to the comment #14:
Checked reconnection time, it became 1sec:
/var/log/neutron/dhcp-agent.log:2016-08-12 08:21:00.025 18409 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 10.109.11.6:5673 is unreachable: timed out. Trying again in 1 seconds.

tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.