Mirantis OpenStack

server is in ERROR status after destroying node with active l3 agent for MOS 8.0

Bug #1580634 reported by Alexander Gromov on 2016-05-11

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Status tracked in 10.0.x
10.0.x	Invalid	High	Timur Nurlygayanov	Mirantis OpenStack 10.0
8.0.x	Invalid	High	Alexander Gromov	Mirantis OpenStack 8.0-updates
9.x	Invalid	High	Timur Nurlygayanov	Mirantis OpenStack 9.0

Bug Description

Environment:
3 controllers, 2 computes-cinder VLAN or VxLAN or VxLAN + L2POP

Steps to reproduce:
1. Check on what agents is router1:
    neutron l3-agent-list-hosting-router router1
2. If there isn't agent on the primary controller:
    neutron l3-agent-router-remove non_on_primary_agent_id router1
    neutron l3-agent-router-add on_primary_agent_id router1
3. Destroy primary controller
    virsh destroy <primary_controller>
4. Wait some time until all agents are up neutron-agent-list
5. Check that all routers reschedule from primary controller:
    neutron router-list-on-l3-agent <on_primary_agent_id>
6. Boot vm3 in network1
7. ping 8.8.8.8 from vm3
8. ping between vm1 and vm3 by internal ip
9. ping between vm1 and vm2 by floating ip

Expected results:
1) vm3 is successfully booted and is in ACTIVE status
2) pings are successful

Actual results:
Sometimes server is in ERROR status, sometimes pings are not successful. One of these conditions always doesn't work.

I have used 'test_shutdown_primary_controller_with_l3_agt' test from https://github.com/Mirantis/mos-integration-tests repository to reproduce this problem.

Reproducibility:
Server is in ERROR state approximately in 30% of cases.
In other cases pings don't work.

Tags:

Revision history for this message

Alexander Gromov (agromov) wrote on 2016-05-11:

Error message for instance:
"Build of instance c2f69199-f757-4f12-add0-4c4cef0bd01e was re-scheduled: Timed out waiting for a reply to message ID b4eee6c75c274a93803f526007752c26", "code": 500, "details": " File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 1907, in _do_build_and_run_instance filter_properties) File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 2060, in _build_and_run_instance instance_uuid=instance.uuid, reason=six.text_type(e))", "created": "2016-05-11T14:28:30Z"

Revision history for this message

Alexander Gromov (agromov) wrote on 2016-05-11:

fuel-snapshot-2016-05-11_14-31-58.tar.xz Edit (54.4 MiB, application/octet-stream)

Roman Podoliaka (rpodolyaka) on 2016-05-13

tags:

added: area-nova

Revision history for this message

Alexander Gromov (agromov) wrote on 2016-05-17:

Reproduced on CI with the following error:
InstanceError: Instance server03 is in ERROR status Build of instance 58ebb07e-f136-4368-8e82-1a557d8053d6 was re-scheduled: Binding failed for port 37f04a11-14ee-4a74-b837-5301a8b28873, please check neutron logs for more information. File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1907, in dobuild_and_run_instance filter_properties) File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2060, in buildand_run_instance instance_uuid=instance.uuid, reason=six.text_type(e))

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-05-18:

Alexander, I've checked the diagnostic snapshot you've provided and it looks like you don't give a RabbitMQ cluster enough time to fully recover before trying to boot an instance. There is connection error in the middle of instance boot:

we start building an instance

2016-05-11T14:26:37.695624+00:00 debug: 2016-05-11 14:26:34.755 1792 DEBUG nova.compute.manager [req-f44df37d-43bb-4655-9f52-7289823dfbf7 - -
- - -] [instance: c2f69199-f757-4f12-add0-4c4cef0bd01e] Skipping network cache update for instance because it is Building. _heal_instance_info_
cache /usr/lib/python2.7/dist-packages/nova/compute/manager.py:5553

then there is a disconnect

2016-05-11T14:26:42.796898+00:00 err: 2016-05-11 14:26:39.854 1792 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.109.36.8:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 5 seconds.

connection is recovered after ~1.5 minutes:

2016-05-11T14:28:21.814615+00:00 warning: 2016-05-11 14:28:18.871 1792 WARNING nova.servicegroup.drivers.db [-] Lost connection to nova-conductor for reporting service status.
2016-05-11T14:28:21.814615+00:00 warning: 2016-05-11 14:28:18.872 1792 WARNING oslo.service.loopingcall [-] Function 'nova.servicegroup.drivers.db.DbDriver._report_state' run outlasted interval by 5.03 sec
2016-05-11T14:28:21.899409+00:00 info: 2016-05-11 14:28:18.957 1792 INFO nova.servicegroup.drivers.db [-] Recovered from being unable to report status.

the problem is that by the time Nova reconnects to a RabbitMQ cluster the RPC calls time out (it took more than 60 seconds)

2016-05-11 14:28:24.465 1792 ERROR nova.compute.manager [instance: c2f69199-f757-4f12-add0-4c4cef0bd01e] MessagingTimeout: Timed out waiting for a reply to message ID b4eee6c75c274a93803f526007752c26

so the instance goes to ERROR state.

I quickly checked the RabbitMQ logs and it looks like although only one controller node was stopped, the RabbitMQ service were shut down and restarted on two remaining nodes:

2016-05-11T14:28:48.803260+00:00 notice: =INFO REPORT==== 11-May-2016::14:28:42 ===
2016-05-11T14:28:48.803260+00:00 notice: started TCP Listener on 10.109.36.5:5673

2016-05-11T14:27:13.560595+00:00 notice: =INFO REPORT==== 11-May-2016::14:27:01 ===
2016-05-11T14:27:13.560595+00:00 notice: started TCP Listener on 10.109.36.8:5673

so during some time rabbitmq was not available, which explains long reconnect in nova.

Oslo team, is this expected? ^

we start building an instance

2016-05-11T14:26:37.695624+00:00 debug:  2016-05-11 14:26:34.755 1792 DEBUG nova.compute.manager [req-f44df37d-43bb-4655-9f52-7289823dfbf7 - - 
- - -] [instance: c2f69199-f757-4f12-add0-4c4cef0bd01e] Skipping network cache update for instance because it is Building. _heal_instance_info_
cache /usr/lib/python2.7/dist-packages/nova/compute/manager.py:5553

then there is a disconnect

2016-05-11T14:26:42.796898+00:00 err:  2016-05-11 14:26:39.854 1792 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.109.36.8:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 5 seconds.

connection is recovered after ~1.5 minutes:

2016-05-11T14:28:21.814615+00:00 warning:  2016-05-11 14:28:18.871 1792 WARNING nova.servicegroup.drivers.db [-] Lost connection to nova-conductor for reporting service status.
2016-05-11T14:28:21.814615+00:00 warning:  2016-05-11 14:28:18.872 1792 WARNING oslo.service.loopingcall [-] Function 'nova.servicegroup.drivers.db.DbDriver._report_state' run outlasted interval by 5.03 sec
2016-05-11T14:28:21.899409+00:00 info:  2016-05-11 14:28:18.957 1792 INFO nova.servicegroup.drivers.db [-] Recovered from being unable to report status.

the problem is that by the time Nova reconnects to a RabbitMQ cluster the RPC calls time out (it took more than 60 seconds)

2016-05-11 14:28:24.465 1792 ERROR nova.compute.manager [instance: c2f69199-f757-4f12-add0-4c4cef0bd01e] MessagingTimeout: Timed out waiting for a reply to message ID b4eee6c75c274a93803f526007752c26

so the instance goes to ERROR state.

I quickly checked the RabbitMQ logs and it looks like although only one controller node was stopped, the RabbitMQ service were shut down and restarted on two remaining nodes:

2016-05-11T14:28:48.803260+00:00 notice:  =INFO REPORT==== 11-May-2016::14:28:42 ===
2016-05-11T14:28:48.803260+00:00 notice:  started TCP Listener on 10.109.36.5:5673

2016-05-11T14:27:13.560595+00:00 notice:  =INFO REPORT==== 11-May-2016::14:27:01 ===
2016-05-11T14:27:13.560595+00:00 notice:  started TCP Listener on 10.109.36.8:5673

so during some time rabbitmq was not available, which explains long reconnect in nova.

Oslo team, is this expected? ^

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-05-18:

Alexander, is this reproduced in MOS 9.0 / 10.0?

Revision history for this message

Alexander Gromov (agromov) wrote on 2016-05-19:

Roman, I have checked it only on MOS 8.0.

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-05-20:

Alexander, RabbitMQ failed over slowly because of bug https://bugs.launchpad.net/fuel/+bug/1543154 So either we need to port fix for it from 9.0, or in the test case just wait for RabbitMQ cluster to 'stabilize'. That should take less than 10 minutes.

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-06-20:

So I'm going to check it on MOS 9.0 because we should run the same tests on MOS 9.0 already.

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-06-22:

We didn't checked it on MOS 9.0 yet but we have automated test for this and we need to just run this automated test for 9.0 build.

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-06-22:

#10

This test is green for MOS 9.0 builds, TR id 836109.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

fuel-snapshot-2016-05-11_14-31-58.tar.xz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.