server is in ERROR status after destroying node with active l3 agent for MOS 8.0

Bug #1580634 reported by Alexander Gromov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Invalid
High
Timur Nurlygayanov
8.0.x
Invalid
High
Alexander Gromov
9.x
Invalid
High
Timur Nurlygayanov

Bug Description

Environment:
3 controllers, 2 computes-cinder VLAN or VxLAN or VxLAN + L2POP

Steps to reproduce:
1. Check on what agents is router1:
    neutron l3-agent-list-hosting-router router1
2. If there isn't agent on the primary controller:
    neutron l3-agent-router-remove non_on_primary_agent_id router1
    neutron l3-agent-router-add on_primary_agent_id router1
3. Destroy primary controller
    virsh destroy <primary_controller>
4. Wait some time until all agents are up neutron-agent-list
5. Check that all routers reschedule from primary controller:
    neutron router-list-on-l3-agent <on_primary_agent_id>
6. Boot vm3 in network1
7. ping 8.8.8.8 from vm3
8. ping between vm1 and vm3 by internal ip
9. ping between vm1 and vm2 by floating ip

Expected results:
1) vm3 is successfully booted and is in ACTIVE status
2) pings are successful

Actual results:
Sometimes server is in ERROR status, sometimes pings are not successful. One of these conditions always doesn't work.

I have used 'test_shutdown_primary_controller_with_l3_agt' test from https://github.com/Mirantis/mos-integration-tests repository to reproduce this problem.

Reproducibility:
Server is in ERROR state approximately in 30% of cases.
In other cases pings don't work.

Tags: area-nova
Revision history for this message
Alexander Gromov (agromov) wrote :

Error message for instance:
"Build of instance c2f69199-f757-4f12-add0-4c4cef0bd01e was re-scheduled: Timed out waiting for a reply to message ID b4eee6c75c274a93803f526007752c26", "code": 500, "details": " File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 1907, in _do_build_and_run_instance filter_properties) File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 2060, in _build_and_run_instance instance_uuid=instance.uuid, reason=six.text_type(e))", "created": "2016-05-11T14:28:30Z"

Revision history for this message
Alexander Gromov (agromov) wrote :
tags: added: area-nova
Revision history for this message
Alexander Gromov (agromov) wrote :

Reproduced on CI with the following error:
InstanceError: Instance server03 is in ERROR status Build of instance 58ebb07e-f136-4368-8e82-1a557d8053d6 was re-scheduled: Binding failed for port 37f04a11-14ee-4a74-b837-5301a8b28873, please check neutron logs for more information. File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1907, in dobuild_and_run_instance filter_properties) File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2060, in buildand_run_instance instance_uuid=instance.uuid, reason=six.text_type(e))

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Alexander, I've checked the diagnostic snapshot you've provided and it looks like you don't give a RabbitMQ cluster enough time to fully recover before trying to boot an instance. There is connection error in the middle of instance boot:

we start building an instance

2016-05-11T14:26:37.695624+00:00 debug: 2016-05-11 14:26:34.755 1792 DEBUG nova.compute.manager [req-f44df37d-43bb-4655-9f52-7289823dfbf7 - -
- - -] [instance: c2f69199-f757-4f12-add0-4c4cef0bd01e] Skipping network cache update for instance because it is Building. _heal_instance_info_
cache /usr/lib/python2.7/dist-packages/nova/compute/manager.py:5553

then there is a disconnect

2016-05-11T14:26:42.796898+00:00 err: 2016-05-11 14:26:39.854 1792 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 10.109.36.8:5673 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 5 seconds.

connection is recovered after ~1.5 minutes:

2016-05-11T14:28:21.814615+00:00 warning: 2016-05-11 14:28:18.871 1792 WARNING nova.servicegroup.drivers.db [-] Lost connection to nova-conductor for reporting service status.
2016-05-11T14:28:21.814615+00:00 warning: 2016-05-11 14:28:18.872 1792 WARNING oslo.service.loopingcall [-] Function 'nova.servicegroup.drivers.db.DbDriver._report_state' run outlasted interval by 5.03 sec
2016-05-11T14:28:21.899409+00:00 info: 2016-05-11 14:28:18.957 1792 INFO nova.servicegroup.drivers.db [-] Recovered from being unable to report status.

the problem is that by the time Nova reconnects to a RabbitMQ cluster the RPC calls time out (it took more than 60 seconds)

2016-05-11 14:28:24.465 1792 ERROR nova.compute.manager [instance: c2f69199-f757-4f12-add0-4c4cef0bd01e] MessagingTimeout: Timed out waiting for a reply to message ID b4eee6c75c274a93803f526007752c26

so the instance goes to ERROR state.

I quickly checked the RabbitMQ logs and it looks like although only one controller node was stopped, the RabbitMQ service were shut down and restarted on two remaining nodes:

2016-05-11T14:28:48.803260+00:00 notice: =INFO REPORT==== 11-May-2016::14:28:42 ===
2016-05-11T14:28:48.803260+00:00 notice: started TCP Listener on 10.109.36.5:5673

2016-05-11T14:27:13.560595+00:00 notice: =INFO REPORT==== 11-May-2016::14:27:01 ===
2016-05-11T14:27:13.560595+00:00 notice: started TCP Listener on 10.109.36.8:5673

so during some time rabbitmq was not available, which explains long reconnect in nova.

Oslo team, is this expected? ^

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Alexander, is this reproduced in MOS 9.0 / 10.0?

Revision history for this message
Alexander Gromov (agromov) wrote :

Roman, I have checked it only on MOS 8.0.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Alexander, RabbitMQ failed over slowly because of bug https://bugs.launchpad.net/fuel/+bug/1543154 So either we need to port fix for it from 9.0, or in the test case just wait for RabbitMQ cluster to 'stabilize'. That should take less than 10 minutes.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

So I'm going to check it on MOS 9.0 because we should run the same tests on MOS 9.0 already.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

We didn't checked it on MOS 9.0 yet but we have automated test for this and we need to just run this automated test for 9.0 build.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

This test is green for MOS 9.0 builds, TR id 836109.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.