Instance failed to spawn after controllers reboot

Bug #1515154 reported by Ksenia Svechnikova
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
High
Dmitry Mescheryakov
7.0.x
Fix Released
High
Dmitry Mescheryakov
8.0.x
Fix Released
High
Dmitry Mescheryakov
9.x
Fix Released
High
Dmitry Mescheryakov

Bug Description

7.0 MU1

Swarm test: https://product-ci.infra.mirantis.net/view/7.0_swarm/job/7.0.system_test.ubuntu.cic_maintenance_mode/103/testReport/junit/(root)/auto_cic_maintenance_mode/auto_cic_maintenance_mode/

            1. Revert snapshot 3 ['controller', 'mongo'] +2 ['compute', 'cinder']
            2. reboot --force controller
            3. Wait until controller is switching in maintenance mode
            4. Exit maintenance mode
            5. Check the controller become available
            6. Run OSTF and repeat for other 2 nodes.

For the 3rd node OSTF smoke finally didn't pass

 {
  "Create volume and boot instance from it (failure)": "Failed to get to expected status. In error state. Please refer to OpenStack logs for more details."
 },
 {
  "Create volume and attach it to instance (failure)": "Failed to get to expected status. In error state. Please refer to OpenStack logs for more details."
 },
 {
  "Check network connectivity from instance via floating IP (failure)": "Failed to get to expected status. In error state. Please refer to OpenStack logs for more details."
 },
 {
  "Launch instance (failure)": "Failed to get to expected status. In error state. Please refer to OpenStack logs for more details."
 },
 {
  "Launch instance with file injection (failure)": "Failed to get to expected status. In error state. Please refer to OpenStack logs for more details."
 },
 {
  "Launch instance, create snapshot, launch instance from snapshot (failure)": "Failed to get to expected status. In error state. Please refer to OpenStack logs for more details."
 }
]

Errors in nova-compute "NovaException: Unexpected vif_type=binding_failed" :

https://paste.mirantis.net/show/1405/

Open vSwitch agents are marked as dead in neutron agent-list the same as others:

root@node-5:~# neutron agent-list
+--------------------------------------+--------------------+--------------------------+-------+----------------+---------------------------+
| id | agent_type | host | alive | admin_state_up | binary |
+--------------------------------------+--------------------+--------------------------+-------+----------------+---------------------------+
| 122b0a93-0b84-42d2-adf1-0aed651d4344 | L3 agent | node-5.test.domain.local | xxx | True | neutron-l3-agent |
| 3b1e0177-3458-4a1e-b1aa-c32df12a9cd3 | DHCP agent | node-5.test.domain.local | xxx | True | neutron-dhcp-agent |
| 49b5dac6-75eb-492d-9a01-d39679ac6a75 | Metadata agent | node-5.test.domain.local | xxx | True | neutron-metadata-agent |
| 4b9259b6-87f8-423c-bfa3-ad984c4c4937 | L3 agent | node-4.test.domain.local | xxx | True | neutron-l3-agent |
| 577098e7-8565-4456-b2e8-1feeb133f983 | DHCP agent | node-4.test.domain.local | xxx | True | neutron-dhcp-agent |
| 70b2841c-1c31-4ef8-ab8a-35d63883c072 | L3 agent | node-1.test.domain.local | :-) | True | neutron-l3-agent |
| 992c731e-2f37-4498-be62-9b969b0aa4b9 | Open vSwitch agent | node-1.test.domain.local | :-) | True | neutron-openvswitch-agent |
| 99eabcd2-7192-4c0c-a717-62a49b4ff0f3 | Metadata agent | node-4.test.domain.local | xxx | True | neutron-metadata-agent |
| a066f867-1f44-4524-925c-3de92df5cc3f | DHCP agent | node-1.test.domain.local | :-) | True | neutron-dhcp-agent |
| c6028891-c029-47cd-9081-5d7644717d56 | Open vSwitch agent | node-5.test.domain.local | :-) | True | neutron-openvswitch-agent |
| ccc0bcb2-f1de-48bb-bd71-82bac0a519dc | Open vSwitch agent | node-3.test.domain.local | xxx | True | neutron-openvswitch-agent |
| d84f6b13-9b4a-4969-9219-29aeb13cf091 | Open vSwitch agent | node-2.test.domain.local | xxx | True | neutron-openvswitch-agent |
| f923b54d-fe16-43b2-9d71-9c0179f3e5b5 | Open vSwitch agent | node-4.test.domain.local | xxx | True | neutron-openvswitch-agent |
| fbd47cae-5990-4af4-863d-9c565cc029af | Metadata agent | node-1.test.domain.local | :-) | True | neutron-metadata-agent |
+--------------------------------------+--------------------+--------------------------+-------+----------------+---------------------------+

Workaround: Kill neutron-openvswitch-agent on each compute

Revision history for this message
Ksenia Svechnikova (kdemina) wrote :
summary: - Instance failed to spawn after nodes reboot
+ Instance failed to spawn after controller's reboot
summary: - Instance failed to spawn after controller's reboot
+ Instance failed to spawn after controllers reboot
Changed in mos:
status: New → Confirmed
Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

Also we can see crush report in rabbit logs during the test execution(https://bugs.launchpad.net/fuel/+bug/1513511), but the test firstly run HA test and verify rabbit

2015-11-10 12:06:21,821 - INFO fuel_web_client.py:243 -- OSTF test statuses are : {
 "Check if amount of tables in databases is the same on each node": "success",
 "RabbitMQ replication": "success",
 "Check pacemaker status": "success",
 "RabbitMQ availability": "success",
 "Check galera environment state": "success",
 "Check data replication over mysql": "success"
}

and then start ['smoke', 'sanity']

In reverted snapshot I've seen that rabbitmq was not stopped

Changed in mos:
assignee: MOS Neutron (mos-neutron) → Andrey Epifanov (aepifanov)
Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

One one hand this issue is exactly new one, as we didn't see such behavior previously during. On the other, it's floating one, as in today's swarm it was not reproduced

https://product-ci.infra.mirantis.net/view/7.0_swarm/job/7.0.system_test.ubuntu.cic_maintenance_mode/104/console test auto_cic_maintenance_mode passed

Due to the reason we have a workaround and the issue if floating, I can suggest to change the priority to High

Changed in mos:
importance: Critical → High
Changed in mos:
status: Confirmed → In Progress
Changed in mos:
assignee: Andrey Epifanov (aepifanov) → Dmitry Mescheryakov (dmitrymex)
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

We had a reproduction of the issue on 100 nodes scale today: many agents were marked as dead, restarting agents helped. In the log of an agent one can see that the last thing it did with rabbitmq was successful reconnect:

http://paste.openstack.org/show/481941/

But for some reason it stopped sending periodic heartbeats and hence was marked as dead. lsof output shows that connections to RabbitMQ are live:

root@node-413:~# lsof | grep 11262 | grep 5673
neutron-l 11262 neutron 10u IPv4 75449293 0t0 TCP node-413.domain.tld:53794->node-452.domain.tld:5673 (ESTABLISHED)
neutron-l 11262 neutron 11u IPv4 75460227 0t0 TCP node-413.domain.tld:54068->node-452.domain.tld:5673 (ESTABLISHED)
neutron-l 11262 neutron 18u IPv4 75461014 0t0 TCP node-413.domain.tld:54187->node-452.domain.tld:5673 (ESTABLISHED)

here 11262 is pid of an l3 agent

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

We should switch on eventlet backdoor so when the process is hung we should be able to look at where it is stuck. We'll need neutron teams help with the reproduction

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Current status: Ksenia Demina right now tries to reproduce the issue with the backdoor enabled. The process goes slowly because of CI issues.

Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

The test was rerun more that 20 times, but this issue was not reproduced on 7.0 MU. Change status to Invalid

tags: added: area-oslo hit-hcf
removed: neutron
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The issue reappeared again on 8.0. This time we were able to get current state of all eventlet threads of a hanging openvswitch agent. Attached are four independent snapshots of state. It can be seen that only thread #0 and thread #12 changed their state. It also can be seen that thread #2 is report_state thread and it is hanging trying to send a message to RabbitMQ. Specifically, it hangs waiting for confirmation from RabbitMQ that message was received and saved (publisher confirms are enabled).

So, basically the issue appears when RabbitMQ forgets to send confirmation back to the client. This time the issue occurred due to https://github.com/rabbitmq/rabbitmq-server/issues/255 . Basically, we need to ensure that oslo.messaging times out when RabbitMQ forgets to do its job.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/oslo.messaging (openstack-ci/fuel-8.0/liberty)

Reviewed: https://review.fuel-infra.org/16605
Submitter: Pkgs Jenkins <email address hidden>
Branch: openstack-ci/fuel-8.0/liberty

Commit: aa57f058cd1037b7793db882c1c377b6dc3c0fc7
Author: Dmitry Mescheryakov <email address hidden>
Date: Tue Feb 2 13:35:11 2016

Correctly set socket timeout for publishing

Previously a wrong attribute was taken which didn't work

Closes-Bug: #1515154
Change-Id: I9dc2453be3047f993930c2b5c1e74720c8c3b26c

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Neutron team, can we enable heartbeats in Neutron in 9.0? I would prefer that solution.

Maksym Strukov (unbelll)
tags: added: on-verification
Revision history for this message
Maksym Strukov (unbelll) wrote :

Verified as fixed in 8.0-566

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The fix was merged into Mitaka as well: https://review.openstack.org/#/c/278275/
But still, it would be preferable for Neutron to support heartbeats.

Revision history for this message
Alexey Galkin (agalkin) wrote :

Verified as fixed in 9.0-220.

root@node-9:~# cat /var/log/neutron-all.log | grep "Failed to get socket attribute"

<167>Apr 22 15:26:01 node-9 neutron-server: 2016-04-22 15:26:01.288 23838 DEBUG oslo.messaging._drivers.impl_rabbit [req-b7e6eadc-eda7-4d4e-a5de-f18b982e74f3 - - - - -] Failed to get socket attribute: 'NoneType' object has no attribute 'sock' set_transport_socket_timeout /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/impl_rabbit.py:868

Error message have a DEBUG status, it confirms that "AttributeError" exceptions are caught and successfully handled.

tags: removed: on-verification
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The issue probably exists in 7.0 as well, so we are going to backport the fix here.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Reproduction of the original issue is pretty hard and I can suggest a rather simple but hacky way to verify. If anybody is interested, let me know.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-7.0/2015.1.0)

Fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Dmitry Mescheryakov <email address hidden>
Review: https://review.fuel-infra.org/29040

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/oslo.messaging (openstack-ci/fuel-7.0/2015.1.0)

Reviewed: https://review.fuel-infra.org/29040
Submitter: Denis V. Meltsaykin <email address hidden>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: 19ef79278aab07bb2cc3b70ab8e7dbca0f648bb7
Author: Dmitry Mescheryakov <email address hidden>
Date: Mon May 22 15:31:42 2017

Correctly set socket timeout for publishing

Previously a wrong attribute was taken which didn't work

Closes-Bug: #1515154
Change-Id: I9dc2453be3047f993930c2b5c1e74720c8c3b26c

Revision history for this message
TatyanaGladysheva (tgladysheva) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.