Mirantis OpenStack

Instance failed to spawn after controllers reboot

Bug #1515154 reported by Ksenia Svechnikova on 2015-11-11

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Fix Released	High	Dmitry Mescheryakov	Mirantis OpenStack 8.0
7.0.x	Fix Released	High	Dmitry Mescheryakov	Mirantis OpenStack 7.0-mu-8
8.0.x	Fix Released	High	Dmitry Mescheryakov	Mirantis OpenStack 8.0
9.x	Fix Released	High	Dmitry Mescheryakov	Mirantis OpenStack 9.0

Bug Description

7.0 MU1

Swarm test: https://product-ci.infra.mirantis.net/view/7.0_swarm/job/7.0.system_test.ubuntu.cic_maintenance_mode/103/testReport/junit/(root)/auto_cic_maintenance_mode/auto_cic_maintenance_mode/

            1. Revert snapshot 3 ['controller', 'mongo'] +2 ['compute', 'cinder']
            2. reboot --force controller
            3. Wait until controller is switching in maintenance mode
            4. Exit maintenance mode
            5. Check the controller become available
            6. Run OSTF and repeat for other 2 nodes.

For the 3rd node OSTF smoke finally didn't pass

{
  "Create volume and boot instance from it (failure)": "Failed to get to expected status. In error state. Please refer to OpenStack logs for more details."
},
{
  "Create volume and attach it to instance (failure)": "Failed to get to expected status. In error state. Please refer to OpenStack logs for more details."
},
{
  "Check network connectivity from instance via floating IP (failure)": "Failed to get to expected status. In error state. Please refer to OpenStack logs for more details."
},
{
  "Launch instance (failure)": "Failed to get to expected status. In error state. Please refer to OpenStack logs for more details."
},
{
  "Launch instance with file injection (failure)": "Failed to get to expected status. In error state. Please refer to OpenStack logs for more details."
},
{
  "Launch instance, create snapshot, launch instance from snapshot (failure)": "Failed to get to expected status. In error state. Please refer to OpenStack logs for more details."
}
]

Errors in nova-compute "NovaException: Unexpected vif_type=binding_failed" :

https://paste.mirantis.net/show/1405/

Open vSwitch agents are marked as dead in neutron agent-list the same as others:

Workaround: Kill neutron-openvswitch-agent on each compute

Tags:

Revision history for this message

Ksenia Svechnikova (kdemina) wrote on 2015-11-11:

fail_error_auto_cic_maintenance_mode-fuel-snapshot-2015-11-10_12-24-34.tar.xz Edit (94.0 MiB, application/octet-stream)

summary:

- Instance failed to spawn after nodes reboot
+ Instance failed to spawn after controller's reboot

Ksenia Svechnikova (kdemina) on 2015-11-11

summary:

- Instance failed to spawn after controller's reboot
+ Instance failed to spawn after controllers reboot

slava valyavskiy (slava-val-al) on 2015-11-11

Changed in mos:
status:	New → Confirmed

Revision history for this message

Ksenia Svechnikova (kdemina) wrote on 2015-11-11:

Also we can see crush report in rabbit logs during the test execution(https://bugs.launchpad.net/fuel/+bug/1513511), but the test firstly run HA test and verify rabbit

2015-11-10 12:06:21,821 - INFO fuel_web_client.py:243 -- OSTF test statuses are : {
"Check if amount of tables in databases is the same on each node": "success",
"RabbitMQ replication": "success",
"Check pacemaker status": "success",
"RabbitMQ availability": "success",
"Check galera environment state": "success",
"Check data replication over mysql": "success"
}

and then start ['smoke', 'sanity']

In reverted snapshot I've seen that rabbitmq was not stopped

Andrey Epifanov (aepifanov) on 2015-11-12

Changed in mos:
assignee:	MOS Neutron (mos-neutron) → Andrey Epifanov (aepifanov)

Revision history for this message

Ksenia Svechnikova (kdemina) wrote on 2015-11-12:

One one hand this issue is exactly new one, as we didn't see such behavior previously during. On the other, it's floating one, as in today's swarm it was not reproduced

https://product-ci.infra.mirantis.net/view/7.0_swarm/job/7.0.system_test.ubuntu.cic_maintenance_mode/104/console test auto_cic_maintenance_mode passed

Due to the reason we have a workaround and the issue if floating, I can suggest to change the priority to High

Ksenia Svechnikova (kdemina) on 2015-11-12

Changed in mos:
importance:	Critical → High

Andrey Epifanov (aepifanov) on 2015-11-12

Changed in mos:
status:	Confirmed → In Progress

Alexander Ignatov (aignatov) on 2015-11-12

Changed in mos:
assignee:	Andrey Epifanov (aepifanov) → Dmitry Mescheryakov (dmitrymex)

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-12-15:

We had a reproduction of the issue on 100 nodes scale today: many agents were marked as dead, restarting agents helped. In the log of an agent one can see that the last thing it did with rabbitmq was successful reconnect:

http://paste.openstack.org/show/481941/

But for some reason it stopped sending periodic heartbeats and hence was marked as dead. lsof output shows that connections to RabbitMQ are live:

root@node-413:~# lsof | grep 11262 | grep 5673
neutron-l 11262 neutron 10u IPv4 75449293 0t0 TCP node-413.domain.tld:53794->node-452.domain.tld:5673 (ESTABLISHED)
neutron-l 11262 neutron 11u IPv4 75460227 0t0 TCP node-413.domain.tld:54068->node-452.domain.tld:5673 (ESTABLISHED)
neutron-l 11262 neutron 18u IPv4 75461014 0t0 TCP node-413.domain.tld:54187->node-452.domain.tld:5673 (ESTABLISHED)

here 11262 is pid of an l3 agent

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2015-12-22:

We should switch on eventlet backdoor so when the process is hung we should be able to look at where it is stuck. We'll need neutron teams help with the reproduction

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-12-28:

Current status: Ksenia Demina right now tries to reproduce the issue with the backdoor enabled. The process goes slowly because of CI issues.

Revision history for this message

Ksenia Svechnikova (kdemina) wrote on 2016-01-11:

The test was rerun more that 20 times, but this issue was not reproduced on 7.0 MU. Change status to Invalid

Roman Podoliaka (rpodolyaka) on 2016-02-02

tags:

added: area-oslo hit-hcf
removed: neutron

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-02:

The issue reappeared again on 8.0. This time we were able to get current state of all eventlet threads of a hanging openvswitch agent. Attached are four independent snapshots of state. It can be seen that only thread #0 and thread #12 changed their state. It also can be seen that thread #2 is report_state thread and it is hanging trying to send a message to RabbitMQ. Specifically, it hangs waiting for confirmation from RabbitMQ that message was received and saved (publisher confirms are enabled).

So, basically the issue appears when RabbitMQ forgets to send confirmation back to the client. This time the issue occurred due to https://github.com/rabbitmq/rabbitmq-server/issues/255 . Basically, we need to ensure that oslo.messaging times out when RabbitMQ forgets to do its job.

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-02:

openvswitch.log.1 Edit (36.5 KiB, text/plain)

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-02:

#10

openvswitch.log.2 Edit (36.4 KiB, text/plain)

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-02:

#11

openvswitch.log.3 Edit (36.4 KiB, text/plain)

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-02:

#12

openvswitch.log.4 Edit (36.2 KiB, text/plain)

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-02-02: Fix merged to openstack/oslo.messaging (openstack-ci/fuel-8.0/liberty)

#13

Reviewed: https://review.fuel-infra.org/16605
Submitter: Pkgs Jenkins <email address hidden>
Branch: openstack-ci/fuel-8.0/liberty

Commit: aa57f058cd1037b7793db882c1c377b6dc3c0fc7
Author: Dmitry Mescheryakov <email address hidden>
Date: Tue Feb 2 13:35:11 2016

Correctly set socket timeout for publishing

Previously a wrong attribute was taken which didn't work

Closes-Bug: #1515154
Change-Id: I9dc2453be3047f993930c2b5c1e74720c8c3b26c

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-03:

#14

The fix https://review.fuel-infra.org/#/c/16605/ is merged

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-03:

#15

Neutron team, can we enable heartbeats in Neutron in 9.0? I would prefer that solution.

Maksym Strukov (unbelll) on 2016-02-15

tags:

added: on-verification

Revision history for this message

Maksym Strukov (unbelll) wrote on 2016-02-15:

#16

Verified as fixed in 8.0-566

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-29:

#17

The fix was merged into Mitaka as well: https://review.openstack.org/#/c/278275/
But still, it would be preferable for Neutron to support heartbeats.

Revision history for this message

Alexey Galkin (agalkin) wrote on 2016-04-22:

#18

Verified as fixed in 9.0-220.

root@node-9:~# cat /var/log/neutron-all.log | grep "Failed to get socket attribute"

<167>Apr 22 15:26:01 node-9 neutron-server: 2016-04-22 15:26:01.288 23838 DEBUG oslo.messaging._drivers.impl_rabbit [req-b7e6eadc-eda7-4d4e-a5de-f18b982e74f3 - - - - -] Failed to get socket attribute: 'NoneType' object has no attribute 'sock' set_transport_socket_timeout /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/impl_rabbit.py:868

Error message have a DEBUG status, it confirms that "AttributeError" exceptions are caught and successfully handled.

tags:

removed: on-verification

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-11-30:

#19

The issue probably exists in 7.0 as well, so we are going to backport the fix here.

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-11-30:

#20

Reproduction of the original issue is pretty hard and I can suggest a rather simple but hacky way to verify. If anybody is interested, let me know.

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-11-30: Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-7.0/2015.1.0)

#21

Fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Dmitry Mescheryakov <email address hidden>
Review: https://review.fuel-infra.org/29040