OpenStack: cable pull test for cluster and MGT on active controller cause both controllers as not active

Bug #1840182 reported by Anujeyan Manokeran
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Won't Fix
High
Bart Wensley

Bug Description

Brief Description
-----------------.
  When I do the cable pull cluster and MGT together on active controller(controller-1) and put back active controller (controller-1) got rebooted and controller-0 become active as expected. After sometimes both controller-0 and controller-1 went to a state none of them are active . There was no recovery to reboot them self.
controller-0:~$ source /etc/platform/openrc
Openstack Admin credentials can only be loaded from the active controller.

controller-1:~$ source /etc/platform/openrc
Openstack Admin credentials can only be loaded from the active controller.

Bin Did analysis and found below.

| 2019-08-14T16:04:12.516 | 1161 | service-group-scn | vim-services | active-failed | disabling-failed | vim(disabled, failed)
| 2019-08-14T16:04:12.711 | 1190 | service-scn | mtc-agent | enabled-active | disabling | disable state requested
| 2019-08-14T16:04:13.795 | 1220 | service-group-scn | vim-services | disabling-failed | disabled-failed | vim(disabled, )
| 2019-08-14T16:04:15.311 | 1225 | service-scn | mtc-agent | disabling | disabled | disable success
| 2019-08-14T16:09:16.768 | 1282 | service-scn | mtc-agent | disabled | enabling-throttle | enabled-active state requested
| 2019-08-14T16:09:22.312 | 1366 | service-scn | mtc-agent | enabling-throttle | enabling | enabled-active state requested
| 2019-08-14T16:09:23.354 | 1373 | service-scn | mtc-agent | enabling | enabled-active | enable success

There was issue with vim on controller-0.

2019-08-14T16:00:06.686 controller-0 VIM_Thread[333973] ERROR Queue.declare: (404) NOT_FOUND - home node '<email address hidden>' of durable queue 'notifications.nfvi_nova_listener_queue' in vhost 'nova' is down or inaccessible
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nfv_vim/vim.py", line 161, in process_main
    init_complete = process_initialize()
  File "/usr/lib64/python2.7/site-packages/nfv_vim/vim.py", line 79, in process_initialize
    if not nfvi.nfvi_initialize(config.CONF['nfvi']):
  File "/usr/lib64/python2.7/site-packages/nfv_vim/nfvi/_nfvi_module.py", line 80, in nfvi_initialize
    _task_worker_pools['compute'])
  File "/usr/lib64/python2.7/site-packages/nfv_vim/nfvi/_nfvi_compute_module.py", line 464, in nfvi_compute_initialize
    _compute_plugin.initialize(config['config_file'])
  File "/usr/lib64/python2.7/site-packages/nfv_vim/nfvi/_nfvi_plugin.py", line 94, in initialize
    self._plugin.obj.initialize(config_file)
  File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/nfvi_compute_api.py", line 3393, in initialize
    'notifications.nfvi_nova_listener_queue')
  File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/openstack/rpc_listener.py", line 42, in __init__
    self._consumer = Consumer(self._connection, self._rpc_receive_queue)
  File "/usr/lib/python2.7/site-packages/kombu/messaging.py", line 386, in __init__
    self.revive(self.channel)
  File "/usr/lib/python2.7/site-packages/kombu/messaging.py", line 407, in revive
    self.declare()
  File "/usr/lib/python2.7/site-packages/kombu/messaging.py", line 420, in declare
    queue.declare()
  File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 605, in declare
    self._create_queue(nowait=nowait, channel=channel)
  File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 614, in _create_queue
    self.queue_declare(nowait=nowait, passive=False, channel=channel)
  File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 649, in queue_declare
    nowait=nowait,
  File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1165, in queue_declare
    spec.Queue.DeclareOk, returns_tuple=True,
  File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 93, in wait
    self.connection.drain_events(timeout=timeout)
  File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 464, in drain_events
    return self.blocking_read(timeout)
  File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 469, in blocking_read
    return self.on_inbound_frame(frame)
  File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 68, in on_frame
    callback(channel, method_sig, buf, None)
  File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 473, in on_inbound_method
    method_sig, payload, content,
  File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 142, in dispatch_method
    listener(*args)
  File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 293, in _on_close
    reply_code, reply_text, (class_id, method_id), ChannelError,
NotFound: Queue.declare: (404) NOT_FOUND - home node '<email address hidden>' of durable queue 'notifications.nfvi_nova_listener_queue' in vhost 'nova' is down or inaccessible
2019-08-14T16:00:06.690 controller-0 VIM_Image-Worker-0_Thread[334031] INFO _thread.py.232 Thread Image-Worker-0: shutting down.
2019-08-14T16:00:06.691 controller-0 VIM_Image-Worker-0_Thread[334031] INFO _thread.py.236 Thread Image-Worker-0: shutdown.
2019-08-14T16:00:06.698 controller-0 VIM_Compute-Worker-0_Thread[334037] INFO _thread.py.232 Thread Compute-Worker-0: shutting down.
2019-08-14T16:00:06.699 controller-0 VIM_Compute-Worker-0_Thread[334037] INFO _thread.py.236 Thread Compute-Worker-0: shutdown.
2019-08-14T16:00:06.706 controller-0 VIM_Compute-Worker-1_Thread[334038] INFO _thread.py.232 Thread Compute-Worker-1: shutting down.
2019-08-14T16:00:06.707 controller-0 VIM_Compute-Worker-1_Thread[334038] INFO _thread.py.236 Thread Compute-Worker-1: shutdown.
2019-08-14T16:00:06.713 controller-0 VIM_BlockStorage-Worker-0_Thread[334034] INFO _thread.py.232 Thread BlockStorage-Worker-0: shutting down.
2019-08-14T16:00:06.714 controller-0 VIM_BlockStorage-Worker-0_Thread[334034] INFO _thread.py.236 Thread BlockStorage-Worker-0: shutdown.
2019-08-14T16:00:06.717 controller-0 VIM_Identity-Worker-0_Thread[334027] INFO _thread.py.232 Thread Identity-Worker-0: shutting down.
2019-08-14T16:00:06.717 controller-0 VIM_Identity-Worker-0_Thread[334027] INFO _thread.py.236 Thread Identity-Worker-0: shutdown.
2019-08-14T16:00:06.724 controller-0 VIM_Alarm_Thread[334024] INFO _thread.py.232 Thread Alarm: shutting down.
2019-08-14T16:00:06.724 controller-0 VIM_Alarm_Thread[334024] INFO _thread.py.236 Thread Alarm: shutdown.

The mtcAgent was restarted as a part of failure recovery, it did not send node update to SM to info that the controller-1 is unlocked-enabled. From SM on controller-0, it doesn't know the controller-1 is enabled.
It still has the controller-1 state as failed, and in this case as uncontrolled swact was not successful.

There are 2 issues, one is obviously vim had problem to go active after controller-0 took over active. 2nd issue is a uncontrolled swact shouldn't start before standby controller is confirmed enabled.

controller-0:~$ source /etc/platform/openrc
Openstack Admin credentials can only be loaded from the active controller.

controller-1:~$ source /etc/platform/openrc
Openstack Admin credentials can only be loaded from the active controller.

controller-0:~$ sudo sm-dump
sudo: ldap_sasl_bind_s(): Can't contact LDAP server
Password:

-Service_Groups------------------------------------------------------------------------
oam-services disabled disabled
controller-services disabled disabled
cloud-services disabled disabled
patching-services active active
directory-services active active
web-services active active
storage-services active active
storage-monitoring-services active active
vim-services disabled disabled failed
---------------------------------------------------------------------------------------

-Services------------------------------------------------------------------------------
oam-ip disabled disabled
management-ip disabled disabled
drbd-pg disabled disabled
drbd-rabbit disabled disabled
drbd-platform disabled disabled
pg-fs disabled disabled
rabbit-fs disabled disabled
nfs-mgmt disabled disabled
platform-fs disabled disabled
postgres disabled disabled
rabbit disabled disabled
platform-export-fs disabled disabled
platform-nfs-ip disabled disabled
sysinv-inv disabled disabled
sysinv-conductor disabled disabled
mtc-agent disabled disabled
hw-mon disabled disabled
dnsmasq disabled disabled
fm-mgr disabled disabled
keystone disabled disabled
open-ldap enabled-active enabled-active
snmp disabled disabled
lighttpd enabled-active enabled-active
horizon enabled-active enabled-active
patch-alarm-manager enabled-active enabled-active
mgr-restful-plugin enabled-active enabled-active
ceph-manager enabled-active enabled-active
vim disabled disabled
vim-api disabled disabled
vim-webserver disabled disabled
guest-agent disabled disabled
haproxy disabled disabled
pxeboot-ip disabled disabled
drbd-extension disabled disabled
extension-fs disabled disabled
extension-export-fs disabled disabled
etcd disabled disabled
drbd-etcd disabled disabled
etcd-fs disabled disabled
barbican-api disabled disabled
barbican-keystone-listener disabled disabled
barbican-worker disabled disabled
cluster-host-ip disabled disabled
docker-distribution disabled disabled
dockerdistribution-fs disabled disabled
drbd-dockerdistribution disabled disabled
helmrepository-fs disabled disabled
registry-token-server disabled disabled

controller-1:~$ sudo sm-dump
sudo: ldap_sasl_bind_s(): Can't contact LDAP server
Password:

-Service_Groups------------------------------------------------------------------------
oam-services standby standby
controller-services standby standby
cloud-services standby standby
patching-services standby standby
directory-services disabled disabled
web-services disabled disabled
storage-services disabled disabled
storage-monitoring-services standby standby
vim-services standby standby
---------------------------------------------------------------------------------------

-Services------------------------------------------------------------------------------
oam-ip enabled-standby disabled
management-ip enabled-standby disabled
drbd-pg enabled-standby enabled-standby
drbd-rabbit enabled-standby enabled-standby
drbd-platform enabled-standby enabled-standby
pg-fs enabled-standby disabled
rabbit-fs enabled-standby disabled
nfs-mgmt enabled-standby disabled
platform-fs enabled-standby disabled
postgres enabled-standby disabled
rabbit enabled-standby disabled
platform-export-fs enabled-standby disabled
platform-nfs-ip enabled-standby disabled
sysinv-inv enabled-standby disabled
sysinv-conductor enabled-standby disabled
mtc-agent enabled-standby disabled
hw-mon enabled-standby disabled
dnsmasq enabled-standby disabled
fm-mgr enabled-standby disabled
keystone enabled-standby disabled
open-ldap disabled disabled
snmp enabled-standby disabled
lighttpd disabled disabled
horizon disabled disabled
patch-alarm-manager enabled-standby disabled
mgr-restful-plugin disabled disabled
ceph-manager enabled-standby disabled
vim enabled-standby disabled
vim-api enabled-standby disabled
vim-webserver enabled-standby disabled
guest-agent enabled-standby disabled
haproxy enabled-standby disabled
pxeboot-ip enabled-standby disabled
drbd-extension enabled-standby enabled-standby
extension-fs enabled-standby disabled
extension-export-fs enabled-standby disabled
etcd enabled-standby disabled
drbd-etcd enabled-standby enabled-standby
etcd-fs enabled-standby disabled
barbican-api enabled-standby disabled
barbican-keystone-listener enabled-standby disabled
barbican-worker enabled-standby disabled
cluster-host-ip enabled-standby disabled
docker-distribution enabled-standby disabled
dockerdistribution-fs enabled-standby disabled
drbd-dockerdistribution enabled-standby enabled-standby
helmrepository-fs enabled-standby disabled
registry-token-server enabled-standby disabled
------------------------------------------------------------------------------
controller-1:~$

Severity
--------.
Major
Steps to Reproduce
------------------
1. Make sure system is installed and good health . No alarms.
2. Pull cable provision with Management and cluster network from active controller (Above test was on Controller-1)
3. Put back the cable as it was before.
4. Reboot on Controller-1 and Controller-0 become active.
5. After some time Controller-0 was not active admin crendtials are not working as per description.
controller-0:~$ source /etc/platform/openrc
Openstack Admin credentials can only be loaded from the active controller.

Expected Behavior
------------------
Controller-0 is active and Openstack Admin credentials

Actual Behavior
----------------
As per description no active controller
Reproducibility
---------------
Not sure tested once with the same load controller-0 rebooted and become disabled failed .
System Configuration
--------------------
Storage system

Branch/Pull Time/Commit
-----------------------
BUILD_DATE= 2019-08-12 21:00:17 -0400

Last Pass
---------
2019-02-10_20-18-00

Timestamp/Logs
--------------
2019-08-14T16:00:06

Test Activity
-------------
Regression test

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
Revision history for this message
Bin Qian (bqian20) wrote :

I looked at the logs again, it actually didn't have the 2nd issue that a uncontrolled swact started before the other controller was enabled. it was a service failure-recovery on a single controller (as SM did not receive the controller-1 being enabled msg, SM on controller-0 had controller-1 state as unlocked-failed). with a 5 minutes debounce wait time.

| 2019-08-14T16:09:16.662 | 1252 | service-group-scn | vim-services | disabled | go-active |
| 2019-08-14T16:09:24.369 | 1384 | service-group-scn | vim-services | go-active | active |
| 2019-08-14T16:14:33.679 | 1485 | service-group-scn | vim-services | disabled | go-active |
| 2019-08-14T16:14:41.673 | 1617 | service-group-scn | vim-services | go-active | active |
| 2019-08-14T16:19:46.528 | 1718 | service-group-scn | vim-services | disabled | go-active |
| 2019-08-14T16:19:54.490 | 1850 | service-group-scn | vim-services | go-active | active |
| 2019-08-14T16:25:03.220 | 1951 | service-group-scn | vim-services | disabled | go-active |

A vim patch should fix the issue reported.

Revision history for this message
Brent Rowsell (brent-rowsell) wrote :

Can you clarify Infra ? There is no infra interface anymore.

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote : Re: cable pull test for cluster and MGT on active controller cause both controllers are not active

I mean cluster network . I have remove infra from launch pad text.

summary: - cable pull test for infra and MGT on active controller cause both
+ cable pull test for cluster and MGT on active controller cause both
controllers are not active
description: updated
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Numan Waheed (nwaheed)
summary: cable pull test for cluster and MGT on active controller cause both
- controllers are not active
+ controllers as not active
Revision history for this message
Ghada Khalil (gkhalil) wrote : Re: cable pull test for cluster and MGT on active controller cause both controllers as not active

Marking as stx.3.0 / high priority -- failure scenario not handled properly by the system.

tags: added: stx.nfv
tags: added: stx.3.0
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Bart Wensley (bartwensley)
Revision history for this message
Bart Wensley (bartwensley) wrote :
Download full text (5.2 KiB)

The VIM on controller-0 is failing to connect to the nova notifications queue:

NotFound: Queue.declare: (404) NOT_FOUND - home node '<email address hidden>' of durable queue 'notifications.nfvi_nova_listener_queue' in vhost 'nova' is down or inaccessible

This error indicates that the queue exists, but cannot be connected to. The rabbitmq logs indicate that rabbitmq is shutting down on controller-0:

{"log":"2019-08-14 15:58:15.535 [info] \u003c0.472.0\u003e Autoheal request received from '<email address hidden>'\n","stream":"stdout","time":"2019-08-14T15:58:15.535800761Z"}
{"log":"2019-08-14 15:58:15.536 [info] \u003c0.472.0\u003e Autoheal decision\n","stream":"stdout","time":"2019-08-14T15:58:15.536559958Z"}
{"log":" * Partitions: [['<email address hidden>'],['<email address hidden>']]\n","stream":"stdout","time":"2019-08-14T15:58:15.536586158Z"}
{"log":" * Winner: '<email address hidden>'\n","stream":"stdout","time":"2019-08-14T15:58:15.536601564Z"}
{"log":" * Losers: ['<email address hidden>']\n","stream":"stdout","time":"2019-08-14T15:58:15.536613184Z"}
{"log":"2019-08-14 15:58:15.536 [warning] \u003c0.472.0\u003e Autoheal: we were selected to restart; winner is '<email address hidden>'\n","stream":"stdout","time":"2019-08-14T15:58:15.537049161Z"}
{"log":"2019-08-14 15:58:15.536 [info] \u003c0.20917.1\u003e RabbitMQ is asked to stop...\n","stream":"stdout","time":"2019-08-14T15:58:15.537061637Z"}
{"log":"2019-08-14 15:58:15.585 [info] \u003c0.20917.1\u003e Stopping RabbitMQ applications and their dependencies in the following order:\n","stream":"stdout","time":"2019-08-14T15:58:15.585461935Z"}
{"log":" rabbitmq_management\n","stream":"stdout","time":"2019-08-14T15:58:15.585474435Z"}
{"log":" amqp_client\n","stream":"stdout","time":"2019-08-14T15:58:15.585478042Z"}
{"log":" rabbitmq_web_dispatch\n","stream":"stdout","time":"2019-08-14T15:58:15.585481157Z"}
{"log":" cowboy\n","stream":"stdout","time":"2019-08-14T15:58:15.585484303Z"}
{"log":" cowlib\n","stream":"stdout","time":"2019-08-14T15:58:15.585491628Z"}
{"log":" rabbitmq_peer_discovery_k8s\n","stream":"stdout","time":"2019-08-14T15:58:15.58549482Z"}
{"log":" rabbitmq_peer_discovery_common\n","stream":"stdout","time":"2019-08-14T15:58:15.585497784Z"}
{"log":" rabbitmq_management_agent\n","stream":"stdout","time":"2019-08-14T15:58:15.585500727Z"}
{"log":" rabbit\n","stream":"stdout","time":"2019-08-14T15:58:15.585503554Z"}
{"log":" mnesia\n","stream":"stdout","time":"2019-08-14T15:58:15.585506253Z"}
{"log":" rabbit_common\n","stream":"stdout","time":"2019-08-14T15:58:15.585508974Z"}
{"log":" sysmon_handler\n","stream":"stdout","time":"2019-08-14T15:58:15.58...

Read more...

Revision history for this message
Dariush Eslimi (deslimi) wrote :

Can we reproduce this and keep the lab in that state so we can debug further?

Revision history for this message
Ghada Khalil (gkhalil) wrote :

This sounds similar to https://bugs.launchpad.net/starlingx/+bug/1835807

Agree with Dariush that we need to know if this is still reproducible with openstack train

Changed in starlingx:
status: Triaged → Incomplete
Revision history for this message
Ghada Khalil (gkhalil) wrote :
Changed in starlingx:
status: Incomplete → Won't Fix
Yang Liu (yliu12)
summary: - cable pull test for cluster and MGT on active controller cause both
- controllers as not active
+ OpenStack: cable pull test for cluster and MGT on active controller
+ cause both controllers as not active
Revision history for this message
Ghada Khalil (gkhalil) wrote :

LP was closed as Won't Fix a long time ago; removing the stx.retestneeded tag

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.