cable pull test for cluster and MGT on active controller cause both controllers as not active

Bug #1840182 reported by Anujeyan Manokeran on 2019-08-14
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
High
Bart Wensley

Bug Description

Brief Description
-----------------.
  When I do the cable pull cluster and MGT together on active controller(controller-1) and put back active controller (controller-1) got rebooted and controller-0 become active as expected. After sometimes both controller-0 and controller-1 went to a state none of them are active . There was no recovery to reboot them self.
controller-0:~$ source /etc/platform/openrc
Openstack Admin credentials can only be loaded from the active controller.

controller-1:~$ source /etc/platform/openrc
Openstack Admin credentials can only be loaded from the active controller.

Bin Did analysis and found below.

| 2019-08-14T16:04:12.516 | 1161 | service-group-scn | vim-services | active-failed | disabling-failed | vim(disabled, failed)
| 2019-08-14T16:04:12.711 | 1190 | service-scn | mtc-agent | enabled-active | disabling | disable state requested
| 2019-08-14T16:04:13.795 | 1220 | service-group-scn | vim-services | disabling-failed | disabled-failed | vim(disabled, )
| 2019-08-14T16:04:15.311 | 1225 | service-scn | mtc-agent | disabling | disabled | disable success
| 2019-08-14T16:09:16.768 | 1282 | service-scn | mtc-agent | disabled | enabling-throttle | enabled-active state requested
| 2019-08-14T16:09:22.312 | 1366 | service-scn | mtc-agent | enabling-throttle | enabling | enabled-active state requested
| 2019-08-14T16:09:23.354 | 1373 | service-scn | mtc-agent | enabling | enabled-active | enable success

There was issue with vim on controller-0.

2019-08-14T16:00:06.686 controller-0 VIM_Thread[333973] ERROR Queue.declare: (404) NOT_FOUND - home node '<email address hidden>' of durable queue 'notifications.nfvi_nova_listener_queue' in vhost 'nova' is down or inaccessible
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nfv_vim/vim.py", line 161, in process_main
    init_complete = process_initialize()
  File "/usr/lib64/python2.7/site-packages/nfv_vim/vim.py", line 79, in process_initialize
    if not nfvi.nfvi_initialize(config.CONF['nfvi']):
  File "/usr/lib64/python2.7/site-packages/nfv_vim/nfvi/_nfvi_module.py", line 80, in nfvi_initialize
    _task_worker_pools['compute'])
  File "/usr/lib64/python2.7/site-packages/nfv_vim/nfvi/_nfvi_compute_module.py", line 464, in nfvi_compute_initialize
    _compute_plugin.initialize(config['config_file'])
  File "/usr/lib64/python2.7/site-packages/nfv_vim/nfvi/_nfvi_plugin.py", line 94, in initialize
    self._plugin.obj.initialize(config_file)
  File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/nfvi_compute_api.py", line 3393, in initialize
    'notifications.nfvi_nova_listener_queue')
  File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/openstack/rpc_listener.py", line 42, in __init__
    self._consumer = Consumer(self._connection, self._rpc_receive_queue)
  File "/usr/lib/python2.7/site-packages/kombu/messaging.py", line 386, in __init__
    self.revive(self.channel)
  File "/usr/lib/python2.7/site-packages/kombu/messaging.py", line 407, in revive
    self.declare()
  File "/usr/lib/python2.7/site-packages/kombu/messaging.py", line 420, in declare
    queue.declare()
  File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 605, in declare
    self._create_queue(nowait=nowait, channel=channel)
  File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 614, in _create_queue
    self.queue_declare(nowait=nowait, passive=False, channel=channel)
  File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 649, in queue_declare
    nowait=nowait,
  File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1165, in queue_declare
    spec.Queue.DeclareOk, returns_tuple=True,
  File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 93, in wait
    self.connection.drain_events(timeout=timeout)
  File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 464, in drain_events
    return self.blocking_read(timeout)
  File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 469, in blocking_read
    return self.on_inbound_frame(frame)
  File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 68, in on_frame
    callback(channel, method_sig, buf, None)
  File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 473, in on_inbound_method
    method_sig, payload, content,
  File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 142, in dispatch_method
    listener(*args)
  File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 293, in _on_close
    reply_code, reply_text, (class_id, method_id), ChannelError,
NotFound: Queue.declare: (404) NOT_FOUND - home node '<email address hidden>' of durable queue 'notifications.nfvi_nova_listener_queue' in vhost 'nova' is down or inaccessible
2019-08-14T16:00:06.690 controller-0 VIM_Image-Worker-0_Thread[334031] INFO _thread.py.232 Thread Image-Worker-0: shutting down.
2019-08-14T16:00:06.691 controller-0 VIM_Image-Worker-0_Thread[334031] INFO _thread.py.236 Thread Image-Worker-0: shutdown.
2019-08-14T16:00:06.698 controller-0 VIM_Compute-Worker-0_Thread[334037] INFO _thread.py.232 Thread Compute-Worker-0: shutting down.
2019-08-14T16:00:06.699 controller-0 VIM_Compute-Worker-0_Thread[334037] INFO _thread.py.236 Thread Compute-Worker-0: shutdown.
2019-08-14T16:00:06.706 controller-0 VIM_Compute-Worker-1_Thread[334038] INFO _thread.py.232 Thread Compute-Worker-1: shutting down.
2019-08-14T16:00:06.707 controller-0 VIM_Compute-Worker-1_Thread[334038] INFO _thread.py.236 Thread Compute-Worker-1: shutdown.
2019-08-14T16:00:06.713 controller-0 VIM_BlockStorage-Worker-0_Thread[334034] INFO _thread.py.232 Thread BlockStorage-Worker-0: shutting down.
2019-08-14T16:00:06.714 controller-0 VIM_BlockStorage-Worker-0_Thread[334034] INFO _thread.py.236 Thread BlockStorage-Worker-0: shutdown.
2019-08-14T16:00:06.717 controller-0 VIM_Identity-Worker-0_Thread[334027] INFO _thread.py.232 Thread Identity-Worker-0: shutting down.
2019-08-14T16:00:06.717 controller-0 VIM_Identity-Worker-0_Thread[334027] INFO _thread.py.236 Thread Identity-Worker-0: shutdown.
2019-08-14T16:00:06.724 controller-0 VIM_Alarm_Thread[334024] INFO _thread.py.232 Thread Alarm: shutting down.
2019-08-14T16:00:06.724 controller-0 VIM_Alarm_Thread[334024] INFO _thread.py.236 Thread Alarm: shutdown.

The mtcAgent was restarted as a part of failure recovery, it did not send node update to SM to info that the controller-1 is unlocked-enabled. From SM on controller-0, it doesn't know the controller-1 is enabled.
It still has the controller-1 state as failed, and in this case as uncontrolled swact was not successful.

There are 2 issues, one is obviously vim had problem to go active after controller-0 took over active. 2nd issue is a uncontrolled swact shouldn't start before standby controller is confirmed enabled.

controller-0:~$ source /etc/platform/openrc
Openstack Admin credentials can only be loaded from the active controller.

controller-1:~$ source /etc/platform/openrc
Openstack Admin credentials can only be loaded from the active controller.

controller-0:~$ sudo sm-dump
sudo: ldap_sasl_bind_s(): Can't contact LDAP server
Password:

-Service_Groups------------------------------------------------------------------------
oam-services disabled disabled
controller-services disabled disabled
cloud-services disabled disabled
patching-services active active
directory-services active active
web-services active active
storage-services active active
storage-monitoring-services active active
vim-services disabled disabled failed
---------------------------------------------------------------------------------------

-Services------------------------------------------------------------------------------
oam-ip disabled disabled
management-ip disabled disabled
drbd-pg disabled disabled
drbd-rabbit disabled disabled
drbd-platform disabled disabled
pg-fs disabled disabled
rabbit-fs disabled disabled
nfs-mgmt disabled disabled
platform-fs disabled disabled
postgres disabled disabled
rabbit disabled disabled
platform-export-fs disabled disabled
platform-nfs-ip disabled disabled
sysinv-inv disabled disabled
sysinv-conductor disabled disabled
mtc-agent disabled disabled
hw-mon disabled disabled
dnsmasq disabled disabled
fm-mgr disabled disabled
keystone disabled disabled
open-ldap enabled-active enabled-active
snmp disabled disabled
lighttpd enabled-active enabled-active
horizon enabled-active enabled-active
patch-alarm-manager enabled-active enabled-active
mgr-restful-plugin enabled-active enabled-active
ceph-manager enabled-active enabled-active
vim disabled disabled
vim-api disabled disabled
vim-webserver disabled disabled
guest-agent disabled disabled
haproxy disabled disabled
pxeboot-ip disabled disabled
drbd-extension disabled disabled
extension-fs disabled disabled
extension-export-fs disabled disabled
etcd disabled disabled
drbd-etcd disabled disabled
etcd-fs disabled disabled
barbican-api disabled disabled
barbican-keystone-listener disabled disabled
barbican-worker disabled disabled
cluster-host-ip disabled disabled
docker-distribution disabled disabled
dockerdistribution-fs disabled disabled
drbd-dockerdistribution disabled disabled
helmrepository-fs disabled disabled
registry-token-server disabled disabled

controller-1:~$ sudo sm-dump
sudo: ldap_sasl_bind_s(): Can't contact LDAP server
Password:

-Service_Groups------------------------------------------------------------------------
oam-services standby standby
controller-services standby standby
cloud-services standby standby
patching-services standby standby
directory-services disabled disabled
web-services disabled disabled
storage-services disabled disabled
storage-monitoring-services standby standby
vim-services standby standby
---------------------------------------------------------------------------------------

-Services------------------------------------------------------------------------------
oam-ip enabled-standby disabled
management-ip enabled-standby disabled
drbd-pg enabled-standby enabled-standby
drbd-rabbit enabled-standby enabled-standby
drbd-platform enabled-standby enabled-standby
pg-fs enabled-standby disabled
rabbit-fs enabled-standby disabled
nfs-mgmt enabled-standby disabled
platform-fs enabled-standby disabled
postgres enabled-standby disabled
rabbit enabled-standby disabled
platform-export-fs enabled-standby disabled
platform-nfs-ip enabled-standby disabled
sysinv-inv enabled-standby disabled
sysinv-conductor enabled-standby disabled
mtc-agent enabled-standby disabled
hw-mon enabled-standby disabled
dnsmasq enabled-standby disabled
fm-mgr enabled-standby disabled
keystone enabled-standby disabled
open-ldap disabled disabled
snmp enabled-standby disabled
lighttpd disabled disabled
horizon disabled disabled
patch-alarm-manager enabled-standby disabled
mgr-restful-plugin disabled disabled
ceph-manager enabled-standby disabled
vim enabled-standby disabled
vim-api enabled-standby disabled
vim-webserver enabled-standby disabled
guest-agent enabled-standby disabled
haproxy enabled-standby disabled
pxeboot-ip enabled-standby disabled
drbd-extension enabled-standby enabled-standby
extension-fs enabled-standby disabled
extension-export-fs enabled-standby disabled
etcd enabled-standby disabled
drbd-etcd enabled-standby enabled-standby
etcd-fs enabled-standby disabled
barbican-api enabled-standby disabled
barbican-keystone-listener enabled-standby disabled
barbican-worker enabled-standby disabled
cluster-host-ip enabled-standby disabled
docker-distribution enabled-standby disabled
dockerdistribution-fs enabled-standby disabled
drbd-dockerdistribution enabled-standby enabled-standby
helmrepository-fs enabled-standby disabled
registry-token-server enabled-standby disabled
------------------------------------------------------------------------------
controller-1:~$

Severity
--------.
Major
Steps to Reproduce
------------------
1. Make sure system is installed and good health . No alarms.
2. Pull cable provision with Management and cluster network from active controller (Above test was on Controller-1)
3. Put back the cable as it was before.
4. Reboot on Controller-1 and Controller-0 become active.
5. After some time Controller-0 was not active admin crendtials are not working as per description.
controller-0:~$ source /etc/platform/openrc
Openstack Admin credentials can only be loaded from the active controller.

Expected Behavior
------------------
Controller-0 is active and Openstack Admin credentials

Actual Behavior
----------------
As per description no active controller
Reproducibility
---------------
Not sure tested once with the same load controller-0 rebooted and become disabled failed .
System Configuration
--------------------
Storage system

Branch/Pull Time/Commit
-----------------------
BUILD_DATE= 2019-08-12 21:00:17 -0400

Last Pass
---------
2019-02-10_20-18-00

Timestamp/Logs
--------------
2019-08-14T16:00:06

Test Activity
-------------
Regression test

Anujeyan Manokeran (anujeyan) wrote :
Anujeyan Manokeran (anujeyan) wrote :
Bin Qian (bqian20) wrote :

I looked at the logs again, it actually didn't have the 2nd issue that a uncontrolled swact started before the other controller was enabled. it was a service failure-recovery on a single controller (as SM did not receive the controller-1 being enabled msg, SM on controller-0 had controller-1 state as unlocked-failed). with a 5 minutes debounce wait time.

| 2019-08-14T16:09:16.662 | 1252 | service-group-scn | vim-services | disabled | go-active |
| 2019-08-14T16:09:24.369 | 1384 | service-group-scn | vim-services | go-active | active |
| 2019-08-14T16:14:33.679 | 1485 | service-group-scn | vim-services | disabled | go-active |
| 2019-08-14T16:14:41.673 | 1617 | service-group-scn | vim-services | go-active | active |
| 2019-08-14T16:19:46.528 | 1718 | service-group-scn | vim-services | disabled | go-active |
| 2019-08-14T16:19:54.490 | 1850 | service-group-scn | vim-services | go-active | active |
| 2019-08-14T16:25:03.220 | 1951 | service-group-scn | vim-services | disabled | go-active |

A vim patch should fix the issue reported.

Brent Rowsell (brent-rowsell) wrote :

Can you clarify Infra ? There is no infra interface anymore.

I mean cluster network . I have remove infra from launch pad text.

summary: - cable pull test for infra and MGT on active controller cause both
+ cable pull test for cluster and MGT on active controller cause both
controllers are not active
description: updated
Numan Waheed (nwaheed) on 2019-08-16
tags: added: stx.retestneeded
Numan Waheed (nwaheed) on 2019-08-16
summary: cable pull test for cluster and MGT on active controller cause both
- controllers are not active
+ controllers as not active
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 / high priority -- failure scenario not handled properly by the system.

tags: added: stx.nfv
tags: added: stx.3.0
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Bart Wensley (bartwensley)
Bart Wensley (bartwensley) wrote :
Download full text (5.2 KiB)

The VIM on controller-0 is failing to connect to the nova notifications queue:

NotFound: Queue.declare: (404) NOT_FOUND - home node '<email address hidden>' of durable queue 'notifications.nfvi_nova_listener_queue' in vhost 'nova' is down or inaccessible

This error indicates that the queue exists, but cannot be connected to. The rabbitmq logs indicate that rabbitmq is shutting down on controller-0:

{"log":"2019-08-14 15:58:15.535 [info] \u003c0.472.0\u003e Autoheal request received from '<email address hidden>'\n","stream":"stdout","time":"2019-08-14T15:58:15.535800761Z"}
{"log":"2019-08-14 15:58:15.536 [info] \u003c0.472.0\u003e Autoheal decision\n","stream":"stdout","time":"2019-08-14T15:58:15.536559958Z"}
{"log":" * Partitions: [['<email address hidden>'],['<email address hidden>']]\n","stream":"stdout","time":"2019-08-14T15:58:15.536586158Z"}
{"log":" * Winner: '<email address hidden>'\n","stream":"stdout","time":"2019-08-14T15:58:15.536601564Z"}
{"log":" * Losers: ['<email address hidden>']\n","stream":"stdout","time":"2019-08-14T15:58:15.536613184Z"}
{"log":"2019-08-14 15:58:15.536 [warning] \u003c0.472.0\u003e Autoheal: we were selected to restart; winner is '<email address hidden>'\n","stream":"stdout","time":"2019-08-14T15:58:15.537049161Z"}
{"log":"2019-08-14 15:58:15.536 [info] \u003c0.20917.1\u003e RabbitMQ is asked to stop...\n","stream":"stdout","time":"2019-08-14T15:58:15.537061637Z"}
{"log":"2019-08-14 15:58:15.585 [info] \u003c0.20917.1\u003e Stopping RabbitMQ applications and their dependencies in the following order:\n","stream":"stdout","time":"2019-08-14T15:58:15.585461935Z"}
{"log":" rabbitmq_management\n","stream":"stdout","time":"2019-08-14T15:58:15.585474435Z"}
{"log":" amqp_client\n","stream":"stdout","time":"2019-08-14T15:58:15.585478042Z"}
{"log":" rabbitmq_web_dispatch\n","stream":"stdout","time":"2019-08-14T15:58:15.585481157Z"}
{"log":" cowboy\n","stream":"stdout","time":"2019-08-14T15:58:15.585484303Z"}
{"log":" cowlib\n","stream":"stdout","time":"2019-08-14T15:58:15.585491628Z"}
{"log":" rabbitmq_peer_discovery_k8s\n","stream":"stdout","time":"2019-08-14T15:58:15.58549482Z"}
{"log":" rabbitmq_peer_discovery_common\n","stream":"stdout","time":"2019-08-14T15:58:15.585497784Z"}
{"log":" rabbitmq_management_agent\n","stream":"stdout","time":"2019-08-14T15:58:15.585500727Z"}
{"log":" rabbit\n","stream":"stdout","time":"2019-08-14T15:58:15.585503554Z"}
{"log":" mnesia\n","stream":"stdout","time":"2019-08-14T15:58:15.585506253Z"}
{"log":" rabbit_common\n","stream":"stdout","time":"2019-08-14T15:58:15.585508974Z"}
{"log":" sysmon_handler\n","stream":"stdout","time":"2019-08-14T15:58:15.58...

Read more...

Dariush Eslimi (deslimi) wrote :

Can we reproduce this and keep the lab in that state so we can debug further?

Ghada Khalil (gkhalil) wrote :

This sounds similar to https://bugs.launchpad.net/starlingx/+bug/1835807

Agree with Dariush that we need to know if this is still reproducible with openstack train

Changed in starlingx:
status: Triaged → Incomplete
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers