OpenStack: cable pull test for cluster and MGT on active controller cause both controllers as not active
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Won't Fix
|
High
|
Bart Wensley |
Bug Description
Brief Description
-----------------.
When I do the cable pull cluster and MGT together on active controller(
controller-0:~$ source /etc/platform/
Openstack Admin credentials can only be loaded from the active controller.
controller-1:~$ source /etc/platform/
Openstack Admin credentials can only be loaded from the active controller.
Bin Did analysis and found below.
| 2019-08-
| 2019-08-
| 2019-08-
| 2019-08-
| 2019-08-
| 2019-08-
| 2019-08-
There was issue with vim on controller-0.
2019-08-
Traceback (most recent call last):
File "/usr/lib64/
init_complete = process_
File "/usr/lib64/
if not nfvi.nfvi_
File "/usr/lib64/
_task_
File "/usr/lib64/
_compute_
File "/usr/lib64/
self.
File "/usr/lib64/
'notificati
File "/usr/lib64/
self._consumer = Consumer(
File "/usr/lib/
self.
File "/usr/lib/
self.declare()
File "/usr/lib/
queue.declare()
File "/usr/lib/
self.
File "/usr/lib/
self.
File "/usr/lib/
nowait=nowait,
File "/usr/lib/
spec.
File "/usr/lib/
self.
File "/usr/lib/
return self.blocking_
File "/usr/lib/
return self.on_
File "/usr/lib/
callback(
File "/usr/lib/
method_sig, payload, content,
File "/usr/lib/
listener(*args)
File "/usr/lib/
reply_code, reply_text, (class_id, method_id), ChannelError,
NotFound: Queue.declare: (404) NOT_FOUND - home node '<email address hidden>' of durable queue 'notifications.
2019-08-
2019-08-
2019-08-
2019-08-
2019-08-
2019-08-
2019-08-
2019-08-
2019-08-
2019-08-
2019-08-
2019-08-
The mtcAgent was restarted as a part of failure recovery, it did not send node update to SM to info that the controller-1 is unlocked-enabled. From SM on controller-0, it doesn't know the controller-1 is enabled.
It still has the controller-1 state as failed, and in this case as uncontrolled swact was not successful.
There are 2 issues, one is obviously vim had problem to go active after controller-0 took over active. 2nd issue is a uncontrolled swact shouldn't start before standby controller is confirmed enabled.
controller-0:~$ source /etc/platform/
Openstack Admin credentials can only be loaded from the active controller.
controller-1:~$ source /etc/platform/
Openstack Admin credentials can only be loaded from the active controller.
controller-0:~$ sudo sm-dump
sudo: ldap_sasl_bind_s(): Can't contact LDAP server
Password:
-Service_
oam-services disabled disabled
controller-services disabled disabled
cloud-services disabled disabled
patching-services active active
directory-services active active
web-services active active
storage-services active active
storage-
vim-services disabled disabled failed
-------
-Services-
oam-ip disabled disabled
management-ip disabled disabled
drbd-pg disabled disabled
drbd-rabbit disabled disabled
drbd-platform disabled disabled
pg-fs disabled disabled
rabbit-fs disabled disabled
nfs-mgmt disabled disabled
platform-fs disabled disabled
postgres disabled disabled
rabbit disabled disabled
platform-export-fs disabled disabled
platform-nfs-ip disabled disabled
sysinv-inv disabled disabled
sysinv-conductor disabled disabled
mtc-agent disabled disabled
hw-mon disabled disabled
dnsmasq disabled disabled
fm-mgr disabled disabled
keystone disabled disabled
open-ldap enabled-active enabled-active
snmp disabled disabled
lighttpd enabled-active enabled-active
horizon enabled-active enabled-active
patch-alarm-manager enabled-active enabled-active
mgr-restful-plugin enabled-active enabled-active
ceph-manager enabled-active enabled-active
vim disabled disabled
vim-api disabled disabled
vim-webserver disabled disabled
guest-agent disabled disabled
haproxy disabled disabled
pxeboot-ip disabled disabled
drbd-extension disabled disabled
extension-fs disabled disabled
extension-export-fs disabled disabled
etcd disabled disabled
drbd-etcd disabled disabled
etcd-fs disabled disabled
barbican-api disabled disabled
barbican-
barbican-worker disabled disabled
cluster-host-ip disabled disabled
docker-distribution disabled disabled
dockerdistribut
drbd-dockerdist
helmrepository-fs disabled disabled
registry-
controller-1:~$ sudo sm-dump
sudo: ldap_sasl_bind_s(): Can't contact LDAP server
Password:
-Service_
oam-services standby standby
controller-services standby standby
cloud-services standby standby
patching-services standby standby
directory-services disabled disabled
web-services disabled disabled
storage-services disabled disabled
storage-
vim-services standby standby
-------
-Services-
oam-ip enabled-standby disabled
management-ip enabled-standby disabled
drbd-pg enabled-standby enabled-standby
drbd-rabbit enabled-standby enabled-standby
drbd-platform enabled-standby enabled-standby
pg-fs enabled-standby disabled
rabbit-fs enabled-standby disabled
nfs-mgmt enabled-standby disabled
platform-fs enabled-standby disabled
postgres enabled-standby disabled
rabbit enabled-standby disabled
platform-export-fs enabled-standby disabled
platform-nfs-ip enabled-standby disabled
sysinv-inv enabled-standby disabled
sysinv-conductor enabled-standby disabled
mtc-agent enabled-standby disabled
hw-mon enabled-standby disabled
dnsmasq enabled-standby disabled
fm-mgr enabled-standby disabled
keystone enabled-standby disabled
open-ldap disabled disabled
snmp enabled-standby disabled
lighttpd disabled disabled
horizon disabled disabled
patch-alarm-manager enabled-standby disabled
mgr-restful-plugin disabled disabled
ceph-manager enabled-standby disabled
vim enabled-standby disabled
vim-api enabled-standby disabled
vim-webserver enabled-standby disabled
guest-agent enabled-standby disabled
haproxy enabled-standby disabled
pxeboot-ip enabled-standby disabled
drbd-extension enabled-standby enabled-standby
extension-fs enabled-standby disabled
extension-export-fs enabled-standby disabled
etcd enabled-standby disabled
drbd-etcd enabled-standby enabled-standby
etcd-fs enabled-standby disabled
barbican-api enabled-standby disabled
barbican-
barbican-worker enabled-standby disabled
cluster-host-ip enabled-standby disabled
docker-distribution enabled-standby disabled
dockerdistribut
drbd-dockerdist
helmrepository-fs enabled-standby disabled
registry-
-------
controller-1:~$
Severity
--------.
Major
Steps to Reproduce
------------------
1. Make sure system is installed and good health . No alarms.
2. Pull cable provision with Management and cluster network from active controller (Above test was on Controller-1)
3. Put back the cable as it was before.
4. Reboot on Controller-1 and Controller-0 become active.
5. After some time Controller-0 was not active admin crendtials are not working as per description.
controller-0:~$ source /etc/platform/
Openstack Admin credentials can only be loaded from the active controller.
Expected Behavior
------------------
Controller-0 is active and Openstack Admin credentials
Actual Behavior
----------------
As per description no active controller
Reproducibility
---------------
Not sure tested once with the same load controller-0 rebooted and become disabled failed .
System Configuration
-------
Storage system
Branch/Pull Time/Commit
-------
BUILD_DATE= 2019-08-12 21:00:17 -0400
Last Pass
---------
2019-02-10_20-18-00
Timestamp/Logs
--------------
2019-08-14T16:00:06
Test Activity
-------------
Regression test
tags: | added: stx.retestneeded |
summary: |
cable pull test for cluster and MGT on active controller cause both - controllers are not active + controllers as not active |
summary: |
- cable pull test for cluster and MGT on active controller cause both - controllers as not active + OpenStack: cable pull test for cluster and MGT on active controller + cause both controllers as not active |
I looked at the logs again, it actually didn't have the 2nd issue that a uncontrolled swact started before the other controller was enabled. it was a service failure-recovery on a single controller (as SM did not receive the controller-1 being enabled msg, SM on controller-0 had controller-1 state as unlocked-failed). with a 5 minutes debounce wait time.
| 2019-08- 14T16:09: 16.662 | 1252 | service-group-scn | vim-services | disabled | go-active | 14T16:09: 24.369 | 1384 | service-group-scn | vim-services | go-active | active | 14T16:14: 33.679 | 1485 | service-group-scn | vim-services | disabled | go-active | 14T16:14: 41.673 | 1617 | service-group-scn | vim-services | go-active | active | 14T16:19: 46.528 | 1718 | service-group-scn | vim-services | disabled | go-active | 14T16:19: 54.490 | 1850 | service-group-scn | vim-services | go-active | active | 14T16:25: 03.220 | 1951 | service-group-scn | vim-services | disabled | go-active |
| 2019-08-
| 2019-08-
| 2019-08-
| 2019-08-
| 2019-08-
| 2019-08-
A vim patch should fix the issue reported.