After reboot active controller, controller in disabled/failed state

Bug #1829545 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Tee Ngo

Bug Description

Brief Description
-----------------
In multi nodes system, after rebooting active controller, this controller is in disable/failed state

Severity
--------
Major

Steps to Reproduce
------------------
as description

TC-name: mtc/test_services_persists_over_reboot.py::test_system_persist_over_host_reboot[controller]

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Multi-node system

Lab-name: WCP_113-121

Branch/Pull Time/Commit
-----------------------
stx master as of 20190517T013000Z

Last Pass
---------

Timestamp/Logs
--------------
[2019-05-17 09:13:47,567] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2019-05-17 09:13:49,161] 387 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | storage-0 | storage | unlocked | enabled | available |
| 4 | storage-1 | storage | unlocked | enabled | available |
| 5 | compute-0 | worker | unlocked | enabled | available |
| 6 | compute-1 | worker | unlocked | enabled | available |
| 7 | compute-2 | worker | unlocked | enabled | available |
| 8 | compute-3 | worker | unlocked | enabled | available |
| 9 | compute-4 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

[2019-05-17 09:22:40,335] 139 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-05-17 09:22:40,335] 262 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2019-05-17 09:26:13,928] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2019-05-17 09:26:15,485] 387 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | disabled | offline |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | storage-0 | storage | unlocked | enabled | available |
| 4 | storage-1 | storage | unlocked | enabled | available |
| 5 | compute-0 | worker | unlocked | enabled | available |
| 6 | compute-1 | worker | unlocked | enabled | available |
| 7 | compute-2 | worker | unlocked | enabled | available |
| 8 | compute-3 | worker | unlocked | enabled | available |
| 9 | compute-4 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

[2019-05-17 10:18:54,294] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2019-05-17 10:18:55,918] 387 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | disabled | failed |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | storage-0 | storage | unlocked | enabled | available |
| 4 | storage-1 | storage | unlocked | enabled | available |
| 5 | compute-0 | worker | unlocked | enabled | available |
| 6 | compute-1 | worker | unlocked | enabled | available |
| 7 | compute-2 | worker | unlocked | enabled | available |
| 8 | compute-3 | worker | unlocked | enabled | available |
| 9 | compute-4 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
controller-1:~$

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Al Bailey (albailey1974) wrote :
Download full text (3.7 KiB)

There are two puppet errors
puppet/2019-05-17-07-39-41_controller/puppet.log:2019-05-17T07:42:03.517 Error: 2019-05-17 07:42:03 +0000 kubeadm init --config=/etc/kubernetes/kubeadm.yaml returned 1 instead of one of [0]

puppet/2019-05-17-07-39-41_controller/puppet.log:2019-05-17T07:42:03.617 Error: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: change from notrun to 0 failed: kubeadm init --config=/etc/kubernetes/kubeadm.yaml returned 1 instead of one of [0]

Here are more logs related to the failed kubernetes command:

2019-05-17T07:41:45.660 ^[[0;36mDebug: 2019-05-17 07:41:45 +0000 Exec[configure master node](provider=posix): Executing 'kubeadm init --config=/etc/kubernetes/kubeadm.yaml'^[[0m
2019-05-17T07:41:45.662 ^[[0;36mDebug: 2019-05-17 07:41:45 +0000 Executing: 'kubeadm init --config=/etc/kubernetes/kubeadm.yaml'^[[0m
2019-05-17T07:42:03.488 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [init] Using Kubernetes version: v1.13.5^[[0m
2019-05-17T07:42:03.491 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [preflight] Running pre-flight checks^[[0m
2019-05-17T07:42:03.493 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [preflight] Pulling images required for setting up a Kubernetes cluster^[[0m
2019-05-17T07:42:03.495 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [preflight] This might take a minute or two, depending on the speed of your internet connection^[[0m
2019-05-17T07:42:03.498 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [preflight] You can also perform this action in beforehand using 'kubeadm config images pull'^[[0m
2019-05-17T07:42:03.500 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: error execution phase preflight: [preflight] Some fatal errors occurred:^[[0m
2019-05-17T07:42:03.502 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-apiserver:v1.13.5: output: Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp 108.177.111.82:443: connect: no route to host^[[0m
2019-05-17T07:42:03.504 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: , error: exit status 1^[[0m
2019-05-17T07:42:03.506 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-controller-manager:v1.13.5: output: Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp 108.177.111.82:443: connect: no route to host^[[0m
2019-05-17T07:42:03.508 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Ku...

Read more...

Revision history for this message
Al Bailey (albailey1974) wrote :
Download full text (4.1 KiB)

On active controller:

fm alarm-list
+----------+----------------------------------------------------------------------------------------------+---------------------------+----------+-------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+----------------------------------------------------------------------------------------------+---------------------------+----------+-------------------+
| 200.004 | controller-0 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock | host=controller-0 | critical | 2019-05-17T09:23: |
| | and Unlock may be required if auto-recovery is unsuccessful. | | | 18.940531 |
| | | | | |
| 200.009 | controller-0 experienced a persistent critical 'Cluster-host Network' communication failure. | host=controller-0.network | critical | 2019-05-17T09:23: |
| | | =Cluster-host | | 15.911553 |
| | | | | |
| 200.005 | controller-0 experienced a persistent critical 'Management Network' communication failure. | host=controller-0.network | critical | 2019-05-17T09:23: |
| | | =Management | | 15.830535 |
| | | | | |
| 400.002 | Service group directory-services loss of redundancy; expected 2 active members but only 1 | service_domain=controller | major | 2019-05-17T09:23: |
| | active member available | .service_group=directory- | | 00.346577 |
| | | services | | |
| | | | | |
| 400.002 | Service group web-services loss of redundancy; expected 2 active members but only 1 active | service_domain=controller | major | 2019-05-17T09:23: |
| | member available | .service_group=web- | | 00.265557 |
| | | services | | ...

Read more...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Tee to see if this is a duplicate of https://bugs.launchpad.net/starlingx/+bug/1828880

Changed in starlingx:
assignee: nobody → Tee Ngo (teewrs)
Ghada Khalil (gkhalil)
tags: added: stx.sanity
Ghada Khalil (gkhalil)
summary: - After reboot active controller, controller in disable/failed state
+ After reboot active controller, controller in disabled/failed state
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/659865

Changed in starlingx:
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; high priority since this is a sanity issue

tags: added: stx.2.0
Changed in starlingx:
importance: Undecided → High
Revision history for this message
Tee Ngo (teewrs) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/659865
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=319b5602df48594c495489b900ddda73979b1e66
Submitter: Zuul
Branch: master

commit 319b5602df48594c495489b900ddda73979b1e66
Author: Tee Ngo <email address hidden>
Date: Fri May 17 14:23:43 2019 -0400

    Correct controller-0 mgmt mac following bootstrap

    This commit corrects the mgmt_mac of controller-0 as part
    of mgmt interface provisioning following Ansible bootstrap.

    Test:
      Bring up a standard system. Verify that after a force
      reboot of the active controller, the controller is able
      to recover successfully.

    Closes-Bug: #1828880
    Closes-Bug: #1829545
    Depends-On: I9ef9d30bbf8713c75206b338aefd53c3e77db0cb

    Change-Id: I3536202a396c47bc0cf8463505f6de48815fee02
    Signed-off-by: Tee Ngo <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

Not saw this issue in recent loads.
Lab: WCP_99_103
Load: 2019-05-21_14-14-17

Lab: WP_1_2
Load: 2019-05-18_06-36-50

Revision history for this message
Nimalini Rasa (nrasa) wrote :

The fix only works for system host-if-modify, and deosn't work for system host-if-add. The issue is seen on a standard lab with LAG mgmt (wcp-35).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/661354

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Re-opening as an additional fix is required. Review is already open as per above.

Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
Peng Peng (ppeng) wrote :

Issure was reproduced on
WCP_99-103
2019-05-23_18-37-00

[2019-05-24 13:45:29,029] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2019-05-24 13:45:30,442] 387 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | compute-0 | worker | unlocked | enabled | available |
| 4 | compute-1 | worker | unlocked | enabled | available |
| 5 | compute-2 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
[wrsroot@controller-0 ~(keystone_admin)]$

[2019-05-24 13:51:32,050] 139 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-05-24 13:51:32,050] 262 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2019-05-24 14:12:24,889] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2019-05-24 14:12:26,286] 387 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | disabled | failed |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | compute-0 | worker | unlocked | enabled | available |
| 4 | compute-1 | worker | unlocked | enabled | available |
| 5 | compute-2 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/661354
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=370d591d920b0861e5a12468103492c027895ec5
Submitter: Zuul
Branch: master

commit 370d591d920b0861e5a12468103492c027895ec5
Author: Tee Ngo <email address hidden>
Date: Fri May 24 12:17:44 2019 -0400

    Correct controller-0 mgmt mac following bootstrap

    The previous commit 319b5602df48594c495489b900ddda73979b1e66
    did not address the special case where there is a LAG/AE interface
    on management network. This commit covers these cases.

    Test:
      Bring up a standard system with LAG on management. Verify that
      controller-0 is not in failed state after a force reboot.

    Closes-Bug: #1829545
    Change-Id: Ia8dd3bda79a59af4cba59686e7bb916c2b86d9bb
    Signed-off-by: Tee Ngo <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

Have not seen this issue recently

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.