StarlingX

After reboot active controller, controller in disabled/failed state

Bug #1829545 reported by Peng Peng on 2019-05-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Tee Ngo

Bug Description

Brief Description
-----------------
In multi nodes system, after rebooting active controller, this controller is in disable/failed state

Severity
--------
Major

Steps to Reproduce
------------------
as description

TC-name: mtc/test_services_persists_over_reboot.py::test_system_persist_over_host_reboot[controller]

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Multi-node system

Lab-name: WCP_113-121

Branch/Pull Time/Commit
-----------------------
stx master as of 20190517T013000Z

Last Pass
---------

[2019-05-17 09:22:40,335] 139 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-05-17 09:22:40,335] 262 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

Test Activity
-------------
Sanity

Tags:

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-17:

ALL_NODES_20190517.140839.tar Edit (37.1 MiB, application/x-tar)

Revision history for this message

Al Bailey (albailey1974) wrote on 2019-05-17:

Download full text (3.7 KiB)

There are two puppet errors
puppet/2019-05-17-07-39-41_controller/puppet.log:2019-05-17T07:42:03.517 Error: 2019-05-17 07:42:03 +0000 kubeadm init --config=/etc/kubernetes/kubeadm.yaml returned 1 instead of one of [0]

puppet/2019-05-17-07-39-41_controller/puppet.log:2019-05-17T07:42:03.617 Error: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: change from notrun to 0 failed: kubeadm init --config=/etc/kubernetes/kubeadm.yaml returned 1 instead of one of [0]

Here are more logs related to the failed kubernetes command:

2019-05-17T07:41:45.660 ^[[0;36mDebug: 2019-05-17 07:41:45 +0000 Exec[configure master node](provider=posix): Executing 'kubeadm init --config=/etc/kubernetes/kubeadm.yaml'^[[0m
2019-05-17T07:41:45.662 ^[[0;36mDebug: 2019-05-17 07:41:45 +0000 Executing: 'kubeadm init --config=/etc/kubernetes/kubeadm.yaml'^[[0m
2019-05-17T07:42:03.488 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [init] Using Kubernetes version: v1.13.5^[[0m
2019-05-17T07:42:03.491 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [preflight] Running pre-flight checks^[[0m
2019-05-17T07:42:03.493 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [preflight] Pulling images required for setting up a Kubernetes cluster^[[0m
2019-05-17T07:42:03.495 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [preflight] This might take a minute or two, depending on the speed of your internet connection^[[0m
2019-05-17T07:42:03.498 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [preflight] You can also perform this action in beforehand using 'kubeadm config images pull'^[[0m
2019-05-17T07:42:03.500 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: error execution phase preflight: [preflight] Some fatal errors occurred:^[[0m
2019-05-17T07:42:03.502 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-apiserver:v1.13.5: output: Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp 108.177.111.82:443: connect: no route to host^[[0m
2019-05-17T07:42:03.504 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: , error: exit status 1^[[0m
2019-05-17T07:42:03.506 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-controller-manager:v1.13.5: output: Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp 108.177.111.82:443: connect: no route to host^[[0m
2019-05-17T07:42:03.508 ^[[mNotice: 2019-05-17 07:42:03 +0000 /Stage[main]/Platform::Ku...

On active controller:

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-17:

Assigning to Tee to see if this is a duplicate of https://bugs.launchpad.net/starlingx/+bug/1828880

Changed in starlingx:
assignee:	nobody → Tee Ngo (teewrs)

Ghada Khalil (gkhalil) on 2019-05-17

tags:

added: stx.sanity

Ghada Khalil (gkhalil) on 2019-05-17

summary:

- After reboot active controller, controller in disable/failed state
+ After reboot active controller, controller in disabled/failed state

Numan Waheed (nwaheed) on 2019-05-17

tags:

added: stx.retestneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-17: Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/659865

Changed in starlingx:
status:	New → In Progress

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-17:

Marking as release gating; high priority since this is a sanity issue

tags:	added: stx.2.0
Changed in starlingx:
importance:	Undecided → High

Revision history for this message

Tee Ngo (teewrs) wrote on 2019-05-17:

This is indeed a duplicate of https://bugs.launchpad.net/starlingx/+bug/1828880

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-17: Fix merged to config (master)

Reviewed: https://review.opendev.org/659865
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=319b5602df48594c495489b900ddda73979b1e66
Submitter: Zuul
Branch: master

commit 319b5602df48594c495489b900ddda73979b1e66
Author: Tee Ngo <email address hidden>
Date: Fri May 17 14:23:43 2019 -0400

Correct controller-0 mgmt mac following bootstrap

This commit corrects the mgmt_mac of controller-0 as part
of mgmt interface provisioning following Ansible bootstrap.

    Test:
      Bring up a standard system. Verify that after a force
      reboot of the active controller, the controller is able
      to recover successfully.

    Closes-Bug: #1828880
    Closes-Bug: #1829545
    Depends-On: I9ef9d30bbf8713c75206b338aefd53c3e77db0cb

Change-Id: I3536202a396c47bc0cf8463505f6de48815fee02
Signed-off-by: Tee Ngo <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-22:

Not saw this issue in recent loads.
Lab: WCP_99_103
Load: 2019-05-21_14-14-17

Lab: WP_1_2
Load: 2019-05-18_06-36-50

Revision history for this message

Nimalini Rasa (nrasa) wrote on 2019-05-24:

#10

The fix only works for system host-if-modify, and deosn't work for system host-if-add. The issue is seen on a standard lab with LAG mgmt (wcp-35).

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-24: Fix proposed to config (master)

#11

Fix proposed to branch: master
Review: https://review.opendev.org/661354

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-24:

#12

Re-opening as an additional fix is required. Review is already open as per above.

Changed in starlingx:
status:	Fix Released → In Progress

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-24:

#13

ALL_NODES_20190524.200928.tar Edit (36.6 MiB, application/x-tar)

Issure was reproduced on
WCP_99-103
2019-05-23_18-37-00

[2019-05-24 13:51:32,050] 139 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-05-24 13:51:32,050] 262 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

Issure was reproduced on
WCP_99-103
2019-05-23_18-37-00

[2019-05-24 13:51:32,050] 139  INFO  MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-05-24 13:51:32,050] 262  DEBUG MainThread ssh.send    :: Send 'sudo reboot -f'

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-24: Fix merged to config (master)

#14

Reviewed: https://review.opendev.org/661354
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=370d591d920b0861e5a12468103492c027895ec5
Submitter: Zuul
Branch: master

commit 370d591d920b0861e5a12468103492c027895ec5
Author: Tee Ngo <email address hidden>
Date: Fri May 24 12:17:44 2019 -0400

Correct controller-0 mgmt mac following bootstrap

    The previous commit 319b5602df48594c495489b900ddda73979b1e66
    did not address the special case where there is a LAG/AE interface
    on management network. This commit covers these cases.

    Test:
      Bring up a standard system with LAG on management. Verify that
      controller-0 is not in failed state after a force reboot.

    Closes-Bug: #1829545
    Change-Id: Ia8dd3bda79a59af4cba59686e7bb916c2b86d9bb
    Signed-off-by: Tee Ngo <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-29:

#15

Have not seen this issue recently

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.