Controller failed to lock following a failover due to elastic pod failure to shutdown
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
Dan Voiculeasa |
Bug Description
Brief Description
-----------------
The controller node failed to lock after a failover.
Severity
--------
Major
Steps to Reproduce
------------------
1. Bring up a standard system
2. Apply StarlingX monitor application
3. Force a failover by issuing the command "sudo reboot" on the primary controller (e.g. controller-0).
4. After the standby controller-1 has taken over and controller-0 has recovered from the reboot and becomes unlocked-
5. Lock the now standby controller
Expected Behavior
------------------
The standby controller is in locked state within 3 minutes.
Actual Behavior
----------------
The standby controller (controller-1) failed to lock as elasticsearch pods are stuck in terminating. VIM eventually timed out and failed the lock. The controller-1 daemon.log kept spewing the following logs:
2019-11-
2019-11-
A follow up host-lock --force had be issued in order to lock controller-1. As soon as the system host-unlock was issued against controller-1, the monitor related pods on the primary controller-0 got into a bad state and Kibana dashboard became unreachable until controller-1 recovered from the unlock.
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system coredns-
monitor mon-elasticsear
monitor mon-elasticsear
monitor mon-elasticsear
monitor mon-elasticsear
monitor mon-elasticsear
monitor mon-elasticsear
monitor mon-elasticsear
monitor mon-elasticsear
monitor mon-elasticsear
monitor mon-filebeat-bppwv 1/2 CrashLoopBackOff 127 9d fd00:206:
monitor mon-kibana-
monitor mon-logstash-0 0/1 Pending 0 22m <none> <none> <none> <none>
monitor mon-metricbeat-
monitor mon-metricbeat-
Reproducibility
---------------
Very frequent
System Configuration
-------
Standard system
Branch/Pull Time/Commit
-------
Oct. 24th load
Last Pass
---------
This scenario was not executed by me previously and I do not know whether it was part of a regression test suite.
Timestamp/Logs
--------------
Attached is the daemon.log of the controller node that consistently failed host-lock following a failover.
Test Activity
-------------
Ad-hoc testing
Changed in starlingx: | |
assignee: | Ovidiu Poncea (ovidiu.poncea) → Dan Voiculeasa (dvoicule) |
tags: | added: in-r-stx30 |
stx.3.0 / high priority - pods not shutting down preventing the host lock operation.