Controller failed to lock following a failover due to elastic pod failure to shutdown

Bug #1851287 reported by Tee Ngo on 2019-11-04
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
High
Dan Voiculeasa

Bug Description

Brief Description
-----------------
The controller node failed to lock after a failover.

Severity
--------
Major

Steps to Reproduce
------------------
1. Bring up a standard system
2. Apply StarlingX monitor application
3. Force a failover by issuing the command "sudo reboot" on the primary controller (e.g. controller-0).
4. After the standby controller-1 has taken over and controller-0 has recovered from the reboot and becomes unlocked-enabled-available, perform a host swact
5. Lock the now standby controller

Expected Behavior
------------------
The standby controller is in locked state within 3 minutes.

Actual Behavior
----------------
The standby controller (controller-1) failed to lock as elasticsearch pods are stuck in terminating. VIM eventually timed out and failed the lock. The controller-1 daemon.log kept spewing the following logs:

2019-11-04T17:45:47.696 controller-1 dockerd[12252]: info time="2019-11-04T17:45:47.696483189Z" level=error msg="Error running exec 4e99d41ebaa999838e836cca8d6cdd36f4fdf5fab8bc55f0a3908a646ba19c96 in container: OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown"
2019-11-04T17:45:47.696 controller-1 kubelet[88547]: info W1104 17:45:47.696940 88547 prober.go:108] No ref for container "docker://cc09882af3071eaa3d82bd6e91cb78d2a7a52a895960edfb3a24706ed96f540f" (mon-elasticsearch-data-1_monitor(bf423e8c-0fc3-40a4-a662-e9f42d3f581f):elasticsearch)

A follow up host-lock --force had be issued in order to lock controller-1. As soon as the system host-unlock was issued against controller-1, the monitor related pods on the primary controller-0 got into a bad state and Kibana dashboard became unreachable until controller-1 recovered from the unlock.

NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system coredns-6bc668cd76-2xcrj 0/1 Pending 0 22m <none> <none> <none> <none>
monitor mon-elasticsearch-client-1 0/1 Pending 0 22m <none> <none> <none> <none>
monitor mon-elasticsearch-curator-1572890400-5bl5m 0/1 Error 0 17m fd00:206::8e22:765f:6121:eb55 controller-0 <none> <none>
monitor mon-elasticsearch-curator-1572890400-7nbnr 0/1 Error 0 22m fd00:206::8e22:765f:6121:eb63 controller-0 <none> <none>
monitor mon-elasticsearch-curator-1572890400-b2r5r 0/1 Error 0 8m51s fd00:206::8e22:765f:6121:eb7a controller-0 <none> <none>
monitor mon-elasticsearch-curator-1572890400-mwgf2 0/1 Error 0 15m fd00:206::8e22:765f:6121:eb42 controller-0 <none> <none>
monitor mon-elasticsearch-curator-1572890400-q66mf 0/1 Error 0 12m fd00:206::8e22:765f:6121:eb79 controller-0 <none> <none>
monitor mon-elasticsearch-curator-1572890400-tppf8 0/1 Error 0 20m fd00:206::8e22:765f:6121:eb6b controller-0 <none> <none>
monitor mon-elasticsearch-data-1 0/1 Pending 0 14m <none> <none> <none> <none>
monitor mon-elasticsearch-master-1 0/1 Pending 0 22m <none> <none> <none> <none>
monitor mon-filebeat-bppwv 1/2 CrashLoopBackOff 127 9d fd00:206::a4ce:fec1:5423:e327 controller-1 <none> <none>
monitor mon-kibana-6cf57cfd5b-pg7cf 0/1 CrashLoopBackOff 16 67m fd00:206::8e22:765f:6121:eb59 controller-0 <none> <none>
monitor mon-logstash-0 0/1 Pending 0 22m <none> <none> <none> <none>
monitor mon-metricbeat-7948cd594c-wwqlp 0/1 Error 18 67m fd00:206::8e22:765f:6121:eb6f controller-0 <none> <none>
monitor mon-metricbeat-tqzgn 0/1 Error 116 9d fd00:204::4 controller-1 <none> <none>

Reproducibility
---------------
Very frequent

System Configuration
--------------------
Standard system

Branch/Pull Time/Commit
-----------------------
Oct. 24th load

Last Pass
---------
This scenario was not executed by me previously and I do not know whether it was part of a regression test suite.

Timestamp/Logs
--------------
Attached is the daemon.log of the controller node that consistently failed host-lock following a failover.

Test Activity
-------------
Ad-hoc testing

Tee Ngo (teewrs) wrote :
Ghada Khalil (gkhalil) wrote :

stx.3.0 / high priority - pods not shutting down preventing the host lock operation.

summary: - Controller failed to lock following a failover
+ Controller failed to lock following a failover due to elastic pod
+ failure to shutdown
Changed in starlingx:
assignee: nobody → Kevin Smith (kevin.smith.wrs)
importance: Undecided → Medium
status: New → Triaged
importance: Medium → High
tags: added: stx.3.0 stx.containers
Kevin Smith (kevin.smith.wrs) wrote :

The data and master pods are blocked on I/O writing to their respective rbd volumes according to kernel logs. There is a ceph osd bounce just before the block which is suspicious. In some cases, the situation will resolve itself over time, the pods will terminate and the lock will complete. If not, a host-lock --force with the accompanying reboot will forcefully resolve the problem. The exact same symptoms can be reproduced if the volumes are allowed to go full (which the elasticsearch-curator is preventing). Unfortunately, ability to reproduce on the lab that reported the problem has been lost.

I currently suspect the index shard re-sync between the data pods that occurs after node recovery. It is possible to disable routing (at least temporarily) before a pod starts up which could help the situation, but without the ability to reproduce it will be hard to verify if that helps.

Kevin Smith (kevin.smith.wrs) wrote :

Issue reproduced with a host-swact. No host-lock performed after swact, but pods would surely get stuck terminating if it was issued. Logs attached. Ceph continuously in below state.

controller-1:~$ ceph -s
  cluster:
    id: 427bf4e1-20f5-4c9a-a1f9-337796696e3a
    health: HEALTH_WARN
            Reduced data availability: 64 pgs inactive, 64 pgs peering

  services:
    mon: 3 daemons, quorum controller-0,controller-1,compute-0
    mgr: controller-1(active), standbys: controller-0
    osd: 2 osds: 2 up, 2 in

  data:
    pools: 1 pools, 64 pgs
    objects: 73.06 k objects, 283 GiB
    usage: 568 GiB used, 324 GiB / 892 GiB avail
    pgs: 100.000% pgs not active
             64 peering

Bob Church (rchurch) wrote :
Download full text (6.2 KiB)

Looks like this is a ceph recovery issue. The OSDs look to be stuck peering which blocks cluster access.

controller-1:/var/log/ceph$ ceph health detail | head
HEALTH_WARN Reduced data availability: 64 pgs inactive, 64 pgs peering
PG_AVAILABILITY Reduced data availability: 64 pgs inactive, 64 pgs peering
    pg 1.0 is stuck peering for 6124.369015, current state peering, last acting [1,0]
    pg 1.1 is stuck peering for 6119.913881, current state peering, last acting [0,1]
    pg 1.2 is stuck peering for 6149.545071, current state peering, last acting [1,0]
    pg 1.3 is stuck peering for 6148.786716, current state peering, last acting [1,0]
    pg 1.4 is stuck peering for 6187.537512, current state peering, last acting [1,0]
    pg 1.5 is stuck peering for 6182.361878, current state peering, last acting [1,0]
    pg 1.6 is stuck peering for 6119.918444, current state peering, last acting [0,1]
    pg 1.7 is stuck peering for 6119.918502, current state peering, last acting [0,1]

Our current recovery logic will not detect this scenario. A restart of either OSD daemon will fix the issue.

controller-1:~$ ceph -s
  cluster:
    id: 427bf4e1-20f5-4c9a-a1f9-337796696e3a
    health: HEALTH_WARN
            Reduced data availability: 64 pgs inactive, 64 pgs peering

  services:
    mon: 3 daemons, quorum controller-0,controller-1,compute-0
    mgr: controller-1(active), standbys: controller-0
    osd: 2 osds: 2 up, 2 in

  data:
    pools: 1 pools, 64 pgs
    objects: 73.06 k objects, 283 GiB
    usage: 568 GiB used, 324 GiB / 892 GiB avail
    pgs: 100.000% pgs not active
             64 peering

controller-1:~$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.87097 root storage-tier
-2 0.87097 chassis group-0
-4 0.43549 host controller-0
 0 ssd 0.43549 osd.0 up 1.00000 1.00000
-3 0.43549 host controller-1
 1 ssd 0.43549 osd.1 up 1.00000 1.00000

controller-1:~$ tail /var/log/ceph/ceph-osd.1.log
2019-11-13 15:33:52.302 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:33:53.315 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:33:54.273 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:33:55.279 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:33:56.267 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2723 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirecte...

Read more...

Frank Miller (sensfan22) wrote :

Re-assigning to Ovidiu to determine who should prime this issue.

Changed in starlingx:
assignee: Kevin Smith (kevin.smith.wrs) → Ovidiu Poncea (ovidiu.poncea)
Changed in starlingx:
assignee: Ovidiu Poncea (ovidiu.poncea) → Dan Voiculeasa (dvoicule)
Frank Miller (sensfan22) wrote :

There are 2 recommended actions to address this issue:
1) Update the ceph recovery code to detect if OSDs are in "stuck peering" state and restart the OSD.
2) Investigate if a semantic check needs to be added to prevent a lock of a host with OSDs if ceph is in "backfilling" state.

Fix proposed to branch: master
Review: https://review.opendev.org/695917

Changed in starlingx:
status: Triaged → In Progress
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers