Controller failed to lock following a failover due to elastic pod failure to shutdown

Bug #1851287 reported by Tee Ngo
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Dan Voiculeasa

Bug Description

Brief Description
-----------------
The controller node failed to lock after a failover.

Severity
--------
Major

Steps to Reproduce
------------------
1. Bring up a standard system
2. Apply StarlingX monitor application
3. Force a failover by issuing the command "sudo reboot" on the primary controller (e.g. controller-0).
4. After the standby controller-1 has taken over and controller-0 has recovered from the reboot and becomes unlocked-enabled-available, perform a host swact
5. Lock the now standby controller

Expected Behavior
------------------
The standby controller is in locked state within 3 minutes.

Actual Behavior
----------------
The standby controller (controller-1) failed to lock as elasticsearch pods are stuck in terminating. VIM eventually timed out and failed the lock. The controller-1 daemon.log kept spewing the following logs:

2019-11-04T17:45:47.696 controller-1 dockerd[12252]: info time="2019-11-04T17:45:47.696483189Z" level=error msg="Error running exec 4e99d41ebaa999838e836cca8d6cdd36f4fdf5fab8bc55f0a3908a646ba19c96 in container: OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown"
2019-11-04T17:45:47.696 controller-1 kubelet[88547]: info W1104 17:45:47.696940 88547 prober.go:108] No ref for container "docker://cc09882af3071eaa3d82bd6e91cb78d2a7a52a895960edfb3a24706ed96f540f" (mon-elasticsearch-data-1_monitor(bf423e8c-0fc3-40a4-a662-e9f42d3f581f):elasticsearch)

A follow up host-lock --force had be issued in order to lock controller-1. As soon as the system host-unlock was issued against controller-1, the monitor related pods on the primary controller-0 got into a bad state and Kibana dashboard became unreachable until controller-1 recovered from the unlock.

NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system coredns-6bc668cd76-2xcrj 0/1 Pending 0 22m <none> <none> <none> <none>
monitor mon-elasticsearch-client-1 0/1 Pending 0 22m <none> <none> <none> <none>
monitor mon-elasticsearch-curator-1572890400-5bl5m 0/1 Error 0 17m fd00:206::8e22:765f:6121:eb55 controller-0 <none> <none>
monitor mon-elasticsearch-curator-1572890400-7nbnr 0/1 Error 0 22m fd00:206::8e22:765f:6121:eb63 controller-0 <none> <none>
monitor mon-elasticsearch-curator-1572890400-b2r5r 0/1 Error 0 8m51s fd00:206::8e22:765f:6121:eb7a controller-0 <none> <none>
monitor mon-elasticsearch-curator-1572890400-mwgf2 0/1 Error 0 15m fd00:206::8e22:765f:6121:eb42 controller-0 <none> <none>
monitor mon-elasticsearch-curator-1572890400-q66mf 0/1 Error 0 12m fd00:206::8e22:765f:6121:eb79 controller-0 <none> <none>
monitor mon-elasticsearch-curator-1572890400-tppf8 0/1 Error 0 20m fd00:206::8e22:765f:6121:eb6b controller-0 <none> <none>
monitor mon-elasticsearch-data-1 0/1 Pending 0 14m <none> <none> <none> <none>
monitor mon-elasticsearch-master-1 0/1 Pending 0 22m <none> <none> <none> <none>
monitor mon-filebeat-bppwv 1/2 CrashLoopBackOff 127 9d fd00:206::a4ce:fec1:5423:e327 controller-1 <none> <none>
monitor mon-kibana-6cf57cfd5b-pg7cf 0/1 CrashLoopBackOff 16 67m fd00:206::8e22:765f:6121:eb59 controller-0 <none> <none>
monitor mon-logstash-0 0/1 Pending 0 22m <none> <none> <none> <none>
monitor mon-metricbeat-7948cd594c-wwqlp 0/1 Error 18 67m fd00:206::8e22:765f:6121:eb6f controller-0 <none> <none>
monitor mon-metricbeat-tqzgn 0/1 Error 116 9d fd00:204::4 controller-1 <none> <none>

Reproducibility
---------------
Very frequent

System Configuration
--------------------
Standard system

Branch/Pull Time/Commit
-----------------------
Oct. 24th load

Last Pass
---------
This scenario was not executed by me previously and I do not know whether it was part of a regression test suite.

Timestamp/Logs
--------------
Attached is the daemon.log of the controller node that consistently failed host-lock following a failover.

Test Activity
-------------
Ad-hoc testing

Revision history for this message
Tee Ngo (teewrs) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.3.0 / high priority - pods not shutting down preventing the host lock operation.

summary: - Controller failed to lock following a failover
+ Controller failed to lock following a failover due to elastic pod
+ failure to shutdown
Changed in starlingx:
assignee: nobody → Kevin Smith (kevin.smith.wrs)
importance: Undecided → Medium
status: New → Triaged
importance: Medium → High
tags: added: stx.3.0 stx.containers
Revision history for this message
Kevin Smith (kevin.smith.wrs) wrote :

The data and master pods are blocked on I/O writing to their respective rbd volumes according to kernel logs. There is a ceph osd bounce just before the block which is suspicious. In some cases, the situation will resolve itself over time, the pods will terminate and the lock will complete. If not, a host-lock --force with the accompanying reboot will forcefully resolve the problem. The exact same symptoms can be reproduced if the volumes are allowed to go full (which the elasticsearch-curator is preventing). Unfortunately, ability to reproduce on the lab that reported the problem has been lost.

I currently suspect the index shard re-sync between the data pods that occurs after node recovery. It is possible to disable routing (at least temporarily) before a pod starts up which could help the situation, but without the ability to reproduce it will be hard to verify if that helps.

Revision history for this message
Kevin Smith (kevin.smith.wrs) wrote :

Issue reproduced with a host-swact. No host-lock performed after swact, but pods would surely get stuck terminating if it was issued. Logs attached. Ceph continuously in below state.

controller-1:~$ ceph -s
  cluster:
    id: 427bf4e1-20f5-4c9a-a1f9-337796696e3a
    health: HEALTH_WARN
            Reduced data availability: 64 pgs inactive, 64 pgs peering

  services:
    mon: 3 daemons, quorum controller-0,controller-1,compute-0
    mgr: controller-1(active), standbys: controller-0
    osd: 2 osds: 2 up, 2 in

  data:
    pools: 1 pools, 64 pgs
    objects: 73.06 k objects, 283 GiB
    usage: 568 GiB used, 324 GiB / 892 GiB avail
    pgs: 100.000% pgs not active
             64 peering

Revision history for this message
Bob Church (rchurch) wrote :
Download full text (6.2 KiB)

Looks like this is a ceph recovery issue. The OSDs look to be stuck peering which blocks cluster access.

controller-1:/var/log/ceph$ ceph health detail | head
HEALTH_WARN Reduced data availability: 64 pgs inactive, 64 pgs peering
PG_AVAILABILITY Reduced data availability: 64 pgs inactive, 64 pgs peering
    pg 1.0 is stuck peering for 6124.369015, current state peering, last acting [1,0]
    pg 1.1 is stuck peering for 6119.913881, current state peering, last acting [0,1]
    pg 1.2 is stuck peering for 6149.545071, current state peering, last acting [1,0]
    pg 1.3 is stuck peering for 6148.786716, current state peering, last acting [1,0]
    pg 1.4 is stuck peering for 6187.537512, current state peering, last acting [1,0]
    pg 1.5 is stuck peering for 6182.361878, current state peering, last acting [1,0]
    pg 1.6 is stuck peering for 6119.918444, current state peering, last acting [0,1]
    pg 1.7 is stuck peering for 6119.918502, current state peering, last acting [0,1]

Our current recovery logic will not detect this scenario. A restart of either OSD daemon will fix the issue.

controller-1:~$ ceph -s
  cluster:
    id: 427bf4e1-20f5-4c9a-a1f9-337796696e3a
    health: HEALTH_WARN
            Reduced data availability: 64 pgs inactive, 64 pgs peering

  services:
    mon: 3 daemons, quorum controller-0,controller-1,compute-0
    mgr: controller-1(active), standbys: controller-0
    osd: 2 osds: 2 up, 2 in

  data:
    pools: 1 pools, 64 pgs
    objects: 73.06 k objects, 283 GiB
    usage: 568 GiB used, 324 GiB / 892 GiB avail
    pgs: 100.000% pgs not active
             64 peering

controller-1:~$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.87097 root storage-tier
-2 0.87097 chassis group-0
-4 0.43549 host controller-0
 0 ssd 0.43549 osd.0 up 1.00000 1.00000
-3 0.43549 host controller-1
 1 ssd 0.43549 osd.1 up 1.00000 1.00000

controller-1:~$ tail /var/log/ceph/ceph-osd.1.log
2019-11-13 15:33:52.302 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:33:53.315 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:33:54.273 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:33:55.279 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2722 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirected e1739)
2019-11-13 15:33:56.267 7f2961c1f700 -1 osd.1 1740 get_health_metrics reporting 2723 slow ops, oldest is osd_op(client.2869821.0:28539 1.10 1.1d461e50 (undecoded) ondisk+write+known_if_redirecte...

Read more...

Revision history for this message
Frank Miller (sensfan22) wrote :

Re-assigning to Ovidiu to determine who should prime this issue.

Changed in starlingx:
assignee: Kevin Smith (kevin.smith.wrs) → Ovidiu Poncea (ovidiu.poncea)
Changed in starlingx:
assignee: Ovidiu Poncea (ovidiu.poncea) → Dan Voiculeasa (dvoicule)
Revision history for this message
Frank Miller (sensfan22) wrote :

There are 2 recommended actions to address this issue:
1) Update the ceph recovery code to detect if OSDs are in "stuck peering" state and restart the OSD.
2) Investigate if a semantic check needs to be added to prevent a lock of a host with OSDs if ceph is in "backfilling" state.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/695917

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/696938

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/695917
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=11fd5d9cd48a1539b9c7a4ebc8aaad69ed24ae5b
Submitter: Zuul
Branch: master

commit 11fd5d9cd48a1539b9c7a4ebc8aaad69ed24ae5b
Author: Dan Voiculeasa <email address hidden>
Date: Thu Nov 21 15:01:36 2019 +0200

    ceph-init-wrapper: Detect stuck peering OSDs and restart them

    OSDs might become stuck peering.
    Recover from such state.

    Closes-bug: 1851287

    Change-Id: I2ef1a0e93d38c3d041ee0c5c1e66a4ac42785a68
    Signed-off-by: Dan Voiculeasa <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Next step is to cherrypick this commit to the r/stx.3.0 branch

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (r/stx.3.0)

Fix proposed to branch: r/stx.3.0
Review: https://review.opendev.org/698222

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (r/stx.3.0)

Reviewed: https://review.opendev.org/698222
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=0bf4b546df8c7fdec8cfc6cb6f71b9609ee54306
Submitter: Zuul
Branch: r/stx.3.0

commit 0bf4b546df8c7fdec8cfc6cb6f71b9609ee54306
Author: Dan Voiculeasa <email address hidden>
Date: Thu Nov 21 15:01:36 2019 +0200

    ceph-init-wrapper: Detect stuck peering OSDs and restart them

    OSDs might become stuck peering.
    Recover from such state.

    Closes-bug: 1851287

    Change-Id: I2ef1a0e93d38c3d041ee0c5c1e66a4ac42785a68
    Signed-off-by: Dan Voiculeasa <email address hidden>
    (cherry picked from commit 11fd5d9cd48a1539b9c7a4ebc8aaad69ed24ae5b)

Ghada Khalil (gkhalil)
tags: added: in-r-stx30
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/698561

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/696938
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=37db1ee792d7ea1eee77b3b134f078a5fca5fdbe
Submitter: Zuul
Branch: master

commit 37db1ee792d7ea1eee77b3b134f078a5fca5fdbe
Author: Dan Voiculeasa <email address hidden>
Date: Mon Dec 2 18:47:12 2019 +0200

    ceph: Add semantic check on host-lock to avoid data/service loss

    Avoid locking nodes that have OSDs in recovery state.
    If an OSD that fills others with newer data is stopped(host-lock), then
    ceph doesn't feed the old data to consumers, thus K8s pods get stuck.

    Parse `ceph health details` for PGs in `recovery_wait` or `recovering`
    state. Identify OSDs acting on those PGs. Identify which nodes own the
    OSDs. Deny the lock on those OSDs until ceph recovered.

    Mock `ceph health details` and `ceph osd tree` in tests with a simple
    AIO-DX configuration. Controller-0 with OSD.0. Controller-1 with OSD.1.

    Example `ceph health details` output:
    pg 1.0 is active+recovery_wait+degraded, acting [1,0]
    pg 1.1 is active+recovering+degraded, acting [1,0]

    Partial-Bug: 1851287
    Change-Id: Id644d1de5ba2a0bff51638fb9cb8a4d2732e7278
    Signed-off-by: Dan Voiculeasa <email address hidden>

Revision history for this message
Frank Miller (sensfan22) wrote :

As indicated above, the solution for this LP is addressed by 2 actions:
1) Update the ceph recovery code to detect if OSDs are in "stuck peering" state and restart the OSD.
2) Investigate if a semantic check needs to be added to prevent a lock of a host with OSDs if ceph is in "backfilling" state.

The 1st action is higher priority and the commits are merged into r/stx.3.0.
The 2nd action is a good preventative addition but is not required for r/stx.3.0. It only needs to merge into master for the stx.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (f/centos8)
Download full text (7.1 KiB)

Reviewed: https://review.opendev.org/698561
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=9035cd1be8aa3138691c6c99219030dfbe77ebaf
Submitter: Zuul
Branch: f/centos8

commit 4aa661ce5666220d6beb2a3a3fac987cba4feb74
Author: Martin, Chen <email address hidden>
Date: Thu Nov 21 10:28:13 2019 +0800

    Build layering
    Rebase tarball for i40e Driver
    Rebase srpm for systemd 219-67.el7
    Rebase srpm for sudo
    Rebase srpm for ntp

    Depends-On: https://review.opendev.org/#/c/695061/
    Depends-On: https://review.opendev.org/#/c/695560/
    Depends-On: https://review.opendev.org/#/c/695637/
    Depends-On: https://review.opendev.org/#/c/695983/

    Story: 2006166
    Task: 37570

    Change-Id: I7f33e0fb1319df3421318c4927d2a5675a490273
    Signed-off-by: Martin, Chen <email address hidden>

commit 5d854355d873702b78ff6aa8c6fddc025c45be2d
Author: Jim Somerville <email address hidden>
Date: Mon Nov 25 16:07:17 2019 -0500

    Uprev ntp to version 4.2.6p5-29.el7

    This solves:
    ntp: Stack-based buffer overflow in ntpq and ntpdc allows
    denial of service or code execution (CVE-2018-12327)

    See the announcement link:

    https://lists.centos.org/pipermail/centos-cr-announce/2019-August/006016.html

    for more details.

    Here we refresh the meta patches and correct the crime of
    "name of patch file differs from git format-patch". We
    also clean up the commit short logs.

    Change-Id: I263465d85f06096296fdd478a302eb110ab1259c
    Closes-Bug: 1849197
    Depends-On: https://review.opendev.org/#/c/695983
    Signed-off-by: Jim Somerville <email address hidden>

commit 11fd5d9cd48a1539b9c7a4ebc8aaad69ed24ae5b
Author: Dan Voiculeasa <email address hidden>
Date: Thu Nov 21 15:01:36 2019 +0200

    ceph-init-wrapper: Detect stuck peering OSDs and restart them

    OSDs might become stuck peering.
    Recover from such state.

    Closes-bug: 1851287

    Change-Id: I2ef1a0e93d38c3d041ee0c5c1e66a4ac42785a68
    Signed-off-by: Dan Voiculeasa <email address hidden>

commit f30cb74fef4b97721010ca9bc6a6b6dde03c4add
Author: Robin Lu <email address hidden>
Date: Fri Nov 22 11:01:27 2019 +0800

    Update sudo srpm patch for CVE bug

    To fix below CVE, we will use sudo-1.8.23-4.el7_7.1.src.rpm
    And we have to update some patches according to new srpm.
    https://lists.centos.org/pipermail/centos-announce/2019-October/023499.html

    CVE bug: CVE-2019-14287: sudo: can bypass certain policy blacklists

    Closes-Bug: 1852825
    Depends-On: https://review.opendev.org/#/c/695637/
    Change-Id: Ifc0a3423464fafce06cd504d9b427fc3433fb756
    Signed-off-by: Robin Lu <email address hidden>

commit 0231aba5cdcb96b15106591acfff280159050366
Author: Jim Somerville <email address hidden>
Date: Thu Nov 21 15:54:15 2019 -0500

    Uprev systemd to version 219-67.el7

    This solves:
    systemd: line splitting via fgets() allows for state injection
    during daemon-reexec (CVE-2018-15686)

    along with some other less critical issues. See the security
    announcement link:

   ...

Read more...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.