Distributed Cloud: host swact failed as the controller services group could not be disabled for the node to go standby

Bug #1866099 reported by Tee Ngo
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Matt Peters

Bug Description

Brief Description
-----------------
Controller swact failed as sm could not disable the controller services group in order to go standby.

Severity
--------
Critical - following the swact failure, the system got into a stale mate state. Platform services and system commands were not available until sm issued a node reboot 50 minutes or so later.

Steps to Reproduce
------------------
1. Configure a distributed cloud with stx-monitor app applied on the SystemController
2. Lock the standby controller
3. Change platform memory allocation and wait for the platform-integ-apps reapply to complete
4. Unlock the standby controller
5. Perform host-swact
6. Lock the new standby controller to update its memory then unlock
7. Perform host-swact

Expected Behavior
------------------
Host swact is successful

Actual Behavior
----------------
Swact would fail in step 5. It would take close to 1 hour for the system to self recover. Swact would fail again in step 7.

Snippet of sm-customer.log
| 2020-03-04T15:43:22.199 | 471 | node-scn | controller-1 | | swact | issued against host controller-1
| 2020-03-04T15:57:24.462 | 602 | service-scn | drbd-cephmon | enabled-go-standby | enabled-go-standby-failed | go-standby failed
| 2020-03-04T15:57:24.462 | 603 | service-group-scn | controller-services | disabling | disabling-failed | drbd-cephmon(enabled-go-standby, failed)
| 2020-03-04T15:57:26.564 | 630 | service-group-scn | controller-services | disabling-failed | go-active-failed | drbd-cephmon(enabled-go-standby, failed), drbd-dockerdistribution(enabled-go-standby, failed), drbd-etcd(enabled-go-standby, failed), drbd-patch-vault(enabled-go-standby, failed), drbd-extension(enabled-go-standby, failed), ...
| 2020-03-04T15:57:28.768 | 674 | service-scn | dcorch-patch-api-proxy | enabled-active | disabled | process (pid=2915275) failed
| 2020-03-04T16:00:26.523 | 756 | service-group-scn | controller-services | go-active-failed | disabling-failed | drbd-cephmon(enabled-go-standby, ), drbd-dockerdistribution(enabled-go-standby, ), drbd-etcd(enabled-go-standby, ), drbd-patch-vault(enabled-go-standby, ), drbd-extension(enabled-go-standby, ), ...

first host-swact command issued against controller-0:
2020-03-04T12:44:31.000 controller-0 -sh: info HISTORY: PID=1483452 UID=42425 system host-swact controller-0

first node reboot initiated by sm against controller-1 to recover the system:
2020-03-04T13:35:32.000 controller-0 sm: debug time[160767.454] log<5782> INFO: sm[98802]: sm_node_api.cpp(910): ** Issuing a controlled reboot of the system **

second host-swact command issued against controller-1:
2020-03-04T15:43:04.000 controller-1 -sh: info HISTORY: PID=2344189 UID=42425 system host-swact controller-1

second node reboot initiated by sm against controller-0 to recover the system:
2020-03-04T16:34:04.000 controller-1 sm: debug time[14677.635] log<4013> INFO: sm[104039]: sm_node_api.cpp(910): ** Issuing a controlled reboot of the system **

Reproducibility
---------------
very reproducible

System Configuration
--------------------
IPv6 distributed cloud with stx-monitor

Branch/Pull Time/Commit
-----------------------
Feb 22 master code

Last Pass
---------
Not certain if host-swact in the configuration as described above was verified before

Timestamp/Logs
--------------
See attached for more logs

Test Activity
-------------
Evaluation

 Workaround
 ----------
 Describe workaround if available

Revision history for this message
Tee Ngo (teewrs) wrote :
Revision history for this message
Frank Miller (sensfan22) wrote :

Key kernel logs indicate this is a "device is held open by someone" type issue. Sample logs at the time of the failure:

2020-03-04T12:44:54.819 controller-0 kernel: err [157361.484752] block drbd8: State change failed: Device is held open by someone
2020-03-04T12:44:54.819 controller-0 kernel: err [157361.487387] block drbd6: State change failed: Device is held open by someone
2020-03-04T12:44:55.373 controller-0 kernel: err [157361.986267] block drbd9: State change failed: Device is held open by someone
2020-03-04T12:44:55.373 controller-0 kernel: err [157361.987222] block drbd7: State change failed: Device is held open by someone
2020-03-04T12:44:55.373 controller-0 kernel: err [157361.987475] block drbd5: State change failed: Device is held open by someone
2020-03-04T12:45:09.006 controller-0 kernel: err [157375.714197] block drbd2: State change failed: Device is held open by someone
2020-03-04T12:45:09.472 controller-0 kernel: err [157376.178824] block drbd0: State change failed: Device is held open by someone
2020-03-04T12:45:10.991 controller-0 kernel: err [157377.694175] block drbd1: State change failed: Device is held open by someone

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / high priority - switch of activity failing for distributed cloud

tags: added: stx.distcloud
tags: added: stx.4.0
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Dariush Eslimi (deslimi)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to DC PL for further action

Revision history for this message
Tee Ngo (teewrs) wrote :

Adding the drbd related open files before swact.

Revision history for this message
Tee Ngo (teewrs) wrote :

Adding drbd related open files after swact

Revision history for this message
Tee Ngo (teewrs) wrote :

Attached are the two lsof dumps of drbd (lsof|grep drbd) before and after the swact on the active controller.

The after swact file contains the following entries:
drbd 2603777 root cwd DIR 259,6 4096 2 /
drbd 2603777 root rtd DIR 259,6 4096 2 /
drbd 2603777 root txt REG 259,6 964672 265566 /usr/bin/bash
drbd 2603777 root mem REG 259,6 106075056 393152 /usr/lib/locale/locale-archive
drbd 2603777 root mem REG 259,6 2151672 265635 /usr/lib64/libc-2.17.so
drbd 2603777 root mem REG 259,6 19288 265641 /usr/lib64/libdl-2.17.so
drbd 2603777 root mem REG 259,6 174576 265971 /usr/lib64/libtinfo.so.5.9
drbd 2603777 root mem REG 259,6 163400 265628 /usr/lib64/ld-2.17.so
drbd 2603777 root mem REG 259,6 26254 265937 /usr/lib64/gconv/gconv-modules.cache
drbd 2603777 root 0r CHR 1,3 0t0 1028 /dev/null
drbd 2603777 root 1w CHR 1,3 0t0 1028 /dev/null
drbd 2603777 root 2w CHR 1,3 0t0 1028 /dev/null
drbd 2603777 root 9w CHR 1,3 0t0 1028 /dev/null
drbd 2603777 root 255r REG 259,6 29552 655839 /usr/lib/ocf/resource.d/linbit/drbd
drbd 2604179 root cwd DIR 259,6 4096 2 /
drbd 2604179 root rtd DIR 259,6 4096 2 /
drbd 2604179 root txt REG 259,6 964672 265566 /usr/bin/bash
drbd 2604179 root mem REG 259,6 106075056 393152 /usr/lib/locale/locale-archive
drbd 2604179 root mem REG 259,6 2151672 265635 /usr/lib64/libc-2.17.so
drbd 2604179 root mem REG 259,6 19288 265641 /usr/lib64/libdl-2.17.so
drbd 2604179 root mem REG 259,6 174576 265971 /usr/lib64/libtinfo.so.5.9
drbd 2604179 root mem REG 259,6 163400 265628 /usr/lib64/ld-2.17.so
drbd 2604179 root mem REG 259,6 26254 265937 /usr/lib64/gconv/gconv-modules.cache

There aren't similar entries when trying host-swact which succeeded in another SystemController (without any subclouds and stx-monitor).

Revision history for this message
Tee Ngo (teewrs) wrote :
Download full text (5.8 KiB)

The owners of the bash shells are drbd demote which were triggered by sm e.g.

Finding process name of pid 2665471 (a bash shell)
controller-0:~$ ps -efww|grep 2665471
root 2665471 103911 0 18:52 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4096011 2665471 0 19:39 ? 00:00:00 sleep 1
sysadmin 4096119 4080989 0 19:39 pts/13 00:00:00 grep --color=auto 2665471

Finding process name of pid 2665613 (another bash)
controller-0:~$ ps -efww|grep 2665613
root 2665613 103911 0 18:52 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4123722 2665613 0 19:40 ? 00:00:00 sleep 1
sysadmin 4123802 4080989 0 19:40 pts/13 00:00:00 grep --color=auto 2665613

Finding the parent of these drbd demote. By this time it issued a reboot
controller-0:~$ ps -efww|grep 103911
root 103911 1 1 15:38 ? 00:04:20 /usr/bin/sm
root 103918 103911 0 15:38 ? 00:00:01 /usr/bin/sm
root 2665471 103911 0 18:52 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 2665613 103911 0 18:52 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 2665615 103911 0 18:52 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 2665617 103911 0 18:52 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 2665873 103911 0 18:52 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 2670126 103911 0 18:53 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 2670255 103911 0 18:53 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 2670563 103911 0 18:53 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4114797 103911 0 19:39 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4114798 103911 0 19:39 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4114799 103911 0 19:39 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4114802 103911 0 19:39 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4114807 103911 0 19:39 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4114811 103911 0 19:39 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4114813 103911 0 19:39 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4114820 103911 0 19:39 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4127812 103911 0 19:40 ? 00:00:00 reboot
root 4132507 103911 0 19:40 ? 00:00:00 /bin/bash /etc/init.d/ceph-init-wrapper stop
root 4132518 103911 4 19:40 ? 00:00:00 /usr/bin/python /etc/init.d/mgr-restful-plugin stop
root 4132537 103911 0 19:40 ? 00:00:00 /bin/sh /etc/init.d/openldap stop
root 4133148 103911 0 19:40 ? 00:00:00 [IPaddr2] <defunct>

Below is the stack trace of one the drbd demote processes. Perhaps the content of /proc/drbd should be monitored for h...

Read more...

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

This issue was seen in IP-33-36 . logs are attached.

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

Below console logs were seen IP-33-36

[97796.298366] block drbd8: State change failed: Device is held open by someone
[97796.298368] block drbd8: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.298369] block drbd8: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.302227] block drbd5: State change failed: Device is held open by someone
[97796.302228] block drbd5: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.302229] block drbd5: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.303381] block drbd5: State change failed: Device is held open by someone
[97796.303383] block drbd5: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.303384] block drbd5: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.303927] block drbd1: State change failed: Device is held open by someone
[97796.303929] block drbd1: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.303930] block drbd1: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.318116] block drbd2: State change failed: Device is held open by someone
[97796.318118] block drbd2: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.318119] block drbd2: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.324579] block drbd7: State change failed: Device is held open by someone
[97796.324581] block drbd7: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.324582] block drbd7: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.329491] block drbd0: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.329493] block drbd0: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.329634] block drbd0: State change failed: Device is held open by someone
[97796.329636] block drbd0: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.329637] block drbd0: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.336381] block drbd7: State change failed: Device is held open by someone
[97796.336383] block drbd7: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.336384] block drbd7: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.337276] block drbd2: State change failed: Device is held open by someone
[97796.337278] block drbd2: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.337279] block drbd2: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r-

Frank Miller (sensfan22)
Changed in starlingx:
assignee: Dariush Eslimi (deslimi) → Al Bailey (albailey1974)
Revision history for this message
Al Bailey (albailey1974) wrote :

A container was running which had run the 'mount' command for the drbd filesystem.
That container was still running, and was blocking the teardown of drbd.

The container was running something called metricbeat

crictl ps |grep 8afd0d02
8afd0d023f166 9d91ab49e8de2 32 hours ago Running metricbeat

crictl exec 8afd0d023f166 mount |grep drbd
/dev/drbd8 on /hostfs/var/lib/docker-distribution type ext4 (rw,noatime,nodiratime,data=ordered)
/dev/drbd9 on /hostfs/var/lib/ceph/mon type ext4 (rw,noatime,nodiratime,data=ordered)
/dev/drbd6 on /hostfs/opt/patch-vault type ext4 (rw,noatime,nodiratime,data=ordered)
/dev/drbd1 on /hostfs/var/lib/rabbitmq type ext4 (rw,noatime,nodiratime,data=ordered)
/dev/drbd5 on /hostfs/opt/extension type ext4 (rw,noatime,nodiratime,data=ordered)
/dev/drbd0 on /hostfs/var/lib/postgresql type ext4 (rw,noatime,nodiratime,data=ordered)
/dev/drbd2 on /hostfs/opt/platform type ext4 (rw,noatime,nodiratime,data=ordered)
/dev/drbd7 on /hostfs/opt/etcd type ext4 (rw,noatime,nodiratime,data=ordered)
/dev/drbd2 on /hostfs/www/pages/iso type ext4 (rw,noatime,nodiratime,data=ordered)
/dev/drbd2 on /hostfs/www/pages/helm_charts type ext4 (rw,noatime,nodiratime,data=ordered)

Need to determine if there is a way to protect the system from these types of actions, or else provide a way to shutdown or force an unmount within the container.

Revision history for this message
Al Bailey (albailey1974) wrote :

Ghada, please assign to Kevin.

Changed in starlingx:
assignee: Al Bailey (albailey1974) → nobody
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Kevin Smith (kevin.smith.wrs)
Changed in starlingx:
assignee: Kevin Smith (kevin.smith.wrs) → Matt Peters (mpeters-wrs)
Revision history for this message
Matt Peters (mpeters-wrs) wrote :

The containers are not running a "mount" command from within the container. Any K8s Pod that requests a hostPath volume type and specifies the rootfs will perform a *bind* mount on the rootfs and all sub-path mounts (which includes DRBD).

The K8s security policies can protect the system from application Pods performing this same operation.
https://kubernetes.io/docs/concepts/policy/pod-security-policy/#volumes-and-file-systems

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to monitor-armada-app (master)

Fix proposed to branch: master
Review: https://review.opendev.org/712756

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to monitor-armada-app (master)

Reviewed: https://review.opendev.org/712756
Committed: https://git.openstack.org/cgit/starlingx/monitor-armada-app/commit/?id=687fb8584f3a244fa73cd8cb5091f496e0316f16
Submitter: Zuul
Branch: master

commit 687fb8584f3a244fa73cd8cb5091f496e0316f16
Author: Matt Peters <email address hidden>
Date: Thu Mar 12 13:54:44 2020 -0400

    Propagate host mount changes to metricbeat container

    The metricbeat pod requests a hostPath volume which mounts the
    rootfs of the host into the container. This is required for
    metricbeat to read the filesystem usage from within the container.
    The rootfs mount is also bind mounting all subpath mounts within the
    container which includes the DRBD filesystems. When SM attempts
    to unmount the DRBD filesystems as part of the swact operation, the
    metricbeat container is still holding on to the mounted filesystems,
    preventing the DRBD devices to be shutdown.

    To address this, the metricbeat container will specify the mount
    propagation so that any changes to the host filesystem are also
    reflected inside the container. This will ensure that when the host
    performs the unmount, the container will also release the mount
    reference.

    In the future, security policies should be added to ensure other
    application containers do not perform host mount operations without
    having the necessary privileges to access the host filesystem.

    Change-Id: I43c6dcbe6a438b357532263be5b04f122b587c70
    Closes-Bug: 1866099
    Signed-off-by: Matt Peters <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.