StarlingX

Distributed Cloud: host swact failed as the controller services group could not be disabled for the node to go standby

Bug #1866099 reported by Tee Ngo on 2020-03-04

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Matt Peters

Bug Description

Brief Description
-----------------
Controller swact failed as sm could not disable the controller services group in order to go standby.

Severity
--------
Critical - following the swact failure, the system got into a stale mate state. Platform services and system commands were not available until sm issued a node reboot 50 minutes or so later.

Steps to Reproduce
------------------
1. Configure a distributed cloud with stx-monitor app applied on the SystemController
2. Lock the standby controller
3. Change platform memory allocation and wait for the platform-integ-apps reapply to complete
4. Unlock the standby controller
5. Perform host-swact
6. Lock the new standby controller to update its memory then unlock
7. Perform host-swact

Expected Behavior
------------------
Host swact is successful

Actual Behavior
----------------
Swact would fail in step 5. It would take close to 1 hour for the system to self recover. Swact would fail again in step 7.

first host-swact command issued against controller-0:
2020-03-04T12:44:31.000 controller-0 -sh: info HISTORY: PID=1483452 UID=42425 system host-swact controller-0

first node reboot initiated by sm against controller-1 to recover the system:
2020-03-04T13:35:32.000 controller-0 sm: debug time[160767.454] log<5782> INFO: sm[98802]: sm_node_api.cpp(910): ** Issuing a controlled reboot of the system **

second host-swact command issued against controller-1:
2020-03-04T15:43:04.000 controller-1 -sh: info HISTORY: PID=2344189 UID=42425 system host-swact controller-1

second node reboot initiated by sm against controller-0 to recover the system:
2020-03-04T16:34:04.000 controller-1 sm: debug time[14677.635] log<4013> INFO: sm[104039]: sm_node_api.cpp(910): ** Issuing a controlled reboot of the system **

Reproducibility
---------------
very reproducible

System Configuration
--------------------
IPv6 distributed cloud with stx-monitor

Branch/Pull Time/Commit
-----------------------
Feb 22 master code

Last Pass
---------
Not certain if host-swact in the configuration as described above was verified before

Timestamp/Logs
--------------
See attached for more logs

Test Activity
-------------
Evaluation

Workaround
----------
Describe workaround if available

Tags:

Revision history for this message

Tee Ngo (teewrs) wrote on 2020-03-04:

swact_failure.tgz Edit (195.0 MiB, application/x-tar)

Revision history for this message

Frank Miller (sensfan22) wrote on 2020-03-04:

Key kernel logs indicate this is a "device is held open by someone" type issue. Sample logs at the time of the failure:

2020-03-04T12:44:54.819 controller-0 kernel: err [157361.484752] block drbd8: State change failed: Device is held open by someone
2020-03-04T12:44:54.819 controller-0 kernel: err [157361.487387] block drbd6: State change failed: Device is held open by someone
2020-03-04T12:44:55.373 controller-0 kernel: err [157361.986267] block drbd9: State change failed: Device is held open by someone
2020-03-04T12:44:55.373 controller-0 kernel: err [157361.987222] block drbd7: State change failed: Device is held open by someone
2020-03-04T12:44:55.373 controller-0 kernel: err [157361.987475] block drbd5: State change failed: Device is held open by someone
2020-03-04T12:45:09.006 controller-0 kernel: err [157375.714197] block drbd2: State change failed: Device is held open by someone
2020-03-04T12:45:09.472 controller-0 kernel: err [157376.178824] block drbd0: State change failed: Device is held open by someone
2020-03-04T12:45:10.991 controller-0 kernel: err [157377.694175] block drbd1: State change failed: Device is held open by someone

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-03-04:

stx.4.0 / high priority - switch of activity failing for distributed cloud

tags:	added: stx.distcloud
tags:	added: stx.4.0
Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged
assignee:	nobody → Dariush Eslimi (deslimi)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-03-04:

Assigning to DC PL for further action

Revision history for this message

Tee Ngo (teewrs) wrote on 2020-03-04:

lsof_drbd_before_swact.txt Edit (14.2 KiB, text/plain)

Adding the drbd related open files before swact.

Revision history for this message

Tee Ngo (teewrs) wrote on 2020-03-04:

lsof_drbd_after_swact.txt Edit (26.3 KiB, text/plain)

Adding drbd related open files after swact

Revision history for this message

Tee Ngo (teewrs) wrote on 2020-03-04:

Attached are the two lsof dumps of drbd (lsof|grep drbd) before and after the swact on the active controller.

The after drbd 2603777 drbd 2603777 drbd 2603777 drbd 2603777 drbd 2603777 drbd 2603777 drbd 2603777 drbd 2603777 drbd 2603777 drbd 2603777 drbd 2603777 drbd 2603777 drbd 2603777 drbd 2603777 drbd 2604179 drbd 2604179 drbd 2604179 drbd 2604179 drbd 2604179 drbd 2604179 drbd 2604179 drbd 2604179 drbd 2604179 swact file contains the following entries:
root cwd DIR 259,6 4096 2 /
root rtd DIR 259,6 4096 2 /
root txt REG 259,6 964672 265566 /usr/bin/bash
root mem REG 259,6 106075056 393152 /usr/lib/locale/locale-archive
root mem REG 259,6 2151672 265635 /usr/lib64/libc-2.17.so
root mem REG 259,6 19288 265641 /usr/lib64/libdl-2.17.so
root mem REG 259,6 174576 265971 /usr/lib64/libtinfo.so.5.9
root mem REG 259,6 163400 265628 /usr/lib64/ld-2.17.so
root mem REG 259,6 26254 265937 /usr/lib64/gconv/gconv-modules.cache
root 0r CHR 1,3 0t0 1028 /dev/null
root 1w CHR 1,3 0t0 1028 /dev/null
root 2w CHR 1,3 0t0 1028 /dev/null
root 9w CHR 1,3 0t0 1028 /dev/null
root 255r REG 259,6 29552 655839 /usr/lib/ocf/resource.d/linbit/drbd
root cwd DIR 259,6 4096 2 /
root rtd DIR 259,6 4096 2 /
root txt REG 259,6 964672 265566 /usr/bin/bash
root mem REG 259,6 106075056 393152 /usr/lib/locale/locale-archive
root mem REG 259,6 2151672 265635 /usr/lib64/libc-2.17.so
root mem REG 259,6 19288 265641 /usr/lib64/libdl-2.17.so
root mem REG 259,6 174576 265971 /usr/lib64/libtinfo.so.5.9
root mem REG 259,6 163400 265628 /usr/lib64/ld-2.17.so
root mem REG 259,6 26254 265937 /usr/lib64/gconv/gconv-modules.cache

There aren't similar entries when trying host-swact which succeeded in another SystemController (without any subclouds and stx-monitor).

Attached are the two lsof dumps of drbd (lsof|grep drbd) before and after the swact on the active controller.

The after swact file contains the following entries:
drbd      2603777              root  cwd       DIR              259,6       4096          2 /
drbd      2603777              root  rtd       DIR              259,6       4096          2 /
drbd      2603777              root  txt       REG              259,6     964672     265566 /usr/bin/bash
drbd      2603777              root  mem       REG              259,6  106075056     393152 /usr/lib/locale/locale-archive
drbd      2603777              root  mem       REG              259,6    2151672     265635 /usr/lib64/libc-2.17.so
drbd      2603777              root  mem       REG              259,6      19288     265641 /usr/lib64/libdl-2.17.so
drbd      2603777              root  mem       REG              259,6     174576     265971 /usr/lib64/libtinfo.so.5.9
drbd      2603777              root  mem       REG              259,6     163400     265628 /usr/lib64/ld-2.17.so
drbd      2603777              root  mem       REG              259,6      26254     265937 /usr/lib64/gconv/gconv-modules.cache
drbd      2603777              root    0r      CHR                1,3        0t0       1028 /dev/null
drbd      2603777              root    1w      CHR                1,3        0t0       1028 /dev/null
drbd      2603777              root    2w      CHR                1,3        0t0       1028 /dev/null
drbd      2603777              root    9w      CHR                1,3        0t0       1028 /dev/null
drbd      2603777              root  255r      REG              259,6      29552     655839 /usr/lib/ocf/resource.d/linbit/drbd
drbd      2604179              root  cwd       DIR              259,6       4096          2 /
drbd      2604179              root  rtd       DIR              259,6       4096          2 /
drbd      2604179              root  txt       REG              259,6     964672     265566 /usr/bin/bash
drbd      2604179              root  mem       REG              259,6  106075056     393152 /usr/lib/locale/locale-archive
drbd      2604179              root  mem       REG              259,6    2151672     265635 /usr/lib64/libc-2.17.so
drbd      2604179              root  mem       REG              259,6      19288     265641 /usr/lib64/libdl-2.17.so
drbd      2604179              root  mem       REG              259,6     174576     265971 /usr/lib64/libtinfo.so.5.9
drbd      2604179              root  mem       REG              259,6     163400     265628 /usr/lib64/ld-2.17.so
drbd      2604179              root  mem       REG              259,6      26254     265937 /usr/lib64/gconv/gconv-modules.cache

There aren't similar entries when trying host-swact which succeeded in another SystemController (without any subclouds and stx-monitor).

Revision history for this message

Tee Ngo (teewrs) wrote on 2020-03-05:

Download full text (5.8 KiB)

The owners of the bash shells are drbd demote which were triggered by sm e.g.

Finding process name of pid 2665471 (a bash shell)
controller-0:~$ ps -efww|grep 2665471
root 2665471 103911 0 18:52 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4096011 2665471 0 19:39 ? 00:00:00 sleep 1
sysadmin 4096119 4080989 0 19:39 pts/13 00:00:00 grep --color=auto 2665471

Finding process name of pid 2665613 (another bash)
controller-0:~$ ps -efww|grep 2665613
root 2665613 103911 0 18:52 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4123722 2665613 0 19:40 ? 00:00:00 sleep 1
sysadmin 4123802 4080989 0 19:40 pts/13 00:00:00 grep --color=auto 2665613

Finding the parent of these drbd demote. By this time it issued a reboot
controller-0:~$ ps -efww|grep 103911
root 103911 1 1 15:38 ? 00:04:20 /usr/bin/sm
root 103918 103911 0 15:38 ? 00:00:01 /usr/bin/sm
root 2665471 103911 0 18:52 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 2665613 103911 0 18:52 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 2665615 103911 0 18:52 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 2665617 103911 0 18:52 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 2665873 103911 0 18:52 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 2670126 103911 0 18:53 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 2670255 103911 0 18:53 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 2670563 103911 0 18:53 ? 00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4114797 103911 0 19:39 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4114798 103911 0 19:39 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4114799 103911 0 19:39 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4114802 103911 0 19:39 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4114807 103911 0 19:39 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4114811 103911 0 19:39 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4114813 103911 0 19:39 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4114820 103911 0 19:39 ? 00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root 4127812 103911 0 19:40 ? 00:00:00 reboot
root 4132507 103911 0 19:40 ? 00:00:00 /bin/bash /etc/init.d/ceph-init-wrapper stop
root 4132518 103911 4 19:40 ? 00:00:00 /usr/bin/python /etc/init.d/mgr-restful-plugin stop
root 4132537 103911 0 19:40 ? 00:00:00 /bin/sh /etc/init.d/openldap stop
root 4133148 103911 0 19:40 ? 00:00:00 [IPaddr2] <defunct>

Below is the stack trace of one the drbd demote processes. Perhaps the content of /proc/drbd should be monitored for h...

The owners of the bash shells are drbd demote which were triggered by sm e.g.

Finding process name of pid 2665471 (a bash shell)
controller-0:~$ ps -efww|grep 2665471
root     2665471  103911  0 18:52 ?        00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root     4096011 2665471  0 19:39 ?        00:00:00 sleep 1
sysadmin 4096119 4080989  0 19:39 pts/13   00:00:00 grep --color=auto 2665471

Finding process name of pid 2665613 (another bash)
controller-0:~$ ps -efww|grep 2665613
root     2665613  103911  0 18:52 ?        00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root     4123722 2665613  0 19:40 ?        00:00:00 sleep 1
sysadmin 4123802 4080989  0 19:40 pts/13   00:00:00 grep --color=auto 2665613

Finding the parent of these drbd demote. By this time it issued a reboot
controller-0:~$ ps -efww|grep 103911
root      103911       1  1 15:38 ?        00:04:20 /usr/bin/sm
root      103918  103911  0 15:38 ?        00:00:01 /usr/bin/sm
root     2665471  103911  0 18:52 ?        00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root     2665613  103911  0 18:52 ?        00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root     2665615  103911  0 18:52 ?        00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root     2665617  103911  0 18:52 ?        00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root     2665873  103911  0 18:52 ?        00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root     2670126  103911  0 18:53 ?        00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root     2670255  103911  0 18:53 ?        00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root     2670563  103911  0 18:53 ?        00:00:18 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root     4114797  103911  0 19:39 ?        00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root     4114798  103911  0 19:39 ?        00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root     4114799  103911  0 19:39 ?        00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root     4114802  103911  0 19:39 ?        00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root     4114807  103911  0 19:39 ?        00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root     4114811  103911  0 19:39 ?        00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root     4114813  103911  0 19:39 ?        00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root     4114820  103911  0 19:39 ?        00:00:00 /bin/bash /usr/lib/ocf/resource.d/linbit/drbd demote
root     4127812  103911  0 19:40 ?        00:00:00 reboot
root     4132507  103911  0 19:40 ?        00:00:00 /bin/bash /etc/init.d/ceph-init-wrapper stop
root     4132518  103911  4 19:40 ?        00:00:00 /usr/bin/python /etc/init.d/mgr-restful-plugin stop
root     4132537  103911  0 19:40 ?        00:00:00 /bin/sh /etc/init.d/openldap stop
root     4133148  103911  0 19:40 ?        00:00:00 [IPaddr2] <defunct>

Below is the stack trace of one the drbd demote processes. Perhaps the content of /proc/drbd should be monitored for hints after issuing the host-swact command.

[pid 4133733] stat("/proc/drbd", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 4133733] brk(NULL)                 = 0xa86000
[pid 4133733] brk(0xaa9000)             = 0xaa9000
[pid 4133733] brk(NULL)                 = 0xaa9000
[pid 4133733] socket(AF_NETLINK, SOCK_DGRAM, NETLINK_GENERIC) = 8
[pid 4133733] setsockopt(8, SOL_SOCKET, SO_SNDBUF, [2048], 4) = 0
[pid 4133733] setsockopt(8, SOL_SOCKET, SO_RCVBUF, [2048], 4) = 0
[pid 4133733] bind(8, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
[pid 4133733] getsockname(8, {sa_family=AF_NETLINK, pid=4133733, groups=00000000}, [12]) = 0
[pid 4133733] write(8, " \0\0\0\20\0\1\0\224\37`^e\23?\0\3\2\0\0\t\0\2\0drbd\0\0\0\0", 32) = 32
[pid 4133733] poll([{fd=8, events=POLLIN}], 1, 3000) = 1 ([{fd=8, revents=POLLIN}])
[pid 4133733] recvmsg(8, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\314\2\0\0\20\0\0\0\224\37`^e\23?\0\1\2\0\0\t\0\2\0drbd\0\0\0\0"..., 8192}], msg_controllen=0, msg_flags=0}, MSG_PEEK) = 716
[pid 4133733] poll([{fd=8, events=POLLIN}], 1, 3000) = 1 ([{fd=8, revents=POLLIN}])
[pid 4133733] recvmsg(8, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\314\2\0\0\20\0\0\0\224\37`^e\23?\0\1\2\0\0\t\0\2\0drbd\0\0\0\0"..., 8192}], msg_controllen=0, msg_flags=0}, 0) = 716
[pid 4133733] open("/var/lock/drbd-147-6", O_RDWR|O_CREAT, 0600) = 18
[pid 4133733] rt_sigaction(SIGALRM, {0x409a70, [], SA_RESTORER, 0x7fcdc18a9280}, {SIG_DFL, [], 0}, 8) = 0
[pid 4133733] alarm(1)                  = 0
[pid 4133733] fcntl(18, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
[pid 4133733] alarm(0)                  = 1
[pid 4133733] rt_sigaction(SIGALRM, {SIG_DFL, [], SA_RESTORER, 0x7fcdc18a9280}, NULL, 8) = 0
[pid 4133733] write(8, "\34\0\0\0\34\0\1\0\225\37`^e\23?\0\17\1\0\0\6\0\0\0\0\0\0\0", 28) = 28
[pid 4133733] poll([{fd=8, events=POLLIN}], 1, 120000) = 1 ([{fd=8, revents=POLLIN}])
[pid 4133733] recvmsg(8, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\34\0\0\0\34\0\0\0\225\37`^e\23?\0\17\1\0\0\6\0\0\0\364\377\377\377", 8192}], msg_controllen=0, msg_flags=0}, MSG_PEEK) = 28
[pid 4133733] poll([{fd=8, events=POLLIN}], 1, 120000) = 1 ([{fd=8, revents=POLLIN}])
[pid 4133733] recvmsg(8, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\34\0\0\0\34\0\0\0\225\37`^e\23?\0\17\1\0\0\6\0\0\0\364\377\377\377", 8192}], msg_controllen=0, msg_flags=0}, 0) = 28
[pid 4133733] write(2, "6: State change failed: (-12) De"..., 61) = 61
[pid 4133733] close(18)                 = 0
[pid 4133733] exit_group(11)            = ?
[pid 4133733] +++ exited with 11 +++

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2020-03-09:

This issue was seen in IP-33-36 . logs are attached.

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2020-03-09:

#10

Below console logs were seen IP-33-36

[97796.298366] block drbd8: State change failed: Device is held open by someone
[97796.298368] block drbd8: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.298369] block drbd8: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.302227] block drbd5: State change failed: Device is held open by someone
[97796.302228] block drbd5: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.302229] block drbd5: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.303381] block drbd5: State change failed: Device is held open by someone
[97796.303383] block drbd5: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.303384] block drbd5: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.303927] block drbd1: State change failed: Device is held open by someone
[97796.303929] block drbd1: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.303930] block drbd1: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.318116] block drbd2: State change failed: Device is held open by someone
[97796.318118] block drbd2: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.318119] block drbd2: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.324579] block drbd7: State change failed: Device is held open by someone
[97796.324581] block drbd7: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.324582] block drbd7: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.329491] block drbd0: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.329493] block drbd0: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.329634] block drbd0: State change failed: Device is held open by someone
[97796.329636] block drbd0: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.329637] block drbd0: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.336381] block drbd7: State change failed: Device is held open by someone
[97796.336383] block drbd7: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.336384] block drbd7: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.337276] block drbd2: State change failed: Device is held open by someone
[97796.337278] block drbd2: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.337279] block drbd2: wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r-

Below console logs were seen IP-33-36

[97796.298366] block drbd8: State change failed: Device is held open by someone
[97796.298368] block drbd8:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.298369] block drbd8:  wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.302227] block drbd5: State change failed: Device is held open by someone
[97796.302228] block drbd5:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.302229] block drbd5:  wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.303381] block drbd5: State change failed: Device is held open by someone
[97796.303383] block drbd5:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.303384] block drbd5:  wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.303927] block drbd1: State change failed: Device is held open by someone
[97796.303929] block drbd1:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.303930] block drbd1:  wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.318116] block drbd2: State change failed: Device is held open by someone
[97796.318118] block drbd2:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.318119] block drbd2:  wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.324579] block drbd7: State change failed: Device is held open by someone
[97796.324581] block drbd7:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.324582] block drbd7:  wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.329491] block drbd0:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.329493] block drbd0:  wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.329634] block drbd0: State change failed: Device is held open by someone
[97796.329636] block drbd0:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.329637] block drbd0:  wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.336381] block drbd7: State change failed: Device is held open by someone
[97796.336383] block drbd7:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.336384] block drbd7:  wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r----- }
[97796.337276] block drbd2: State change failed: Device is held open by someone
[97796.337278] block drbd2:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
[97796.337279] block drbd2:  wanted = { cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate r-

Frank Miller (sensfan22) on 2020-03-10

Changed in starlingx:
assignee:	Dariush Eslimi (deslimi) → Al Bailey (albailey1974)

Revision history for this message

Al Bailey (albailey1974) wrote on 2020-03-11:

#11

A container was running which had run the 'mount' command for the drbd filesystem.
That container was still running, and was blocking the teardown of drbd.

The container was running something called metricbeat

crictl ps |grep 8afd0d02
8afd0d023f166 9d91ab49e8de2 32 hours ago Running metricbeat

crictl exec 8afd0d023f166 mount |grep drbd
/dev/drbd8 on /hostfs/var/lib/docker-distribution type ext4 (rw,noatime,nodiratime,data=ordered)
/dev/drbd9 on /hostfs/var/lib/ceph/mon type ext4 (rw,noatime,nodiratime,data=ordered)
/dev/drbd6 on /hostfs/opt/patch-vault type ext4 (rw,noatime,nodiratime,data=ordered)
/dev/drbd1 on /hostfs/var/lib/rabbitmq type ext4 (rw,noatime,nodiratime,data=ordered)
/dev/drbd5 on /hostfs/opt/extension type ext4 (rw,noatime,nodiratime,data=ordered)
/dev/drbd0 on /hostfs/var/lib/postgresql type ext4 (rw,noatime,nodiratime,data=ordered)
/dev/drbd2 on /hostfs/opt/platform type ext4 (rw,noatime,nodiratime,data=ordered)
/dev/drbd7 on /hostfs/opt/etcd type ext4 (rw,noatime,nodiratime,data=ordered)
/dev/drbd2 on /hostfs/www/pages/iso type ext4 (rw,noatime,nodiratime,data=ordered)
/dev/drbd2 on /hostfs/www/pages/helm_charts type ext4 (rw,noatime,nodiratime,data=ordered)

Need to determine if there is a way to protect the system from these types of actions, or else provide a way to shutdown or force an unmount within the container.

Revision history for this message

Al Bailey (albailey1974) wrote on 2020-03-11:

#12

Ghada, please assign to Kevin.

Changed in starlingx:
assignee:	Al Bailey (albailey1974) → nobody

Ghada Khalil (gkhalil) on 2020-03-11

Changed in starlingx:
assignee:	nobody → Kevin Smith (kevin.smith.wrs)

Matt Peters (mpeters-wrs) on 2020-03-12

Changed in starlingx:
assignee:	Kevin Smith (kevin.smith.wrs) → Matt Peters (mpeters-wrs)

Revision history for this message

Matt Peters (mpeters-wrs) wrote on 2020-03-12:

#13

The containers are not running a "mount" command from within the container. Any K8s Pod that requests a hostPath volume type and specifies the rootfs will perform a *bind* mount on the rootfs and all sub-path mounts (which includes DRBD).

The K8s security policies can protect the system from application Pods performing this same operation.
https://kubernetes.io/docs/concepts/policy/pod-security-policy/#volumes-and-file-systems

Matt Peters (mpeters-wrs) on 2020-03-12

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-12: Fix proposed to monitor-armada-app (master)

#14

Fix proposed to branch: master
Review: https://review.opendev.org/712756

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-12: Fix merged to monitor-armada-app (master)

#15

Reviewed: https://review.opendev.org/712756
Committed: https://git.openstack.org/cgit/starlingx/monitor-armada-app/commit/?id=687fb8584f3a244fa73cd8cb5091f496e0316f16
Submitter: Zuul
Branch: master

commit 687fb8584f3a244fa73cd8cb5091f496e0316f16
Author: Matt Peters <email address hidden>
Date: Thu Mar 12 13:54:44 2020 -0400

Propagate host mount changes to metricbeat container

    The metricbeat pod requests a hostPath volume which mounts the
    rootfs of the host into the container. This is required for
    metricbeat to read the filesystem usage from within the container.
    The rootfs mount is also bind mounting all subpath mounts within the
    container which includes the DRBD filesystems. When SM attempts
    to unmount the DRBD filesystems as part of the swact operation, the
    metricbeat container is still holding on to the mounted filesystems,
    preventing the DRBD devices to be shutdown.

    To address this, the metricbeat container will specify the mount
    propagation so that any changes to the host filesystem are also
    reflected inside the container. This will ensure that when the host
    performs the unmount, the container will also release the mount
    reference.

    In the future, security policies should be added to ensure other
    application containers do not perform host mount operations without
    having the necessary privileges to access the host filesystem.

    Change-Id: I43c6dcbe6a438b357532263be5b04f122b587c70
    Closes-Bug: 1866099
    Signed-off-by: Matt Peters <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.