Killing the hbsClient over and over can (based on timing of those repeated kills) result in a quorum failure that leads to a host watchdog reset.
This is true for the first and last case but not for the Jan 28th case for which there is insufficient details to know what exactly failed.
Seeing long shutdown -> kernel restart times (5-6 minute) in the Jan 21 and Feb 50 tarballs.
2020-02-05T11:20:19.656 controller-0 kernel: warning 2087.443446] systemd-coredump[522109]: Failed to get EXE.
2020-02-05T11:26:38.206 controller-0 systemd[1]: info systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN)
Also seeing mysql coredumps in each of the three tarballs.
From ALL_NODES_20200205.142940
80da3ffd21c57f9618a034ddd7217490 /var/lib/systemd/coredump/core.mysqld.999.b278d04360c14e28810f245ebc4a8220.164.1580899174000000.xz
From ALL_NODES_20200128.162403
2ff424b1979f82ff7e131400a9dface4 /var/lib/systemd/coredump/core.mysqld.999.e2941c43019a4683beaec0e7dcec876b.163.1580207578000000.xz
From ALL_NODES_20200121.162346
f58713f106b72a8fe0e05cf8ac5e328f /var/lib/systemd/coredump/core.mysqld.999.3b9805a854544adb97718e175c5c8565.165.1579602731000000.xz
Booked a Simplex lab (wc68) and caused a quorum failure that lead to a reboot but did not see a long shutdown time nor the mysql coredump.
The root cause of this PV test case failure is a timeout due to long shutdown times.
The common thread here are the mysqld core dumps.
I suggest this is where the investigation continue.
Note that if the PV test were to have not timed out, say they extended the timeout period or it was just under the threshold, this issue would have been reported as a mysqld core dump issue.
Killing the hbsClient over and over can (based on timing of those repeated kills) result in a quorum failure that leads to a host watchdog reset.
This is true for the first and last case but not for the Jan 28th case for which there is insufficient details to know what exactly failed.
Seeing long shutdown -> kernel restart times (5-6 minute) in the Jan 21 and Feb 50 tarballs.
2020-02- 05T11:20: 19.656 controller-0 kernel: warning 2087.443446] systemd- coredump[ 522109] : Failed to get EXE. 05T11:26: 38.206 controller-0 systemd[1]: info systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN)
2020-02-
Also seeing mysql coredumps in each of the three tarballs.
From ALL_NODES_ 20200205. 142940 618a034ddd72174 90 /var/lib/ systemd/ coredump/ core.mysqld. 999.b278d04360c 14e28810f245ebc 4a8220. 164.15808991740 00000.xz
80da3ffd21c57f9
From ALL_NODES_ 20200128. 162403 f7e131400a9dfac e4 /var/lib/ systemd/ coredump/ core.mysqld. 999.e2941c43019 a4683beaec0e7dc ec876b. 163.15802075780 00000.xz
2ff424b1979f82f
From ALL_NODES_ 20200121. 162346 fe0e05cf8ac5e32 8f /var/lib/ systemd/ coredump/ core.mysqld. 999.3b9805a8545 44adb97718e175c 5c8565. 165.15796027310 00000.xz
f58713f106b72a8
Booked a Simplex lab (wc68) and caused a quorum failure that lead to a reboot but did not see a long shutdown time nor the mysql coredump.
The root cause of this PV test case failure is a timeout due to long shutdown times.
The common thread here are the mysqld core dumps.
I suggest this is where the investigation continue.
Note that if the PV test were to have not timed out, say they extended the timeout period or it was just under the threshold, this issue would have been reported as a mysqld core dump issue.