StarlingX

Bug #1860471
Comment #7

Comment 7 for bug 1860471

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-02-06:

Killing the hbsClient over and over can (based on timing of those repeated kills) result in a quorum failure that leads to a host watchdog reset.

This is true for the first and last case but not for the Jan 28th case for which there is insufficient details to know what exactly failed.

Seeing long shutdown -> kernel restart times (5-6 minute) in the Jan 21 and Feb 50 tarballs.

2020-02-05T11:20:19.656 controller-0 kernel: warning 2087.443446] systemd-coredump[522109]: Failed to get EXE.
2020-02-05T11:26:38.206 controller-0 systemd[1]: info systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN)

Also seeing mysql coredumps in each of the three tarballs.

From ALL_NODES_20200205.142940
80da3ffd21c57f9618a034ddd7217490 /var/lib/systemd/coredump/core.mysqld.999.b278d04360c14e28810f245ebc4a8220.164.1580899174000000.xz

From ALL_NODES_20200128.162403
2ff424b1979f82ff7e131400a9dface4 /var/lib/systemd/coredump/core.mysqld.999.e2941c43019a4683beaec0e7dcec876b.163.1580207578000000.xz

From ALL_NODES_20200121.162346
f58713f106b72a8fe0e05cf8ac5e328f /var/lib/systemd/coredump/core.mysqld.999.3b9805a854544adb97718e175c5c8565.165.1579602731000000.xz

Booked a Simplex lab (wc68) and caused a quorum failure that lead to a reboot but did not see a long shutdown time nor the mysql coredump.

The root cause of this PV test case failure is a timeout due to long shutdown times.

The common thread here are the mysqld core dumps.
I suggest this is where the investigation continue.

Note that if the PV test were to have not timed out, say they extended the timeout period or it was just under the threshold, this issue would have been reported as a mysqld core dump issue.