StarlingX

After process hbsClient killed, fm cmd failed as Unable to establish connection

Bug #1860471 reported by Peng Peng on 2020-01-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Invalid	Medium	Eric MacDonald

Bug Description

Brief Description
-----------------
After process:hbsClient was killed, fm cmd failed by Unable to establish connection to http://192.168.204.1:5000/v3/auth/tokens

Severity
--------
Major

Steps to Reproduce
------------------
kill hbsClient
run fm cmd

TC-name: mtc/test_process_monitoring.py::test_process_monitoring[hbsClient]

Expected Behavior
------------------
fm cnd is working as usual

Actual Behavior
----------------
fm cmd not working

Reproducibility
---------------
Unknown - first time this is seen in sanity, will monitor

System Configuration
--------------------
One node system

Lab-name: SM-4

Branch/Pull Time/Commit
-----------------------
master 20200121T000002Z

Last Pass
---------
20191210T000000Z

Timestamp/Logs
--------------
2020-01-21 11:05:03,439] 314 DEBUG MainThread ssh.send :: Send 'cat >/home/sysadmin//test_process.sh <<EOL
true; n=1; last_pid=''; pid=''; for((;n<4;)); do pid=\$(cat /var/run/hbsClient.pid 2>/dev/null); date;
                if [ "x\$pid" = "x" -o "\$pid" = "\$last_pid" ]; then echo "stale or empty PID:\$pid, last_pid=\$last_pid";
                sleep 0.5; continue; fi; echo "Li69nux*" | sudo -S kill -9 \$pid &>/dev/null;
                if [ \$? -eq 0 ]; then echo "OK \$n - \$pid killed"; ((n++)); last_pid=\$pid; pid=''; sleep 15;
                else sleep 0.5; fi; done; echo \$pid
EOL'

[2020-01-21 11:05:03,872] 314 DEBUG MainThread ssh.send :: Send 'nohup ./test_process.sh > ./results.txt 2>&1 &'

[2020-01-21 11:11:22,146] 314 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.1:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne event-list -l 10 -q 'event_log_id=200.006;start=2020-01-21T10:41:48.295439' --nowrap --nopaging --uuid'
[2020-01-21 11:11:22,890] 436 DEBUG MainThread ssh.expect :: Output:
Must provide Keystone credentials or user-defined endpoint and token, error was: Unable to establish connection to http://192.168.204.1:5000/v3/auth/tokens: HTTPConnectionPool(host='192.168.204.1', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa042e60650>: Failed to establish a new connection: [Errno 111] Connection refused',))
controller-0:~$

Test Activity
-------------
Sanity

See original description

Tags:

Revision history for this message

Peng Peng (ppeng) wrote on 2020-01-21:

collect log @
https://files.starlingx.kube.cengn.ca/launchpad/1860471

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-01-24:

@Peng, can you monitor if this is a reproducible issue or if it is a one-of?
It appears there is an issue with keystone in this run, but that should have no relationship to killing hbsclient.

description:	updated
Changed in starlingx:
status:	New → Incomplete

Revision history for this message

Peng Peng (ppeng) wrote on 2020-02-05:

Issue was reproduced again on
Lab: SM_4
Load: 20200127T000002Z

log @
https://files.starlingx.kube.cengn.ca/launchpad/1860471

Changed in starlingx:
status:	Incomplete → Confirmed
tags:	added: stx.retestneeded

Revision history for this message

Yang Liu (yliu12) wrote on 2020-02-05:

After killing the hbsclient process on simplex system, the expectation is the process should recover and controller-0 become degraded briefly, then available after process recovered.
The recovery happened in about 1 minute on 20190210 load (test last pass).

Looking at the automation logs for the latest failure on 20200127T000002Z load, the recover did not happen, and system/fm CLIs no longer work after that. System rebooted itself after 17 minutes of process killing, and eventually recovered.
[2020-02-05 11:10:49,562] >> process kill
[2020-02-05 11:27:16,407] >> system reboot

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-02-05:

stx.4.0 / medium priority - issue seen multiple times; needs further investigation

tags:	added: stx.metal
Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)
importance:	Undecided → Medium
tags:	added: stx.4.0

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-02-06:

Starting investigation.

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-02-06:

Killing the hbsClient over and over can (based on timing of those repeated kills) result in a quorum failure that leads to a host watchdog reset.

This is true for the first and last case but not for the Jan 28th case for which there is insufficient details to know what exactly failed.

Seeing long shutdown -> kernel restart times (5-6 minute) in the Jan 21 and Feb 50 tarballs.

2020-02-05T11:20:19.656 controller-0 kernel: warning 2087.443446] systemd-coredump[522109]: Failed to get EXE.
2020-02-05T11:26:38.206 controller-0 systemd[1]: info systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN)

Also seeing mysql coredumps in each of the three tarballs.

From ALL_NODES_20200205.142940
80da3ffd21c57f9618a034ddd7217490 /var/lib/systemd/coredump/core.mysqld.999.b278d04360c14e28810f245ebc4a8220.164.1580899174000000.xz

From ALL_NODES_20200128.162403
2ff424b1979f82ff7e131400a9dface4 /var/lib/systemd/coredump/core.mysqld.999.e2941c43019a4683beaec0e7dcec876b.163.1580207578000000.xz

From ALL_NODES_20200121.162346
f58713f106b72a8fe0e05cf8ac5e328f /var/lib/systemd/coredump/core.mysqld.999.3b9805a854544adb97718e175c5c8565.165.1579602731000000.xz

Booked a Simplex lab (wc68) and caused a quorum failure that lead to a reboot but did not see a long shutdown time nor the mysql coredump.

The root cause of this PV test case failure is a timeout due to long shutdown times.

The common thread here are the mysqld core dumps.
I suggest this is where the investigation continue.

Note that if the PV test were to have not timed out, say they extended the timeout period or it was just under the threshold, this issue would have been reported as a mysqld core dump issue.

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-02-06:

Without the coredump we typically see a 2 minute rather than 6 minute gap between systemd shutdown and startup over a reset.

Try 1: wc68 20.01 2020-01-28_18-49-15 Formal
2020-02-06T16:00:09.336 controller-0 systemd[1]: info Stopping System Logger Daemon...
2020-02-06T16:01:52.631 controller-0 systemd[1]: info systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD

Try 2: wc68 20.01 2020-01-28_18-49-15 Formal
2020-02-06T16:53:48.593 controller-0 systemd[1]: info Stopping System Logger Daemon...
2020-02-06T16:55:31.642 controller-0 systemd[1]: info systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN)

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-02-06:

The following LP also reports mysqld coredump over an unlock, basically a reset.

core.mysqld coredump file generated after host-unlock
https://bugs.launchpad.net/starlingx/+bug/1860473

Occurrence is rare.
No service affecting impact, issue occurs over a reset.

Revision history for this message

Peng Peng (ppeng) wrote on 2020-03-19:

#10

Issue was reproduced on
Lab: SM_2
Load: 20200318T013004Z
log @
https://files.starlingx.kube.cengn.ca/launchpad/1860471

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-04-01:

#11

No core dump indicated in collect/extra/coredump.info 'No core dumps'.
However kern.log reports this systemd-coredump segfault at failure time.

2020-03-18T18:51:13.033 controller-0 kernel: info [ 1756.503375] systemd-coredum[457049]: segfault at 0 ip 000055628594db58 sp 00007ffd6ecbf650 error 6 in systemd-coredump[556285941000+1e000]

2020-03-18T18:51:30.011 controller-0 systemd[1]: info Stopping System Logger Daemon...
2020-03-18T18:57:32.335 controller-0 systemd[1]: info systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN)

Another case of a 6 minute shutdown.
Another case of host watchdog reset due to expected quorum failure.

2020-03-18T18:50:39.989 [85072.00015] localhost hostwd --- hostwHdlr.cpp ( 201) hostw_service_command :Error : *** PMON reports unrecoverable system - message 'PMON detected hbsClient failure' ***
2020-03-18T18:50:39.989 [85072.00016] localhost hostwd --- hostwHdlr.cpp ( 227) hostw_log_and_reboot :Error : *** Host Watchdog declaring system unhealthy ***

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2020-04-01:

#12

I think this is an invalid bug report.
PV is killing a quorum process over and over.
The host watchdog is detecting the failure and self resetting as per design intent.

Eric MacDonald (rocksolidmtce) on 2020-04-01

Changed in starlingx:
status:	Confirmed → Invalid

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-06-27:

#13

LP is closed as invalid; removing the stx.retestneeded tag

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.