Debian: Killing Process "sshd" does not trigger an event in "fm event-list"

Bug #1991400 reported by Manasses julio da silva junior
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Manasses julio da silva junior

Bug Description

Brief Description
---------------------
The process "sshd" is running and monitored by starlinx.

If we kill the sshd process, an event should be trigger in the "fm event-list". This event is not triggered.

Note that the sshd process does recover after being killed.

Severity
----------
Minor

Steps to Reproduce{}
----------------------
Log on to a Debian lab.

Use "ps -aux | grep sshd" to see the list of sshd processes running.
Use "kill -9 <pid>" to kill every process which showed up from the previous command.
Run "fm event-list"
Automated Test Case: test_process_monitoring[non_dc-sshd]

Expected Behavior
-------------------
There should be an event in the list of the shape:

"controller-0 'sshd' process has failed. Auto recovery in progress."

Actual Behavior
-----------------
There is no such event in the logs.

Reproducibility
------------------
100% on Debian

Not seen on CentOS.

System Configuration
----------------------
Debian AIO-Simplex

Last Pass
-----------
Never on Debian.

Timestamp/Logs
----------------
[2022-06-23 15:59:40,226] 1312 INFO MainThread test_process_monitoring.kill_pmon_process_and_verify_impact:: Last PID of PMON process is:[1] 286992, process:sshd
[2022-06-23 15:59:40,226] 1381 INFO MainThread test_process_monitoring.kill_pmon_process_and_verify_impact:: Sleep for some time after each kill: 260
[2022-06-23 16:04:00,324] 1384 INFO MainThread test_process_monitoring.kill_pmon_process_and_verify_impact:: After process:sshd been killed 12 times, wait for controller-0 to reach: {'operational': 'enabled', 'availability': 'available'}
[2022-06-23 16:04:00,324] 4719 INFO MainThread system_helper.wait_for_host_values:: Waiting for controller-0 to reach state(s) - {'operational': 'enabled', 'availability': 'available'}
[2022-06-23 16:04:04,531] 4762 INFO MainThread system_helper.wait_for_host_values:: controller-0 operational has reached: enabled
[2022-06-23 16:04:04,531] 4762 INFO MainThread system_helper.wait_for_host_values:: controller-0 availability has reached: available
[2022-06-23 16:04:04,532] 4770 INFO MainThread system_helper.wait_for_host_values:: controller-0 is in state(s): {'operational': 'enabled', 'availability': 'available'}
[2022-06-23 16:04:07,594] 1196 WARNING MainThread test_process_monitoring.wait_for_pmon_process_events:: No matched event found at try:1, will sleep 3 seconds and retry
matched events:
[], host=controller-0, expected status={'operational': 'enabled', 'availability': 'available'}, expecting=True

(...)

[2022-06-23 16:06:14,773] 1196 WARNING MainThread test_process_monitoring.wait_for_pmon_process_events:: No matched event found at try:11, will sleep 3 seconds and retry
matched events:
[], host=controller-0, expected status={'operational': 'enabled', 'availability': 'available'}, expecting=True
[2022-06-23 16:06:17,776] 1207 INFO MainThread test_process_monitoring.wait_for_pmon_process_events:: No matched events:
[]
[2022-06-23 16:06:17,776] 1474 ERROR MainThread test_process_monitoring.kill_pmon_process_and_verify_impact:: No event/alarm raised for process:sshd, process_type:pmon, host:controller-0
[2022-06-23 16:06:17,776] 1488 ERROR MainThread test_process_monitoring.kill_pmon_process_and_verify_impact:: No event/alarm raised for process:sshd, process_type:pmon, host:controller-0, although host reached expected status:{'operational': 'enabled', 'availability': 'available'}
***Failure at test call: /home/croy/Dev/cgcs/CGCSAuto/testcases/stress/mtc/test_process_monitoring.py:1722: AssertionError: failed in killing process and verifying impact for process:sshd
FAILED

Alarms
--------
None

Test Activity
---------------
Automated Regression Analysis

Workaround
-----------
Don't look in the alarm event logs.

Revision history for this message
Manasses julio da silva junior (manassesjulio) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config-files (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config-files (master)

Reviewed: https://review.opendev.org/c/starlingx/config-files/+/859994
Committed: https://opendev.org/starlingx/config-files/commit/9b43fdabc633b0e22637991536e3be7f662de644
Submitter: "Zuul (22348)"
Branch: master

commit 9b43fdabc633b0e22637991536e3be7f662de644
Author: Manasses Julio <email address hidden>
Date: Fri Sep 30 11:08:27 2022 -0300

    Killing "sshd" does not trigger an alarm event

    It was identified that the sshd service was being restarted
    automatically by systemd because the ssh.service file had the
    "Restart=on-failure" set, which prevented pmon to proper monitor it
    resulting in the alarm not being set.
    Comparing with CentOS, we confirmed that this line in the service
    file was commented out.
    However, checking the openssh-config for Debian, the line was also
    commented out, but the file, sshd.service is not the file used by
    openssh-server package in Debian, instead, it is the ssh.service.
    This commit replaces sshd.service file by ssh.service (based on the
    file from the Debian package openssh-server) and add the modification
    to comment out the "Restart=on-failure".

    Test Plan
    PASS: Build and install
    PASS: Unlock AIO-SX
    PASS: Verified that killing the sshd process 10 times would
    trigger the alarm

    Closes-bug: 1991400

    Signed-off-by: Manasses Julio <email address hidden>
    Change-Id: I477d5261ccf44ccb29c7b3ab0dbc9bd816a753cc

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Manasses julio da silva junior (manassesjulio)
importance: Undecided → Low
Ghada Khalil (gkhalil)
tags: added: stx.8.0 stx.fault
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.