‘collectd’ process failure detected due to missing pmon pidfile, noted after stx-monitor runtime manifest apply

Bug #1844724 reported by John Kung
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Eric MacDonald

Bug Description

Brief Description
-----------------
Persistent 200.006 alarm raised for "compute is degraded due to the failure of its 'collectd' process"

Severity
--------

Major: Alarm raised is a major degrade alarm; however, it is a false alarm since the process is actually running.

Steps to Reproduce
------------------
The issue was observed during the following steps, however, the issue should be generic to pmon initiated restart of collectd:

A method:
Apply the stx-monitor application to trigger restart of collectd via its runtime manifest.
# create the stx-monitor-1.0-1.tgz from build
build-tools/build-helm-charts.sh --verbose --app stx-monitor

# upload and apply the app
system application-upload stx-monitor-1.0-1.tgz
system application-apply stx-monitor

# can also update metricbeat overrides and perform application-apply again.

Expected Behavior
------------------
The process should be automatically restarted without persistent degrade alarm.

Actual Behavior
----------------
process is restarted, however, pidfile is missing (or removed) which leads pmon to raise degrade alarm.

Reproducibility
---------------
Intermittent, estimate 5%

System Configuration
--------------------
Multi-node system, IPv6

Branch/Pull Time/Commit
-----------------------
2019-09-17

Timestamp/Logs
--------------
| 2019-09-19T | set | 200.006 | compute-12 is degraded due to the failure of its 'collectd' process. | host=compute-12. | major || 18:42:53. | | | Auto recovery of this major process is in progress. | process=collectd | |
| 122527

2019-09-19T18:42:13.000 [35470.01243] compute-12 pmond mon pmonMsg.cpp ( 701) pmon_service_inbox : Info : collectd process-restart ; by request
2019-09-19T18:42:14.000 [35470.01244] compute-12 pmond mon pmonFsm.cpp ( 565) pmon_passive_handler : Info : collectd stability period (10 secs)
2019-09-19T18:42:14.000 [35470.01245] compute-12 pmond mon pmonHdlr.cpp (1103) unregister_process : Info : collectd unregistered (3279726)
2019-09-19T18:42:14.000 [35470.01246] compute-12 pmond mon pmonHdlr.cpp (1205) respawn_process : Info : collectd restart of running process
2019-09-19T18:42:14.000 [35470.01247] compute-12 pmond mon pmonHdlr.cpp ( 942) kill_running_process : Warn : collectd kill succeeded (3279726)
2019-09-19T18:42:14.000 [35470.01248] compute-12 pmond mon pmonHdlr.cpp (1305) respawn_process : Info : collectd Spawn (3280578)
2019-09-19T18:42:14.000 [35470.01249] compute-12 pmond mon pmonFsm.cpp ( 577) pmon_passive_handler : Info : collectd Restarted
2019-09-19T18:42:15.525 [35470.01250] compute-12 pmond mon pmonHdlr.cpp (1138) register_process : Info : collectd Registered (3280587)
2019-09-19T18:42:15.525 [35470.01251] compute-12 pmond mon pmonHdlr.cpp (1475) daemon_sigchld_hdlr : Info : collectd spawned in 1.524 secs
2019-09-19T18:42:18.500 [35470.01252] compute-12 pmond mon pmonMsg.cpp ( 701) pmon_service_inbox : Info : collectd process-restart ; by request
2019-09-19T18:42:19.500 [35470.01253] compute-12 pmond mon pmonFsm.cpp ( 565) pmon_passive_handler : Info : collectd stability period (10 secs)
2019-09-19T18:42:19.500 [35470.01254] compute-12 pmond mon pmonHdlr.cpp (1103) unregister_process : Info : collectd unregistered (3280587)
2019-09-19T18:42:19.500 [35470.01255] compute-12 pmond mon pmonHdlr.cpp (1003) process_running : Info : collectd process not running
2019-09-19T18:42:19.500 [35470.01256] compute-12 pmond mon pmonHdlr.cpp (1305) respawn_process : Info : collectd Spawn (3280947)
2019-09-19T18:42:19.500 [35470.01257] compute-12 pmond mon pmonFsm.cpp ( 577) pmon_passive_handler : Info : collectd Restarted
2019-09-19T18:42:20.489 [35470.01258] compute-12 pmond mon pmonHdlr.cpp (1138) register_process : Info : collectd Registered (3280965)
2019-09-19T18:42:52.918 [35470.01259] compute-12 pmond mon pmonHdlr.cpp (1063) _get_events : Warn : collectd Not Running
2019-09-19T18:42:52.923 [35470.01260] compute-12 pmond com nodeUtil.cpp (1840) get_system_state : Info : systemctl reports host as 'degraded'
2019-09-19T18:42:52.923 [35470.01261] compute-12 pmond mon pmonHdlr.cpp ( 320) manage_process_failure :Error : collectd failed (3280965) (p:0 a:0)
2019-09-19T18:42:52.923 [35470.01262] compute-12 pmond mon pmonHdlr.cpp (1568) manage_alarm : Warn : collectd Major Assert
2019-09-19T18:42:52.923 [35470.01263] compute-12 pmond mon pmonAlarm.cpp ( 251) pmonAlarm_major : Warn : compute-12 setting major 'collectd' process failure alarm (200.006.process=collectd)
2019-09-19T18:42:52.923 fmAPI.cpp(490): Enqueue raise alarm request: UUID (fad43169-7357-47c7-8943-8cfa3c0b4c77) alarm id (200.006) instant id (host=compute-12.process=collectd)

Test Activity
-------------
Developer Testing

Tags: stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as low priority / not gating given the rate of occurrence is very low and there is no system impact other than a stale alarm. Still would be nice to investigate/resolve, but not gating.

tags: added: stx.metal
Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Actually its not a stale alarm, the process is failed and restarted over and over to recover while the process pidfile remains missing over each restart.

I think this is medium and gating.

I have a debug patch in the integration lab.

We need a reconfig soak to run to reproduce.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

I'm unclear whether this is a pmond issue or collectd issue so I ran some 'pmon-restart' soaks over the last few days.

while true
do
    if [ ! -e "/var/run/${process}.pid" ] ; then
        echo "Process not running ... exiting ..."
        exit 1
    else
         pmon-restart ${process}
         sleep 30
    fi
done

export process=collectd ; pmon-restart of collectd ... 4 failures in 208 restarts
export process=acpid ; pmon-restart of acpid ... 0 failures in 303 restarts
export process=pci-irq-affinity-agent ; pmon-restart of collectd ... 0 failures in 2061 restarts

Either there is different timing with collectd or something specific to collectd

Going to make a small change to pmond to remove 2 suspect aspects and see if that makes a difference.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Soaking collectd restart with pmond out of the picture.

#!/bin/bash
LOOPS=1000
old_pid=`cat /var/run/collectd.pid`
for ((loop=0;loop<${LOOPS};loop++))
do
    date
    systemctl restart collectd
    sleep 5
    if [ ! -e /var/run/collectd.pid ] ; then
        echo "No PIDFILE";
        exit 1
    fi
    new_pid=`cat /var/run/collectd.pid`
    if [ "$old_pid" == "$new_pid" ] ; then
        echo "Error: same PID"
        exit 2
    fi
    echo "New Pid $loop: $new_pid"
    old_pid=$new_pid
done

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

This is a pmond issue.

Each process config file can specify how long to wait (in seconds) after the process is started before expecting to see the pidfile and start monitoring that process.

The pmon-restart function is not honoring the startuptime wait before looking for the pidfile.

Update coded, testing in progress.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/687028

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/687028
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=8b757de303489715a0c8c63478fb30d601f60c5f
Submitter: Zuul
Branch: master

commit 8b757de303489715a0c8c63478fb30d601f60c5f
Author: Eric MacDonald <email address hidden>
Date: Mon Oct 7 09:14:33 2019 -0400

    Make pmon-restart honour startuptime config option

    Each monitored process's config file contains a startuptime
    label that specifies how many seconds it takes for that newly
    started process to stabalize and produce its pidfile.

    The pmon-restart feature needs to delay monitoring
    newly restarted process for 'startuptime' seconds.

    Failing to do so can cause it to fail the restarted
    process to early if there is pidfile creation delay.

    Test Plan:

    PASS: Verify collectd pmon-restart function with soak ;
          > 5000+ collectd pmon-restarts.
    PASS: Verify pmond regression test suite (test-pmon-action.sh)
          > restart command ; graceful restart all monitored processes. (5 loops)
          > kill command ; kill and recover all monitored processes. (5 loops)

    Regression:

    PASS: Verify pmon-stop command/function
    PASS: Verify pmon-start command/function also honors the startuptime.
    PASS: Verify pmon-stop auto start after auto-start timeout
    PASS: Verify System Install
    PASS: Verify Patching (soak)

    Change-Id: I9fd7bba8e49fe4c28281539ab4930bdac370ef11
    Closes-Bug: #1844724
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.