StarlingX

‘collectd’ process failure detected due to missing pmon pidfile, noted after stx-monitor runtime manifest apply

Bug #1844724 reported by John Kung on 2019-09-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Low	Eric MacDonald

Bug Description

Brief Description
-----------------
Persistent 200.006 alarm raised for "compute is degraded due to the failure of its 'collectd' process"

Severity
--------

Major: Alarm raised is a major degrade alarm; however, it is a false alarm since the process is actually running.

Steps to Reproduce
------------------
The issue was observed during the following steps, however, the issue should be generic to pmon initiated restart of collectd:

A method:
Apply the stx-monitor application to trigger restart of collectd via its runtime manifest.
# create the stx-monitor-1.0-1.tgz from build
build-tools/build-helm-charts.sh --verbose --app stx-monitor

# upload and apply the app
system application-upload stx-monitor-1.0-1.tgz
system application-apply stx-monitor

# can also update metricbeat overrides and perform application-apply again.

Expected Behavior
------------------
The process should be automatically restarted without persistent degrade alarm.

Actual Behavior
----------------
process is restarted, however, pidfile is missing (or removed) which leads pmon to raise degrade alarm.

Reproducibility
---------------
Intermittent, estimate 5%

System Configuration
--------------------
Multi-node system, IPv6

Branch/Pull Time/Commit
-----------------------
2019-09-17

Timestamp/Logs
--------------
| 2019-09-19T | set | 200.006 | compute-12 is degraded due to the failure of its 'collectd' process. | host=compute-12. | major || 18:42:53. | | | Auto recovery of this major process is in progress. | process=collectd | |
| 122527

2019-09-19T18:42:13.000 [35470.01243] compute-12 pmond mon pmonMsg.cpp 2019-09-19T18:42:14.000 [35470.01244] compute-12 pmond mon pmonFsm.cpp 2019-09-19T18:42:14.000 [35470.01245] compute-12 pmond mon pmonHdlr.cpp 2019-09-19T18:42:14.000 [35470.01246] compute-12 pmond mon pmonHdlr.cpp 2019-09-19T18:42:14.000 [35470.01247] compute-12 pmond mon pmonHdlr.cpp 2019-09-19T18:42:14.000 [35470.01248] compute-12 pmond mon pmonHdlr.cpp 2019-09-19T18:42:14.000 [35470.01249] compute-12 pmond mon pmonFsm.cpp 2019-09-19T18:42:15.525 [35470.01250] compute-12 pmond mon pmonHdlr.cpp 2019-09-19T18:42:15.525 [35470.01251] compute-12 pmond mon pmonHdlr.cpp 2019-09-19T18:42:18.500 [35470.01252] compute-12 pmond mon pmonMsg.cpp 2019-09-19T18:42:19.500 [35470.01253] compute-12 pmond mon pmonFsm.cpp 2019-09-19T18:42:19.500 [35470.01254] compute-12 pmond mon pmonHdlr.cpp 2019-09-19T18:42:19.500 [35470.01255] compute-12 pmond mon pmonHdlr.cpp 2019-09-19T18:42:19.500 [35470.01256] compute-12 pmond mon pmonHdlr.cpp 2019-09-19T18:42:19.500 [35470.01257] compute-12 pmond mon pmonFsm.cpp 2019-09-19T18:42:20.489 [35470.01258] compute-12 pmond mon pmonHdlr.cpp 2019-09-19T18:42:52.918 [35470.01259] compute-12 pmond mon pmonHdlr.cpp 2019-09-19T18:42:52.923 [35470.01260] compute-12 pmond com nodeUtil.cpp 2019-09-19T18:42:52.923 [35470.01261] compute-12 pmond mon pmonHdlr.cpp 2019-09-19T18:42:52.923 [35470.01262] compute-12 pmond mon pmonHdlr.cpp 2019-09-19T18:42:52.923 [35470.01263] compute-12 pmond mon pmonAlarm.cpp 2019-09-19T18:42:52.923 fmAPI.cpp(490): Enqueue raise alarm request: ( 701) pmon_service_inbox : Info : collectd process-restart ; by request
( 565) pmon_passive_handler : Info : collectd stability period (10 secs)
(1103) unregister_process : Info : collectd unregistered (3279726)
(1205) respawn_process : Info : collectd restart of running process
( 942) kill_running_process : Warn : collectd kill succeeded (3279726)
(1305) respawn_process : Info : collectd Spawn (3280578)
( 577) pmon_passive_handler : Info : collectd Restarted
(1138) register_process : Info : collectd Registered (3280587)
(1475) daemon_sigchld_hdlr : Info : collectd spawned in 1.524 secs
( 701) pmon_service_inbox : Info : collectd process-restart ; by request
( 565) pmon_passive_handler : Info : collectd stability period (10 secs)
(1103) unregister_process : Info : collectd unregistered (3280587)
(1003) process_running : Info : collectd process not running
(1305) respawn_process : Info : collectd Spawn (3280947)
( 577) pmon_passive_handler : Info : collectd Restarted
(1138) register_process : Info : collectd Registered (3280965)
(1063) _get_events : Warn : collectd Not Running
(1840) get_system_state : Info : systemctl reports host as 'degraded'
( 320) manage_process_failure :Error : collectd failed (3280965) (p:0 a:0)
(1568) manage_alarm : Warn : collectd Major Assert
( 251) pmonAlarm_major : Warn : compute-12 setting major 'collectd' process failure alarm (200.006.process=collectd)
UUID (fad43169-7357-47c7-8943-8cfa3c0b4c77) alarm id (200.006) instant id (host=compute-12.process=collectd)

Test Activity
-------------
Developer Testing

Tags:

Eric MacDonald (rocksolidmtce) on 2019-09-19

Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-20:

Marking as low priority / not gating given the rate of occurrence is very low and there is no system impact other than a stale alarm. Still would be nice to investigate/resolve, but not gating.

tags:	added: stx.metal
Changed in starlingx:
importance:	Undecided → Low
status:	New → Triaged

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-09-20:

Actually its not a stale alarm, the process is failed and restarted over and over to recover while the process pidfile remains missing over each restart.

I think this is medium and gating.

I have a debug patch in the integration lab.

We need a reconfig soak to run to reproduce.

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-10-03:

I'm unclear whether this is a pmond issue or collectd issue so I ran some 'pmon-restart' soaks over the last few days.

while true
do
    if [ ! -e "/var/run/${process}.pid" ] ; then
        echo "Process not running ... exiting ..."
        exit 1
    else
         pmon-restart ${process}
         sleep 30
    fi
done

export process=collectd ; pmon-restart of collectd ... 4 failures in 208 restarts
export process=acpid ; pmon-restart of acpid ... 0 failures in 303 restarts
export process=pci-irq-affinity-agent ; pmon-restart of collectd ... 0 failures in 2061 restarts

Either there is different timing with collectd or something specific to collectd

Going to make a small change to pmond to remove 2 suspect aspects and see if that makes a difference.

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-10-03:

Soaking collectd restart with pmond out of the picture.

#!/bin/bash
LOOPS=1000
old_pid=`cat /var/run/collectd.pid`
for ((loop=0;loop<${LOOPS};loop++))
do
    date
    systemctl restart collectd
    sleep 5
    if [ ! -e /var/run/collectd.pid ] ; then
        echo "No PIDFILE";
        exit 1
    fi
    new_pid=`cat /var/run/collectd.pid`
    if [ "$old_pid" == "$new_pid" ] ; then
        echo "Error: same PID"
        exit 2
    fi
    echo "New Pid $loop: $new_pid"
    old_pid=$new_pid
done

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-10-04:

This is a pmond issue.

Each process config file can specify how long to wait (in seconds) after the process is started before expecting to see the pidfile and start monitoring that process.

The pmon-restart function is not honoring the startuptime wait before looking for the pidfile.

Update coded, testing in progress.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-07: Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/687028

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-10: Fix merged to metal (master)

Reviewed: https://review.opendev.org/687028
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=8b757de303489715a0c8c63478fb30d601f60c5f
Submitter: Zuul
Branch: master

commit 8b757de303489715a0c8c63478fb30d601f60c5f
Author: Eric MacDonald <email address hidden>
Date: Mon Oct 7 09:14:33 2019 -0400

Make pmon-restart honour startuptime config option

    Each monitored process's config file contains a startuptime
    label that specifies how many seconds it takes for that newly
    started process to stabalize and produce its pidfile.

The pmon-restart feature needs to delay monitoring
newly restarted process for 'startuptime' seconds.

Failing to do so can cause it to fail the restarted
process to early if there is pidfile creation delay.

Test Plan:

    PASS: Verify collectd pmon-restart function with soak ;
          > 5000+ collectd pmon-restarts.
    PASS: Verify pmond regression test suite (test-pmon-action.sh)
          > restart command ; graceful restart all monitored processes. (5 loops)
          > kill command ; kill and recover all monitored processes. (5 loops)

Regression:

    PASS: Verify pmon-stop command/function
    PASS: Verify pmon-start command/function also honors the startuptime.
    PASS: Verify pmon-stop auto start after auto-start timeout
    PASS: Verify System Install
    PASS: Verify Patching (soak)

    Change-Id: I9fd7bba8e49fe4c28281539ab4930bdac370ef11
    Closes-Bug: #1844724
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.