Brief Description
-----------------
Persistent 200.006 alarm raised for "compute is degraded due to the failure of its 'collectd' process"
Severity
--------
Major: Alarm raised is a major degrade alarm; however, it is a false alarm since the process is actually running.
Steps to Reproduce
------------------
The issue was observed during the following steps, however, the issue should be generic to pmon initiated restart of collectd:
A method:
Apply the stx-monitor application to trigger restart of collectd via its runtime manifest.
# create the stx-monitor-1.0-1.tgz from build
build-tools/build-helm-charts.sh --verbose --app stx-monitor
# upload and apply the app
system application-upload stx-monitor-1.0-1.tgz
system application-apply stx-monitor
# can also update metricbeat overrides and perform application-apply again.
Expected Behavior
------------------
The process should be automatically restarted without persistent degrade alarm.
Actual Behavior
----------------
process is restarted, however, pidfile is missing (or removed) which leads pmon to raise degrade alarm.
Reproducibility
---------------
Intermittent, estimate 5%
System Configuration
--------------------
Multi-node system, IPv6
Branch/Pull Time/Commit
-----------------------
2019-09-17
Timestamp/Logs
--------------
| 2019-09-19T | set | 200.006 | compute-12 is degraded due to the failure of its 'collectd' process. | host=compute-12. | major || 18:42:53. | | | Auto recovery of this major process is in progress. | process=collectd | |
| 122527
2019-09-19T18:42:13.000 [35470.01243] compute-12 pmond mon pmonMsg.cpp ( 701) pmon_service_inbox : Info : collectd process-restart ; by request
2019-09-19T18:42:14.000 [35470.01244] compute-12 pmond mon pmonFsm.cpp ( 565) pmon_passive_handler : Info : collectd stability period (10 secs)
2019-09-19T18:42:14.000 [35470.01245] compute-12 pmond mon pmonHdlr.cpp (1103) unregister_process : Info : collectd unregistered (3279726)
2019-09-19T18:42:14.000 [35470.01246] compute-12 pmond mon pmonHdlr.cpp (1205) respawn_process : Info : collectd restart of running process
2019-09-19T18:42:14.000 [35470.01247] compute-12 pmond mon pmonHdlr.cpp ( 942) kill_running_process : Warn : collectd kill succeeded (3279726)
2019-09-19T18:42:14.000 [35470.01248] compute-12 pmond mon pmonHdlr.cpp (1305) respawn_process : Info : collectd Spawn (3280578)
2019-09-19T18:42:14.000 [35470.01249] compute-12 pmond mon pmonFsm.cpp ( 577) pmon_passive_handler : Info : collectd Restarted
2019-09-19T18:42:15.525 [35470.01250] compute-12 pmond mon pmonHdlr.cpp (1138) register_process : Info : collectd Registered (3280587)
2019-09-19T18:42:15.525 [35470.01251] compute-12 pmond mon pmonHdlr.cpp (1475) daemon_sigchld_hdlr : Info : collectd spawned in 1.524 secs
2019-09-19T18:42:18.500 [35470.01252] compute-12 pmond mon pmonMsg.cpp ( 701) pmon_service_inbox : Info : collectd process-restart ; by request
2019-09-19T18:42:19.500 [35470.01253] compute-12 pmond mon pmonFsm.cpp ( 565) pmon_passive_handler : Info : collectd stability period (10 secs)
2019-09-19T18:42:19.500 [35470.01254] compute-12 pmond mon pmonHdlr.cpp (1103) unregister_process : Info : collectd unregistered (3280587)
2019-09-19T18:42:19.500 [35470.01255] compute-12 pmond mon pmonHdlr.cpp (1003) process_running : Info : collectd process not running
2019-09-19T18:42:19.500 [35470.01256] compute-12 pmond mon pmonHdlr.cpp (1305) respawn_process : Info : collectd Spawn (3280947)
2019-09-19T18:42:19.500 [35470.01257] compute-12 pmond mon pmonFsm.cpp ( 577) pmon_passive_handler : Info : collectd Restarted
2019-09-19T18:42:20.489 [35470.01258] compute-12 pmond mon pmonHdlr.cpp (1138) register_process : Info : collectd Registered (3280965)
2019-09-19T18:42:52.918 [35470.01259] compute-12 pmond mon pmonHdlr.cpp (1063) _get_events : Warn : collectd Not Running
2019-09-19T18:42:52.923 [35470.01260] compute-12 pmond com nodeUtil.cpp (1840) get_system_state : Info : systemctl reports host as 'degraded'
2019-09-19T18:42:52.923 [35470.01261] compute-12 pmond mon pmonHdlr.cpp ( 320) manage_process_failure :Error : collectd failed (3280965) (p:0 a:0)
2019-09-19T18:42:52.923 [35470.01262] compute-12 pmond mon pmonHdlr.cpp (1568) manage_alarm : Warn : collectd Major Assert
2019-09-19T18:42:52.923 [35470.01263] compute-12 pmond mon pmonAlarm.cpp ( 251) pmonAlarm_major : Warn : compute-12 setting major 'collectd' process failure alarm (200.006.process=collectd)
2019-09-19T18:42:52.923 fmAPI.cpp(490): Enqueue raise alarm request: UUID (fad43169-7357-47c7-8943-8cfa3c0b4c77) alarm id (200.006) instant id (host=compute-12.process=collectd)
Test Activity
-------------
Developer Testing
Marking as low priority / not gating given the rate of occurrence is very low and there is no system impact other than a stale alarm. Still would be nice to investigate/ resolve, but not gating.