Comment 0 for bug 1827326

Revision history for this message
Gerry Kopec (gerry-kopec) wrote : 2830 patch-alarm-manager proceses on active controller

Brief Description
-----------------
Due to networking issues PV1 spent many hours with neither controller able to become active. controller-1 kept attempting to go active but would then revert to standby. Appears that during this period patch-alarm-manager processes kept being created but were never cleaned up on failure leaving 2830 processes running.

Severity
--------
Minor

Steps to Reproduce
------------------
Not sure how to reproduce it myself.
Bin Qian described the networking problem:
Looks like it is a network issue. Msg with uuid=09cd9b6b-f098-4468-a81e-f292ce0dd20d was received by controller-0 from a unknown lab over multicast ip 239.1.1.1. TCPDUMP show that the msg was sent from mac 00:1e:67:68:0b:f0.
Similar msg received by controller-1.

2019-04-23T19:39:35.000 controller-0 sm: debug time[42573.674] log<653628> INFO: sm[88764]: sm_msg.c(367): Message instance (cfaf2360-1568-4103-812e-4a6415283268) changed for node (controller-1), now=09cd9b6b-f098-4468-a81e-f292ce0dd20d.
2019-04-23T19:39:35.000 controller-0 sm: debug time[42573.674] log<653629> INFO: sm[88764]: sm_msg.c(367): Message instance (09cd9b6b-f098-4468-a81e-f292ce0dd20d) changed for node (controller-1), now=cfaf2360-1568-4103-812e-4a6415283268.

Expected Behavior
------------------
There should only be 1 patch-alarm-manager process.

Actual Behavior
----------------
There were 2830 patch-alarm-manager processes.

Reproducibility
---------------
Not sure how likely this scenario is.

System Configuration
--------------------
Multi-node (2+10)

Branch/Pull Time/Commit
-----------------------
cengn load: 20190421T233001Z

Last Pass
---------
n/a

Timestamp/Logs
--------------
| 2019-04-23T09:34:16.923 | 557854 | service-scn | patch-alarm-manager | enabled-active | disabling | disable state requested
| 2019-04-23T09:34:16.930 | 557871 | service-scn | patch-alarm-manager | disabling | disabled | disable success
| 2019-04-23T09:34:21.940 | 557932 | service-scn | patch-alarm-manager | disabled | enabling | enabled-active state requested
| 2019-04-23T09:34:22.405 | 557966 | service-scn | patch-alarm-manager | enabling | enabled-active | enable success
| 2019-04-23T09:34:22.947 | 558010 | service-scn | patch-alarm-manager | enabled-active | disabling | disable state requested
| 2019-04-23T09:34:22.953 | 558041 | service-scn | patch-alarm-manager | disabling | disabled | disable success
| 2019-04-23T09:34:24.443 | 558073 | service-scn | patch-alarm-manager | disabled | enabling | enabled-active state requested
| 2019-04-23T09:34:24.902 | 558077 | service-scn | patch-alarm-manager | enabling | enabled-active | enable success
<snip>
| 2019-04-23T16:14:50.394 | 965978 | service-scn | patch-alarm-manager | enabled-active | disabling | disable state requested
| 2019-04-23T16:14:50.398 | 965984 | service-scn | patch-alarm-manager | disabling | disabled | disable success
| 2019-04-23T16:14:55.433 | 966005 | service-scn | patch-alarm-manager | disabled | enabling | enabled-active state requested
| 2019-04-23T16:14:55.846 | 966013 | service-scn | patch-alarm-manager | enabling | enabled-active | enable success
| 2019-04-23T16:14:58.981 | 966026 | service-scn | patch-alarm-manager | enabled-active | disabling | disable state requested
| 2019-04-23T16:14:58.985 | 966028 | service-scn | patch-alarm-manager | disabling | disabled | disable success
| 2019-04-23T16:15:01.492 | 966033 | service-scn | patch-alarm-manager | disabled | enabling | enabled-active state requested
| 2019-04-23T16:15:01.877 | 966034 | service-scn | patch-alarm-manager | enabling | enabled-active | enable success
| 2019-04-23T16:15:03.999 | 966040 | service-scn | patch-alarm-manager | enabled-active | disabling | disable state requested
<snip>

date; ps -ef | grep patch-alarm-manager | grep python
Thu May 2 05:05:15 UTC 2019
root 368 1 0 Apr23 ? 00:00:13 python /usr/bin/patch-alarm-manager start
root 457 1 0 Apr23 ? 00:00:12 python /usr/bin/patch-alarm-manager start
root 468 1 0 Apr23 ? 00:00:14 python /usr/bin/patch-alarm-manager start
root 507 1 0 Apr23 ? 00:00:17 python /usr/bin/patch-alarm-manager start
root 508 1 0 Apr23 ? 00:00:16 python /usr/bin/patch-alarm-manager start
root 516 1 0 Apr23 ? 00:00:16 python /usr/bin/patch-alarm-manager start
root 527 1 0 Apr23 ? 00:00:19 python /usr/bin/patch-alarm-manager start
<snip>
root 147355 1 0 Apr23 ? 00:00:20 python /usr/bin/patch-alarm-manager start
root 147357 1 0 Apr23 ? 00:00:11 python /usr/bin/patch-alarm-manager start
root 147378 1 0 Apr23 ? 00:00:17 python /usr/bin/patch-alarm-manager start
root 147422 1 0 Apr23 ? 00:00:12 python /usr/bin/patch-alarm-manager start
root 147427 1 0 Apr23 ? 00:00:16 python /usr/bin/patch-alarm-manager start

Test Activity
-------------
System Engineering