2830 patch-alarm-manager processes on active controller

Bug #1827326 reported by Gerry Kopec
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Don Penney

Bug Description

Brief Description
-----------------
Due to networking issues PV1 spent many hours with neither controller able to become active. controller-1 kept attempting to go active but would then revert to standby. Appears that during this period patch-alarm-manager processes kept being created but were never cleaned up on failure leaving 2830 processes running.

Severity
--------
Minor

Steps to Reproduce
------------------
Not sure how to reproduce it myself.
Bin Qian described the networking problem:
Looks like it is a network issue. Msg with uuid=09cd9b6b-f098-4468-a81e-f292ce0dd20d was received by controller-0 from a unknown lab over multicast ip 239.1.1.1. TCPDUMP show that the msg was sent from mac 00:1e:67:68:0b:f0.
Similar msg received by controller-1.

2019-04-23T19:39:35.000 controller-0 sm: debug time[42573.674] log<653628> INFO: sm[88764]: sm_msg.c(367): Message instance (cfaf2360-1568-4103-812e-4a6415283268) changed for node (controller-1), now=09cd9b6b-f098-4468-a81e-f292ce0dd20d.
2019-04-23T19:39:35.000 controller-0 sm: debug time[42573.674] log<653629> INFO: sm[88764]: sm_msg.c(367): Message instance (09cd9b6b-f098-4468-a81e-f292ce0dd20d) changed for node (controller-1), now=cfaf2360-1568-4103-812e-4a6415283268.

Expected Behavior
------------------
There should only be 1 patch-alarm-manager process.

Actual Behavior
----------------
There were 2830 patch-alarm-manager processes.

Reproducibility
---------------
Not sure how likely this scenario is.

System Configuration
--------------------
Multi-node (2+10)

Branch/Pull Time/Commit
-----------------------
cengn load: 20190421T233001Z

Last Pass
---------
n/a

Timestamp/Logs
--------------
| 2019-04-23T09:34:16.923 | 557854 | service-scn | patch-alarm-manager | enabled-active | disabling | disable state requested
| 2019-04-23T09:34:16.930 | 557871 | service-scn | patch-alarm-manager | disabling | disabled | disable success
| 2019-04-23T09:34:21.940 | 557932 | service-scn | patch-alarm-manager | disabled | enabling | enabled-active state requested
| 2019-04-23T09:34:22.405 | 557966 | service-scn | patch-alarm-manager | enabling | enabled-active | enable success
| 2019-04-23T09:34:22.947 | 558010 | service-scn | patch-alarm-manager | enabled-active | disabling | disable state requested
| 2019-04-23T09:34:22.953 | 558041 | service-scn | patch-alarm-manager | disabling | disabled | disable success
| 2019-04-23T09:34:24.443 | 558073 | service-scn | patch-alarm-manager | disabled | enabling | enabled-active state requested
| 2019-04-23T09:34:24.902 | 558077 | service-scn | patch-alarm-manager | enabling | enabled-active | enable success
<snip>
| 2019-04-23T16:14:50.394 | 965978 | service-scn | patch-alarm-manager | enabled-active | disabling | disable state requested
| 2019-04-23T16:14:50.398 | 965984 | service-scn | patch-alarm-manager | disabling | disabled | disable success
| 2019-04-23T16:14:55.433 | 966005 | service-scn | patch-alarm-manager | disabled | enabling | enabled-active state requested
| 2019-04-23T16:14:55.846 | 966013 | service-scn | patch-alarm-manager | enabling | enabled-active | enable success
| 2019-04-23T16:14:58.981 | 966026 | service-scn | patch-alarm-manager | enabled-active | disabling | disable state requested
| 2019-04-23T16:14:58.985 | 966028 | service-scn | patch-alarm-manager | disabling | disabled | disable success
| 2019-04-23T16:15:01.492 | 966033 | service-scn | patch-alarm-manager | disabled | enabling | enabled-active state requested
| 2019-04-23T16:15:01.877 | 966034 | service-scn | patch-alarm-manager | enabling | enabled-active | enable success
| 2019-04-23T16:15:03.999 | 966040 | service-scn | patch-alarm-manager | enabled-active | disabling | disable state requested
<snip>

date; ps -ef | grep patch-alarm-manager | grep python
Thu May 2 05:05:15 UTC 2019
root 368 1 0 Apr23 ? 00:00:13 python /usr/bin/patch-alarm-manager start
root 457 1 0 Apr23 ? 00:00:12 python /usr/bin/patch-alarm-manager start
root 468 1 0 Apr23 ? 00:00:14 python /usr/bin/patch-alarm-manager start
root 507 1 0 Apr23 ? 00:00:17 python /usr/bin/patch-alarm-manager start
root 508 1 0 Apr23 ? 00:00:16 python /usr/bin/patch-alarm-manager start
root 516 1 0 Apr23 ? 00:00:16 python /usr/bin/patch-alarm-manager start
root 527 1 0 Apr23 ? 00:00:19 python /usr/bin/patch-alarm-manager start
<snip>
root 147355 1 0 Apr23 ? 00:00:20 python /usr/bin/patch-alarm-manager start
root 147357 1 0 Apr23 ? 00:00:11 python /usr/bin/patch-alarm-manager start
root 147378 1 0 Apr23 ? 00:00:17 python /usr/bin/patch-alarm-manager start
root 147422 1 0 Apr23 ? 00:00:12 python /usr/bin/patch-alarm-manager start
root 147427 1 0 Apr23 ? 00:00:16 python /usr/bin/patch-alarm-manager start

Can provide full logs on request.

Test Activity
-------------
System Engineering

Revision history for this message
Gerry Kopec (gerry-kopec) wrote :
summary: - 2830 patch-alarm-manager proceses on active controller
+ 2830 patch-alarm-manager processes on active controller
description: updated
Revision history for this message
Dariush Eslimi (deslimi) wrote :

Marking stx.2.0 gating as this is resource leak and can prevent system to function properly.

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Don Penney (dpenney)
tags: added: stx.2.0 stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to integ (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/657009

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to update (master)

Fix proposed to branch: master
Review: https://review.opendev.org/657010

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to update (master)

Reviewed: https://review.opendev.org/657010
Committed: https://git.openstack.org/cgit/starlingx/update/commit/?id=69c9cb05687d34615f39ae0fdd5923d0a92b941f
Submitter: Zuul
Branch: master

commit 69c9cb05687d34615f39ae0fdd5923d0a92b941f
Author: Don Penney <email address hidden>
Date: Fri May 3 14:42:25 2019 -0400

    Fix bug in patch-alarm-manager start check

    The start function of the patch-alarm-manager init script
    checks for a valid pidfile to see if the process is
    already running. Unfortunately, the code has a couple
    of typos that cause the check to fail if the "start"
    is called when the process is already running.

    This commit fixes the typos.

    Change-Id: If46f03a5d042f949db9359d6ddd7f69790ccaf4f
    Closes-Bug: 1827326
    Signed-off-by: Don Penney <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to integ (master)

Reviewed: https://review.opendev.org/657009
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=f3c9e6640e5704c6128e456e75a52a933557f3dc
Submitter: Zuul
Branch: master

commit f3c9e6640e5704c6128e456e75a52a933557f3dc
Author: Don Penney <email address hidden>
Date: Fri May 3 14:45:35 2019 -0400

    Fix bug in logmgmt start check

    The start function of the logmgmt init script
    checks for a valid pidfile to see if the process is
    already running. Unfortunately, the code has a couple
    of typos that cause the check to fail if the "start"
    is called when the process is already running.

    This commit fixes the typos.

    Change-Id: I5795d23cc9e41a18b62e35bf3df07817522efe52
    Related-Bug: 1827326
    Signed-off-by: Don Penney <email address hidden>

Ghada Khalil (gkhalil)
tags: added: stx.update
Revision history for this message
Gerry Kopec (gerry-kopec) wrote :

Verified this on build 2019-05-09_16-05-20.

controller-0:/home/wrsroot# ps -ef | grep patch-alarm-manager | grep python
root 1202491 1 0 May14 ? 00:00:22 python /usr/bin/patch-alarm-manager start

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.