2830 patch-alarm-manager processes on active controller

Bug #1827326 reported by Gerry Kopec
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Don Penney

Bug Description

Brief Description
-----------------
Due to networking issues PV1 spent many hours with neither controller able to become active. controller-1 kept attempting to go active but would then revert to standby. Appears that during this period patch-alarm-manager processes kept being created but were never cleaned up on failure leaving 2830 processes running.

Severity
--------
Minor

Steps to Reproduce
------------------
Not sure how to reproduce it myself.
Bin Qian described the networking problem:
Looks like it is a network issue. Msg with uuid=09cd9b6b-f098-4468-a81e-f292ce0dd20d was received by controller-0 from a unknown lab over multicast ip 239.1.1.1. TCPDUMP show that the msg was sent from mac 00:1e:67:68:0b:f0.
Similar msg received by controller-1.

2019-04-23T19:39:35.000 controller-0 sm: debug time[42573.674] log<653628> INFO: sm[88764]: sm_msg.c(367): Message instance (cfaf2360-1568-4103-812e-4a6415283268) changed for node (controller-1), now=09cd9b6b-f098-4468-a81e-f292ce0dd20d.
2019-04-23T19:39:35.000 controller-0 sm: debug time[42573.674] log<653629> INFO: sm[88764]: sm_msg.c(367): Message instance (09cd9b6b-f098-4468-a81e-f292ce0dd20d) changed for node (controller-1), now=cfaf2360-1568-4103-812e-4a6415283268.

Expected Behavior
------------------
There should only be 1 patch-alarm-manager process.

Actual Behavior
----------------
There were 2830 patch-alarm-manager processes.

Reproducibility
---------------
Not sure how likely this scenario is.

System Configuration
--------------------
Multi-node (2+10)

Branch/Pull Time/Commit
-----------------------
cengn load: 20190421T233001Z

Last Pass
---------
n/a

Timestamp/Logs
--------------
| 2019-04-23T09:34:16.923 | 557854 | service-scn | patch-alarm-manager | enabled-active | disabling | disable state requested
| 2019-04-23T09:34:16.930 | 557871 | service-scn | patch-alarm-manager | disabling | disabled | disable success
| 2019-04-23T09:34:21.940 | 557932 | service-scn | patch-alarm-manager | disabled | enabling | enabled-active state requested
| 2019-04-23T09:34:22.405 | 557966 | service-scn | patch-alarm-manager | enabling | enabled-active | enable success
| 2019-04-23T09:34:22.947 | 558010 | service-scn | patch-alarm-manager | enabled-active | disabling | disable state requested
| 2019-04-23T09:34:22.953 | 558041 | service-scn | patch-alarm-manager | disabling | disabled | disable success
| 2019-04-23T09:34:24.443 | 558073 | service-scn | patch-alarm-manager | disabled | enabling | enabled-active state requested
| 2019-04-23T09:34:24.902 | 558077 | service-scn | patch-alarm-manager | enabling | enabled-active | enable success
<snip>
| 2019-04-23T16:14:50.394 | 965978 | service-scn | patch-alarm-manager | enabled-active | disabling | disable state requested
| 2019-04-23T16:14:50.398 | 965984 | service-scn | patch-alarm-manager | disabling | disabled | disable success
| 2019-04-23T16:14:55.433 | 966005 | service-scn | patch-alarm-manager | disabled | enabling | enabled-active state requested
| 2019-04-23T16:14:55.846 | 966013 | service-scn | patch-alarm-manager | enabling | enabled-active | enable success
| 2019-04-23T16:14:58.981 | 966026 | service-scn | patch-alarm-manager | enabled-active | disabling | disable state requested
| 2019-04-23T16:14:58.985 | 966028 | service-scn | patch-alarm-manager | disabling | disabled | disable success
| 2019-04-23T16:15:01.492 | 966033 | service-scn | patch-alarm-manager | disabled | enabling | enabled-active state requested
| 2019-04-23T16:15:01.877 | 966034 | service-scn | patch-alarm-manager | enabling | enabled-active | enable success
| 2019-04-23T16:15:03.999 | 966040 | service-scn | patch-alarm-manager | enabled-active | disabling | disable state requested
<snip>

date; ps -ef | grep patch-alarm-manager | grep python
Thu May 2 05:05:15 UTC 2019
root 368 1 0 Apr23 ? 00:00:13 python /usr/bin/patch-alarm-manager start
root 457 1 0 Apr23 ? 00:00:12 python /usr/bin/patch-alarm-manager start
root 468 1 0 Apr23 ? 00:00:14 python /usr/bin/patch-alarm-manager start
root 507 1 0 Apr23 ? 00:00:17 python /usr/bin/patch-alarm-manager start
root 508 1 0 Apr23 ? 00:00:16 python /usr/bin/patch-alarm-manager start
root 516 1 0 Apr23 ? 00:00:16 python /usr/bin/patch-alarm-manager start
root 527 1 0 Apr23 ? 00:00:19 python /usr/bin/patch-alarm-manager start
<snip>
root 147355 1 0 Apr23 ? 00:00:20 python /usr/bin/patch-alarm-manager start
root 147357 1 0 Apr23 ? 00:00:11 python /usr/bin/patch-alarm-manager start
root 147378 1 0 Apr23 ? 00:00:17 python /usr/bin/patch-alarm-manager start
root 147422 1 0 Apr23 ? 00:00:12 python /usr/bin/patch-alarm-manager start
root 147427 1 0 Apr23 ? 00:00:16 python /usr/bin/patch-alarm-manager start

Can provide full logs on request.

Test Activity
-------------
System Engineering

Revision history for this message
Gerry Kopec (gerry-kopec) wrote :
summary: - 2830 patch-alarm-manager proceses on active controller
+ 2830 patch-alarm-manager processes on active controller
description: updated
Revision history for this message
Dariush Eslimi (deslimi) wrote :

Marking stx.2.0 gating as this is resource leak and can prevent system to function properly.

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Don Penney (dpenney)
tags: added: stx.2.0 stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to integ (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/657009

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to update (master)

Fix proposed to branch: master
Review: https://review.opendev.org/657010

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to update (master)

Reviewed: https://review.opendev.org/657010
Committed: https://git.openstack.org/cgit/starlingx/update/commit/?id=69c9cb05687d34615f39ae0fdd5923d0a92b941f
Submitter: Zuul
Branch: master

commit 69c9cb05687d34615f39ae0fdd5923d0a92b941f
Author: Don Penney <email address hidden>
Date: Fri May 3 14:42:25 2019 -0400

    Fix bug in patch-alarm-manager start check

    The start function of the patch-alarm-manager init script
    checks for a valid pidfile to see if the process is
    already running. Unfortunately, the code has a couple
    of typos that cause the check to fail if the "start"
    is called when the process is already running.

    This commit fixes the typos.

    Change-Id: If46f03a5d042f949db9359d6ddd7f69790ccaf4f
    Closes-Bug: 1827326
    Signed-off-by: Don Penney <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to integ (master)

Reviewed: https://review.opendev.org/657009
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=f3c9e6640e5704c6128e456e75a52a933557f3dc
Submitter: Zuul
Branch: master

commit f3c9e6640e5704c6128e456e75a52a933557f3dc
Author: Don Penney <email address hidden>
Date: Fri May 3 14:45:35 2019 -0400

    Fix bug in logmgmt start check

    The start function of the logmgmt init script
    checks for a valid pidfile to see if the process is
    already running. Unfortunately, the code has a couple
    of typos that cause the check to fail if the "start"
    is called when the process is already running.

    This commit fixes the typos.

    Change-Id: I5795d23cc9e41a18b62e35bf3df07817522efe52
    Related-Bug: 1827326
    Signed-off-by: Don Penney <email address hidden>

Ghada Khalil (gkhalil)
tags: added: stx.update
Revision history for this message
Gerry Kopec (gerry-kopec) wrote :

Verified this on build 2019-05-09_16-05-20.

controller-0:/home/wrsroot# ps -ef | grep patch-alarm-manager | grep python
root 1202491 1 0 May14 ? 00:00:22 python /usr/bin/patch-alarm-manager start

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers