AIO: PCI-IRQ-Affinity-Agent repeated restarts due to pmon

Bug #1839525 reported by Jim Gauld
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
zhipeng liu

Bug Description

Brief Description
-----------------
On AIO-DX system, see intermittent restarts of PCI-IRQ-Affinity-Agent, prior to worker config complete. This may leave controller degraded. See logs of restart logs in /var/log/pmond.log .

2019-07-11T17:43:43.476 [111350.11006] controller-0 mtcAgent hbs nodeClass.cpp (5394) critical_process_failed :Error : controller-0 has critical 'kubelet' process failure
. .
2019-07-11T19:33:33.643 [102225.00518] controller-0 mtcAgent hbs nodeClass.cpp (5175) log_process_failure : Warn : controller-1 pmon: 'pci-irq-affinity-agent' process failed and is being auto recovered
2019-07-11T19:33:35.297 [102225.00519] controller-0 mtcAgent hbs nodeClass.cpp (5175) log_process_failure : Warn : controller-0 pmon: 'pci-irq-affinity-agent' process failed and is being auto recovered

Eric MacDonald and I investigated this on R430_3_4 (July 10-15 timeframe) for kubelet.service and PCI-IRQ-Affinity-Agent. It was determined that we were missing one line of pmon configuration that makes PMON not take actions until worker configuration is complete. This was since corrected for kubelet.service, but is still missing for PCI-IRQ-Affinity-Agent.

I had corrected the problem by adding the following config setting in stx-config:

diff --git a/puppet-manifests/src/modules/platform/templates/kubelet-pmond-conf.erb b/puppet-manifests/src/modules/platform/templates/kubelet-pmond-conf.erb
index 5ad4466..ce6832d 100644
--- a/puppet-manifests/src/modules/platform/templates/kubelet-pmond-conf.erb
+++ b/puppet-manifests/src/modules/platform/templates/kubelet-pmond-conf.erb
@@ -13,3 +13,4 @@ restarts = 3 ; restarts before error assertion
 startuptime = 5 ; seconds to wait after process start
 interval = 5 ; number of seconds to wait between restarts
 debounce = 20 ; number of seconds to wait before degrade clear
+subfunction = last-config ; run monitor only after last config is run

I had also tested with the following one-liner change in stx-integ, but that not delivered with the kubelet affinity changes -- that is missing.

diff --git a/utilities/pci-irq-affinity-agent/files/pci-irq-affinity-agent.conf b/utilities/pci-irq-affinity-agent/files/pci-irq-affinity-agent.conf
index 544cee0..a40a13c 100644
--- a/utilities/pci-irq-affinity-agent/files/pci-irq-affinity-agent.conf
+++ b/utilities/pci-irq-affinity-agent/files/pci-irq-affinity-agent.conf
@@ -7,3 +7,4 @@ severity = major ; minor, major, critical
 restarts = 3 ; restarts before error assertion
 interval = 5 ; number of seconds to wait between restarts
 debounce = 20 ; number of seconds to wait before degrade clear
+subfunction = last-config

After this line is added, expect to see the following in /var/log/pmond.log ;
2019-07-11T15:10:59.476 [104190.00093] controller-0 pmond mon pmonFsm.cpp ( 890) pmon_passive_handler : Warn : pci-irq-affinity-agent monitoring is waiting on /var/run/.worker_config_complete

Severity
--------
Major: System/Feature is usable but degraded.

Revision history for this message
Frank Miller (sensfan22) wrote :

PCI Interrupt affinity handling was added as a SB in stx.2.0. Marking this as high priority/stx.2.0 gating as the process is restarting. Requesting Zhipeng address this issue.

Changed in starlingx:
status: New → Triaged
importance: Undecided → High
assignee: nobody → zhipeng liu (zhipengs)
tags: added: stx.2.0 stx.distro.openstack
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/675503

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi Jim,

I have tested the EB with your solution.
I can see below in pmon.log
2019-08-09T03:05:23.054 [96082.00089] controller-0 pmond mon pmonFsm.cpp ( 890) pmon_passive_handler : Warn : pci-irq-affinity-agent monitoring is waiting on /var/run/.worker_config_complete
2019-08-09T03:05:23.054 [96082.00090] controller-0 pmond mon pmonFsm.cpp ( 890) pmon_passive_handler : Warn : kubelet monitoring is waiting on /var/run/.worker_config_complete

Zhipeng

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/675503
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=40f57f314176f24538902b1da5b10566f3e6235f
Submitter: Zuul
Branch: master

commit 40f57f314176f24538902b1da5b10566f3e6235f
Author: zhipengl <email address hidden>
Date: Fri Aug 9 18:38:09 2019 +0800

    Fix PCI-IRQ-Affinity-Agent repeated restarts due to pmon.

    It was determined that we were missing one line of pmon
    configuration that makes PMON not take actions until
    worker configuration is complete.

    Test pass in AIO-DX
    Below log can be seen in pmond.log
    pci-irq-affinity-agent monitoring is waiting on
    /var/run/.worker_config_complete
    Before openstack application finished, no degrade or pci-irq-affinity-
    agent restart can be seen.

    Closes-bug: 1839525

    Change-Id: I3d28ea4afa2f7e65dcc9e48a7a46b4f80c574e3e
    Signed-off-by: zhipengl <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Zhipeng, Please cherry-pick to the stx.2.0 release branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (r/stx.2.0)

Fix proposed to branch: r/stx.2.0
Review: https://review.opendev.org/676037

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (r/stx.2.0)

Reviewed: https://review.opendev.org/676037
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=6c4223712b303d8d537fef210189cfc8e9bd12c2
Submitter: Zuul
Branch: r/stx.2.0

commit 6c4223712b303d8d537fef210189cfc8e9bd12c2
Author: zhipengl <email address hidden>
Date: Fri Aug 9 18:38:09 2019 +0800

    Fix PCI-IRQ-Affinity-Agent repeated restarts due to pmon.

    It was determined that we were missing one line of pmon
    configuration that makes PMON not take actions until
    worker configuration is complete.

    Test pass in AIO-DX
    Below log can be seen in pmond.log
    pci-irq-affinity-agent monitoring is waiting on
    /var/run/.worker_config_complete
    Before openstack application finished, no degrade or pci-irq-affinity-
    agent restart can be seen.

    Closes-bug: 1839525

    Change-Id: I3d28ea4afa2f7e65dcc9e48a7a46b4f80c574e3e
    Signed-off-by: zhipengl <email address hidden>
    (cherry picked from commit 40f57f314176f24538902b1da5b10566f3e6235f)

Ghada Khalil (gkhalil)
tags: added: in-r-stx20
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.