All-in-one: pci-irq-affinity-agent fails to start - controller-0 stuck in degraded state after initial unlock

Bug #1828877 reported by Chris Winnicki
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
zhipeng liu

Bug Description

Brief Description
-----------------
pci-irq-affinity-agent fails to start - controller-0 stuck in degraded state after initial unlock

pmond reports the following (continuously):
/var/log/pmond.log (snippet)

2019-05-13T18:34:52.837 [93604.05081] controller-0 pmond mon pmonFsm.cpp ( 565) pmon_passive_handler : Info : pci-irq-affinity-agent stability period (20 secs)
2019-05-13T18:34:52.837 [93604.05082] controller-0 pmond mon pmonHdlr.cpp (1003) process_running : Info : pci-irq-affinity-agent process not running
2019-05-13T18:34:52.837 [93604.05083] controller-0 pmond mon pmonHdlr.cpp (1305) respawn_process : Info : pci-irq-affinity-agent Spawn (1200886)
2019-05-13T18:34:53.837 [93604.05084] controller-0 pmond mon pmonHdlr.cpp ( 897) want_degrade_clear : Warn : pci-irq-affinity-agent is still failed 'major' ; degrade assert

controller-0 stuck in degraded state:
[wrsroot@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | degraded |
+----+--------------+-------------+----------------+-------------+--------------+

(Alarm snippet)
fm alarm-list
[wrsroot@controller-0 ~(keystone_admin)]$ fm alarm-list
+-------+------------------------------------------------------------------------------------+--------------------------------------+----------+----------------+
| Alarm | Reason Text | Entity ID | Severity | Time Stamp |
| ID | | | | |
+-------+------------------------------------------------------------------------------------+--------------------------------------+----------+----------------+
| 200. | controller-0 is degraded due to the failure of its 'pci-irq-affinity-agent' | host=controller-0.process=pci-irq- | major | 2019-05-13T16: |
| 006 | process. Auto recovery of this major process is in progress. | affinity-agent | | 40:46.408005 |
| | | | | |
+-------+------------------------------------------------------------------------------------+--------------------------------------+----------+----------------+
[wrsroot@controller-0 ~(keystone_admin)]$ date
Mon May 13 18:43:31 UTC 2019

The issue is possibly caused by:
https://review.opendev.org/#/c/640264/

Severity
--------
Major: System cannot be fully installed

Steps to Reproduce
------------------
Install controller-0 as All-in-one dublex mode

Expected Behavior
------------------
controller-0 should not be in degraded state after initial unlock

Actual Behavior
----------------
pci-irq-affinity-agent process keeps failing
controller-0 never gets out of degraded state

Reproducibility
---------------
100% reproducible on build: 20190512T233000Z

System Configuration
--------------------
1+1 system (AIO-DX)
Internal lab name: cgcs-wildcat-69-70

Branch/Pull Time/Commit
-----------------------
BUILD_ID="20190512T233000Z"
JOB="STX_build_master_master"
<email address hidden>"

Last Pass
---------
20190508T233000Z

Timestamp/Logs
--------------
Attached

Test Activity
-------------
Lab install

Revision history for this message
Chris Winnicki (chriswinnicki) wrote :
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per above, it is suspected that this issue is introduced by: https://review.opendev.org/#/c/640264/
Assigning to Zhipheng Liu to investigate

Changed in starlingx:
assignee: nobody → zhipeng liu (zhipengs)
importance: Undecided → High
tags: added: stx.2.0
tags: added: stx.integ
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; issue introduced by recent commit and is causing the controller to be in a degraded state.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

This was also reported in sanity report by Maria Perez Ibarra:
[Starlingx-discuss] [Containers] Sanity Test - ISO 20190513

Changed in starlingx:
status: New → Triaged
Revision history for this message
zhipeng liu (zhipengs) wrote :

Root cause found.
This agent should not be installed in AIO. I will filter it.
Patch is on the way.

Zhipeng

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/658950

Changed in starlingx:
status: Triaged → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.metal
removed: stx.integ
summary: - pci-irq-affinity-agent fails to start - controller-0 stuck in degraded
- state after initial unlock
+ All-in-one: pci-irq-affinity-agent fails to start - controller-0 stuck
+ in degraded state after initial unlock
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Zhipeng, my understanding is the pci-irq-affinity-agent listens to nova notifications to handle pci devices. AIO systems will have nova running when the openstack application is applied. This agent will need to be running to handle the notifications. When will this process be started on AIO systems?

Revision history for this message
Bin Qian (bqian20) wrote :

When applying stx-openstack, the following error occurred:
2019-05-14T13:07:00.362 Debug: 2019-05-14 13:07:00 +0000 Exec[restart-pciirqaffinity-service](provider=posix): Executing 'systemctl restart pci-irq-affinity-agent.service'
2019-05-14T13:07:00.364 Debug: 2019-05-14 13:07:00 +0000 Executing: 'systemctl restart pci-irq-affinity-agent.service'
2019-05-14T13:07:00.367 Notice: 2019-05-14 13:07:00 +0000 /Stage[post]/Platform::Pciirqaffinity::Reload/Exec[restart-pciirqaffinity-service]/returns: Job for pci-irq-affinity-agent.service failed because a configured resource limit was exceeded. See "systemctl status pci-irq-affinity-agent.service" and "journalctl -xe" for details.
2019-05-14T13:07:00.369 Error: 2019-05-14 13:07:00 +0000 systemctl restart pci-irq-affinity-agent.service returned 1 instead of one of [0]
2019-05-14T13:07:00.371 /usr/share/ruby/vendor_ruby/puppet/util/errors.rb:106:in `fail'

Ghada Khalil (gkhalil)
tags: added: stx.integ
removed: stx.metal
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/659081

Revision history for this message
zhipeng liu (zhipengs) wrote :

Root cause is in start script, this agent can be started
only if node type is worker.
But for AIO, the node type is controller. Then pmon will
restart it again and again and cause controller degrade
in the end.
Since in muti-node setup, we only install this agent in
worker node. So in start script, no need to add this node
type judgement.

In pci-irq-affinity-agent bash script
      if [ ${NODETYPE} = "worker" ] ; then //this judgement blocked agent start in AIO.
            .....
            /bin/sh -c "${AFFINITYAGENT}"' >> /dev/null 2>&1 & echo $!' > ${daemon_pidfile}

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on metal (master)

Change abandoned by zhipeng liu (<email address hidden>) on branch: master
Review: https://review.opendev.org/658950

Revision history for this message
zhipeng liu (zhipengs) wrote :

Patch is ready
https://review.opendev.org/#/c/659081/

Fix for pci-irq-affinity-agent failing to start in AIO

Ensure that pci-irq-affinity-agent is launched on worker nodes.
This includes AIO and standard configs.

Root cause is in this agent start script, it can be started
only if node type is worker. But for AIO, the node type is controller.
Then pmon will restart it again and again and cause controller degrade
in the end.

Below test pass
1) Pci-irq-affinity-agent started normally before openstack apply.
After openstack apply, related openstack config applied to
agent config file as expected.
2) Verified agent started normally in non-openstack worker node for
both AIO and multi-node.
No degrade in controller node.

Ghada Khalil (gkhalil)
tags: added: stx.sanity
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/659081
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=ce0cc60346a73372ca967f6311f8a92a924641d7
Submitter: Zuul
Branch: master

commit ce0cc60346a73372ca967f6311f8a92a924641d7
Author: zhipengl <email address hidden>
Date: Wed May 15 07:22:42 2019 +0800

    Fix for pci-irq-affinity-agent failing to start in AIO

    Ensure that pci-irq-affinity-agent is launched on worker nodes.
    This includes AIO and standard configs.

    Root cause is in this agent start script, it can be started
    only if node type is worker. But for AIO, the node type is controller.
    Then pmon will restart it again and again and cause controller degrade
    in the end.

    Below test for AIO pass
    1) Pci-irq-affinity-agent started normally before openstack apply.
    After openstack apply, related openstack config applied to
    agent config file as expected.
    2) Verified agent started normally in non-openstack worker node for
    both AIO and multi-node.
    No degrade in controller node.

    Change-Id: I73e9dff0358b7ed86bfaaadac834e19fe227892f
    Closes-Bug: #1828877
    Signed-off-by: zhipengl <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/661875

Revision history for this message
Chris Winnicki (chriswinnicki) wrote :

Retest verdict: Passed

Build:
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190617T233000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="150"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-06-17 23:30:00 +0000"

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.