platform: compute nodes in failed state after unlocking

Bug #1828903 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
High
zhipeng liu

Bug Description

Brief Description
-----------------

During lab setup period, compute nodes were in "failed" state after unlocking

Severity
--------
Critical

Steps to Reproduce
------------------

TC-name:

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Multi-node system

Lab-name: WCP_99-103

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-05-10_15-59-33

Last Pass
---------

Timestamp/Logs
--------------

[2019-05-13 22:26:28,996] 262 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-0'
[2019-05-13 22:26:36,549] 387 DEBUG MainThread ssh.expect :: Output:
---------------------------------------------------------------+

Property Value
---------------------------------------------------------------+

action none
administrative locked
availability online
bm_ip 128.224.64.189
bm_type bmc
bm_username root
boot_device /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0
capabilities {}
config_applied 48d2b479-dd33-416b-b98f-8cb5a8a6d07c
config_status Config out-of-date
config_target 4b849030-c267-4f26-b860-b910656738c8
console ttyS0,115200
created_at 2019-05-13T21:59:06.033987+00:00
hostname compute-0
id 3
install_output text
install_state completed
install_state_info None
invprovision unprovisioned
location {}
mgmt_ip 192.168.204.30
mgmt_mac 90:e2:ba:b0:dd:e8
operational disabled
personality worker
reserved False
rootfs_device /dev/disk/by-path/pci-0000:00:1f.2-ata-1.0
serialid None
software_load 19.05
subfunctions worker,lowlatency
task Unlocking
tboot false
ttys_dcd None
updated_at 2019-05-13T22:26:07.553006+00:00
uptime 994
uuid 03fe9d4c-7b9a-4deb-a4a1-96a9c449e8ab
vim_progress_status None
---------------------------------------------------------------+
[wrsroot@controller-0 ~(keystone_admin)]$

[2019-05-13 22:26:36,652] 262 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-2'

[2019-05-13 22:26:44,827] 262 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-1'

[2019-05-13 22:49:46,636] 262 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne host-list --nowrap'
[2019-05-13 22:49:48,053] 387 DEBUG MainThread ssh.expect :: Output:
---------------+-----------------------------------------------------

id hostname personality administrative operational availability
---------------+-----------------------------------------------------

1 controller-0 controller unlocked enabled available
2 controller-1 controller unlocked enabled available
3 compute-0 worker unlocked disabled failed
4 compute-1 worker unlocked disabled failed
5 compute-2 worker unlocked disabled failed
---------------+-----------------------------------------------------

Test Activity
-------------
lab setup

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Peng,
Are you specifying a vswitch type of ovs-dpdk?
Essentially running into this bug: https://bugs.launchpad.net/starlingx/+bug/1828264

Changed in starlingx:
status: New → Incomplete
Revision history for this message
David Sullivan (dsullivanwr) wrote :

I don't think this is related to 1828264. 1828264 is a semantic check which will block unlocking compute nodes without a label present. In this case the nodes were unlocked by sysinv but then failed.

Seems to have failed the worker puppet apply:
2019-05-13T22:29:42.569 Error: 2019-05-13 22:29:42 +0000 Could not set 'present' on ensure: No such file or directory - /etc/pci_irq_affinity/config.ini20190513-12436-1ifbxbo.lock at /usr/share/puppet/modules/platform/manifests/pciirqaffinity.pp:23

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; high priority given that locks are failing. seems to be related to recent commits for pci-irq-affinity-agent

Changed in starlingx:
status: Incomplete → Triaged
tags: added: stx.2.0 stx.integ
Changed in starlingx:
importance: Undecided → High
assignee: nobody → zhipeng liu (zhipengs)
Revision history for this message
zhipeng liu (zhipengs) wrote :

From log, pci-irq-affinity-agent related files are not installed.
it seems below patch is not included
https://review.opendev.org/#/c/640264/
But its dependent patch is included. So during puppet apply, it could not find related file.
https://review.opendev.org/#/c/654415/

Who can help double check this?

I see the build time is around my patch merge time.
SW_VERSION="19.05"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2019-05-10_15-59-33"
SRC_BUILD_ID="13"

JOB="TC_19.05_Build"
BUILD_BY="jenkins"
BUILD_NUMBER="13"
BUILD_HOST="yow-cgts4-lx.wrs.com"
BUILD_DATE="2019-05-10 16:00:18 -0400"

Revision history for this message
zhipeng liu (zhipengs) wrote :

I have check the merge time of this 2 patches.
stx-config patch is before build date 5.10 13 pm
stx-integ patch is after build date 5.10 17 pm

So please use latest build to retest it, thanks!

Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as Incomplete until Peng re-tests with a newer build

Changed in starlingx:
status: Triaged → Incomplete
Revision history for this message
Peng Peng (ppeng) wrote :

We did not see this issue on
Lab: WCP_99_103
Load: 2019-05-15_18-01-07

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Closing as the reporter confirmed the issue is not reproducible with a newer build

Changed in starlingx:
status: Incomplete → Invalid
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.