SX controller take too long to recover after host-unlock

Bug #1890323 reported by Peng Peng
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Al Bailey

Bug Description

Brief Description
-----------------
30 minutes after SX host-unlock, controller node was still not recvoered.

Severity
--------
Major

Steps to Reproduce
------------------
SX host-unlock

TC-name: /networking/test_sriovdp.py::TestSriovMixed::()::test_sriovdp_mixed_add_vf_interface[1]

Expected Behavior
------------------
controller node recovered less than 5 mins

Actual Behavior
----------------
controller node not recovered after 30 mins

Reproducibility
---------------
Unknown - first time this is seen in sanity

System Configuration
--------------------
One node system

Lab-name: SM-3

Branch/Pull Time/Commit
-----------------------
2020-08-04_00-00-00

Last Pass
---------
2020-08-03_00-00-00

Timestamp/Logs
--------------
[2020-08-04 09:30:02,171] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'

[2020-08-04 09:58:21,003] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne show'
[2020-08-04 09:58:21,914] 436 DEBUG MainThread ssh.expect :: Output:
Authorization failed: Unable to establish connection to http://[abcd:204::1]:5000/v3/auth/tokens
controller-0:~$

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
WCP_112
2020-08-16_22-54-19

[2020-08-17 07:29:18,063] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'

[2020-08-17 07:57:34,118] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne show'
[2020-08-17 07:57:35,394] 436 DEBUG MainThread ssh.expect :: Output:
Authorization failed: Unable to establish connection to http://[abcd:204::1]:5000/v3/auth/tokens
controller-0:~$

It seems controller had double reboot after host-unlock.

collect log also added.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Peng, Did the original occurrence also involve a double reboot?

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Also are you still seeing these double reboots on these two systems or any others?

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
Ghada Khalil (gkhalil)
tags: added: stx.metal
Revision history for this message
Austin Sun (sunausti) wrote :

does this issue meet recently ?

Revision history for this message
Al Bailey (albailey1974) wrote :

I see a slow recovery on AIO-SX in virtualbox, so I will submit a fix against this bug

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ha/+/846058

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ha (master)

Change abandoned by "Al Bailey <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/ha/+/846058
Reason: Will make a vbox specific change

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/846240

Revision history for this message
Al Bailey (albailey1974) wrote :

The puppet change (niceness -20) does not really work.
I will abandon that change, and resume the original investigation into trying to get SM to not spin indefinitely

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (master)

Change abandoned by "Al Bailey <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/846240
Reason: This does not fix the issue

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ha (master)

Change abandoned by "Al Bailey <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/ha/+/846058
Reason: Adding an additional platform core resolves the issue

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/853206
Committed: https://opendev.org/starlingx/config/commit/1a29a9ea728ee411df8c60ac2041bce6fcba25e7
Submitter: "Zuul (22348)"
Branch: master

commit 1a29a9ea728ee411df8c60ac2041bce6fcba25e7
Author: Al Bailey <email address hidden>
Date: Mon Aug 15 19:40:22 2022 +0000

    Disable nohz_full in a virtual env

    In VirtualBox, after unlock, SM has all of its services
    in 'initial' state.

    The reason for this is that SM will not proceed unless
    it detects there are no timer delays.

    This is particularly noticable for AIO-SX.

    By disabling nohz_full in virtual box, the timers are
    not delayed and SM is able to start up its services
    more quickly (5 seconds). Othwerwise SM initialization
    on a 4 core system can range from 10 minutes to 10 hours.

    Test Plan:
      Build/Bootstrap/Unlock Debian AIO-SX on virtualbox.

    Closes-Bug: 1890323
    Signed-off-by: Al Bailey <email address hidden>
    Change-Id: I94226721d2ccd83a8b0caac09d1c745d4c908ae4

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.8.0
Changed in starlingx:
assignee: nobody → Al Bailey (albailey1974)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.