Controller can't unlock after lock on AIO-SX

Bug #1833472 reported by Hector
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
High
Yan Chen

Bug Description

Brief Description
-----------------
On AIO-SX after a couple of running lock/unlock on controller. the second try will lock the system and
trying to unlock will prompt timeout error 'Manifest apply timeout ; Unlock to retry'.

After rebooting controller-0 and running 'system application-apply' able to unlock.

Please provide if this is the expected behavior or not and let me know the suggestion.

Severity
--------
Major

Steps to Reproduce
------------------
1. On AIO-SX just run a couple of times lock and unlock
2. Using CLI below error occurs
>> system host-unlock
Error:
Manifest apply timeout ; Unlock to retry

3. When checking system application status shows error.
>> system application-list will prompt on stx-openstack
Error status: 'apply-failed'.

Expected Behavior
------------------
lock/unlock cycle should work continuously

Actual Behavior
----------------
system gets to lock status and don't change until workaround is done.

Reproducibility
---------------
90%

System Configuration
--------------------
One node system (AIO-SX)

Branch/Pull Time/Commit
-----------------------
BUILD_DATE="2019-06-13 01:30:00 +0000"

Last Pass
---------
Don't Know

Timestamp/Logs
--------------
Logs Attached

Test Activity
-------------
Regression Testing

Revision history for this message
Hector (ramoseh) wrote :
Yan Chen (ychen2u)
Changed in starlingx:
assignee: nobody → Yan Chen (ychen2u)
Revision history for this message
Yan Chen (ychen2u) wrote :

If lock and unlock again on a Simplex, the application for stx-openstack will be re-applied before the unlock.
If the re-apply failed, the simplex cannot be unlocked.
See from the log, it seems to be a network error during the re-apply:

"os_error":"Network is unreachable","syscall":"getsockopt(SO_ERROR)","target_address":"ipv4:10.102.50.42:44134"}

So the behavior should be expected.
And please check why the connection failed at that time.

Changed in starlingx:
status: New → Invalid
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Yan, As the reporter points out, locking/unlocking operations should pass consistently, including the stx-openstack apply. So this is not expected behavior. Can you elaborate on where the "network unreachable" error is coming from and what connection is failing?

Changed in starlingx:
status: Invalid → Incomplete
Ada Cabrales (acabrale)
tags: added: stx.regression
tags: added: stx.retestneeded
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: SM_3
Load: 20190704T013000Z

After host-unlock, ssh connection is not disconnected as expected and kept in locked/online status

log:
[2019-07-04 15:23:14,668] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2019-07-04 15:23:16,858] 423 DEBUG MainThread ssh.expect :: Output:
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-17-centos-stable-versioned | armada-manifest | stx-openstack.yaml | applied | completed |
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+

[sysadmin@controller-0 ~(keystone_admin)]$
[2019-07-04 15:23:17,386] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-07-04 15:23:17,386] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'

[2019-07-04 15:31:25,104] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-07-04 15:31:25,104] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-show controller-0'
[2019-07-04 15:31:27,654] 423 DEBUG MainThread ssh.expect :: Output:
+---------------------+----------------------------------------------------------------------+
| Property | Value |
+---------------------+----------------------------------------------------------------------+
| action | none |
| administrative | locked |
| availability | online |

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Appears that the issue was also reproduced by Peng. Marking as stx.2.0 release gating.

Changed in starlingx:
status: Incomplete → Confirmed
tags: added: stx.2.0
Changed in starlingx:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/669278

Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
Yan Chen (ychen2u) wrote :

Would you please send me the detailed operation list to reproduce this issue?
I cannot reproduce this issue as I'm just doing "lock-unlock" for several rounds.
But I see from the log there are obviously other operations then this simple process.

See from the sysinv.log, at the last stage of the test, unlock operation was triggered twice, and the judgement of "whether unlock_ready flag should be removed" was not aligned with the judgement of "whether the reapply should be taken place". -- Patch #669278.

The operation list see from the sysinv.log:
 unlock
 openstack apply
 lock
 unlock
 hello-kitty apply --> please provide the hello-kitty application
 lock
 unlock
 reapply success --> please provide the operation to trigger this reapply
 reapply success --> please provide the operation to trigger this reapply
 lock
 unlock
 unlock --> not wait until the unlock reapply finished, a new unlock come

Changed in starlingx:
status: In Progress → Incomplete
Revision history for this message
Oscar Medina Perez (omedinap) wrote :

I reproduced the issue using the following build:

BUILD_DATE="2019-07-05 01:30:00 +0000"

Please see attached log file.

Revision history for this message
Yan Chen (ychen2u) wrote :

I think this log is about another apply issue.

So I think it's better to seperate this LP into several issues:
1. application reapply failures found in the log, we do have several LP about this kind of issues.
2. when a reapply in-progress, and another "unlock" come in, the unlock-ready flag should not be removed. (see this situation from the first log)
3. when a reapply failed, the unlock operation should not be block (is it correct?).

Revision history for this message
Yan Chen (ychen2u) wrote :

For issue 3, I’m ok if we decide to block unlock if reapply failed. I can modify my patch to only fix issue 2, which is definitely an issue.

In the first log pack, I found the following log, it happened when a reapply was still in-progress and the user couldn’t wait and triggered another unlock:

2019-07-04 15:46:54.708 112183 INFO sysinv.api.controllers.v1.host [-] Remove unlock ready flag ## here unlock ready flag is deleted.
….
2019-07-04 15:46:54.784 112183 INFO sysinv.api.controllers.v1.host [-] stx-openstack system app is present but not applied, skipping re-apply ## here reapply is actually skipped.

Revision history for this message
Yan Chen (ychen2u) wrote :

For issue 1 "application reapply failures":
the first log stuck at osh-openstack-placement.
and the second log stuck at osh-openstack-nginx-ports-control.
I think they are 2 different issues, and there are already several reapply failure LPs submitted.
I will check if there's any duplication.

Changed in starlingx:
status: Incomplete → In Progress
Revision history for this message
Yan Chen (ychen2u) wrote :

Too many different reapply issues are mixed in this issue.

1.
From the origin log, we can see a network error:
"os_error":"Network is unreachable","syscall":"getsockopt(SO_ERROR)","target_address":"ipv4:10.102.50.42:44134"
This caused the first reapply failure.
Log: https://bugs.launchpad.net/starlingx/+bug/1833472/+attachment/5271623/+files/sysinv.log

I'm not sure why the Network went wrong there. But from the log, I see the unlock during the reapply process removed the unlock_ready flag again. This is a real issue and I submitted https://review.opendev.org/669278 to fix it.

Even if this patch cannot fix the connection failure, I do suggest we retest after the patch merged, to see if the network connection issue can be reproduced. (But NOT the following other reapply issue.)

2.
Reapply stuck at osh-openstack-placement.
Log: https://bugs.launchpad.net/starlingx/+bug/1833472/+attachment/5275096/+files/ALL_NODES_20190704.180458.tar

3.
Reapply stuck at osh-openstack-nginx-ports-control.
Log: https://bugs.launchpad.net/starlingx/+bug/1833472/+attachment/5275887/+files/sysinv.log

Revision history for this message
Yan Chen (ychen2u) wrote :

Update for the duplicated failures:
3.
Reapply stuck at osh-openstack-nginx-ports-control.
Log: https://bugs.launchpad.net/starlingx/+bug/1833472/+attachment/5275887/+files/sysinv.log
---> Duplicate with https://bugs.launchpad.net/starlingx/+bug/1834070, which is fixed at July.08.

Revision history for this message
Oscar Medina Perez (omedinap) wrote :

On build date ISO: "2019-07-12 01:30:00 +0000" the issue is not seen. I have tried a couple of times by locking/unlocking without any error.

Revision history for this message
Yan Chen (ychen2u) wrote :

Can I set as Invalid as there's no such kind of network failures found?
"os_error":"Network is unreachable","syscall":"getsockopt(SO_ERROR)","target_address":"ipv4:10.102.50.42:44134"

Revision history for this message
Yan Chen (ychen2u) wrote :

As the network issue is not reproducible, I will set this issue to Invalid.

Changed in starlingx:
status: In Progress → Invalid
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: SM_3
Load: 20190721T233000Z

LP-1833472 happened before this failure

17:42 host-unlock failed
17:58 boot up VM failed

Changed in starlingx:
status: Invalid → Confirmed
Revision history for this message
Yang Liu (yliu12) wrote :

Note that an important step for this issue is to first launch multiple VMs on the simplex system.
For the steps that matched ALL_NODES_20190723.193210.tar:

Test Step 1: Boot 4 vms with various storage settings
- vm1 boot from cinder volume
- vm2 boot from cinder volume with ephemeral disk
- vm3 boot from glance image
- vm4 boot from glance image, then attach a cinder volume to vm after it's launched
Test Step 2: Lock vm host on simplex system
Test Step 3: Ensure vms are in SHUTOFF state after locked host come online
Test Step 4: Unlock host on simplex system

Observation in Test Step4:
- host-unlock cmd accepted but host never really unlocked/rebooted
- system host-show:
| task | Manifest apply timeout ; Unlock to retry |
- system application-list:
| stx-openstack | 1.0-17-centos-stable-versioned | armada-manifest | stx-openstack.yaml | applying | processing chart: osh-openstack-placement, overall completion: 58.0% |
+---------------------+--------------------------------+-------------------------------+--------------------+----------+----------------------------------------------------------------------+

Revision history for this message
Yan Chen (ychen2u) wrote :

I dont see the following network issue in the log, so this is a new reapply issue, again.
"os_error":"Network is unreachable","syscall":"getsockopt(SO_ERROR)","target_address":"ipv4:10.102.50.42:44134"}

I suggest you to submit a new bug. Dont massing everything up in one LP.

Changed in starlingx:
status: Confirmed → Incomplete
Revision history for this message
Yan Chen (ychen2u) wrote :

This is not the original issue.
Please submit a new bug instead of using this LP.

Changed in starlingx:
status: Incomplete → Invalid
Revision history for this message
Ghada Khalil (gkhalil) wrote :

LP was closed as invalid a long time ago; removing the stx.retestneeded tag

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.