StarlingX

Controller can't unlock after lock on AIO-SX

Bug #1833472 reported by Hector on 2019-06-19

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Invalid	High	Yan Chen

Bug Description

Brief Description
-----------------
On AIO-SX after a couple of running lock/unlock on controller. the second try will lock the system and
trying to unlock will prompt timeout error 'Manifest apply timeout ; Unlock to retry'.

After rebooting controller-0 and running 'system application-apply' able to unlock.

Please provide if this is the expected behavior or not and let me know the suggestion.

Severity
--------
Major

Steps to Reproduce
------------------
1. On AIO-SX just run a couple of times lock and unlock
2. Using CLI below error occurs
>> system host-unlock
Error:
Manifest apply timeout ; Unlock to retry

3. When checking system application status shows error.
>> system application-list will prompt on stx-openstack
Error status: 'apply-failed'.

Expected Behavior
------------------
lock/unlock cycle should work continuously

Actual Behavior
----------------
system gets to lock status and don't change until workaround is done.

Reproducibility
---------------
90%

System Configuration
--------------------
One node system (AIO-SX)

Branch/Pull Time/Commit
-----------------------
BUILD_DATE="2019-06-13 01:30:00 +0000"

Last Pass
---------
Don't Know

Timestamp/Logs
--------------
Logs Attached

Test Activity
-------------
Regression Testing

Tags:

Revision history for this message

Hector (ramoseh) wrote on 2019-06-19:

sysinv.log Edit (1.8 MiB, text/plain)

Yan Chen (ychen2u) on 2019-06-21

Changed in starlingx:
assignee:	nobody → Yan Chen (ychen2u)

Revision history for this message

Yan Chen (ychen2u) wrote on 2019-06-21:

If lock and unlock again on a Simplex, the application for stx-openstack will be re-applied before the unlock.
If the re-apply failed, the simplex cannot be unlocked.
See from the log, it seems to be a network error during the re-apply:

"os_error":"Network is unreachable","syscall":"getsockopt(SO_ERROR)","target_address":"ipv4:10.102.50.42:44134"}

So the behavior should be expected.
And please check why the connection failed at that time.

Changed in starlingx:
status:	New → Invalid

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-06-26:

@Yan, As the reporter points out, locking/unlocking operations should pass consistently, including the stx-openstack apply. So this is not expected behavior. Can you elaborate on where the "network unreachable" error is coming from and what connection is failing?

Changed in starlingx:
status:	Invalid → Incomplete

Ada Cabrales (acabrale) on 2019-06-27

tags:	added: stx.regression
tags:	added: stx.retestneeded

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-04:

ALL_NODES_20190704.180458.tar Edit (27.5 MiB, application/x-tar)

Issue was reproduced on
Lab: SM_3
Load: 20190704T013000Z

After host-unlock, ssh connection is not disconnected as expected and kept in locked/online status

log:
[2019-07-04 15:23:14,668] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2019-07-04 15:23:16,858] 423 DEBUG MainThread ssh.expect :: Output:
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-17-centos-stable-versioned | armada-manifest | stx-openstack.yaml | applied | completed |
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+

[sysadmin@controller-0 ~(keystone_admin)]$
[2019-07-04 15:23:17,386] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-07-04 15:23:17,386] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'

[2019-07-04 15:31:25,104] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-07-04 15:31:25,104] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-show controller-0'
[2019-07-04 15:31:27,654] 423 DEBUG MainThread ssh.expect :: Output:
+---------------------+----------------------------------------------------------------------+
| Property | Value |
+---------------------+----------------------------------------------------------------------+
| action | none |
| administrative | locked |
| availability | online |

Issue was reproduced on
Lab: SM_3
Load: 20190704T013000Z

After host-unlock, ssh connection is not disconnected as expected and kept in locked/online status

log:
[2019-07-04 15:23:14,668] 301  DEBUG MainThread ssh.send    :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2019-07-04 15:23:16,858] 423  DEBUG MainThread ssh.expect  :: Output: 
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+
| application         | version                        | manifest name                 | manifest file      | status  | progress  |
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+
| platform-integ-apps | 1.0-7                          | platform-integration-manifest | manifest.yaml      | applied | completed |
| stx-openstack       | 1.0-17-centos-stable-versioned | armada-manifest               | stx-openstack.yaml | applied | completed |
+---------------------+--------------------------------+-------------------------------+--------------------+---------+-----------+

[sysadmin@controller-0 ~(keystone_admin)]$ 
[2019-07-04 15:23:17,386] 466  DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-07-04 15:23:17,386] 301  DEBUG MainThread ssh.send    :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'

[2019-07-04 15:31:25,104] 466  DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-07-04 15:31:25,104] 301  DEBUG MainThread ssh.send    :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne host-show controller-0'
[2019-07-04 15:31:27,654] 423  DEBUG MainThread ssh.expect  :: Output: 
+---------------------+----------------------------------------------------------------------+
| Property            | Value                                                                |
+---------------------+----------------------------------------------------------------------+
| action              | none                                                                 |
| administrative      | locked                                                               |
| availability        | online                                                               |

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-04:

Appears that the issue was also reproduced by Peng. Marking as stx.2.0 release gating.

Changed in starlingx:
status:	Incomplete → Confirmed
tags:	added: stx.2.0
Changed in starlingx:
importance:	Undecided → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-05: Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/669278

Changed in starlingx:
status:	Confirmed → In Progress

Revision history for this message

Yan Chen (ychen2u) wrote on 2019-07-08:

Would you please send me the detailed operation list to reproduce this issue?
I cannot reproduce this issue as I'm just doing "lock-unlock" for several rounds.
But I see from the log there are obviously other operations then this simple process.

See from the sysinv.log, at the last stage of the test, unlock operation was triggered twice, and the judgement of "whether unlock_ready flag should be removed" was not aligned with the judgement of "whether the reapply should be taken place". -- Patch #669278.

The operation list see from the sysinv.log:
unlock
openstack apply
lock
unlock
hello-kitty apply --> please provide the hello-kitty application
lock
unlock
reapply success --> please provide the operation to trigger this reapply
reapply success --> please provide the operation to trigger this reapply
lock
unlock
unlock --> not wait until the unlock reapply finished, a new unlock come

Changed in starlingx:
status:	In Progress → Incomplete

Revision history for this message

Oscar Medina Perez (omedinap) wrote on 2019-07-09:

"BUILD_DATE= 2019-07-05" Edit (429.2 KiB, text/plain)

I reproduced the issue using the following build:

BUILD_DATE="2019-07-05 01:30:00 +0000"

Please see attached log file.

Revision history for this message

Yan Chen (ychen2u) wrote on 2019-07-10:

I think this log is about another apply issue.

So I think it's better to seperate this LP into several issues:
1. application reapply failures found in the log, we do have several LP about this kind of issues.
2. when a reapply in-progress, and another "unlock" come in, the unlock-ready flag should not be removed. (see this situation from the first log)
3. when a reapply failed, the unlock operation should not be block (is it correct?).

Revision history for this message

Yan Chen (ychen2u) wrote on 2019-07-11:

#10

For issue 3, I’m ok if we decide to block unlock if reapply failed. I can modify my patch to only fix issue 2, which is definitely an issue.

In the first log pack, I found the following log, it happened when a reapply was still in-progress and the user couldn’t wait and triggered another unlock:

2019-07-04 15:46:54.708 112183 INFO sysinv.api.controllers.v1.host [-] Remove unlock ready flag ## here unlock ready flag is deleted.
….
2019-07-04 15:46:54.784 112183 INFO sysinv.api.controllers.v1.host [-] stx-openstack system app is present but not applied, skipping re-apply ## here reapply is actually skipped.

Revision history for this message

Yan Chen (ychen2u) wrote on 2019-07-11:

#11

For issue 1 "application reapply failures":
the first log stuck at osh-openstack-placement.
and the second log stuck at osh-openstack-nginx-ports-control.
I think they are 2 different issues, and there are already several reapply failure LPs submitted.
I will check if there's any duplication.

OpenStack Infra (hudson-openstack) on 2019-07-11

Changed in starlingx:
status:	Incomplete → In Progress

Revision history for this message

Yan Chen (ychen2u) wrote on 2019-07-12:

#12

Too many different reapply issues are mixed in this issue.

1.
From the origin log, we can see a network error:
"os_error":"Network is unreachable","syscall":"getsockopt(SO_ERROR)","target_address":"ipv4:10.102.50.42:44134"
This caused the first reapply failure.
Log: https://bugs.launchpad.net/starlingx/+bug/1833472/+attachment/5271623/+files/sysinv.log

I'm not sure why the Network went wrong there. But from the log, I see the unlock during the reapply process removed the unlock_ready flag again. This is a real issue and I submitted https://review.opendev.org/669278 to fix it.

Even if this patch cannot fix the connection failure, I do suggest we retest after the patch merged, to see if the network connection issue can be reproduced. (But NOT the following other reapply issue.)

2.
Reapply stuck at osh-openstack-placement.
Log: https://bugs.launchpad.net/starlingx/+bug/1833472/+attachment/5275096/+files/ALL_NODES_20190704.180458.tar

3.
Reapply stuck at osh-openstack-nginx-ports-control.
Log: https://bugs.launchpad.net/starlingx/+bug/1833472/+attachment/5275887/+files/sysinv.log

Revision history for this message

Yan Chen (ychen2u) wrote on 2019-07-12:

#13

Update for the duplicated failures:
3.
Reapply stuck at osh-openstack-nginx-ports-control.
Log: https://bugs.launchpad.net/starlingx/+bug/1833472/+attachment/5275887/+files/sysinv.log
---> Duplicate with https://bugs.launchpad.net/starlingx/+bug/1834070, which is fixed at July.08.

Revision history for this message

Oscar Medina Perez (omedinap) wrote on 2019-07-15:

#14

On build date ISO: "2019-07-12 01:30:00 +0000" the issue is not seen. I have tried a couple of times by locking/unlocking without any error.

Revision history for this message

Yan Chen (ychen2u) wrote on 2019-07-16:

#15

Can I set as Invalid as there's no such kind of network failures found?
"os_error":"Network is unreachable","syscall":"getsockopt(SO_ERROR)","target_address":"ipv4:10.102.50.42:44134"

Revision history for this message

Yan Chen (ychen2u) wrote on 2019-07-23:

#16

As the network issue is not reproducible, I will set this issue to Invalid.

Changed in starlingx:
status:	In Progress → Invalid

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-23:

#17

ALL_NODES_20190723.193210.tar Edit (29.4 MiB, application/x-tar)

Issue was reproduced on
Lab: SM_3
Load: 20190721T233000Z

LP-1833472 happened before this failure

17:42 host-unlock failed
17:58 boot up VM failed

Changed in starlingx:
status:	Invalid → Confirmed

Revision history for this message

Yang Liu (yliu12) wrote on 2019-07-23:

#18

Note that an important step for this issue is to first launch multiple VMs on the simplex system.
For the steps that matched ALL_NODES_20190723.193210.tar:

Test Step 1: Boot 4 vms with various storage settings
- vm1 boot from cinder volume
- vm2 boot from cinder volume with ephemeral disk
- vm3 boot from glance image
- vm4 boot from glance image, then attach a cinder volume to vm after it's launched
Test Step 2: Lock vm host on simplex system
Test Step 3: Ensure vms are in SHUTOFF state after locked host come online
Test Step 4: Unlock host on simplex system

Observation in Test Step4:
- host-unlock cmd accepted but host never really unlocked/rebooted
- system host-show:
| task | Manifest apply timeout ; Unlock to retry |
- system application-list:
| stx-openstack | 1.0-17-centos-stable-versioned | armada-manifest | stx-openstack.yaml | applying | processing chart: osh-openstack-placement, overall completion: 58.0% |
+---------------------+--------------------------------+-------------------------------+--------------------+----------+----------------------------------------------------------------------+

Revision history for this message

Yan Chen (ychen2u) wrote on 2019-07-24:

#19

I dont see the following network issue in the log, so this is a new reapply issue, again.
"os_error":"Network is unreachable","syscall":"getsockopt(SO_ERROR)","target_address":"ipv4:10.102.50.42:44134"}

I suggest you to submit a new bug. Dont massing everything up in one LP.

Changed in starlingx:
status:	Confirmed → Incomplete

Revision history for this message

Yan Chen (ychen2u) wrote on 2019-07-29:

#20

This is not the original issue.
Please submit a new bug instead of using this LP.

Changed in starlingx:
status:	Incomplete → Invalid

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-06-27:

#21

LP was closed as invalid a long time ago; removing the stx.retestneeded tag

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.