stx-openstack unexpectedly becomes "not applied" after reapply failed

Bug #1836634 reported by Bin Qian
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Tee Ngo

Bug Description

Brief Description
-----------------
After stx-openstack is successfully applied, the application unexpectedly becomes "not applied" after reapply the application failed.
dbmon service is continuously running after stx-openstack reapply failed, however it couldn't reach mariadb cluster as stx-openstack is no long applied/running.

This issue is split from https://bugs.launchpad.net/starlingx/+bug/1836075

Severity
--------
Major

Steps to Reproduce
------------------
Ref to https://bugs.launchpad.net/starlingx/+bug/1836075
for the steps to reproduce

Expected Behavior
-----------------
When stx-openstack is successfully applied, after a subsequent reapply failed, the system is expected to:
1. stx-openstack will remain as "active" and run with latest successful configuration.
   1.1 mariadb cluster will be running and dbmon service will be able to access and monitor the state of mariadb cluster
2. subsequent reapply attempt shall not be rejected

Actual Behavior
---------------
1. stx-openstack application is still active but not running.
   1.1 dbmon repeatedly report it could not access mariadb.
2. subsequent reapply was rejected by sysinv: "stx-openstack system app is present but not applied, skipping re-apply"

Reproducibility
---------------
Seen once

System Configuration
--------------------
Two node system

Lab-name: IP_5-6

Branch/Pull Time/Commit
-----------------------
stx master as of 20190708T233000Z

Last Pass
---------
Unknown

Timestamp/Logs
--------------
2019-07-10 08:10:05.824 110608 INFO sysinv.conductor.kube_app [-] Application stx-openstack (1.0-17-centos-stable-versioned) apply completed.
...
2019-07-10 11:01:21.802 110608 INFO sysinv.conductor.kube_app [-] Application stx-openstack (1.0-17-centos-stable-versioned) apply started.
2019-07-10 11:08:37.690 110608 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/stx-openstack/1.0-17-centos-stable-versioned/stx-openstack-stx-openstack.yaml. See /var/log/armada stx-openstack-apply.log for details.
2019-07-10 11:08:37.696 110608 INFO sysinv.conductor.kube_app [-] Exiting progress monitoring thread for app stx-openstack
2019-07-10 12:17:52.178 228896 INFO sysinv.api.controllers.v1.host [-] stx-openstack system app is present but not applied, skipping re-apply

| 2019-07-10T08:11:06.772 | 290 | service-scn | dbmon | unknown | enabled-active | audit success
| 2019-07-10T11:05:46.432 | 295 | service-scn | dbmon | enabled-active | disabling | audit failed

Ghada Khalil (gkhalil)
tags: added: stx.containers
Changed in starlingx:
assignee: nobody → Tee Ngo (teewrs)
Revision history for this message
Tee Ngo (teewrs) wrote :

The description of this Launchpad is confusing, the application did not unexpectedly become "not applied" after the reapply failed. There is no such status as "not applied".

Currently, the stx-openstack application automatic reapply is triggered when:
a) any host in the cluster is unlocked
b) any host in the cluster is deleted
in order to pick up chart override updates and/or redistribute pods.

If stx-openstack app does not exist or the app does not have an "applied" status, the reapply is skipped.

The expected behaviors are incorrect:

1. stx-openstack can not be reapplied on both controllers. It can only be reapplied on the active controller
2. after a reapply failed, the stx-openstack status would be set to "apply-failed" it would not remain as "applied" and the service(s) which belong to the failed chart(s) would not be functional and so would services that depend on these failed services.
3. mariadb cluster would not be running if osh-openstack-mariadb release deployment failed during the reapply for whatever reason. I'd like to know how to reproduce this error case.

The log "snippet" in this LP has a July 7th, 2019 timestamp whereas the logs of LP1829289 have the timestamp of May 15th, 2019. Furthermore, the 2 LPs show 2 different system configurations. I can't make a connection between these 2 LPs based on given info. It appeared to me that the log snippet in this LP was grabbed from a duplex lab used for sanity with a known failed test case.

Please provide meaningful info that aids the investigation.

Changed in starlingx:
status: New → Incomplete
Ghada Khalil (gkhalil)
description: updated
Revision history for this message
Dariush Eslimi (deslimi) wrote :

Assigning to Bin to clarify the bug description.

Changed in starlingx:
assignee: Tee Ngo (teewrs) → Bin Qian (bqian20)
Revision history for this message
Bart Wensley (bartwensley) wrote :

The application has the following state info:

[root@controller-0 ~(keystone_admin)]# system application-show stx-openstack
+---------------+----------------------------------+
| Property | Value |
+---------------+----------------------------------+
| active | True |
| app_version | 1.0-17 |
| created_at | 2019-07-16T13:55:07.849046+00:00 |
| manifest_file | stx-openstack.yaml |
| manifest_name | armada-manifest |
| name | stx-openstack |
| progress | completed |
| status | applied |
| updated_at | 2019-07-16T15:50:09.556192+00:00 |
+---------------+----------------------------------+

As Tee says, if the application is re-applied, it is expected that the "status" becomes "apply-failed". However, the "active" property should always be "True" unless the application is removed.

So... is the originator complaining that the "active" property is becoming "False" after a re-apply fails? If so, that is a bug.

Bin Qian (bqian20)
description: updated
Revision history for this message
Bin Qian (bqian20) wrote :

Tee and Bart,
The "not applied" in original description was referred to sysinv rejected reapply with "stx-openstack system app is present but not applied, skipping re-apply".
After reapply failed, the "active" remained "True", this is fine. But the stx-openstack application was not running (dbmon could not access mariadb), it should be running to match application active state.

Revision history for this message
Tee Ngo (teewrs) wrote :

The "state" field introduced by whoever, and the conditions under which it is updated does not match your expectation. "Active" means the application is applied but it does not guarantee that it is fully operational. The status field "applied vs apply-failed" indicates whether it is fully operational or not. "Inactive" means the application is registered with sysinv but is not applied.

Revision history for this message
Brent Rowsell (brent-rowsell) wrote :

Why was the active field added ? We do not track the operational state of the application, only that the armada manifest successfully applied or not.

Revision history for this message
Bart Wensley (bartwensley) wrote :

The active field was added so the VIM (and other components) would know if the stx-openstack application was installed. The VIM uses this to determine whether or not it should be managing the openstack services (e.g. nova/neutron). The VIM needs to continue managing these services, even if a re-apply is in progress (or has failed).

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.2.0 based on input from Dariush

tags: added: stx.2.0
Changed in starlingx:
status: Incomplete → Triaged
importance: Undecided → High
assignee: Bin Qian (bqian20) → Tee Ngo (teewrs)
Revision history for this message
Tee Ngo (teewrs) wrote :

It appears, from correlating various logs in a lab that exhibited this issue, that the sequence of events leading to this failure is as follows:
1. stx-openstack was successfully applied on controller-0
2. locked/unlocked controller-1 which triggered the stx-openstack app reapply on controller-0 which was successful
3. swacted from controller-0 to controller-1 which was successful
4. uploaded and applied a test app were successful
5. removed the test app failed - known issue. From here things started to go haywire
5. swacted to controller-0 failed, followed by a lock/unlock of controller-0 which triggered stx-openstack app reapply that also failed. Reason: stuck at processing charts 'openstack-ceph-rgw' - known issue

Update:
This issue is no longer observed in the same lab with the latest load which has the following commits which resolved ceph-rgw reapply and non system app (e.g. test app) removal issues.

commit 156a254ef851229825dce0b2e6d020e0d67d8347
Author: Robert Church <email address hidden>
Date: Wed Jul 17 14:14:45 2019 -0400

    Fix generic application deletions

commit 4a66dd3723554d7056bf7bae10f66bdbb099b4e3
Author: Shuicheng Lin <email address hidden>
Date: Fri Jul 12 13:03:49 2019 +0800

    fix armada stuck at processing osh-openstack-ceph-rgw chart

Revision history for this message
Tee Ngo (teewrs) wrote :

I was unable to confirm again that this issue is no longer reproducible in the latest load due to a new issue - sysinv conductor was stopped in the middle of application reapply due to failed audit of Ceph mgr which is being tracked via LP https://bugs.launchpad.net/starlingx/+bug/1836075

For now, I'm marking this LP as the side effect of LP1836075. Once the ceph mgr timeout issue has been addressed and if this issue still persists, please reopen the LP.

Revision history for this message
Bill Zvonar (billzvonar) wrote :

Hi Bin - the LP https://bugs.launchpad.net/starlingx/+bug/1836075 is Fix Released now.

Can you re-test this per Tee's comment above?

Thanks.

Revision history for this message
Bin Qian (bqian20) wrote :
Download full text (14.4 KiB)

Reproduced the issue.
Environment:
AIO-DX/Direct connect lab, with Aug 21 build.
2 cpu each equip 10 physical cores with HT on, total 40 vcores

provisioned 10 instances, 1 vcpu and 512mb memory each.

Reproduction scenario:
issue commands:
system host-lock controller-0,
system host-cpu-modify controller-0 -f platform -p0 4 -p1 4
system host-unlock controller-0

from the sysinv.log, found log indicated reapply stx-openstack failed
2019-08-22 19:11:23.035 266150 INFO sysinv.conductor.kube_app [-] Resetting status of app stx-openstack from 'applying' to 'apply-failed'

continue after controller-0 up
swact to controller-0, issue command:
system host-lock controller-1
system host-cpu-modify controller-1 -f platform -p0 4 -p1 4
system host-unlock controller-1
from sysinv.log on controller-0, found log indicated that the reapply stx-openstack was skipped:
2019-08-22 19:51:10.448 267347 INFO sysinv.api.controllers.v1.host [-] stx-openstack system app is present but not applied, skipping re-apply

continue after controller-1 up
swact to controller-1, issue command:
system host-lock controller-0,
system host-cpu-modify controller-0 -f platform -p0 2 -p1 2
system host-unlock controller-0
again from sysinv.log on controller-1, found log indicated that the reapply stx-openstack was skipped:
2019-08-23 14:43:42.199 30712 INFO sysinv.api.controllers.v1.host [-] stx-openstack system app is present but not applied, skipping re-apply

The All hypervisors page on horizon, which report:
controller-0: 20 vcpus
controller-1: 32 vcpus
both numbers don't match the final cpu allocations for both controllers. The numbers above are the same as the number prior to the last round of system host-cpu-modify command. however, sysinv shows the right allocation as the result of the commands:

[sysadmin@controller-1 ~(keystone_admin)]$ system host-cpu-list controller-0
+--------------------------------------+-------+-----------+-------+--------+-------------------------------------------+-------------------+
| uuid | log_c | processor | phy_c | thread | processor_model | assigned_function |
| | ore | | ore | | | |
+--------------------------------------+-------+-----------+-------+--------+-------------------------------------------+-------------------+
| a5c2b8cf-56d4-40b8-8e77-19a40bd94586 | 0 | 0 | 0 | 0 | Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz | Platform |
| 3013883a-7b0e-4f55-89d0-a991aa64d44c | 1 | 0 | 1 | 0 | Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz | Platform |
| 50d4a4fa-f71e-4870-ae80-81e923094caf | 2 | 0 | 2 | 0 | Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz | vSwitch |
| dc9ef989-cab2-48ee-8d18-bc457bab19a6 | 3 | 0 | 3 | 0 | Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz | vSwitch |
| 4ea82525-010f-4541-a2e1-691cc0336ef1 | 4 | 0 | 4 | 0 | Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz | Applications |
| 0ad12460-ceea-4d9b-bfa5-191bcb7df6a2 | 5 | 0 ...

Revision history for this message
Bin Qian (bqian20) wrote :

controller-0 log

Revision history for this message
Bin Qian (bqian20) wrote :

log on controller-1

Revision history for this message
Bin Qian (bqian20) wrote :

All hypervisiors screenshot

Revision history for this message
Dariush Eslimi (deslimi) wrote :

Bin, is the comment above belong to this bug? or the ones that this one id dup of?
As the reproduction issue and also the reported issue are very different than what you are stating above in your comment.

Revision history for this message
Dariush Eslimi (deslimi) wrote :

Bin, i will leave this bug as dup.
If there is a concern please test your scenario with a build that includes fix for https://bugs.launchpad.net/starlingx/+bug/1837750

Revision history for this message
Bin Qian (bqian20) wrote :

This LP was created because during analyzing bug #1836075, 2 unexpected behaviors were discovered as the result of reapply stx-openstack failure.

1. stx-openstack application is not running properly.
2. subsequent reapply is rejected by sysinv

This LP is not a dup of bug #1836075. Bug #1836075 was ceph-mon failure to fail the application reapply.

Revision history for this message
Dariush Eslimi (deslimi) wrote :

As discussed if there is any concern after the fix for https://bugs.launchpad.net/starlingx/+bug/1837750 is merged, please open a new LP.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Duplicate bug was fixed by:
https://review.opendev.org/672708
https://review.opendev.org/672709
Merged on 2019-07-31

Marking as Fix Released

Ghada Khalil (gkhalil)
Changed in starlingx:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.