stx-openstack application stuck at processing chart: osh-openstack-ceph-rgw after unlock standby controller

Bug #1833609 reported by Peng Peng
24
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Lin Shuicheng

Bug Description

Brief Description
-----------------
After lock/unlock standby controller, stx-openstack got reapply, but it is stuck at processing chart: osh-openstack-ceph-rgw.

Severity
--------
Major

Steps to Reproduce
------------------
lock/unlock standby controller
check stx-application reapply status

TC-name: test_lock_unlock_host.py::test_lock_unlock_host[controller]

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Multi-node system

Lab-name: WCP_113-121

Branch/Pull Time/Commit
-----------------------
stx master as of 20190620T013000Z

Last Pass
---------
Lab: WCP_113_121
Load: 20190612T013000Z

Timestamp/Logs
--------------
[2019-06-20 17:32:08,722] 268 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2019-06-20 17:32:10,214] 387 DEBUG MainThread ssh.expect :: Output:
+---------------------+------------------------------+-------------------------------+--------------------+---------------+----------------------------------------------------------------------------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+------------------------------+-------------------------------+--------------------+---------------+----------------------------------------------------------------------------------------------------------------+
| hello-kitty | 1.0 | hello-kitty | manifest.yaml | upload-failed | Upload of application hello-kitty (1.0) failed: Command '['helm-upload', 'starlingx', u'/scratch/apps/hello- |
| | | | | | kitty/1.0/charts/hello-kitty.tgz']' returned non-zero exit status 1 |
| | | | | | |
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-16-centos-stable- | armada-manifest | stx-openstack.yaml | applied | completed

[2019-06-20 17:32:10,672] 268 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'

[2019-06-20 17:38:11,919] 268 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-show controller-0'
[2019-06-20 17:38:13,488] 387 DEBUG MainThread ssh.expect :: Output:
+---------------------+-----------------------------------------------------------------------+
| Property | Value |
+---------------------+-----------------------------------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | available |

[2019-06-20 17:38:19,192] 268 DEBUG MainThread ssh.send :: Send 'kubectl get nodes'
[2019-06-20 17:38:19,417] 387 DEBUG MainThread ssh.expect :: Output:
NAME STATUS ROLES AGE VERSION
compute-0 Ready <none> 165m v1.13.5
compute-1 Ready <none> 165m v1.13.5
compute-2 Ready <none> 165m v1.13.5
compute-3 Ready <none> 165m v1.13.5
compute-4 Ready <none> 165m v1.13.5
controller-0 Ready master 3h54m v1.13.5
controller-1 Ready master 3h7m v1.13.5
[sysadmin@controller-1 ~(keystone_admin)]$
[2019-06-20 17:38:19,417] 268 DEBUG MainThread ssh.send :: Send 'echo $?'
[2019-06-20 17:38:19,520] 387 DEBUG MainThread ssh.expect :: Output:
0
[sysadmin@controller-1 ~(keystone_admin)]$
[2019-06-20 17:38:19,521] 591 INFO MainThread kube_helper.wait_for_nodes_ready:: All nodes are ready: controller-0
[2019-06-20 17:38:19,521] 212 INFO MainThread container_helper.wait_for_apps_status:: Wait for ['stx-openstack'] application(s) to reach status: applied
[2019-06-20 17:38:19,521] 423 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-06-20 17:38:19,521] 268 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2019-06-20 17:38:21,004] 387 DEBUG MainThread ssh.expect :: Output:
+---------------------+------------------------------+-------------------------------+----------------+---------------+------------------------------------------------------------------------------------------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+------------------------------+-------------------------------+----------------+---------------+------------------------------------------------------------------------------------------------------------------------------+
| hello-kitty | 1.0 | hello-kitty | manifest.yaml | upload-failed | Upload of application hello-kitty (1.0) failed: Command '['helm-upload', 'starlingx', u'/scratch/apps/hello-kitty/1.0/charts |
| | | | | | /hello-kitty.tgz']' returned non-zero exit status 1 |
| | | | | | |
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-16-centos-stable- | armada-manifest | stx-openstack. | applying | processing chart: osh-kube-system-ingress, overall completion: 4.0% |
| | versioned | | yaml | |

[2019-06-20 17:44:09,192] 268 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2019-06-20 17:44:10,754] 387 DEBUG MainThread ssh.expect :: Output:
+---------------------+------------------------------+-------------------------------+----------------+---------------+------------------------------------------------------------------------------------------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+------------------------------+-------------------------------+----------------+---------------+------------------------------------------------------------------------------------------------------------------------------+
| hello-kitty | 1.0 | hello-kitty | manifest.yaml | upload-failed | Upload of application hello-kitty (1.0) failed: Command '['helm-upload', 'starlingx', u'/scratch/apps/hello-kitty/1.0/charts |
| | | | | | /hello-kitty.tgz']' returned non-zero exit status 1 |
| | | | | | |
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-16-centos-stable- | armada-manifest | stx-openstack. | applying | processing chart: osh-openstack-ceph-rgw, overall completion: 44.0% |

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
description: updated
Revision history for this message
Lin Shuicheng (shuicheng) wrote :

Will have a check with the log.

Changed in starlingx:
assignee: nobody → Lin Shuicheng (shuicheng)
Revision history for this message
Lin Shuicheng (shuicheng) wrote :

Hi Peng,
There is controller-0 log only in the "ALL_NODES" log file. Is the controller-1 log still available?
The issue should be due to there is failure in controller-1, and system auto swact to controller-0. And lead to the stuck.

When the issue occur, controller-0 is just unlocked, so it should be in standby status. But from sm.log, there is a Uncontrolled swact, which cause active controller switched from controller-1 to controller-0.
swact will cause sysinv process stopped in controller-1, and start in controller-0. The application-apply should be run at controller-1's sysinv process. So the stuck is expected, since sysinv process is killed.

But the LOG attached contains controller-0 only, so not sure what cause the swact.

swact log as below:
2019-06-20T17:37:54.000 controller-0 sm: debug time[191.633] log<77> INFO: sm[92936]: sm_service_domain_neighbor_fsm.c(732): Neighbor (controller-1) received event (exchange-message) was in the exchange state and is now in the full.
2019-06-20T17:37:57.000 controller-0 sm: debug time[194.562] log<78> ERROR: sm[92936]: sm_service_domain_waiting_state.c(236): Service domain (controller) neighbor (controller-0) not found.
2019-06-20T17:37:57.000 controller-0 sm: debug time[194.562] log<79> INFO: sm[92936]: sm_service_domain_fsm.c(308): Set state waiting->leader
2019-06-20T17:37:57.000 controller-0 sm: debug time[194.562] log<80> INFO: sm[92936]: sm_service_domain_fsm.c(493): Service Domain (controller) received event (wait-expired) was in the waiting state and is now in the leader state.
2019-06-20T17:37:57.000 controller-0 sm: debug time[194.563] log<4> INFO: sm_alarm[92950]: sm_alarm_thread.c(1224): Managing alarms for domain (controller).
2019-06-20T17:37:57.000 controller-0 sm: debug time[194.563] log<81> INFO: sm[92936]: sm_service_domain_filter.c(338): Uncontrolled swact start
2019-06-20T17:37:57.000 controller-0 sm: debug time[194.563] log<82> INFO: sm[92936]: sm_node_swact_monitor.cpp(29): Swact has started, host will be active

Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: WCP_71_75
Load: 20190623T233000Z

both controllers logs attached.

timestamp,
[2019-06-24 10:40:10,079] 268 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
 stx-openstack | 1.0-16-centos-stable- | armada-manifest | stx-openstack.yaml | applied | completed

[2019-06-24 10:40:12,117] 268 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'

[2019-06-24 10:52:45,091] 268 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
stx-openstack | 1.0-16-centos-stable- | armada-manifest | stx-openstack. | applying | processing chart: osh-openstack-ceph-rgw, overall completion: 46.0%

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Lin Shuicheng (shuicheng) wrote :

From the log controller-0 is unlocked at 10:40:12.
2019-06-24T10:40:12.000 controller-1 -sh: info HISTORY: PID=782269 UID=42425 system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0

And we could find application re-apply begin at 10:40:19:
2019-06-24 10:40:19.668 333801 INFO sysinv.api.controllers.v1.host [-] Reapplying the stx-openstack app

Begin process osh-openstack-ceph-rgw at 10:51:59:
2019-06-24 10:51:59.235 332533 INFO sysinv.conductor.kube_app [-] processing chart: osh-openstack-ceph-rgw, overall completion: 46.0%

And here is the log in armada:
2019-06-24 10:51:58.243 36 DEBUG armada.handlers.wait [-] [chart=openstack-ceph-rgw]: Starting to wait on: namespace=openstack, resource type=pod, label_selector=(release_group=osh-openstack-ceph-rgw), timeout=1800 _watch_resource_completions /usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py:362^[[00m
2019-06-24 10:52:34.524 36 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:173^[[00m

And controller-0 become unlocked/available at 10:46:38:
2019-06-24T10:46:38.000 controller-1 sm: debug time[3887.884] log<913> INFO: sm[95837]: sm_main_event_handler.c(171): Set node (controller-0) requested, action=2, admin_state=unlocked, oper_state=enabled, avail_status=available, seqno=5.
2019-06-24T10:46:38.000 controller-1 sm: debug time[3887.895] log<914> INFO: sm[95837]: sm_failover.c(1129): controller-1 unlocked-enabled-available, controller-0 unlocked-enabled-available

Then controller-1 force reboot at 10:58:22
2019-06-24T10:58:22.000 controller-1 -sh: info HISTORY: PID=782269 UID=42425 sudo reboot -f

Need further debug to check why stuck at pod ready wait in osh-openstack-ceph-rgw.

Ghada Khalil (gkhalil)
tags: added: stx.sanity
Revision history for this message
Lin Shuicheng (shuicheng) wrote :

It is armada related issue. The issue is that there is only 3 job defined in ceph_rgw chart. After swact && lock/unlock standby controller, the pod created for job doesn't exist any more. So there is no pod for ceph_rgw. And armada needs at least 1 pod to pass resource wait. So it is stuck due to there is no pod.
Here is the armada log:
2019-06-24 10:51:58.074 36 DEBUG armada.handlers.wait [-] [chart=openstack-ceph-rgw]: Resolved `wait.resources` list: [{'type': 'job', 'required': False, 'labels': {'release_group': 'osh-openstack-ceph-rgw'}}, {'type': 'pod', 'labels': {'release_group': 'osh-openstack-ceph-rgw'}}] __init__ /usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py:87^[[00m
required is False for job, but True for pod.
And in function _wait() in wait.py, here is the code logic:
        if (not found_resources) and not self.required:
            return None

Need fix the issue in armada, or find some workaround in STX.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.2.0 gating - locking/unlocking a standby controller is a pretty standard operation. The application re-apply needs to work in this case.

tags: added: stx.2.0 stx.containers
Changed in starlingx:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Lin Shuicheng (shuicheng) wrote :

Here is a detail explanation of the issue.
When the 1st apply stx-openstack application, for osh-openstack-ceph-rgw chart, there are 3 jobs will be executed. And we could find them in job and pod list as below, all of them are labelled with osh-openstack-ceph-rgw:
job:
ceph-ks-endpoints 1/1 31s 37m
ceph-ks-service 1/1 16s 37m
swift-ks-user 1/1 42s 37m
pod:
ceph-ks-endpoints-bcjts 0/3 Completed 0 36m
ceph-ks-service-wmsdc 0/1 Completed 0 36m
swift-ks-user-cdbc8 0/1 Completed 0 36m

After swact and unlock controller-0, controller-0 will be rebooted, and these pods executed in controller-0 will be destroyed.
Then when do application re-apply triggered by controller-0 unlock, armada will try to wait job and pod to ensure the chart apply is succeed.
But since the job is completed, it will not run again, so no new pod is created, and the old pod is destroyed due to the controller-0 reboot. So armada will stuck at pod wait.
It is armada's issue, since it assume there will be at least 1 pod executed for chart apply. It is true for most case, but not for the re-apply case and chart with job only.

It is hard to work-around it in STX. I will report an issue in armada project.

Revision history for this message
Lin Shuicheng (shuicheng) wrote :

Issue is reported to armada project with below story:
https://storyboard.openstack.org/#!/story/2006133
Also mail has been sent to Armada & StarlingX discussion list.

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

I am also experiencing this issue on simplex lab (eg. wcp_105)
BUILD_ID="20190703T013000Z

unlock failing/stuck in reboot loop
         Starting Control drbd resources....
[ [32m OK [0m] Started Control drbd resources..
[ 237.371223] controller_config[13535]: [WARNING]
..
Failed to run the puppet manifest

Revision history for this message
Lin Shuicheng (shuicheng) wrote :

No response from mail-list and storyboard yet.

A fix in armada is proposed with below patch:
https://review.opendev.org/669989

Revision history for this message
Bob Church (rchurch) wrote :

I took a look at this for a bit today. I would recommend that we remove the openstack-ceph-rgw chart_group from being loaded by the manifest for the time being since it's repeatedly blocking sanity/regressions.

I think we can do this as a partial-bug fix here.

I am starting to work on https://bugs.launchpad.net/starlingx/+bug/1833738 to make this chart optional anyway. Once we sort out a fix for this upstream we can apply it to the chart and/or manifest to re-enable with 1833738.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/670445

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Lin Shuicheng (shuicheng) wrote :

Per the comments in https://review.opendev.org/669989, we could fix the issue in StarlingX, by explicit declare the resources need to be waited.
And patch is proposed for review. I will abandon the change in armada.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/670445
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=4a66dd3723554d7056bf7bae10f66bdbb099b4e3
Submitter: Zuul
Branch: master

commit 4a66dd3723554d7056bf7bae10f66bdbb099b4e3
Author: Shuicheng Lin <email address hidden>
Date: Fri Jul 12 13:03:49 2019 +0800

    fix armada stuck at processing osh-openstack-ceph-rgw chart

    The issue occur when do swact && lock/unlock standby controller.
    After swact && lock/unlock standby controller, the pod created for
    job doesn't exist any more. So there is no pod for ceph_rgw.
    While armada assumes there is at least 1 pod exist in default, and
    will wait for the pod to be up and ready. For chart without pod, we
    need explicitly declare the resources that need to be waited for in
    armada schema. This declare will override the default wait list in
    armada.
    Change the timeout value to 300s, which should be enough for the 3
    job to be ready.

    Closes-Bug: 1833609

    Change-Id: I5339406cf914cd54f45b3de5df7ff213e8845bfc
    Signed-off-by: Shuicheng Lin <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: WCP_63_66
Load: 20190717T013000Z

Changed in starlingx:
status: Fix Released → Confirmed
Revision history for this message
Lin Shuicheng (shuicheng) wrote :

Hi Peng,
I checked the log, it is the same issue as before. And the reason is that my fix is included in build since 07/18, while your build is 07/17.
You could find my fix in below 07/18's changelog.
http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190718T013000Z/outputs/CHANGELOG.txt

Please help check the issue again with build after 7/18.
Thanks.

Changed in starlingx:
status: Confirmed → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

The issue was not reproduced recently.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.