Simplex: stx-application failed after lock/unlock controller

Bug #1816177 reported by Yang Liu
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Yan Chen

Bug Description

Brief Description
-----------------
On a simplex system, after lock/unlock controller, it was noticed that stx-openstack application gone into apply-failed status.

Severity
--------
Major

Steps to Reproduce
------------------
1. lock/unlock controller on simplex system
2. check system application-list for stx-openstack status

Expected Behavior
------------------
2. stx-openstack is successfully reapplied after unlock

Actual Behavior
----------------
2. stx-openstack is in apply-failed status.
All the openstack pods were Completed and Running though. After manually reapply the stx-openstack again, it reached applied statue.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
One node system

Branch/Pull Time/Commit
-----------------------
f/stein as of 2019-02-15

Timestamp/Logs
--------------
[2019-02-14 15:52:17,947] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
+---------------+-----------------+------------------------+---------+-----------+
| application | manifest name | manifest file | status | progress |
+---------------+-----------------+------------------------+---------+-----------+
| stx-openstack | armada-manifest | manifest-no-tests.yaml | applied | completed |
+---------------+-----------------+------------------------+---------+-----------+
[wrsroot@controller-0 ~(keystone_admin)]$

[2019-02-14 15:56:30,632] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock controller-0

[2019-02-14 15:57:13,938] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'

[2019-02-14 16:10:35,220] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
+---------------+-----------------+------------------------+--------------+------------------------------------------+
| application | manifest name | manifest file | status | progress |
+---------------+-----------------+------------------------+--------------+------------------------------------------+
| stx-openstack | armada-manifest | manifest-no-tests.yaml | apply-failed | operation aborted, check logs for detail |
+---------------+-----------------+------------------------+--------------+------------------------------------------+
controller-0:~$

# sysinv logs:
2019-02-14 15:57:37.941 4849 INFO sysinv.openstack.common.service [-] Caught SIGTERM, exiting
2019-02-14 15:57:59.490 4849 ERROR sysinv.conductor.kube_app [-] Armada request apply for manifest /manifests/stx-openstack-manifest-no-tests.yaml failed: ('Connection aborted.', error(2, 'ENOENT'))
2019-02-14 15:57:59.507 4849 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.
2019-02-14 15:58:11.514 9073 ERROR sysinv.openstack.common.rpc.common [-] Failed to consume message from queue: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common Traceback (most recent call last):
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/impl_kombu.py", line 564, in ensure
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common return method(*args, **kwargs)
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/impl_kombu.py", line 644, in _consume
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common return self.connection.drain_events(timeout=timeout)
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 301, in drain_events
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common return self.transport.drain_events(self.connection, **kwargs)
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 103, in drain_events
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common return connection.drain_events(**kwargs)
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 464, in drain_events
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common return self.blocking_read(timeout)
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 469, in blocking_read
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common return self.on_inbound_frame(frame)
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 68, in on_frame
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common callback(channel, method_sig, buf, None)
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 473, in on_inbound_method
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common method_sig, payload, content,
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 142, in dispatch_method
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common listener(*args)
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 595, in _on_close
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common (class_id, method_id), ConnectionError)
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
2019-02-14 15:58:11.514 9073 TRACE sysinv.openstack.common.rpc.common
….

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; related to containers; specific to AIO-SX

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2019.05 stx.containers
Changed in starlingx:
assignee: nobody → Cindy Xie (xxie1)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Cindy as a good learning opportunity

Ran An (an.ran)
Changed in starlingx:
assignee: Cindy Xie (xxie1) → Ran An (an.ran)
Revision history for this message
Ran An (an.ran) wrote :

Hi Yang, I'm trying to reproduce this issue.
When I unlock my controller after configure the controller(not apply the application yet), all k8s_* containers fail to run which run well before unlock.

Do you met the same situation? What are the results of command "kubectl get pods -n kube-system" and "kubectl get pods -n openstack"

here is the iso I use to deploy (20190215T203255Z). http://mirror.starlingx.cengn.ca/mirror/starlingx/f/stein/centos/latest_build/outputs/iso/bootimage.iso
attached is my logs for reference.

Revision history for this message
Ran An (an.ran) wrote :

to make my question more clear:

Do you met the same situation?
What are the results of command "kubectl get pods -n kube-system" and "kubectl get pods -n openstack" in your env, when stx-openstack application gone into apply-failed status.

Revision history for this message
Yang Liu (yliu12) wrote :
Download full text (16.9 KiB)

Hi Ran, I did not see the issue you encountered.

For me, the kube-system pods were in good state after initial unlock, stx-openstack application was then applied successfully.

In the reported issue, after another lock/unlock of the host, stx-openstack is in apply-failed state via system application-list; all pods are in good state though via kubectl get pods.

[wrsroot@controller-0 ~(keystone_admin)]$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-node-h7qw5 2/2 Running 4 27h
kube-system coredns-666d7dfb44-698fh 1/1 Running 2 27h
kube-system coredns-666d7dfb44-r9jvs 1/1 Running 2 27h
kube-system ingress-error-pages-f8d5ccccf-2nc8x 1/1 Running 1 26h
kube-system ingress-kxk99 1/1 Running 1 26h
kube-system kube-apiserver-controller-0 1/1 Running 2 27h
kube-system kube-controller-manager-controller-0 1/1 Running 4 27h
kube-system kube-proxy-428df 1/1 Running 2 27h
kube-system kube-scheduler-controller-0 1/1 Running 4 27h
kube-system tiller-deploy-86bf9bf7f-g7mdn 1/1 Running 3 27h
openstack aodh-alarms-cleaner-1550622900-htcdj 0/1 Completed 0 19h
openstack aodh-api-f76c7d8f5-lhgng 1/1 Running 1 26h
openstack aodh-db-init-t989k 0/1 Completed 0 26h
openstack aodh-db-sync-c75bm 0/1 Completed 0 26h
openstack aodh-evaluator-6cbb64dbd5-rgzsk 1/1 Running 1 26h
openstack aodh-ks-endpoints-22ccn 0/3 Completed 0 26h
openstack aodh-ks-service-975dt 0/1 Completed 0 26h
openstack aodh-ks-user-zvnng 0/1 Completed 0 26h
openstack aodh-listener-7bfc887f86-zbdq8 1/1 Running 1 26h
openstack aodh-notifier-6c6b86c659-xbxvw 1/1 Running 1 26h
openstack aodh-rabbit-init-zz6pm 0/1 Completed 0 26h
openstack ceilometer-central-79cc477fc5-m58pp 1/1 Running 1 26h
openstack ceilometer-compute-l59bq 1/1 Running 1 26h
openstack ceilometer-db-sync-mxkhg 0/1 Completed...

Revision history for this message
Ran An (an.ran) wrote :

Thanks Yang. I'll check my env and show it if there are any progress or questions.

Revision history for this message
Ran An (an.ran) wrote :

Hi Yang. sorry to reply so late. Could you help and run attached verify script on your env, when stx-openstack application gone into apply-failed status after unlock controller?

I reproduced this issue finally, and I found the openstack service run well, though the result of command "system application-list" showed "apply-failed". from the logs at #5, pods status on your env are the same with mine.

So I suppose this issue is our system inventory shows a wrong status of the application.

Ran An (an.ran)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Ran An (an.ran) wrote :

I reproduced this issue by image made in March 5th on master branch (after stein is merged).
I found it was container "armada_service" fail to run armada cmd during re-apply application process after unlock controller made application apply failed. By analyzing container's log (following) , it is an issue of docker generating "docker run" with existing container[1].

further investigating is under going.

Error of container:
"OCI runtime create failed: container_linux.go:348: starting container process caused \"process_linux.go:402: container init caused \\\"rootfs_linux.go:58: mounting \\\\\\\"/opt/platform/armada/19.01/admin.conf\\\\\\\" to rootfs \\\\\\\"/var/lib/docker/overlay2/697965f49d86c5127f6095ad586916906fd57da36156f4ee474c39bbe6703cfb/merged\\\\\\\" at \\\\\\\"/var/lib/docker/overlay2/697965f49d86c5127f6095ad586916906fd57da36156f4ee474c39bbe6703cfb/merged/armada/.kube/config\\\\\\\" caused \\\\\\\"not a directory\\\\\\\"\\\"\": unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type"

[1] https://stackoverflow.com/questions/51469309/recreating-docker-bind-mount

Revision history for this message
Ran An (an.ran) wrote :

by bind a directory[1] instead of a file[2], the container error mentioned above is solved.

But several tests showed that the "armada_service" container that created by sysinv conductor would stay in "created" status for about 2 minutes during the reboot process after unlock. While it took a very short time when the system was ready(when we can run command "system application-apply stx-openstack").

[1] mount /opt/platform/armada/19.01/kubeconfig(with a file "config" which is a copy of /etc/kubenetes/admin.conf under it) to /armada/.kube
[2] mount /opt/platform/armada/19.01/admin.conf to /armada/.kube/config

Revision history for this message
Ran An (an.ran) wrote :

follow #8

by changing the docker run parameters according to comment[1] from a docker maintainer, this issue could be fixed.

here I use "--mount type=bind" to bind files from host to container instead of "-v" and change "--restart=always" to "--restart=unless-stopped". Then openstack application could succeed after unlock. This has been tested on env based on iso made in March 5th.

[1] https://github.com/docker/for-win/issues/1192#issuecomment-355939641

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-config (master)

Fix proposed to branch: master
Review: https://review.openstack.org/646031

Revision history for this message
Ran An (an.ran) wrote : Re: Containers: stx-application failed after lock/unlock controller

verification on latest code base was failed.

the root cause is sm service terminated docker registry during booting time[1] while sysinv conductor was downloading docker images from local registry to reapply application at the same time.

[1] seems related with commit patch https://review.openstack.org/#/c/626354/, which re-implement docker registry token.

Revision history for this message
Ran An (an.ran) wrote :

the sm will disable/enable serial of services during boot process when it is ready, including services sysinv-inv(trigger application reapply), sysinv-conductor(download images and start armada container) and docker-distribution(besides local registry for docker).

The dependency between 'docker-distribution' and 'registry-token-server' added by patch[1] delayed time which the registry for docker got ready and made "sysinv conductor download images" and "registry disable/enable" meet at the same time, which lead application fail to reapply.

a proper scheduling for sm to disable/enable services is required.

[1] https://review.openstack.org/#/c/626354/

Revision history for this message
Ran An (an.ran) wrote :
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Revision history for this message
Ran An (an.ran) wrote :

https://bugs.launchpad.net/starlingx/+bug/1822697 is solved. but the verify base on latest code is still failed.

the logs showed:
currently the action "host-unlock" will trigger an application reapply on active controller and reboot the unlock controller.
In simplex system, this two processes proceed on one node and impact each other.

by communicated with John (TL of sysinv), we need to delay the reboot until sysinv conductor reapply application is done.

new patch is under investigating.

Ghada Khalil (gkhalil)
summary: - Containers: stx-application failed after lock/unlock controller
+ Simplex: stx-application failed after lock/unlock controller
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: SM_2
Load: 2019-05-23_18-37-00

[2019-05-24 13:49:35,074] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2019-05-24 13:49:37,241] 387 DEBUG MainThread ssh.expect :: Output:
+---------------------+--------------------------------+-------------------------------+---------------+---------+-----------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+--------------------------------+-------------------------------+---------------+---------+-----------+
| platform-integ-apps | 1.0-5 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-13-centos-stable-versioned | armada-manifest | manifest.yaml | applied | completed |
+---------------------+--------------------------------+-------------------------------+---------------+---------+-----------+

[2019-05-24 13:49:37,733] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'

[2019-05-24 14:06:51,643] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2019-05-24 14:06:53,423] 387 DEBUG MainThread ssh.expect :: Output:
+---------------------+--------------------------------+-------------------------------+---------------+--------------+------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+--------------------------------+-------------------------------+---------------+--------------+------------------------------------------+
| platform-integ-apps | 1.0-5 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-13-centos-stable-versioned | armada-manifest | manifest.yaml | apply-failed | operation aborted, check logs for detail |
+---------------------+--------------------------------+-------------------------------+---------------+--------------+------------------------------------------+
controller-0:~$

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Raising the priority as this appears to be reported regularly from All-in-one Simplex testing in WR labs

Changed in starlingx:
importance: Medium → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (master)

Change abandoned by Ran An (<email address hidden>) on branch: master
Review: https://review.opendev.org/646031
Reason: this workaround do not work on latest code base

Revision history for this message
Frank Miller (sensfan22) wrote :

Ran mentioned that this issue is now being worked by Yan

Changed in starlingx:
assignee: Ran An (an.ran) → Yan Chen (ychen2u)
Revision history for this message
Yan Chen (ychen2u) wrote :
Download full text (4.8 KiB)

Paste Emails here:

From: Chen, Yan
Sent: Monday, June 3, 2019 20:57
To: 'Ngo, Tee' <email address hidden>; 'Kung, John' <email address hidden>; 'Bailey, Henry Albert (Al)' <email address hidden>
Cc: Hu, Yong <email address hidden>; An, Ran1 <email address hidden>; 'MacDonald, Eric' <email address hidden>; 'Qian, Bin' <email address hidden>; 'Smith, Tyler' <email address hidden>
Subject: RE: [StaringX- Container] reapply application after unlock controller issue in AIO

To verify my thoughts, I made a very simple script to delete .unlock_ready before unlock and touch it back when reapply finished.
And the test result was good, the simplex successfully rebooted several seconds after .unlock_ready’s back and after the reboot, the stx-openstack status was “applied” again.

The problem is, .unlock_ready is under /etc/platform folder now, and I don’t have a good way to delete it from sysinv api.
And I think this should not be a permanent flag, it should only indicate that the system is ready for an unlock, but if there’s any config change (like lock-reunlock in this LP), this flag should be removed and the “unlock-ready” status should be re-evaluated.
So can we move this file into /var folder? Or is there a better solution from your point?

===== script to test unlock =====
# Precondition
# 1. The simplex system is deployed successfully.
# 2. Stx-openstack is applied.

source /etc/platform/openrc

sudo rm /etc/platform/.unlock_ready
system host-lock controller-0
sleep 3
system host-unlock controller-0

while true; do system application-list | grep stx-openstack | grep applied; if [ $? -eq 0 ]; then break; fi; sleep 10; done

sudo touch /etc/platform/.unlock_ready

# after this, system waited for several seconds and rebooted.

===== end of script =====

Yan

From: Chen, Yan
Sent: Friday, May 31, 2019 21:46
To: Ngo, Tee <email address hidden>; Kung, John <email address hidden>; Bailey, Henry Albert (Al) <email address hidden>
Cc: Hu, Yong <email address hidden>; An, Ran1 <email address hidden>; MacDonald, Eric <email address hidden>; Qian, Bin <email address hidden>; Smith, Tyler <email address hidden>
Subject: RE: [StaringX- Container] reapply application after unlock controller issue in AIO

“In any case, mtc behavior should not change. MTC boot command should be gated by .unlock_ready flag.”
I fully agree with you.
And yes, we need to handle these 2 tasks carefully.
Anyway we can work on this solution first. Thank you for your clarification.

Yan

From: Ngo, Tee [mailto:<email address hidden>]
Sent: Friday, May 31, 2019 13:06
To: Kung, John <email address hidden>; Chen, Yan <email address hidden>; Bailey, Henry Albert (Al) <email address hidden>
Cc: Hu, Yong <email address hidden>; An, Ran1 <email address hidden>; MacDonald, Eric <email address hidden>; Qian, Bin <email address hidden>; Smith, Tyler <email address hidden>
Subject: RE: [StaringX- Container] reapply application after unlock controller issue in AIO

Two places where the .unlock_ready flag is generated are
a) In sysinv conductor as part of configure_controller_host (called at unlock) for standard config
b) In sysinv agent after the mani...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/663286

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/663286
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=884fbc92adfa3a2b5f6c2e8fc008b046e8f94dbe
Submitter: Zuul
Branch: master

commit 884fbc92adfa3a2b5f6c2e8fc008b046e8f94dbe
Author: chenyan <email address hidden>
Date: Wed Jun 5 17:14:44 2019 +0800

    remove .unlock_ready when unlocking simplex with openstack applied

    When unlocking a simplex system with openstack applied, the
    .unlock_ready flag should be removed, because the unlock operation
    need to re-apply the stx-openstack application before taking reboot.
    The flag will be re-generated when the re-apply finished.

    Closes-Bug: 1816177

    Change-Id: I81d12b31c8cb5ca860b476dc31c56a5f239b223d
    Signed-off-by: chenyan <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Yang Liu (yliu12) wrote :

Verification passed on AIO-SX system on master 20190713T013000Z.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.