stx-openstack application apply aborted

Bug #1837792 reported by Ming Lei on 2019-07-24
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Medium
Angie Wang

Bug Description

Brief Description
-----------------
Application apply aborted!.

Severity
--------
Major

Steps to Reproduce
------------------
1. attempted to apply stx-openstack

2. system application-list: Application apply aborted!.

Expected Behavior
------------------
application completed

Actual Behavior
----------------
see sysinv.log:
2019-07-23 21:43:03.445 100949 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/stx-openstack/1.0-17-centos-stable-versioned/stx-openstack-stx-openstack.yaml. See /var/log/armada/stx-openstack-apply.log for details.
2019-07-23 21:43:03.446 100949 INFO sysinv.conductor.kube_app [-] Exiting progress monitoring thread for app stx-openstack
2019-07-23 21:43:03.453 100949 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.

Reproducibility
---------------
<Reproducible/Intermittent/Seen once>
State if the issue is 100% reproducible, intermittent or seen once. If it is intermittent, state the frequency of occurrence

System Configuration
--------------------
Storage system (IP_20_27)

Branch/Pull Time/Commit
-----------------------
stx master as of 20190721T013000Z

Last Pass
---------
Did this test scenario pass previously? If so, please indicate the load/pull time info of the last pass.
Use this section to also indicate if this is a new test scenario.

Timestamp/Logs
--------------
2019-07-23 21:42:49.849 23995 ERROR armada.handlers.tiller Traceback (most recent call last):
2019-07-23 21:42:49.849 23995 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 465, in install_release
2019-07-23 21:42:49.849 23995 ERROR armada.handlers.tiller metadata=self.metadata)
2019-07-23 21:42:49.849 23995 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 533, in _call_
2019-07-23 21:42:49.849 23995 ERROR armada.handlers.tiller return _end_unary_response_blocking(state, call, False, None)
2019-07-23 21:42:49.849 23995 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
2019-07-23 21:42:49.849 23995 ERROR armada.handlers.tiller raise _Rendezvous(state, None, None, deadline)
2019-07-23 21:42:49.849 23995 ERROR armada.handlers.tiller grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
2019-07-23 21:42:49.849 23995 ERROR armada.handlers.tiller status = StatusCode.UNKNOWN
2019-07-23 21:42:49.849 23995 ERROR armada.handlers.tiller details = "release osh-openstack-libvirt failed: timed out waiting for the condition"
2019-07-23 21:42:49.849 23995 ERROR armada.handlers.tiller debug_error_string = "{"created":"@1563918169.849385472","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release osh-openstack-libvirt failed: timed out waiting for the condition","grpc_status":2}"

Test Activity
-------------

Ming Lei (mlei) wrote :
Ming Lei (mlei) wrote :
Anujeyan Manokeran (anujeyan) wrote :
Download full text (7.6 KiB)

Application apply aborted at 62% on regular system load "20190724T013000Z"
 . Similar to above logs
2019-07-25 12:58:28.876 106112 INFO sysinv.conductor.kube_app [-] processing chart: osh-openstack-placement, overall completion: 62.0%
2019-07-25 13:28:27.382 26231 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:173
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller [-] [chart=openstack-libvirt]: Error while installing release osh-openstack-libvirt: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "release osh-openstack-libvirt failed: timed out waiting for the condition"
        debug_error_string = "{"created":"@1564061309.128669855","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release osh-openstack-libvirt failed: timed out waiting for the condition","grpc_status":2}"
>
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller Traceback (most recent call last):
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 465, in install_release
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller metadata=self.metadata)
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 533, in __call__
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller return _end_unary_response_blocking(state, call, False, None)
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller raise _Rendezvous(state, None, None, deadline)
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller status = StatusCode.UNKNOWN
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller details = "release osh-openstack-libvirt failed: timed out waiting for the condition"
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller debug_error_string = "{"created":"@1564061309.128669855","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release osh-openstack-libvirt failed: timed out waiting for the condition","grpc_status":2}"
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller >
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller
2019-07-25 13:28:29.130 26231 DEBUG armada.handlers.tiller [-] [chart=openstack-libvirt]: Helm getting release status for release=osh-openstack-libvirt, version=0 get_release_status /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:531
2019-07-25 13:28:29.219 26231 DEBUG armada.handlers.tiller [-] [chart=openstack-libvirt]: GetReleaseStatus= name: "osh-openstack-libvirt"
info {
  status {
    code: FAILED
  }
  first_deployed {
    seconds: 1564059509
    nanos: 4957...

Read more...

Ghada Khalil (gkhalil) on 2019-07-25
tags: added: stx.containers
Frank Miller (sensfan22) wrote :

In order to debug this further, collect logs are required from the controllers as well as the computes. Please attach.

Anujeyan Manokeran (anujeyan) wrote :
Frank Miller (sensfan22) wrote :

From Anujeyan's logs the application apply started here:
2019-07-24 21:57:06.749 106112 INFO sysinv.conductor.kube_app [-] Starting progress monitoring thread for app stx-openstack

It timed-out/faiuled here:
2019-07-24 22:27:18.158 106112 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/stx-openstack/1.0-17-centos-stable-versioned/stx-openstack-stx-openstack.yaml. See /var/log/armada/stx-openstack-apply.log for details.
2019-07-24 22:27:18.159 106112 INFO sysinv.conductor.kube_app [-] Exiting progress monitoring thread for app stx-openstack
2019-07-24 22:27:18.175 106112 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.

According to Bob Church, this seems to be the key log indicating what is failing:

neutron-dhcp-agent-compute-0-9c041f23-s7rzv_openstack_init-266eabb307aafd458f62869249a54abaa6028b4567c44a1ad2895cf3b6a112e1.log:{"log":"Entrypoint WARNING: 2019/07/24 21:57:19 entrypoint.go:71: Resolving dependency Job neutron-db-sync in namespace openstack failed: Job Job neutron-db-sync in namespace openstack is not completed yet .\n","stream":"stdout","time":"2019-07-24T21:57:19.528744365Z"}

Ghada Khalil (gkhalil) wrote :

As per input from Frank (containers PL), marking as High / stx.2.0 gating as this has been seen multiple times.

tags: added: stx.2.0
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Angie Wang (angiewang)
Angie Wang (angiewang) wrote :

Application-apply aborted issue was reproduced on WCP-3_6.

Some of the db-sync pods failed, ie.. neutron-db-sync, placement-db-sync...
During the neutron and placement db syncing, the mariadb ingress was reloaded unexpectedly (see the following logs) which caused the DB connection terminated and db sync failed.

mariadb-ingress-6ff964556d-8pg62
I0802 16:55:20.751467 38 controller.go:211] backend reload required
I0802 16:55:20.813445 38 controller.go:220] ingress backend successfully reloaded...

db-sync restart not working because the db table already exists(alembic_version table has not been updated to the right migration version yet before db connection lost).

From the log, unable to determine the root cause of mariadb ingress reloads. Usually, if ingress reloaded, there would have related logs.

Note that this issue only happened when the initial installation of the system and apply stx-openstack right after the system is up.

Found something interesting, the OSD of WCP-3_6 for both controllers are on HDD disk which is slow disk.
Mariadb is on OSD and there are lots of db reads and writes during applying nova, placement and neutron.
After changing OSD to SSD disk for that lab, I am not be able to reproduce that, apply works well.

But we may still want some chart changes that can handle the db sync failure. Ie... drop DB/tables and retry.

Frank Miller (sensfan22) wrote :

Lowered priority to medium and changed gate to stx.3.0 for the following reasons:
1) Issue only occurs if using SAS disks for OSDs, does not occur if using SSDs
2) Issue has a manual recipe workaround if a user must use SAS disks and requires stx-openstack application (issue only occurs if openstack applied)

tags: added: stx.3.0
removed: stx.2.0
Changed in starlingx:
importance: High → Medium
Angie Wang (angiewang) wrote :

The manual workaround is:
 - remove the stx-openstack
   system application-remove stx-openstack
 - re-apply stx-openstack
   system application-apply stx-openstack

Yang Liu (yliu12) on 2019-09-09
tags: added: stx.retestneeded
Mariano Ucha (marianoucha) wrote :

I'm having the same problem I think. I cannot apply stx-openstack. I try the WorkAround but fail again

May I post logs here?

Angie Wang (angiewang) wrote :

Sure. You can post related logs here or you can run "collect all" (which collects logs and other useful information for each node) and upload the generated log tarfile.

stx-openstack apply failure may due to various reasons, if it's caused by db-sync failure like I mentioned above, the manual workaround should fix it.

Mariano Ucha (marianoucha) wrote :

I tryed to apply the Work Around Angie but it failed again.

Angie Wang (angiewang) wrote :

Hi Mariano,

I took a look the logs you provided. It's not due to compute-kits services(neutron, nova, placement) db-sync, they all succeed.

The armada log shows the error which indicates something wrong with openvswitch ..

2019-10-10 21:32:40.460 25301 ERROR armada.handlers.armada armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-openvswitch)). None found! Are `wait.labels` correct? Does `wait.resources` need to exclude `type: pod`?

-bash-4.2$ grep -R "label" var/log/bash.log
2019-10-07T18:10:19.000 localhost -sh: info HISTORY: PID=19544 UID=42425 system host-label-assign controller-0 sriovdp=enabled
2019-10-07T18:13:03.000 localhost -sh: info HISTORY: PID=19544 UID=42425 system host-label-assign controller-0 openstack-control-plane=enabled
2019-10-07T18:13:05.000 localhost -sh: info HISTORY: PID=19544 UID=42425 system host-label-assign controller-0 openstack-compute-node=enabled
2019-10-07T18:13:12.000 localhost -sh: info HISTORY: PID=19544 UID=42425 system host-label-assign controller-0 openvswitch=disabled
2019-10-07T18:13:18.000 localhost -sh: info HISTORY: PID=19544 UID=42425 system host-label-assign controller-0 sriov=enabled

Looks like you didn't enable "openvswitch" label. By default, vswitch runs in container which requires the
"openvswitch" label to be applied.

You can use the following command to verify you have openvswitch enabled.
[sysadmin@controller-0 ~(keystone_admin)]$ system helm-override-list stx-openstack --long
| openvswitch | [u'openstack'] | [True] |

And also verify your vswitch_type is none.
[sysadmin@controller-0 ~(keystone_admin)]$ system show
| vswitch_type | none |

You need to enable the openvswitch label(require lock/unlock host) and then apply stx-openstack.

-Angie

Mariano Ucha (marianoucha) wrote :

Hi Angie! Thank you very much for you answer!

Yes I disabled OVS because I would like to test SRIOV. Is it mandatory to enable it?

Angie Wang (angiewang) wrote :

Hi Mariano, no problem. Neutron needs vswitch, you need to have ovs either running on host or container.

If vswitch_type == none, vswitch will run in container(openvswitch chart is enabled in stx-openstack). It's mandatory to enable openstack-vswitch label.
If vswitch_type == ovs-dpdk, vswitch runs on host(openvswitch chart is disabled in stx-openstack). Virtual environment doesn't support ovs-dpdk, only barematel supports it.

Mariano Ucha (marianoucha) wrote :

Thank you very much again Angie. I'll enable it.

Frank Miller (sensfan22) wrote :

Changing tag to stx.4.0 for same reasons as listed on Sept 6:

Lowered priority to medium and changed gate to stx.3.0 for the following reasons:
1) Issue only occurs if using SAS disks for OSDs, does not occur if using SSDs
2) Issue has a manual recipe workaround if a user must use SAS disks and requires stx-openstack application (issue only occurs if openstack applied)

tags: added: stx.4.0
removed: stx.3.0
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers