Bug #1837792 “stx-openstack application apply aborted " : Bugs : StarlingX

Revision history for this message

Ming Lei (mlei) wrote on 2019-07-24:

#1

stx-openstack-apply.log Edit (167.2 KiB, text/plain)

Revision history for this message

Ming Lei (mlei) wrote on 2019-07-24:

#2

sysinv.log Edit (2.0 MiB, text/plain)

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-07-25:

#3

Download full text (7.6 KiB)

Application apply aborted at 62% on regular system load "20190724T013000Z"
. Similar to above logs
2019-07-25 12:58:28.876 106112 INFO sysinv.conductor.kube_app [-] processing chart: osh-openstack-placement, overall completion: 62.0%
2019-07-25 13:28:27.382 26231 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:173
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller [-] [chart=openstack-libvirt]: Error while installing release osh-openstack-libvirt: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "release osh-openstack-libvirt failed: timed out waiting for the condition"
        debug_error_string = "{"created":"@1564061309.128669855","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release osh-openstack-libvirt failed: timed out waiting for the condition","grpc_status":2}"
>
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller Traceback (most recent call last):
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 465, in install_release
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller metadata=self.metadata)
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 533, in __call__
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller return _end_unary_response_blocking(state, call, False, None)
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller raise _Rendezvous(state, None, None, deadline)
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller status = StatusCode.UNKNOWN
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller details = "release osh-openstack-libvirt failed: timed out waiting for the condition"
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller debug_error_string = "{"created":"@1564061309.128669855","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release osh-openstack-libvirt failed: timed out waiting for the condition","grpc_status":2}"
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller >
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller
2019-07-25 13:28:29.130 26231 DEBUG armada.handlers.tiller [-] [chart=openstack-libvirt]: Helm getting release status for release=osh-openstack-libvirt, version=0 get_release_status /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:531
2019-07-25 13:28:29.219 26231 DEBUG armada.handlers.tiller [-] [chart=openstack-libvirt]: GetReleaseStatus= name: "osh-openstack-libvirt"
info {
  status {
    code: FAILED
  }
  first_deployed {
    seconds: 1564059509
    nanos: 4957...

Application apply aborted at 62% on regular system load "20190724T013000Z"
 . Similar to above logs 
2019-07-25 12:58:28.876 106112 INFO sysinv.conductor.kube_app [-] processing chart: osh-openstack-placement, overall completion: 62.0%
2019-07-25 13:28:27.382 26231 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:173
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller [-] [chart=openstack-libvirt]: Error while installing release osh-openstack-libvirt: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "release osh-openstack-libvirt failed: timed out waiting for the condition"
        debug_error_string = "{"created":"@1564061309.128669855","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release osh-openstack-libvirt failed: timed out waiting for the condition","grpc_status":2}"
>
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller Traceback (most recent call last):
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 465, in install_release
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller     metadata=self.metadata)
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller   File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 533, in __call__
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller     return _end_unary_response_blocking(state, call, False, None)
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller   File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller     raise _Rendezvous(state, None, None, deadline)
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller      status = StatusCode.UNKNOWN
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller      details = "release osh-openstack-libvirt failed: timed out waiting for the condition"
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller      debug_error_string = "{"created":"@1564061309.128669855","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release osh-openstack-libvirt failed: timed out waiting for the condition","grpc_status":2}"
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller >
2019-07-25 13:28:29.128 26231 ERROR armada.handlers.tiller
2019-07-25 13:28:29.130 26231 DEBUG armada.handlers.tiller [-] [chart=openstack-libvirt]: Helm getting release status for release=osh-openstack-libvirt, version=0 get_release_status /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:531
2019-07-25 13:28:29.219 26231 DEBUG armada.handlers.tiller [-] [chart=openstack-libvirt]: GetReleaseStatus= name: "osh-openstack-libvirt"
info {
  status {
    code: FAILED
  }
  first_deployed {
    seconds: 1564059509
    nanos: 495775
  }
  last_deployed {
    seconds: 1564059509
    nanos: 495775
  }
  Description: "Release \"osh-openstack-libvirt\" failed: timed out waiting for the condition"
}
namespace: "openstack"
 get_release_status /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:539
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada [-] Chart deploy [openstack-libvirt] failed: armada.exceptions.tiller_exceptions.ReleaseException: Failed to Install release: osh-openstack-libvirt - Tiller Message: b'Release "osh-openstack-libvirt" failed: timed out waiting for the condition'
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada Traceback (most recent call last):
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 465, in install_release
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada     metadata=self.metadata)
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada   File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 533, in __call__
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada     return _end_unary_response_blocking(state, call, False, None)
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada   File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada     raise _Rendezvous(state, None, None, deadline)
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada      status = StatusCode.UNKNOWN
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada      details = "release osh-openstack-libvirt failed: timed out waiting for the condition"
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada      debug_error_string = "{"created":"@1564061309.128669855","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release osh-openstack-libvirt failed: timed out waiting for the condition","grpc_status":2}"
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada >
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada During handling of the above exception, another exception occurred:
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada Traceback (most recent call last):
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 223, in handle_result
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada     result = get_result()
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada     return self.__get_result()
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada     raise self._exception
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada   File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada     result = self.fn(*self.args, **self.kwargs)
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 212, in deploy_chart
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada     prefix, known_releases)
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 234, in execute
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada     timeout=timer)
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 478, in install_release
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada     raise ex.ReleaseException(release, status, 'Install')
2019-07-25 13:28:29.220 26231 ERROR armada.handlers.armada armada.exceptions.tiller_exceptions.ReleaseException: Failed to Install release: osh-openstack-libvirt - Tiller Message: b'Release "osh-openstack-libvirt" failed: timed out waiting for the condition'
2019-07-25 13:28:29.220 26231

Ghada Khalil (gkhalil) on 2019-07-25

tags:

added: stx.containers

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-07-26:

#4

In order to debug this further, collect logs are required from the controllers as well as the computes. Please attach.

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-07-26:

#5

Similar issue on standard lab logs. Edit (95.2 MiB, application/x-tar)

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-07-26:

#6

From Anujeyan's logs the application apply started here:
2019-07-24 21:57:06.749 106112 INFO sysinv.conductor.kube_app [-] Starting progress monitoring thread for app stx-openstack

It timed-out/faiuled here:
2019-07-24 22:27:18.158 106112 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/stx-openstack/1.0-17-centos-stable-versioned/stx-openstack-stx-openstack.yaml. See /var/log/armada/stx-openstack-apply.log for details.
2019-07-24 22:27:18.159 106112 INFO sysinv.conductor.kube_app [-] Exiting progress monitoring thread for app stx-openstack
2019-07-24 22:27:18.175 106112 ERROR sysinv.conductor.kube_app [-] Application apply aborted!.

According to Bob Church, this seems to be the key log indicating what is failing:

neutron-dhcp-agent-compute-0-9c041f23-s7rzv_openstack_init-266eabb307aafd458f62869249a54abaa6028b4567c44a1ad2895cf3b6a112e1.log:{"log":"Entrypoint WARNING: 2019/07/24 21:57:19 entrypoint.go:71: Resolving dependency Job neutron-db-sync in namespace openstack failed: Job Job neutron-db-sync in namespace openstack is not completed yet .\n","stream":"stdout","time":"2019-07-24T21:57:19.528744365Z"}

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-26:

#7

As per input from Frank (containers PL), marking as High / stx.2.0 gating as this has been seen multiple times.

tags:	added: stx.2.0
Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged
assignee:	nobody → Angie Wang (angiewang)

Revision history for this message

Angie Wang (angiewang) wrote on 2019-08-16:

#8

Application-apply aborted issue was reproduced on WCP-3_6.

Some of the db-sync pods failed, ie.. neutron-db-sync, placement-db-sync...
During the neutron and placement db syncing, the mariadb ingress was reloaded unexpectedly (see the following logs) which caused the DB connection terminated and db sync failed.

mariadb-ingress-6ff964556d-8pg62
I0802 16:55:20.751467 38 controller.go:211] backend reload required
I0802 16:55:20.813445 38 controller.go:220] ingress backend successfully reloaded...

db-sync restart not working because the db table already exists(alembic_version table has not been updated to the right migration version yet before db connection lost).

From the log, unable to determine the root cause of mariadb ingress reloads. Usually, if ingress reloaded, there would have related logs.

Note that this issue only happened when the initial installation of the system and apply stx-openstack right after the system is up.

Found something interesting, the OSD of WCP-3_6 for both controllers are on HDD disk which is slow disk.
Mariadb is on OSD and there are lots of db reads and writes during applying nova, placement and neutron.
After changing OSD to SSD disk for that lab, I am not be able to reproduce that, apply works well.

But we may still want some chart changes that can handle the db sync failure. Ie... drop DB/tables and retry.

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-09-06:

#9

Lowered priority to medium and changed gate to stx.3.0 for the following reasons:
1) Issue only occurs if using SAS disks for OSDs, does not occur if using SSDs
2) Issue has a manual recipe workaround if a user must use SAS disks and requires stx-openstack application (issue only occurs if openstack applied)

tags:	added: stx.3.0 removed: stx.2.0
Changed in starlingx:
importance:	High → Medium

Revision history for this message

Angie Wang (angiewang) wrote on 2019-09-06:

#10

The manual workaround is:
- remove the stx-openstack
system application-remove stx-openstack
- re-apply stx-openstack
system application-apply stx-openstack

Yang Liu (yliu12) on 2019-09-09

tags:

added: stx.retestneeded

Revision history for this message

Mariano Ucha (marianoucha) wrote on 2019-10-10:

#11

I'm having the same problem I think. I cannot apply stx-openstack. I try the WorkAround but fail again

May I post logs here?

Revision history for this message

Angie Wang (angiewang) wrote on 2019-10-10:

#12

Sure. You can post related logs here or you can run "collect all" (which collects logs and other useful information for each node) and upload the generated log tarfile.

stx-openstack apply failure may due to various reasons, if it's caused by db-sync failure like I mentioned above, the manual workaround should fix it.

Revision history for this message

Mariano Ucha (marianoucha) wrote on 2019-10-11:

#13

ALL_NODES_20191010.212847.tar.gz Edit (53.0 MiB, application/x-tar)

I tryed to apply the Work Around Angie but it failed again.

Revision history for this message

Angie Wang (angiewang) wrote on 2019-10-11:

#14

Hi Mariano,

I took a look the logs you provided. It's not due to compute-kits services(neutron, nova, placement) db-sync, they all succeed.

The armada log shows the error which indicates something wrong with openvswitch ..

2019-10-10 21:32:40.460 25301 ERROR armada.handlers.armada armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-openvswitch)). None found! Are `wait.labels` correct? Does `wait.resources` need to exclude `type: pod`?

-bash-4.2$ grep -R "label" var/log/bash.log
2019-10-07T18:10:19.000 localhost -sh: info HISTORY: PID=19544 UID=42425 system host-label-assign controller-0 sriovdp=enabled
2019-10-07T18:13:03.000 localhost -sh: info HISTORY: PID=19544 UID=42425 system host-label-assign controller-0 openstack-control-plane=enabled
2019-10-07T18:13:05.000 localhost -sh: info HISTORY: PID=19544 UID=42425 system host-label-assign controller-0 openstack-compute-node=enabled
2019-10-07T18:13:12.000 localhost -sh: info HISTORY: PID=19544 UID=42425 system host-label-assign controller-0 openvswitch=disabled
2019-10-07T18:13:18.000 localhost -sh: info HISTORY: PID=19544 UID=42425 system host-label-assign controller-0 sriov=enabled

Looks like you didn't enable "openvswitch" label. By default, vswitch runs in container which requires the
"openvswitch" label to be applied.

You can use the following command to verify you have openvswitch enabled.
[sysadmin@controller-0 ~(keystone_admin)]$ system helm-override-list stx-openstack --long
| openvswitch | [u'openstack'] | [True] |

And also verify your vswitch_type is none.
[sysadmin@controller-0 ~(keystone_admin)]$ system show
| vswitch_type | none |

You need to enable the openvswitch label(require lock/unlock host) and then apply stx-openstack.

-Angie

Revision history for this message

Mariano Ucha (marianoucha) wrote on 2019-10-11:

#15

Hi Angie! Thank you very much for you answer!

Yes I disabled OVS because I would like to test SRIOV. Is it mandatory to enable it?

Revision history for this message

Angie Wang (angiewang) wrote on 2019-10-11:

#16

Hi Mariano, no problem. Neutron needs vswitch, you need to have ovs either running on host or container.

If vswitch_type == none, vswitch will run in container(openvswitch chart is enabled in stx-openstack). It's mandatory to enable openstack-vswitch label.
If vswitch_type == ovs-dpdk, vswitch runs on host(openvswitch chart is disabled in stx-openstack). Virtual environment doesn't support ovs-dpdk, only barematel supports it.

Revision history for this message

Mariano Ucha (marianoucha) wrote on 2019-10-11:

#17

Thank you very much again Angie. I'll enable it.

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-12-05:

#18

Changing tag to stx.4.0 for same reasons as listed on Sept 6:

Lowered priority to medium and changed gate to stx.3.0 for the following reasons:
1) Issue only occurs if using SAS disks for OSDs, does not occur if using SSDs
2) Issue has a manual recipe workaround if a user must use SAS disks and requires stx-openstack application (issue only occurs if openstack applied)

tags:

added: stx.4.0
removed: stx.3.0

Frank Miller (sensfan22) on 2020-05-25

tags:	removed: stx.containers
tags:	added: stx.distro.openstack

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-05-26:

#19

Lowering the priority -- Given the reasons noted above and the fact that this issue exists in previous releases, this will not hold up stx.4.0

Changed in starlingx:
importance:	Medium → Low
tags:	added: stx.app stx.contaienrs removed: stx.4.0
tags:	added: stx.apps stx.containers removed: stx.app stx.contaienrs

Revision history for this message

Thales Elero Cervi (tcervi) wrote on 2024-03-11:

#20

Closing it now due to inactivity.
Latest stx-openstack versions did not reproduce such issue, as per: https://<email address hidden>/thread/UFXM4FQSC66K5YGSYMZVHB4DE4A4JN6O/

Changed in starlingx:
status:	Triaged → Invalid

StarlingX

stx-openstack application apply aborted

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches