mistral MessagingTimeout correlates with containerized undercloud uptime

Bug #1789680 reported by John Fulton
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Dougal Matthews

Bug Description

Even the simplest Mistral actions have MessagingTimeout on a containerized Rocky RC1 undercloud with an uptime of approximately 48 hours as reported by three independent users. For example:

(undercloud) [stack@undercloud ~]$ mistral run-action std.echo '{"output": "Hello Workflow!"}'
ERROR (app) MessagingTimeout: Timed out waiting for a reply to message ID 6e089e1b7fc749579f60f2bcbd52b71d
(undercloud) [stack@undercloud ~]$

Rebooting the undercloud works around the problem for two out of the three reports.

This affects overcloud deployment as 'openstack overcloud deploy ...' results in a similar MessagingTimeout error.

Revision history for this message
John Fulton (jfulton-org) wrote :

dtantsur: oh, something I missed in mistral-api initially: ERROR
          oslo.messaging._drivers.impl_rabbit [req-6399e1f7-19b8-4a8d-af99-bc431ae91366
          66fbd53f21484b49b76674b8b020a313 fdaa0a59b31e4119ab7c80f1096b02cd - default
          default] [4166ca76-f412-4789-a948-bfea12b4538f] AMQP server on
          undercloud.internalapi.localdomain:5672 is unreachable: [Errno 104] Connection
   reset by peer. Trying again in 1 seconds.: error: [Errno 104]

jtomasek: fultonj, dtantsur: <shardy>Iit seems that mistral tries to create the execution
          then that "no threads" RPC INFO happens, then eventually the RPC times out

jtomasek: fultonj, dtantsur: my observation - happens with any mistral action or workflow,
          workflow itself actually executes without problem but the response never arrives back
          and the command which initiated it timeouts

description: updated
description: updated
Revision history for this message
John Fulton (jfulton-org) wrote :

Hit this again. Rebooting undercloud as workaround.

+ openstack overcloud deploy --templates /home/stack/templates --libvirt-type qemu --control-flavor oooq_control --compute-flavor oooq_compute --ceph-storage-flavor oooq_ceph --block-storage-flavor oooq_blockstorage --swift-storage-flavor oooq_objectstorage --timeout 90 -e /home/stack/cloud-names.yaml -e /home/stack/templates/environments/docker-ha.yaml -e /home/stack/containers-default-parameters.yaml -e /home/stack/templates/environments/network-isolation.yaml -e /home/stack/templates/environments/net-single-nic-with-vlans.yaml -e /home/stack/network-environment.yaml -e /home/stack/templates/environments/low-memory-usage.yaml -e /home/stack/templates/environments/disable-telemetry.yaml --validation-warnings-fatal --compute-scale 1 --control-scale 1 --ceph-storage-scale 3 --ntp-server pool.ntp.org -e /home/stack/templates/environments/ceph-ansible/ceph-ansible.yaml -e /home/stack/local_fetch_dir.yaml
MessagingTimeout: Timed out waiting for a reply to message ID 30f53672ab2c4660be4be64b460349ee

real 2m5.174s
user 0m0.966s
sys 0m0.458s
+ status_code=1
+ exit 0
(undercloud) [stack@undercloud ~]$ openstack overcloud plan delete overcloud
Deleting plan overcloud...
MessagingTimeout: Timed out waiting for a reply to message ID e4cd14bad76e429f9bbed92afe5b007d
(undercloud) [stack@undercloud ~]$ uptime
 22:46:09 up 2 days, 22:07, 5 users, load average: 2.57, 3.01, 3.17
(undercloud) [stack@undercloud ~]$

Revision history for this message
John Fulton (jfulton-org) wrote :

May not be related, but I had to `virsh destroy undercloud` as the VM hung after `init 6` with the following on the console:

[stack@hamfast ~]$ virsh list
 Id Name State
----------------------------------------------------
 1 undercloud running

[stack@hamfast ~]$ virsh console 1
Connected to domain undercloud
Escape character is ^]
[ OK ] Unmounted /var/lib/docker/container...2fe7ad8802da7a92a51a2fe5541/shm.
[ OK ] Unmounted /var/lib/docker/overlay2/...dcfabdf183da72b297553c19/merged.
[ OK ] Unmounted /var/lib/docker/container...53e84bf7f04eee42189ecf7411e/shm.
[ OK ] Unmounted /var/lib/docker/overlay2/...b81153481b11867e98df1e0e/merged.
[ OK ] Unmounted /var/lib/docker/container...cb1c5e8cd0716df6d2bb8538171/shm.
[ OK ] Unmounted /var/lib/docker/overlay2/...e28d0ff3506b9f82d204f243/merged.
[ OK ] Unmounted /var/lib/docker/container...5115fdf5731d8a86437da680e20/shm.
[ OK ] Unmounted /var/lib/docker/overlay2/...44050b320a23de01eca91bdb/merged.

Toure Dunnon (toure)
Changed in tripleo:
assignee: nobody → Toure Dunnon (toure)
Revision history for this message
Toure Dunnon (toure) wrote :

Next time you hit this issue I would like to see the output of "top -b -o +%MEM | head -n 22".

Revision history for this message
John Fulton (jfulton-org) wrote :

+ openstack overcloud deploy --templates /home/stack/templates --libvirt-type qemu --control-flavor oooq_c
ontrol --compute-flavor oooq_compute --ceph-storage-flavor oooq_ceph --block-storage-flavor oooq_blockstor
age --swift-storage-flavor oooq_objectstorage --timeout 90 -e /home/stack/cloud-names.yaml -e /home/stack/
templates/environments/docker-ha.yaml -e /home/stack/containers-default-parameters.yaml -e /home/stack/tem
plates/environments/network-isolation.yaml -e /home/stack/templates/environments/net-single-nic-with-vlans
.yaml -e /home/stack/network-environment.yaml -e /home/stack/templates/environments/low-memory-usage.yaml
-e /home/stack/templates/environments/disable-telemetry.yaml --validation-warnings-fatal --compute-scale 1
 --control-scale 1 --ceph-storage-scale 3 --ntp-server pool.ntp.org -e /home/stack/templates/environments/
ceph-ansible/ceph-ansible.yaml -e /home/stack/local_fetch_dir.yaml
MessagingTimeout: Timed out waiting for a reply to message ID 0a52722084334af3afcfa1ad11d0717d

real 2m7.305s
user 0m1.001s
sys 0m0.480s
+ status_code=1
+ exit 0
(undercloud) [stack@undercloud ~]$ top -b -o +%MEM | head -n 22
top - 16:20:57 up 2 days, 13:20, 5 users, load average: 2.16, 2.29, 2.52
Tasks: 534 total, 4 running, 529 sleeping, 0 stopped, 1 zombie
%Cpu(s): 40.1 us, 10.9 sy, 0.0 ni, 48.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13334792 total, 634208 free, 9351388 used, 3349196 buff/cache
KiB Swap: 4194300 total, 2004404 free, 2189896 used. 3461936 avail Mem

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 3499 42434 20 0 4870812 892892 4932 S 6.2 6.7 38:35.54 mysqld
 5469 42439 20 0 6546832 264640 3052 S 0.0 2.0 233:38.14 beam.smp
 2359 42430 20 0 521448 164964 3348 S 6.2 1.2 132:48.52 mistral-server
 3400 42436 20 0 933728 161840 3492 S 0.0 1.2 7:30.51 httpd
 3402 42436 20 0 933728 161460 3496 S 0.0 1.2 7:36.59 httpd
 3399 42436 20 0 933216 161372 3488 S 0.0 1.2 8:04.20 httpd
 3401 42436 20 0 932960 160584 3488 S 6.2 1.2 8:03.26 httpd
 3160 42430 20 0 514324 159544 3536 S 0.0 1.2 109:47.67 mistral-server
26938 42418 20 0 400096 139516 2608 S 0.0 1.0 16:21.50 heat-engine
14946 42425 20 0 687820 110312 7280 S 0.0 0.8 0:23.08 httpd
 4199 root 20 0 342312 110220 3368 R 25.0 0.8 129:29.92 nova-compute
14948 42425 20 0 686796 109280 7280 S 0.0 0.8 0:21.85 httpd
19842 42436 20 0 324740 109092 2396 S 0.0 0.8 21:34.07 nova-conductor
19843 42436 20 0 324288 108684 2396 S 0.0 0.8 21:31.44 nova-conductor
19802 42436 20 0 324476 108652 2400 S 0.0 0.8 21:32.99 nova-conductor
(undercloud) [stack@undercloud ~]$

Revision history for this message
John Fulton (jfulton-org) wrote :
Download full text (3.5 KiB)

(undercloud) [stack@undercloud ~]$ top -b -o +%MEM | head -n 22; mistral run-action std.echo '{"output": "
Hello Workflow!"}' ; top -b -o +%MEM | head -n 22
top - 16:22:42 up 2 days, 13:22, 5 users, load average: 2.85, 2.50, 2.57
Tasks: 535 total, 3 running, 531 sleeping, 0 stopped, 1 zombie
%Cpu(s): 18.5 us, 6.8 sy, 0.0 ni, 74.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13334792 total, 610300 free, 9364692 used, 3359800 buff/cache
KiB Swap: 4194300 total, 2004916 free, 2189384 used. 3448884 avail Mem

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 3499 42434 20 0 4870812 892892 4932 S 0.0 6.7 38:36.23 mysqld
 5469 42439 20 0 6565376 272136 3052 S 5.9 2.0 233:44.65 beam.smp
 2359 42430 20 0 521448 164964 3348 S 5.9 1.2 132:52.08 mistral-server
 3400 42436 20 0 933728 161840 3492 S 0.0 1.2 7:30.66 httpd
 3402 42436 20 0 933728 161460 3496 S 0.0 1.2 7:36.74 httpd
 3399 42436 20 0 933216 161372 3488 S 0.0 1.2 8:04.38 httpd
 3401 42436 20 0 932960 160584 3488 S 0.0 1.2 8:03.43 httpd
 3160 42430 20 0 514324 159544 3536 S 0.0 1.2 109:50.49 mistral-server
26938 42418 20 0 400096 139516 2608 S 5.9 1.0 16:23.61 heat-engine
14946 42425 20 0 687820 110312 7280 S 0.0 0.8 0:23.08 httpd
 4199 root 20 0 342312 110220 3368 S 0.0 0.8 129:33.44 nova-compute
14948 42425 20 0 686796 109280 7280 S 0.0 0.8 0:21.85 httpd
19842 42436 20 0 324740 109092 2396 S 0.0 0.8 21:36.38 nova-conductor
19843 42436 20 0 324288 108684 2396 S 0.0 0.8 21:33.70 nova-conductor
19802 42436 20 0 324476 108652 2400 S 11.8 0.8 21:35.32 nova-conductor
ERROR (app) MessagingTimeout: Timed out waiting for a reply to message ID 2457bf9391e947bca2dfc54dac05ae70
top - 16:24:44 up 2 days, 13:24, 5 users, load average: 2.29, 2.40, 2.52
Tasks: 538 total, 3 running, 534 sleeping, 0 stopped, 1 zombie
%Cpu(s): 36.4 us, 12.9 sy, 0.0 ni, 50.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13334792 total, 566944 free, 9407016 used, 3360832 buff/cache
KiB Swap: 4194300 total, 2004916 free, 2189384 used. 3406868 avail Mem

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 3499 42434 20 0 4870812 892892 4932 S 0.0 6.7 38:37.07 mysqld
 5469 42439 20 0 6539840 262232 3052 S 11.8 2.0 233:51.39 beam.smp
 2359 42430 20 0 521448 164964 3348 S 11.8 1.2 132:56.27 mistral-server
 3400 42436 20 0 933728 161840 3492 S 0.0 1.2 7:30.83 httpd
 3402 42436 20 0 933728 161460 3496 S 0.0 1.2 7:36.91 httpd
 3399 42436 20 0 933216 161372 3488 S 0.0 1.2 8:04.58 httpd
 3401 42436 20 0 932960 160584 3488 S 0.0 1.2 8:03.64 httpd
 3160 42430 20 0 514324 159544 3536 S 0.0 1.2 109:53.91 mistral-server
26938 42418 20 0 400096 139516 2608 S 0.0 1.0 16:25.95 heat-engine
14946 42425 20 0 687820 110312 7280 S 0.0 0.8 0:23.08 httpd
 4199 root 20 0 342312 11...

Read more...

Revision history for this message
Toure Dunnon (toure) wrote :

It doesn't look like there is anything out of sorts with regards to total memory. I am still trying to reproduce this on my side.

Revision history for this message
Toure Dunnon (toure) wrote :

just installed the latest upstream version and now I am waiting.

Revision history for this message
Toure Dunnon (toure) wrote :

I have had my system running for the last three days and the system is still responsive.

c1bc63161586 192.168.24.1:8787/tripleorocky/centos-binary-mistral-api:current-tripleo-rdo "kolla_start" 3 days ago Up 3 days (healthy) mistral_api
77d2ba06d67c 192.168.24.1:8787/tripleorocky/centos-binary-mistral-engine:current-tripleo-rdo "kolla_start" 3 days ago Up 3 days (healthy) mistral_engine
e383e8b9c665 192.168.24.1:8787/tripleorocky/centos-binary-mistral-event-engine:current-tripleo-rdo "kolla_start" 3 days ago Up 3 days (healthy) mistral_event_engine
daac90ed4eab 192.168.24.1:8787/tripleorocky/centos-binary-mistral-executor:current-tripleo-rdo "kolla_start" 3 days ago Up 3 days (healthy) mistral_executor

# mistral_executor
#
#
()[mistral@undercloud /]$ rpm -qa|grep mistral
openstack-mistral-common-7.0.1-0.20180907124420.2640c73.el7.noarch
python2-mistral-lib-1.0.0-0.20180821152751.d1ccfd0.el7.noarch
python-mistral-7.0.1-0.20180907124420.2640c73.el7.noarch
openstack-mistral-executor-7.0.1-0.20180907124420.2640c73.el7.noarch
python2-mistralclient-3.7.0-0.20180810140142.f0ee48f.el7.noarch
puppet-mistral-13.3.1-0.20180831192741.bb0e35e.el7.noarch

# mistral_api
#
#
()[mistral@undercloud /]$ rpm -qa|grep mistral
openstack-mistral-common-7.0.1-0.20180907124420.2640c73.el7.noarch
python2-mistral-lib-1.0.0-0.20180821152751.d1ccfd0.el7.noarch
python-mistral-7.0.1-0.20180907124420.2640c73.el7.noarch
python2-mistralclient-3.7.0-0.20180810140142.f0ee48f.el7.noarch
puppet-mistral-13.3.1-0.20180831192741.bb0e35e.el7.noarch
openstack-mistral-api-7.0.1-0.20180907124420.2640c73.el7.noarch

# zaqar
#
#
()[root@undercloud /]# rpm -qa|grep zaqar
python2-zaqarclient-1.10.0-0.20180810073833.1a50023.el7.noarch
openstack-zaqar-7.0.1-0.20180908030326.1b31c7e.el7.noarch
puppet-zaqar-13.3.1-0.20180831212815.00a7f19.el7.noarch

Revision history for this message
Toure Dunnon (toure) wrote :

John, could I get the package list from your install?

Revision history for this message
John Fulton (jfulton-org) wrote :
Download full text (3.3 KiB)

(undercloud) [stack@undercloud ~]$ for C in mistral_executor mistral_api zaqar ; do echo $C; docker exec -ti $C rpm -qa | egrep "mistral|zaqar"; done
mistral_executor
python2-zaqarclient-1.10.0-0.20180806142547.1a50023.el7.noarch
python2-mistral-lib-1.0.0-0.20180730234322.d1ccfd0.el7.noarch
puppet-mistral-13.2.0-0.20180811003807.8e79ad2.el7.noarch
python-mistral-7.0.0-0.20180810091000.7b5bffe.el7.noarch
openstack-mistral-executor-7.0.0-0.20180810091000.7b5bffe.el7.noarch
puppet-zaqar-13.2.0-0.20180725005042.7f6bf39.el7.noarch
openstack-mistral-common-7.0.0-0.20180810091000.7b5bffe.el7.noarch
python2-mistralclient-3.7.0-0.20180806115446.f0ee48f.el7.noarch
mistral_api
python2-zaqarclient-1.10.0-0.20180806142547.1a50023.el7.noarch
python2-mistral-lib-1.0.0-0.20180730234322.d1ccfd0.el7.noarch
puppet-mistral-13.2.0-0.20180811003807.8e79ad2.el7.noarch
python-mistral-7.0.0-0.20180810091000.7b5bffe.el7.noarch
puppet-zaqar-13.2.0-0.20180725005042.7f6bf39.el7.noarch
openstack-mistral-common-7.0.0-0.20180810091000.7b5bffe.el7.noarch
openstack-mistral-api-7.0.0-0.20180810091000.7b5bffe.el7.noarch
python2-mistralclient-3.7.0-0.20180806115446.f0ee48f.el7.noarch
zaqar
python2-zaqarclient-1.10.0-0.20180806142547.1a50023.el7.noarch
openstack-zaqar-7.0.0-0.20180809140531.5830528.el7.noarch
puppet-mistral-13.2.0-0.20180811003807.8e79ad2.el7.noarch
puppet-zaqar-13.2.0-0.20180725005042.7f6bf39.el7.noarch
python2-mistralclient-3.7.0-0.20180806115446.f0ee48f.el7.noarch
(undercloud) [stack@undercloud ~]$

(undercloud) [stack@undercloud ~]$ mistral run-action std.echo '{"output": "Hello Workflow!"}'
ERROR (app) MessagingTimeout: Timed out waiting for a reply to message ID 245f32e07c8140c2be3a39cafdcc6a5e
(undercloud) [stack@undercloud ~]$ date
Tue Sep 18 11:07:28 UTC 2018
(undercloud) [stack@undercloud ~]$ docker ps | grep mistral
65540fb9638a docker.io/tripleomaster/centos-binary-mistral-api:0743a561fd1021f651e5d4d98690415bb3a6674f_2a6352ec "kolla_start" 3 weeks ago Up 23 hours (healthy) mistral_api
75ae674fc3b5 docker.io/tripleomaster/centos-binary-mistral-engine:0743a561fd1021f651e5d4d98690415bb3a6674f_2a6352ec "kolla_start" 3 weeks ago Up 23 hours (healthy) mistral_engine
c1d30cedaa84 docker.io/tripleomaster/centos-binary-mistral-event-engine:0743a561fd1021f651e5d4d98690415bb3a6674f_2a6352ec "kolla_start" 3 weeks ago Up 23 hours (healthy) mistral_event_engine
d11063e58f82 docker.io/tripleomaster/centos-binary-mistral-executor:0743a561fd1021f651e5d4d98690415bb3a6674f_2a6352ec "kolla_start" 3 weeks ago Up 23 hours (healthy) mistral_executor
(undercloud) [stack@undercloud ~]$ docker ps | grep zaqar
e7b1104186c2 docker.io/tripleomaster/centos-binary-zaqar:0743a561fd1021f651e5d4d98690415bb3a6674f_2a6352ec "kolla_start" 3 weeks ago Up 23 hours zaqar_websocket
a6834c2345ab docker.io/tripleomaster/centos-binary-zaqar:0743a561fd1021f651e5d4d9869041...

Read more...

Changed in tripleo:
milestone: rocky-rc2 → stein-1
tags: added: rocky-backport-potential
Revision history for this message
Toure Dunnon (toure) wrote :

Just a quick update, I have found where the problem is taking place, which is in the mistral_api service. It seems as though our messaging queue starts missing replies to executions, I have watched incoming executions and they make it to the engine and executor which in turn are completed but the result never makes it back to the API service. I am now debugging the wsgi service to see if there are any leads.

Revision history for this message
Brad P. Crochet (brad-9) wrote :

I don't think Mistral API is running under WSGI.

ps axfw | grep httpd | grep mistral

Yields nothing.

Revision history for this message
Brad P. Crochet (brad-9) wrote :

Investigating https://review.openstack.org/#/c/557487/ as a possible fix.

Revision history for this message
Thomas Herve (therve) wrote :

That review is definitely not a fix as we don't use the kombu server.

Revision history for this message
Thomas Herve (therve) wrote :

https://review.openstack.org/605633 is an attempt at a fix.

Revision history for this message
Rabi Mishra (rabi) wrote :

Ah, I was also looking at the same thing today. I think I made it to work in my environment with a similar change.

diff --git a/mistral/api/service.py b/mistral/api/service.py
index 3d2aeb90..8d957711 100644
--- a/mistral/api/service.py
+++ b/mistral/api/service.py
@@ -49,6 +49,7 @@ class WSGIService(service.ServiceBase):
         # properly (e.g. message routing for synchronous calls may be based on
         # generated queue names).
         rpc_clients.cleanup()
+ rpc_base.cleanup()

         self.server.start()

Toure Dunnon (toure)
Changed in tripleo:
status: Triaged → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/mistral 5.2.6

This issue was fixed in the openstack/mistral 5.2.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/mistral 6.0.5

This issue was fixed in the openstack/mistral 6.0.5 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/mistral 7.0.3

This issue was fixed in the openstack/mistral 7.0.3 release.

wes hayutin (weshayutin)
Changed in tripleo:
status: Fix Committed → Triaged
Revision history for this message
wes hayutin (weshayutin) wrote :

Let's keep this open until it's in the current builds.

Revision history for this message
wes hayutin (weshayutin) wrote :
tags: added: alert
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I think that may correlate to the SIGHUP processing upon receiving it from logrotate.
Mistral engine and Heat engine had shown to not tolerate that signal well on my dev env, when I was testing the logrotation fixes [0]. So we falled back to signal-less approach [1] and ut was merged just today. I think now things should got fixed automagically.

[0] https://review.openstack.org/#/c/589213/
[1] https://review.openstack.org/#/c/607491/

Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Apparently logrotate has nothing to that. The failed job shows logrotate container not even deployed there http://logs.openstack.org/89/608589/5/check/tripleo-ci-centos-7-scenario003-multinode-oooq-container/317ee0f/logs/undercloud/var/log/containers/

Revision history for this message
Thomas Herve (therve) wrote :

Can we open another bug? The latest error are WebSocket timeouts, have nothing to do with the MessagingTimeout fixed in that bug.

Revision history for this message
Thomas Herve (therve) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/609941

Changed in tripleo:
assignee: Toure Dunnon (toure) → Bogdan Dobrelya (bogdando)
status: Triaged → In Progress
Changed in tripleo:
assignee: Bogdan Dobrelya (bogdando) → Toure Dunnon (toure)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

> It seems that the workflow takes about 10 minutes to run, but we timeout as 6 (https://github.com/openstack/python-tripleoclient/blob/master/tripleoclient/workflows/deployment.py#L52)

> About 5 minutes out of that is spent by skopeo inspect stuff. So I suspect it's due to https://review.openstack.org/#/c/604664/

The root cause is https://bugs.launchpad.net/tripleo/+bug/1797525 it seems

Changed in tripleo:
assignee: Toure Dunnon (toure) → Bogdan Dobrelya (bogdando)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/609746
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=b4053ad111af873a4adbbbfeb3bf1553f71cfd8e
Submitter: Zuul
Branch: master

commit b4053ad111af873a4adbbbfeb3bf1553f71cfd8e
Author: Dougal Matthews <email address hidden>
Date: Thu Oct 11 16:36:12 2018 +0100

    Retry uploading messages to Swift up to 5 times

    This should hopefully handle short, intermitent issues uploading to
    Swift. We currently use the same retry policy when sending Zaqar
    messages.

    Related-Bug: #1789680
    Change-Id: Ibee6ba188585f80f0f7d136c81146096cb4432c2

Revision history for this message
Thomas Herve (therve) wrote :

With https://review.openstack.org/#/c/609586/ in, deploy_plan goes back to about 5 minutes. That's barely under the 6 minutes barrier, maybe it's worth bumping it anyway?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Good idea @Thomas! Where does it live?

Revision history for this message
Thomas Herve (therve) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to python-tripleoclient (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/609993

Changed in tripleo:
assignee: Bogdan Dobrelya (bogdando) → Dougal Matthews (d0ugal)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to python-tripleoclient (master)

Reviewed: https://review.openstack.org/609993
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=9f716d8f66640c8ff89929843884dfb9638a96bb
Submitter: Zuul
Branch: master

commit 9f716d8f66640c8ff89929843884dfb9638a96bb
Author: Dougal Matthews <email address hidden>
Date: Fri Oct 12 11:00:50 2018 +0100

    Increase the deploy_plan timeout in tripleoclient

    The time for the deploy workflow to complete has been creeping up, and
    it is getting close to the 6 minute timeout in the client. This bumps
    the timeout to 10 minutes.

    Change-Id: Iadb7cc9ba3b62a0221109b1dacf2d764944f691a
    Related-Bug: #1789680

Thomas Herve (therve)
tags: removed: alert
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/609941
Reason: we have https://review.openstack.org/#/c/609586/ merged and hopefully need no more to revert

Revision history for this message
Marios Andreou (marios-b) wrote :

folks does that mean with the two reviews @ https://review.openstack.org/609746 https://review.openstack.org/609993 for increase timeout and retry swift upload we can close this for now?

Changed in tripleo:
status: In Progress → Fix Committed
wes hayutin (weshayutin)
Changed in tripleo:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to python-tripleoclient (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/613623

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to python-tripleoclient (stable/rocky)

Reviewed: https://review.openstack.org/613623
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=69f8b31201273e587a69ebe5ea5e0ca5af809b68
Submitter: Zuul
Branch: stable/rocky

commit 69f8b31201273e587a69ebe5ea5e0ca5af809b68
Author: Dougal Matthews <email address hidden>
Date: Fri Oct 12 11:00:50 2018 +0100

    Increase the deploy_plan timeout in tripleoclient

    The time for the deploy workflow to complete has been creeping up, and
    it is getting close to the 6 minute timeout in the client. This bumps
    the timeout to 10 minutes.

    Change-Id: Iadb7cc9ba3b62a0221109b1dacf2d764944f691a
    Related-Bug: #1789680
    (cherry picked from commit 9f716d8f66640c8ff89929843884dfb9638a96bb)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/609941

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to python-tripleoclient (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/615866

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to python-tripleoclient (stable/queens)

Reviewed: https://review.openstack.org/615866
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=ca1fd76fdd551edd4892a6d87b65369aa179634a
Submitter: Zuul
Branch: stable/queens

commit ca1fd76fdd551edd4892a6d87b65369aa179634a
Author: Dougal Matthews <email address hidden>
Date: Fri Oct 12 11:00:50 2018 +0100

    Increase the deploy_plan timeout in tripleoclient

    The time for the deploy workflow to complete has been creeping up, and
    it is getting close to the 6 minute timeout in the client. This bumps
    the timeout to 10 minutes.

    Change-Id: Iadb7cc9ba3b62a0221109b1dacf2d764944f691a
    Related-Bug: #1789680
    (cherry picked from commit 9f716d8f66640c8ff89929843884dfb9638a96bb)
    (cherry picked from commit 69f8b31201273e587a69ebe5ea5e0ca5af809b68)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/mistral 8.0.0.0b1

This issue was fixed in the openstack/mistral 8.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.