Bug #1789680 “mistral MessagingTimeout correlates with container...” : Bugs : tripleo

Revision history for this message

John Fulton (jfulton-org) wrote on 2018-08-29:

#1

dtantsur: oh, something I missed in mistral-api initially: ERROR
          oslo.messaging._drivers.impl_rabbit [req-6399e1f7-19b8-4a8d-af99-bc431ae91366
          66fbd53f21484b49b76674b8b020a313 fdaa0a59b31e4119ab7c80f1096b02cd - default
          default] [4166ca76-f412-4789-a948-bfea12b4538f] AMQP server on
          undercloud.internalapi.localdomain:5672 is unreachable: [Errno 104] Connection
   reset by peer. Trying again in 1 seconds.: error: [Errno 104]

jtomasek: fultonj, dtantsur: <shardy>Iit seems that mistral tries to create the execution
then that "no threads" RPC INFO happens, then eventually the RPC times out

jtomasek: fultonj, dtantsur: my observation - happens with any mistral action or workflow,
workflow itself actually executes without problem but the response never arrives back
and the command which initiated it timeouts

description:	updated
description:	updated

Revision history for this message

John Fulton (jfulton-org) wrote on 2018-09-06:

#2

Hit this again. Rebooting undercloud as workaround.

+ openstack overcloud deploy --templates /home/stack/templates --libvirt-type qemu --control-flavor oooq_control --compute-flavor oooq_compute --ceph-storage-flavor oooq_ceph --block-storage-flavor oooq_blockstorage --swift-storage-flavor oooq_objectstorage --timeout 90 -e /home/stack/cloud-names.yaml -e /home/stack/templates/environments/docker-ha.yaml -e /home/stack/containers-default-parameters.yaml -e /home/stack/templates/environments/network-isolation.yaml -e /home/stack/templates/environments/net-single-nic-with-vlans.yaml -e /home/stack/network-environment.yaml -e /home/stack/templates/environments/low-memory-usage.yaml -e /home/stack/templates/environments/disable-telemetry.yaml --validation-warnings-fatal --compute-scale 1 --control-scale 1 --ceph-storage-scale 3 --ntp-server pool.ntp.org -e /home/stack/templates/environments/ceph-ansible/ceph-ansible.yaml -e /home/stack/local_fetch_dir.yaml
MessagingTimeout: Timed out waiting for a reply to message ID 30f53672ab2c4660be4be64b460349ee

real 2m5.174s
user 0m0.966s
sys 0m0.458s
+ status_code=1
+ exit 0
(undercloud) [stack@undercloud ~]$ openstack overcloud plan delete overcloud
Deleting plan overcloud...
MessagingTimeout: Timed out waiting for a reply to message ID e4cd14bad76e429f9bbed92afe5b007d
(undercloud) [stack@undercloud ~]$ uptime
22:46:09 up 2 days, 22:07, 5 users, load average: 2.57, 3.01, 3.17
(undercloud) [stack@undercloud ~]$

Revision history for this message

John Fulton (jfulton-org) wrote on 2018-09-06:

#3

May not be related, but I had to `virsh destroy undercloud` as the VM hung after `init 6` with the following on the console:

[stack@hamfast ~]$ virsh list
Id Name State
----------------------------------------------------
1 undercloud running

[stack@hamfast ~]$ virsh console 1
Connected to domain undercloud
Escape character is ^]
[ OK ] Unmounted /var/lib/docker/container...2fe7ad8802da7a92a51a2fe5541/shm.
[ OK ] Unmounted /var/lib/docker/overlay2/...dcfabdf183da72b297553c19/merged.
[ OK ] Unmounted /var/lib/docker/container...53e84bf7f04eee42189ecf7411e/shm.
[ OK ] Unmounted /var/lib/docker/overlay2/...b81153481b11867e98df1e0e/merged.
[ OK ] Unmounted /var/lib/docker/container...cb1c5e8cd0716df6d2bb8538171/shm.
[ OK ] Unmounted /var/lib/docker/overlay2/...e28d0ff3506b9f82d204f243/merged.
[ OK ] Unmounted /var/lib/docker/container...5115fdf5731d8a86437da680e20/shm.
[ OK ] Unmounted /var/lib/docker/overlay2/...44050b320a23de01eca91bdb/merged.

Toure Dunnon (toure) on 2018-09-07

Changed in tripleo:
assignee:	nobody → Toure Dunnon (toure)

Revision history for this message

Toure Dunnon (toure) wrote on 2018-09-07:

#4

Next time you hit this issue I would like to see the output of "top -b -o +%MEM | head -n 22".

Revision history for this message

John Fulton (jfulton-org) wrote on 2018-09-09:

#5

+ openstack overcloud deploy --templates /home/stack/templates --libvirt-type qemu --control-flavor oooq_c
ontrol --compute-flavor oooq_compute --ceph-storage-flavor oooq_ceph --block-storage-flavor oooq_blockstor
age --swift-storage-flavor oooq_objectstorage --timeout 90 -e /home/stack/cloud-names.yaml -e /home/stack/
templates/environments/docker-ha.yaml -e /home/stack/containers-default-parameters.yaml -e /home/stack/tem
plates/environments/network-isolation.yaml -e /home/stack/templates/environments/net-single-nic-with-vlans
.yaml -e /home/stack/network-environment.yaml -e /home/stack/templates/environments/low-memory-usage.yaml
-e /home/stack/templates/environments/disable-telemetry.yaml --validation-warnings-fatal --compute-scale 1
--control-scale 1 --ceph-storage-scale 3 --ntp-server pool.ntp.org -e /home/stack/templates/environments/
ceph-ansible/ceph-ansible.yaml -e /home/stack/local_fetch_dir.yaml
MessagingTimeout: Timed out waiting for a reply to message ID 0a52722084334af3afcfa1ad11d0717d

real 2m7.305s
user 0m1.001s
sys 0m0.480s
+ status_code=1
+ exit 0
(undercloud) [stack@undercloud ~]$ top -b -o +%MEM | head -n 22
top - 16:20:57 up 2 days, 13:20, 5 users, load average: 2.16, 2.29, 2.52
Tasks: 534 total, 4 running, 529 sleeping, 0 stopped, 1 zombie
%Cpu(s): 40.1 us, 10.9 sy, 0.0 ni, 48.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13334792 total, 634208 free, 9351388 used, 3349196 buff/cache
KiB Swap: 4194300 total, 2004404 free, 2189896 used. 3461936 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3499 42434 20 0 4870812 892892 4932 S 6.2 6.7 38:35.54 mysqld
5469 42439 20 0 6546832 264640 3052 S 0.0 2.0 233:38.14 beam.smp
2359 42430 20 0 521448 164964 3348 S 6.2 1.2 132:48.52 mistral-server
3400 42436 20 0 933728 161840 3492 S 0.0 1.2 7:30.51 httpd
3402 42436 20 0 933728 161460 3496 S 0.0 1.2 7:36.59 httpd
3399 42436 20 0 933216 161372 3488 S 0.0 1.2 8:04.20 httpd
3401 42436 20 0 932960 160584 3488 S 6.2 1.2 8:03.26 httpd
3160 42430 20 0 514324 159544 3536 S 0.0 1.2 109:47.67 mistral-server
26938 42418 20 0 400096 139516 2608 S 0.0 1.0 16:21.50 heat-engine
14946 42425 20 0 687820 110312 7280 S 0.0 0.8 0:23.08 httpd
4199 root 20 0 342312 110220 3368 R 25.0 0.8 129:29.92 nova-compute
14948 42425 20 0 686796 109280 7280 S 0.0 0.8 0:21.85 httpd
19842 42436 20 0 324740 109092 2396 S 0.0 0.8 21:34.07 nova-conductor
19843 42436 20 0 324288 108684 2396 S 0.0 0.8 21:31.44 nova-conductor
19802 42436 20 0 324476 108652 2400 S 0.0 0.8 21:32.99 nova-conductor
(undercloud) [stack@undercloud ~]$

+ openstack overcloud deploy --templates /home/stack/templates --libvirt-type qemu --control-flavor oooq_c
ontrol --compute-flavor oooq_compute --ceph-storage-flavor oooq_ceph --block-storage-flavor oooq_blockstor
age --swift-storage-flavor oooq_objectstorage --timeout 90 -e /home/stack/cloud-names.yaml -e /home/stack/
templates/environments/docker-ha.yaml -e /home/stack/containers-default-parameters.yaml -e /home/stack/tem
plates/environments/network-isolation.yaml -e /home/stack/templates/environments/net-single-nic-with-vlans
.yaml -e /home/stack/network-environment.yaml -e /home/stack/templates/environments/low-memory-usage.yaml 
-e /home/stack/templates/environments/disable-telemetry.yaml --validation-warnings-fatal --compute-scale 1
 --control-scale 1 --ceph-storage-scale 3 --ntp-server pool.ntp.org -e /home/stack/templates/environments/
ceph-ansible/ceph-ansible.yaml -e /home/stack/local_fetch_dir.yaml
MessagingTimeout: Timed out waiting for a reply to message ID 0a52722084334af3afcfa1ad11d0717d

real    2m7.305s
user    0m1.001s
sys     0m0.480s
+ status_code=1
+ exit 0
(undercloud) [stack@undercloud ~]$ top -b -o +%MEM | head -n 22
top - 16:20:57 up 2 days, 13:20,  5 users,  load average: 2.16, 2.29, 2.52
Tasks: 534 total,   4 running, 529 sleeping,   0 stopped,   1 zombie
%Cpu(s): 40.1 us, 10.9 sy,  0.0 ni, 48.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13334792 total,   634208 free,  9351388 used,  3349196 buff/cache
KiB Swap:  4194300 total,  2004404 free,  2189896 used.  3461936 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3499 42434     20   0 4870812 892892   4932 S   6.2  6.7  38:35.54 mysqld
 5469 42439     20   0 6546832 264640   3052 S   0.0  2.0 233:38.14 beam.smp
 2359 42430     20   0  521448 164964   3348 S   6.2  1.2 132:48.52 mistral-server
 3400 42436     20   0  933728 161840   3492 S   0.0  1.2   7:30.51 httpd
 3402 42436     20   0  933728 161460   3496 S   0.0  1.2   7:36.59 httpd
 3399 42436     20   0  933216 161372   3488 S   0.0  1.2   8:04.20 httpd
 3401 42436     20   0  932960 160584   3488 S   6.2  1.2   8:03.26 httpd
 3160 42430     20   0  514324 159544   3536 S   0.0  1.2 109:47.67 mistral-server
26938 42418     20   0  400096 139516   2608 S   0.0  1.0  16:21.50 heat-engine
14946 42425     20   0  687820 110312   7280 S   0.0  0.8   0:23.08 httpd
 4199 root      20   0  342312 110220   3368 R  25.0  0.8 129:29.92 nova-compute
14948 42425     20   0  686796 109280   7280 S   0.0  0.8   0:21.85 httpd
19842 42436     20   0  324740 109092   2396 S   0.0  0.8  21:34.07 nova-conductor
19843 42436     20   0  324288 108684   2396 S   0.0  0.8  21:31.44 nova-conductor
19802 42436     20   0  324476 108652   2400 S   0.0  0.8  21:32.99 nova-conductor
(undercloud) [stack@undercloud ~]$

Revision history for this message

John Fulton (jfulton-org) wrote on 2018-09-09:

#6

Download full text (3.5 KiB)

(undercloud) [stack@undercloud ~]$ top -b -o +%MEM | head -n 22; mistral run-action std.echo '{"output": "
Hello Workflow!"}' ; top -b -o +%MEM | head -n 22
top - 16:22:42 up 2 days, 13:22, 5 users, load average: 2.85, 2.50, 2.57
Tasks: 535 total, 3 running, 531 sleeping, 0 stopped, 1 zombie
%Cpu(s): 18.5 us, 6.8 sy, 0.0 ni, 74.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13334792 total, 610300 free, 9364692 used, 3359800 buff/cache
KiB Swap: 4194300 total, 2004916 free, 2189384 used. 3448884 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3499 42434 20 0 4870812 892892 4932 S 0.0 6.7 38:36.23 mysqld
5469 42439 20 0 6565376 272136 3052 S 5.9 2.0 233:44.65 beam.smp
2359 42430 20 0 521448 164964 3348 S 5.9 1.2 132:52.08 mistral-server
3400 42436 20 0 933728 161840 3492 S 0.0 1.2 7:30.66 httpd
3402 42436 20 0 933728 161460 3496 S 0.0 1.2 7:36.74 httpd
3399 42436 20 0 933216 161372 3488 S 0.0 1.2 8:04.38 httpd
3401 42436 20 0 932960 160584 3488 S 0.0 1.2 8:03.43 httpd
3160 42430 20 0 514324 159544 3536 S 0.0 1.2 109:50.49 mistral-server
26938 42418 20 0 400096 139516 2608 S 5.9 1.0 16:23.61 heat-engine
14946 42425 20 0 687820 110312 7280 S 0.0 0.8 0:23.08 httpd
4199 root 20 0 342312 110220 3368 S 0.0 0.8 129:33.44 nova-compute
14948 42425 20 0 686796 109280 7280 S 0.0 0.8 0:21.85 httpd
19842 42436 20 0 324740 109092 2396 S 0.0 0.8 21:36.38 nova-conductor
19843 42436 20 0 324288 108684 2396 S 0.0 0.8 21:33.70 nova-conductor
19802 42436 20 0 324476 108652 2400 S 11.8 0.8 21:35.32 nova-conductor
ERROR (app) MessagingTimeout: Timed out waiting for a reply to message ID 2457bf9391e947bca2dfc54dac05ae70
top - 16:24:44 up 2 days, 13:24, 5 users, load average: 2.29, 2.40, 2.52
Tasks: 538 total, 3 running, 534 sleeping, 0 stopped, 1 zombie
%Cpu(s): 36.4 us, 12.9 sy, 0.0 ni, 50.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13334792 total, 566944 free, 9407016 used, 3360832 buff/cache
KiB Swap: 4194300 total, 2004916 free, 2189384 used. 3406868 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3499 42434 20 0 4870812 892892 4932 S 0.0 6.7 38:37.07 mysqld
5469 42439 20 0 6539840 262232 3052 S 11.8 2.0 233:51.39 beam.smp
2359 42430 20 0 521448 164964 3348 S 11.8 1.2 132:56.27 mistral-server
3400 42436 20 0 933728 161840 3492 S 0.0 1.2 7:30.83 httpd
3402 42436 20 0 933728 161460 3496 S 0.0 1.2 7:36.91 httpd
3399 42436 20 0 933216 161372 3488 S 0.0 1.2 8:04.58 httpd
3401 42436 20 0 932960 160584 3488 S 0.0 1.2 8:03.64 httpd
3160 42430 20 0 514324 159544 3536 S 0.0 1.2 109:53.91 mistral-server
26938 42418 20 0 400096 139516 2608 S 0.0 1.0 16:25.95 heat-engine
14946 42425 20 0 687820 110312 7280 S 0.0 0.8 0:23.08 httpd
4199 root 20 0 342312 11...

(undercloud) [stack@undercloud ~]$ top -b -o +%MEM | head -n 22; mistral run-action std.echo '{"output": "
Hello Workflow!"}' ; top -b -o +%MEM | head -n 22                                                         
top - 16:22:42 up 2 days, 13:22,  5 users,  load average: 2.85, 2.50, 2.57
Tasks: 535 total,   3 running, 531 sleeping,   0 stopped,   1 zombie
%Cpu(s): 18.5 us,  6.8 sy,  0.0 ni, 74.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13334792 total,   610300 free,  9364692 used,  3359800 buff/cache
KiB Swap:  4194300 total,  2004916 free,  2189384 used.  3448884 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3499 42434     20   0 4870812 892892   4932 S   0.0  6.7  38:36.23 mysqld
 5469 42439     20   0 6565376 272136   3052 S   5.9  2.0 233:44.65 beam.smp
 2359 42430     20   0  521448 164964   3348 S   5.9  1.2 132:52.08 mistral-server
 3400 42436     20   0  933728 161840   3492 S   0.0  1.2   7:30.66 httpd
 3402 42436     20   0  933728 161460   3496 S   0.0  1.2   7:36.74 httpd
 3399 42436     20   0  933216 161372   3488 S   0.0  1.2   8:04.38 httpd
 3401 42436     20   0  932960 160584   3488 S   0.0  1.2   8:03.43 httpd
 3160 42430     20   0  514324 159544   3536 S   0.0  1.2 109:50.49 mistral-server
26938 42418     20   0  400096 139516   2608 S   5.9  1.0  16:23.61 heat-engine
14946 42425     20   0  687820 110312   7280 S   0.0  0.8   0:23.08 httpd
 4199 root      20   0  342312 110220   3368 S   0.0  0.8 129:33.44 nova-compute
14948 42425     20   0  686796 109280   7280 S   0.0  0.8   0:21.85 httpd
19842 42436     20   0  324740 109092   2396 S   0.0  0.8  21:36.38 nova-conductor
19843 42436     20   0  324288 108684   2396 S   0.0  0.8  21:33.70 nova-conductor
19802 42436     20   0  324476 108652   2400 S  11.8  0.8  21:35.32 nova-conductor
ERROR (app) MessagingTimeout: Timed out waiting for a reply to message ID 2457bf9391e947bca2dfc54dac05ae70
top - 16:24:44 up 2 days, 13:24,  5 users,  load average: 2.29, 2.40, 2.52
Tasks: 538 total,   3 running, 534 sleeping,   0 stopped,   1 zombie
%Cpu(s): 36.4 us, 12.9 sy,  0.0 ni, 50.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13334792 total,   566944 free,  9407016 used,  3360832 buff/cache
KiB Swap:  4194300 total,  2004916 free,  2189384 used.  3406868 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3499 42434     20   0 4870812 892892   4932 S   0.0  6.7  38:37.07 mysqld
 5469 42439     20   0 6539840 262232   3052 S  11.8  2.0 233:51.39 beam.smp
 2359 42430     20   0  521448 164964   3348 S  11.8  1.2 132:56.27 mistral-server
 3400 42436     20   0  933728 161840   3492 S   0.0  1.2   7:30.83 httpd
 3402 42436     20   0  933728 161460   3496 S   0.0  1.2   7:36.91 httpd
 3399 42436     20   0  933216 161372   3488 S   0.0  1.2   8:04.58 httpd
 3401 42436     20   0  932960 160584   3488 S   0.0  1.2   8:03.64 httpd
 3160 42430     20   0  514324 159544   3536 S   0.0  1.2 109:53.91 mistral-server
26938 42418     20   0  400096 139516   2608 S   0.0  1.0  16:25.95 heat-engine
14946 42425     20   0  687820 110312   7280 S   0.0  0.8   0:23.08 httpd
 4199 root      20   0  342312 110220   3368 S   5.9  0.8 129:37.70 nova-compute
14948 42425     20   0  686796 109280   7280 S   0.0  0.8   0:22.24 httpd
19842 42436     20   0  324740 109092   2396 S   0.0  0.8  21:39.07 nova-conductor
19843 42436     20   0  324288 108684   2396 S   5.9  0.8  21:36.30 nova-conductor
19802 42436     20   0  324476 108652   2400 S   0.0  0.8  21:37.92 nova-conductor
(undercloud) [stack@undercloud ~]$

Revision history for this message

Toure Dunnon (toure) wrote on 2018-09-14:

#7

It doesn't look like there is anything out of sorts with regards to total memory. I am still trying to reproduce this on my side.

Revision history for this message

Toure Dunnon (toure) wrote on 2018-09-14:

#8

just installed the latest upstream version and now I am waiting.

Revision history for this message

Toure Dunnon (toure) wrote on 2018-09-17:

#9

I have had my system running for the last three days and the system is still responsive.

c1bc63161586 192.168.24.1:8787/tripleorocky/centos-binary-mistral-api:current-tripleo-rdo "kolla_start" 3 days ago Up 3 days (healthy) mistral_api
77d2ba06d67c 192.168.24.1:8787/tripleorocky/centos-binary-mistral-engine:current-tripleo-rdo "kolla_start" 3 days ago Up 3 days (healthy) mistral_engine
e383e8b9c665 192.168.24.1:8787/tripleorocky/centos-binary-mistral-event-engine:current-tripleo-rdo "kolla_start" 3 days ago Up 3 days (healthy) mistral_event_engine
daac90ed4eab 192.168.24.1:8787/tripleorocky/centos-binary-mistral-executor:current-tripleo-rdo "kolla_start" 3 days ago Up 3 days (healthy) mistral_executor

# mistral_executor
#
#
()[mistral@undercloud /]$ rpm -qa|grep mistral
openstack-mistral-common-7.0.1-0.20180907124420.2640c73.el7.noarch
python2-mistral-lib-1.0.0-0.20180821152751.d1ccfd0.el7.noarch
python-mistral-7.0.1-0.20180907124420.2640c73.el7.noarch
openstack-mistral-executor-7.0.1-0.20180907124420.2640c73.el7.noarch
python2-mistralclient-3.7.0-0.20180810140142.f0ee48f.el7.noarch
puppet-mistral-13.3.1-0.20180831192741.bb0e35e.el7.noarch

# mistral_api
#
#
()[mistral@undercloud /]$ rpm -qa|grep mistral
openstack-mistral-common-7.0.1-0.20180907124420.2640c73.el7.noarch
python2-mistral-lib-1.0.0-0.20180821152751.d1ccfd0.el7.noarch
python-mistral-7.0.1-0.20180907124420.2640c73.el7.noarch
python2-mistralclient-3.7.0-0.20180810140142.f0ee48f.el7.noarch
puppet-mistral-13.3.1-0.20180831192741.bb0e35e.el7.noarch
openstack-mistral-api-7.0.1-0.20180907124420.2640c73.el7.noarch

# zaqar
#
#
()[root@undercloud /]# rpm -qa|grep zaqar
python2-zaqarclient-1.10.0-0.20180810073833.1a50023.el7.noarch
openstack-zaqar-7.0.1-0.20180908030326.1b31c7e.el7.noarch
puppet-zaqar-13.3.1-0.20180831212815.00a7f19.el7.noarch

I have had my system running for the last three days and the system is still responsive.

c1bc63161586        192.168.24.1:8787/tripleorocky/centos-binary-mistral-api:current-tripleo-rdo                 "kolla_start"            3 days ago          Up 3 days (healthy)                         mistral_api
77d2ba06d67c        192.168.24.1:8787/tripleorocky/centos-binary-mistral-engine:current-tripleo-rdo              "kolla_start"            3 days ago          Up 3 days (healthy)                         mistral_engine
e383e8b9c665        192.168.24.1:8787/tripleorocky/centos-binary-mistral-event-engine:current-tripleo-rdo        "kolla_start"            3 days ago          Up 3 days (healthy)                         mistral_event_engine
daac90ed4eab        192.168.24.1:8787/tripleorocky/centos-binary-mistral-executor:current-tripleo-rdo            "kolla_start"            3 days ago          Up 3 days (healthy)                         mistral_executor

# mistral_executor
#
#
()[mistral@undercloud /]$ rpm -qa|grep mistral
openstack-mistral-common-7.0.1-0.20180907124420.2640c73.el7.noarch
python2-mistral-lib-1.0.0-0.20180821152751.d1ccfd0.el7.noarch
python-mistral-7.0.1-0.20180907124420.2640c73.el7.noarch
openstack-mistral-executor-7.0.1-0.20180907124420.2640c73.el7.noarch
python2-mistralclient-3.7.0-0.20180810140142.f0ee48f.el7.noarch
puppet-mistral-13.3.1-0.20180831192741.bb0e35e.el7.noarch

# mistral_api
#
#
()[mistral@undercloud /]$ rpm -qa|grep mistral
openstack-mistral-common-7.0.1-0.20180907124420.2640c73.el7.noarch
python2-mistral-lib-1.0.0-0.20180821152751.d1ccfd0.el7.noarch
python-mistral-7.0.1-0.20180907124420.2640c73.el7.noarch
python2-mistralclient-3.7.0-0.20180810140142.f0ee48f.el7.noarch
puppet-mistral-13.3.1-0.20180831192741.bb0e35e.el7.noarch
openstack-mistral-api-7.0.1-0.20180907124420.2640c73.el7.noarch

# zaqar
#
#
()[root@undercloud /]# rpm -qa|grep zaqar
python2-zaqarclient-1.10.0-0.20180810073833.1a50023.el7.noarch
openstack-zaqar-7.0.1-0.20180908030326.1b31c7e.el7.noarch
puppet-zaqar-13.3.1-0.20180831212815.00a7f19.el7.noarch

Revision history for this message

Toure Dunnon (toure) wrote on 2018-09-17:

#10

John, could I get the package list from your install?

Revision history for this message

John Fulton (jfulton-org) wrote on 2018-09-18:

#11

Download full text (3.3 KiB)

(undercloud) [stack@undercloud ~]$ for C in mistral_executor mistral_api zaqar ; do echo $C; docker exec -ti $C rpm -qa | egrep "mistral|zaqar"; done
mistral_executor
python2-zaqarclient-1.10.0-0.20180806142547.1a50023.el7.noarch
python2-mistral-lib-1.0.0-0.20180730234322.d1ccfd0.el7.noarch
puppet-mistral-13.2.0-0.20180811003807.8e79ad2.el7.noarch
python-mistral-7.0.0-0.20180810091000.7b5bffe.el7.noarch
openstack-mistral-executor-7.0.0-0.20180810091000.7b5bffe.el7.noarch
puppet-zaqar-13.2.0-0.20180725005042.7f6bf39.el7.noarch
openstack-mistral-common-7.0.0-0.20180810091000.7b5bffe.el7.noarch
python2-mistralclient-3.7.0-0.20180806115446.f0ee48f.el7.noarch
mistral_api
python2-zaqarclient-1.10.0-0.20180806142547.1a50023.el7.noarch
python2-mistral-lib-1.0.0-0.20180730234322.d1ccfd0.el7.noarch
puppet-mistral-13.2.0-0.20180811003807.8e79ad2.el7.noarch
python-mistral-7.0.0-0.20180810091000.7b5bffe.el7.noarch
puppet-zaqar-13.2.0-0.20180725005042.7f6bf39.el7.noarch
openstack-mistral-common-7.0.0-0.20180810091000.7b5bffe.el7.noarch
openstack-mistral-api-7.0.0-0.20180810091000.7b5bffe.el7.noarch
python2-mistralclient-3.7.0-0.20180806115446.f0ee48f.el7.noarch
zaqar
python2-zaqarclient-1.10.0-0.20180806142547.1a50023.el7.noarch
openstack-zaqar-7.0.0-0.20180809140531.5830528.el7.noarch
puppet-mistral-13.2.0-0.20180811003807.8e79ad2.el7.noarch
puppet-zaqar-13.2.0-0.20180725005042.7f6bf39.el7.noarch
python2-mistralclient-3.7.0-0.20180806115446.f0ee48f.el7.noarch
(undercloud) [stack@undercloud ~]$

(undercloud) [stack@undercloud ~]$ mistral run-action std.echo '{"output": "Hello Workflow!"}'
ERROR (app) MessagingTimeout: Timed out waiting for a reply to message ID 245f32e07c8140c2be3a39cafdcc6a5e
(undercloud) [stack@undercloud ~]$ date
Tue Sep 18 11:07:28 UTC 2018
(undercloud) [stack@undercloud ~]$ docker ps | grep mistral
65540fb9638a docker.io/tripleomaster/centos-binary-mistral-api:0743a561fd1021f651e5d4d98690415bb3a6674f_2a6352ec "kolla_start" 3 weeks ago Up 23 hours (healthy) mistral_api
75ae674fc3b5 docker.io/tripleomaster/centos-binary-mistral-engine:0743a561fd1021f651e5d4d98690415bb3a6674f_2a6352ec "kolla_start" 3 weeks ago Up 23 hours (healthy) mistral_engine
c1d30cedaa84 docker.io/tripleomaster/centos-binary-mistral-event-engine:0743a561fd1021f651e5d4d98690415bb3a6674f_2a6352ec "kolla_start" 3 weeks ago Up 23 hours (healthy) mistral_event_engine
d11063e58f82 docker.io/tripleomaster/centos-binary-mistral-executor:0743a561fd1021f651e5d4d98690415bb3a6674f_2a6352ec "kolla_start" 3 weeks ago Up 23 hours (healthy) mistral_executor
(undercloud) [stack@undercloud ~]$ docker ps | grep zaqar
e7b1104186c2 docker.io/tripleomaster/centos-binary-zaqar:0743a561fd1021f651e5d4d98690415bb3a6674f_2a6352ec "kolla_start" 3 weeks ago Up 23 hours zaqar_websocket
a6834c2345ab docker.io/tripleomaster/centos-binary-zaqar:0743a561fd1021f651e5d4d9869041...

(undercloud) [stack@undercloud ~]$ for C in mistral_executor mistral_api zaqar ; do echo $C; docker exec -ti $C rpm -qa | egrep "mistral|zaqar"; done 
mistral_executor
python2-zaqarclient-1.10.0-0.20180806142547.1a50023.el7.noarch
python2-mistral-lib-1.0.0-0.20180730234322.d1ccfd0.el7.noarch
puppet-mistral-13.2.0-0.20180811003807.8e79ad2.el7.noarch
python-mistral-7.0.0-0.20180810091000.7b5bffe.el7.noarch
openstack-mistral-executor-7.0.0-0.20180810091000.7b5bffe.el7.noarch
puppet-zaqar-13.2.0-0.20180725005042.7f6bf39.el7.noarch
openstack-mistral-common-7.0.0-0.20180810091000.7b5bffe.el7.noarch
python2-mistralclient-3.7.0-0.20180806115446.f0ee48f.el7.noarch
mistral_api
python2-zaqarclient-1.10.0-0.20180806142547.1a50023.el7.noarch
python2-mistral-lib-1.0.0-0.20180730234322.d1ccfd0.el7.noarch
puppet-mistral-13.2.0-0.20180811003807.8e79ad2.el7.noarch
python-mistral-7.0.0-0.20180810091000.7b5bffe.el7.noarch
puppet-zaqar-13.2.0-0.20180725005042.7f6bf39.el7.noarch
openstack-mistral-common-7.0.0-0.20180810091000.7b5bffe.el7.noarch
openstack-mistral-api-7.0.0-0.20180810091000.7b5bffe.el7.noarch
python2-mistralclient-3.7.0-0.20180806115446.f0ee48f.el7.noarch
zaqar
python2-zaqarclient-1.10.0-0.20180806142547.1a50023.el7.noarch
openstack-zaqar-7.0.0-0.20180809140531.5830528.el7.noarch
puppet-mistral-13.2.0-0.20180811003807.8e79ad2.el7.noarch
puppet-zaqar-13.2.0-0.20180725005042.7f6bf39.el7.noarch
python2-mistralclient-3.7.0-0.20180806115446.f0ee48f.el7.noarch
(undercloud) [stack@undercloud ~]$

(undercloud) [stack@undercloud ~]$ mistral run-action std.echo '{"output": "Hello Workflow!"}'
ERROR (app) MessagingTimeout: Timed out waiting for a reply to message ID 245f32e07c8140c2be3a39cafdcc6a5e
(undercloud) [stack@undercloud ~]$ date
Tue Sep 18 11:07:28 UTC 2018
(undercloud) [stack@undercloud ~]$ docker ps | grep mistral 
65540fb9638a        docker.io/tripleomaster/centos-binary-mistral-api:0743a561fd1021f651e5d4d98690415bb3a6674f_2a6352ec                 "kolla_start"            3 weeks ago         Up 23 hours (healthy)                       mistral_api
75ae674fc3b5        docker.io/tripleomaster/centos-binary-mistral-engine:0743a561fd1021f651e5d4d98690415bb3a6674f_2a6352ec              "kolla_start"            3 weeks ago         Up 23 hours (healthy)                       mistral_engine
c1d30cedaa84        docker.io/tripleomaster/centos-binary-mistral-event-engine:0743a561fd1021f651e5d4d98690415bb3a6674f_2a6352ec        "kolla_start"            3 weeks ago         Up 23 hours (healthy)                       mistral_event_engine
d11063e58f82        docker.io/tripleomaster/centos-binary-mistral-executor:0743a561fd1021f651e5d4d98690415bb3a6674f_2a6352ec            "kolla_start"            3 weeks ago         Up 23 hours (healthy)                       mistral_executor
(undercloud) [stack@undercloud ~]$ docker ps | grep zaqar
e7b1104186c2        docker.io/tripleomaster/centos-binary-zaqar:0743a561fd1021f651e5d4d98690415bb3a6674f_2a6352ec                       "kolla_start"            3 weeks ago         Up 23 hours                                 zaqar_websocket
a6834c2345ab        docker.io/tripleomaster/centos-binary-zaqar:0743a561fd1021f651e5d4d98690415bb3a6674f_2a6352ec                       "kolla_start"            3 weeks ago         Up 23 hours                                 zaqar
(undercloud) [stack@undercloud ~]$

Alex Schultz (alex-schultz) on 2018-09-19

Changed in tripleo:
milestone:	rocky-rc2 → stein-1
tags:	added: rocky-backport-potential

Revision history for this message

Toure Dunnon (toure) wrote on 2018-09-24:

#12

Just a quick update, I have found where the problem is taking place, which is in the mistral_api service. It seems as though our messaging queue starts missing replies to executions, I have watched incoming executions and they make it to the engine and executor which in turn are completed but the result never makes it back to the API service. I am now debugging the wsgi service to see if there are any leads.

Revision history for this message

Brad P. Crochet (brad-9) wrote on 2018-09-25:

#13

I don't think Mistral API is running under WSGI.

ps axfw | grep httpd | grep mistral

Yields nothing.

Revision history for this message

Brad P. Crochet (brad-9) wrote on 2018-09-25:

#14

Investigating https://review.openstack.org/#/c/557487/ as a possible fix.

Revision history for this message

Thomas Herve (therve) wrote on 2018-09-26:

#15

That review is definitely not a fix as we don't use the kombu server.

Revision history for this message

Thomas Herve (therve) wrote on 2018-09-27:

#16

https://review.openstack.org/605633 is an attempt at a fix.

Revision history for this message

Rabi Mishra (rabi) wrote on 2018-09-27:

#17

Ah, I was also looking at the same thing today. I think I made it to work in my environment with a similar change.

diff --git a/mistral/api/service.py b/mistral/api/service.py
index 3d2aeb90..8d957711 100644
--- a/mistral/api/service.py
+++ b/mistral/api/service.py
@@ -49,6 +49,7 @@ class WSGIService(service.ServiceBase):
         # properly (e.g. message routing for synchronous calls may be based on
         # generated queue names).
         rpc_clients.cleanup()
+ rpc_base.cleanup()

self.server.start()

Toure Dunnon (toure) on 2018-10-01

Changed in tripleo:
status:	Triaged → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-08: Fix included in openstack/mistral 5.2.6

#18

This issue was fixed in the openstack/mistral 5.2.6 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-08: Fix included in openstack/mistral 6.0.5

#19

This issue was fixed in the openstack/mistral 6.0.5 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-08: Fix included in openstack/mistral 7.0.3

#20

This issue was fixed in the openstack/mistral 7.0.3 release.

wes hayutin (weshayutin) on 2018-10-10

Changed in tripleo:
status:	Fix Committed → Triaged

Revision history for this message

wes hayutin (weshayutin) wrote on 2018-10-10:

#21

Let's keep this open until it's in the current builds.

Revision history for this message

wes hayutin (weshayutin) wrote on 2018-10-10:

#22

http://logs.openstack.org/89/608589/5/check/tripleo-ci-centos-7-scenario002-multinode-oooq-container/3a5c058/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-10-10_15_50_02

http://logs.openstack.org/19/605419/14/gate/tripleo-ci-centos-7-scenario001-multinode-oooq-container/8783d8e/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-10-10_21_55_19

tags:

added: alert

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2018-10-11:

#23

I think that may correlate to the SIGHUP processing upon receiving it from logrotate.
Mistral engine and Heat engine had shown to not tolerate that signal well on my dev env, when I was testing the logrotation fixes [0]. So we falled back to signal-less approach [1] and ut was merged just today. I think now things should got fixed automagically.

[0] https://review.openstack.org/#/c/589213/
[1] https://review.openstack.org/#/c/607491/

Revision history for this message

wes hayutin (weshayutin) wrote on 2018-10-11:

#24

http://logs.openstack.org/89/608589/5/check/tripleo-ci-centos-7-scenario003-multinode-oooq-container/317ee0f/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-10-11_13_58_40

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2018-10-11:

#25

Apparently logrotate has nothing to that. The failed job shows logrotate container not even deployed there http://logs.openstack.org/89/608589/5/check/tripleo-ci-centos-7-scenario003-multinode-oooq-container/317ee0f/logs/undercloud/var/log/containers/

Revision history for this message

Thomas Herve (therve) wrote on 2018-10-12:

#26

Can we open another bug? The latest error are WebSocket timeouts, have nothing to do with the MessagingTimeout fixed in that bug.

Revision history for this message

Thomas Herve (therve) wrote on 2018-10-12:

#27

I had a look at http://logs.openstack.org/89/608589/5/check/tripleo-ci-centos-7-scenario002-multinode-oooq-container/3a5c058/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-10-10_15_50_02

It seems that the workflow takes about 10 minutes to run, but we timeout as 6 (https://github.com/openstack/python-tripleoclient/blob/master/tripleoclient/workflows/deployment.py#L52)

About 5 minutes out of that is spent by skopeo inspect stuff. So I suspect it's due to https://review.openstack.org/#/c/604664/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-12: Fix proposed to tripleo-common (master)

#28

Fix proposed to branch: master
Review: https://review.openstack.org/609941

Changed in tripleo:
assignee:	Toure Dunnon (toure) → Bogdan Dobrelya (bogdando)
status:	Triaged → In Progress

Bogdan Dobrelya (bogdando) on 2018-10-12

Changed in tripleo:
assignee:	Bogdan Dobrelya (bogdando) → Toure Dunnon (toure)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2018-10-12:

#29

> It seems that the workflow takes about 10 minutes to run, but we timeout as 6 (https://github.com/openstack/python-tripleoclient/blob/master/tripleoclient/workflows/deployment.py#L52)

> About 5 minutes out of that is spent by skopeo inspect stuff. So I suspect it's due to https://review.openstack.org/#/c/604664/

The root cause is https://bugs.launchpad.net/tripleo/+bug/1797525 it seems

OpenStack Infra (hudson-openstack) on 2018-10-12

Changed in tripleo:
assignee:	Toure Dunnon (toure) → Bogdan Dobrelya (bogdando)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-12: Related fix merged to tripleo-common (master)

#30

Reviewed: https://review.openstack.org/609746
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=b4053ad111af873a4adbbbfeb3bf1553f71cfd8e
Submitter: Zuul
Branch: master

commit b4053ad111af873a4adbbbfeb3bf1553f71cfd8e
Author: Dougal Matthews <email address hidden>
Date: Thu Oct 11 16:36:12 2018 +0100

Retry uploading messages to Swift up to 5 times

    This should hopefully handle short, intermitent issues uploading to
    Swift. We currently use the same retry policy when sending Zaqar
    messages.

Related-Bug: #1789680
Change-Id: Ibee6ba188585f80f0f7d136c81146096cb4432c2

Revision history for this message

Thomas Herve (therve) wrote on 2018-10-12:

#31

With https://review.openstack.org/#/c/609586/ in, deploy_plan goes back to about 5 minutes. That's barely under the 6 minutes barrier, maybe it's worth bumping it anyway?

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2018-10-12:

#32

Good idea @Thomas! Where does it live?

Revision history for this message

Thomas Herve (therve) wrote on 2018-10-12:

#33

Linked here https://github.com/openstack/python-tripleoclient/blob/master/tripleoclient/workflows/deployment.py#L52

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-12: Related fix proposed to python-tripleoclient (master)

#34

Related fix proposed to branch: master
Review: https://review.openstack.org/609993

Bogdan Dobrelya (bogdando) on 2018-10-12

Changed in tripleo:
assignee:	Bogdan Dobrelya (bogdando) → Dougal Matthews (d0ugal)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-13: Related fix merged to python-tripleoclient (master)

#35

Reviewed: https://review.openstack.org/609993
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=9f716d8f66640c8ff89929843884dfb9638a96bb
Submitter: Zuul
Branch: master

commit 9f716d8f66640c8ff89929843884dfb9638a96bb
Author: Dougal Matthews <email address hidden>
Date: Fri Oct 12 11:00:50 2018 +0100

Increase the deploy_plan timeout in tripleoclient

    The time for the deploy workflow to complete has been creeping up, and
    it is getting close to the 6 minute timeout in the client. This bumps
    the timeout to 10 minutes.

Change-Id: Iadb7cc9ba3b62a0221109b1dacf2d764944f691a
Related-Bug: #1789680

Thomas Herve (therve) on 2018-10-15

tags:

removed: alert

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-16: Change abandoned on tripleo-common (master)

#36

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/609941
Reason: we have https://review.openstack.org/#/c/609586/ merged and hopefully need no more to revert

Revision history for this message

Marios Andreou (marios-b) wrote on 2018-10-16:

#37

folks does that mean with the two reviews @ https://review.openstack.org/609746 https://review.openstack.org/609993 for increase timeout and retry swift upload we can close this for now?

Bogdan Dobrelya (bogdando) on 2018-10-16

Changed in tripleo:
status:	In Progress → Fix Committed

wes hayutin (weshayutin) on 2018-10-16

Changed in tripleo:
status:	Fix Committed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-26: Related fix proposed to python-tripleoclient (stable/rocky)

#38

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/613623

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-31: Related fix merged to python-tripleoclient (stable/rocky)

#39

Reviewed: https://review.openstack.org/613623
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=69f8b31201273e587a69ebe5ea5e0ca5af809b68
Submitter: Zuul
Branch: stable/rocky

commit 69f8b31201273e587a69ebe5ea5e0ca5af809b68
Author: Dougal Matthews <email address hidden>
Date: Fri Oct 12 11:00:50 2018 +0100

Increase the deploy_plan timeout in tripleoclient

    The time for the deploy workflow to complete has been creeping up, and
    it is getting close to the 6 minute timeout in the client. This bumps
    the timeout to 10 minutes.

    Change-Id: Iadb7cc9ba3b62a0221109b1dacf2d764944f691a
    Related-Bug: #1789680
    (cherry picked from commit 9f716d8f66640c8ff89929843884dfb9638a96bb)

tags:

added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-11-05: Change abandoned on tripleo-common (master)

#40

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/609941

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-11-06: Related fix proposed to python-tripleoclient (stable/queens)

#41

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/615866

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-11-07: Related fix merged to python-tripleoclient (stable/queens)

#42

Reviewed: https://review.openstack.org/615866
Committed: https://git.openstack.org/cgit/openstack/python-tripleoclient/commit/?id=ca1fd76fdd551edd4892a6d87b65369aa179634a
Submitter: Zuul
Branch: stable/queens

commit ca1fd76fdd551edd4892a6d87b65369aa179634a
Author: Dougal Matthews <email address hidden>
Date: Fri Oct 12 11:00:50 2018 +0100

Increase the deploy_plan timeout in tripleoclient

    The time for the deploy workflow to complete has been creeping up, and
    it is getting close to the 6 minute timeout in the client. This bumps
    the timeout to 10 minutes.

    Change-Id: Iadb7cc9ba3b62a0221109b1dacf2d764944f691a
    Related-Bug: #1789680
    (cherry picked from commit 9f716d8f66640c8ff89929843884dfb9638a96bb)
    (cherry picked from commit 69f8b31201273e587a69ebe5ea5e0ca5af809b68)

tags:

added: in-stable-queens

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-01-10: Fix included in openstack/mistral 8.0.0.0b1

#43

This issue was fixed in the openstack/mistral 8.0.0.0b1 development milestone.

tripleo

mistral MessagingTimeout correlates with containerized undercloud uptime

Bug Description

Other bug subscribers

Remote bug watches