Comment 17 for bug 1461562

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Bug https://bugs.launchpad.net/fuel/+bug/1491375 is reproduced many times on almost half of 7.0.swarm jobs.

For example, CI job:
https://product-ci.infra.mirantis.net/view/7.0_swarm/job/7.0.system_test.ubuntu.ha_destructive_ceph_neutron/45/console

##### rabbitmq log:

=ERROR REPORT==== 3-Sep-2015::04:25:24 ===
closing AMQP connection <0.345.0> (10.109.0.2:40620 -> 10.109.0.2:5672):
{heartbeat_timeout,running}

=INFO REPORT==== 3-Sep-2015::04:25:31 ===
accepting AMQP connection <0.2438.0> (10.109.0.2:44064 -> 10.109.0.2:5672)
################

This particular issue most possible was caused by huge size of a message that was casted from astute to nailgun.

Scenario of the issue:

1) Deploy cluster

2) When waiting timeout was reached in system test (cluster deploing was still in progress!) , system test started task for making diagnostic snapshot (time: 04:20)

3) Diagnostic snapshot was successfully created, and then message was casted to Nailgun (>4Mb of size, see the attach) (time: 04:25:20)

4) RabbitMQ closed the connection (time: 04:25:24)

5) Astute, while processing the deploy that was still in progress, reconnected to RabbitMQ and continue deploy:
## astute.log
2015-09-03T04:25:26 warning: [679] Trying to reconnect to message broker. Retry 5 sec later...