Fuel for OpenStack

Test 'Launch instance, create snapshot, launch instance from snapshot' failed - snapshot is deleting right after create command

Bug #1648318 reported by Dmitry Belyaninov on 2016-12-08

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	High	MOS Linux	Fuel for OpenStack 9.x-updates

Bug Description

Sometimes OSTF test [0] failed. According the log snapshot is deleting right after create command: [1]

So there is same fixed issue [2], but it seems with different root cause.

[0] Launch instance, create snapshot, launch instance from snapshot
[1]
fuel_health.test: DEBUG: Waiting for <Image: ost1_test-snapshot-135233190> to get to ACTIVE status. Currently in saving status

fuel_health.test: DEBUG: Sleeping for 10 seconds

fuel_health.test: DEBUG: Waiting for <Image: ost1_test-snapshot-135233190> to get to ACTIVE status. Currently in deleted status

[2] https://bugs.launchpad.net/fuel/+bug/1610261

Failed jobs:
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.network_templates/144/testReport/(root)/deploy_ceph_net_tmpl/deploy_ceph_net_tmpl/
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.repetitive_restart/145/testReport/(root)/ceph_partitions_repetitive_cold_restart/ceph_partitions_repetitive_cold_restart/

Tags:

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2016-12-08:

Bogdan, can you check job 144.

=ERROR REPORT==== 6-Dec-2016::05:34:51 ===
Error on AMQP connection <0.14876.0> (10.200.239.1:44558 -> 10.200.239.3:5673, vhost: '/', user: 'nova', state: running), channel 0:
operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"

atop logs on the moment of failure:
http://paste.openstack.org/show/591767/

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Bogdan Dobrelya (bogdando)
status:	New → Confirmed

Bogdan Dobrelya (bogdando) on 2016-12-08

tags:

added: rabbitmq

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-12-08:

The logs show that OSTF had been failed because of the network partitions disrupted rabbitmq cluster. It was self-recovering in the middle of the test, thus failed it. But I cannot figure out from the given logs if it were ever self-recovered or not.

Changed in fuel:
assignee:	Bogdan Dobrelya (bogdando) → MOS Oslo (mos-oslo)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-12-08:

Looks like a dup of 1620649, the same error message prevented nodes from joinin the cluster

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-12-16:

The RabbitMQ restart is a red herring - it actually happened only in the job #144 and did not happen in #145. The cause of restart was network interface failure (see below) and luckily it did not lead to any problem.

In both cases the root cause was network interface restart on a node during glance image upload (you may find corresponding errors in nova-compute.log). Here is how it looks in haproxy.log:

job #144, node-1-10.109.5.5/var/log/haproxy.log
<134>Dec 6 05:34:28 node-1 haproxy[22661]: 10.109.6.5:47206 [06/Dec/2016:05:34:12.244] glance-api glance-api/node-2 1/0/1/-1/16236 -1 0 - - SD-- 110/0/0/0/0 0/0 "PUT /v1/images/37ffedaf-007a-40ba-a29d-79cf29775e79 HTTP/1.1"

job #145, node-1-10.109.15.4/var/log/haproxy.log
<134>Dec 7 01:59:29 node-1 haproxy[8232]: 10.109.16.5:36666 [07/Dec/2016:01:59:14.753] glance-api glance-api/node-4 0/0/2/-1/14957 -1 0 - - SD-- 125/0/0/0/0 0/0 "PUT /v1/images/0c0247d8-20bb-47b9-a6e9-a9797b926bac HTTP/1.1"

Note that in both cases termination state is "SD--" which basically means that connection was unexpectedly dropped (see http://www.haproxy.org/download/1.5/doc/configuration.txt, "Session state at disconnection").

The root cause in both cases unexpected failure in network interface:

job #144, node-1-10.109.5.5/var/log/messages
<6>Dec 6 05:34:24 node-1 kernel: [ 3404.185648] NETDEV WATCHDOG: enp0s6 (e1000): transmit queue 0 timed out

job #145, node-4-10.109.15.5/var/log/messages
<6>Dec 7 01:59:21 node-1 kernel: [ 741.062232] NETDEV WATCHDOG: enp0s5 (e1000): transmit queue 0 timed out

During these jobs run the interface restarts occurred only during image upload which basically means that our networking stack can not sustain high throughput. Googling "NETDEV WATCHDOG: transmit queue 0 timed out" shows that we are not alone and there are some suggestions how to fix it depending on kernel version, driver, etc.

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-12-16:

Linux team, seems like it is your call, see my comment above about network interface restart. Oh, by the way, the severity of the issue is not clear as so far it happens only on CI VMs. But it would be nice to fix CI anyway.