Test 'Launch instance, create snapshot, launch instance from snapshot' failed - snapshot is deleting right after create command

Bug #1648318 reported by Dmitry Belyaninov
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
MOS Linux

Bug Description

Sometimes OSTF test [0] failed. According the log snapshot is deleting right after create command: [1]

So there is same fixed issue [2], but it seems with different root cause.

[0] Launch instance, create snapshot, launch instance from snapshot
[1]
fuel_health.test: DEBUG: Waiting for <Image: ost1_test-snapshot-135233190> to get to ACTIVE status. Currently in saving status

fuel_health.test: DEBUG: Sleeping for 10 seconds

fuel_health.test: DEBUG: Waiting for <Image: ost1_test-snapshot-135233190> to get to ACTIVE status. Currently in deleted status

[2] https://bugs.launchpad.net/fuel/+bug/1610261

Failed jobs:
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.network_templates/144/testReport/(root)/deploy_ceph_net_tmpl/deploy_ceph_net_tmpl/
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.repetitive_restart/145/testReport/(root)/ceph_partitions_repetitive_cold_restart/ceph_partitions_repetitive_cold_restart/

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Bogdan, can you check job 144.

=ERROR REPORT==== 6-Dec-2016::05:34:51 ===
Error on AMQP connection <0.14876.0> (10.200.239.1:44558 -> 10.200.239.3:5673, vhost: '/', user: 'nova', state: running), channel 0:
operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"

atop logs on the moment of failure:
http://paste.openstack.org/show/591767/

Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Bogdan Dobrelya (bogdando)
status: New → Confirmed
tags: added: rabbitmq
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The logs show that OSTF had been failed because of the network partitions disrupted rabbitmq cluster. It was self-recovering in the middle of the test, thus failed it. But I cannot figure out from the given logs if it were ever self-recovered or not.

Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → MOS Oslo (mos-oslo)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Looks like a dup of 1620649, the same error message prevented nodes from joinin the cluster

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The RabbitMQ restart is a red herring - it actually happened only in the job #144 and did not happen in #145. The cause of restart was network interface failure (see below) and luckily it did not lead to any problem.

In both cases the root cause was network interface restart on a node during glance image upload (you may find corresponding errors in nova-compute.log). Here is how it looks in haproxy.log:

job #144, node-1-10.109.5.5/var/log/haproxy.log
<134>Dec 6 05:34:28 node-1 haproxy[22661]: 10.109.6.5:47206 [06/Dec/2016:05:34:12.244] glance-api glance-api/node-2 1/0/1/-1/16236 -1 0 - - SD-- 110/0/0/0/0 0/0 "PUT /v1/images/37ffedaf-007a-40ba-a29d-79cf29775e79 HTTP/1.1"

job #145, node-1-10.109.15.4/var/log/haproxy.log
<134>Dec 7 01:59:29 node-1 haproxy[8232]: 10.109.16.5:36666 [07/Dec/2016:01:59:14.753] glance-api glance-api/node-4 0/0/2/-1/14957 -1 0 - - SD-- 125/0/0/0/0 0/0 "PUT /v1/images/0c0247d8-20bb-47b9-a6e9-a9797b926bac HTTP/1.1"

Note that in both cases termination state is "SD--" which basically means that connection was unexpectedly dropped (see http://www.haproxy.org/download/1.5/doc/configuration.txt, "Session state at disconnection").

The root cause in both cases unexpected failure in network interface:

job #144, node-1-10.109.5.5/var/log/messages
<6>Dec 6 05:34:24 node-1 kernel: [ 3404.185648] NETDEV WATCHDOG: enp0s6 (e1000): transmit queue 0 timed out

job #145, node-4-10.109.15.5/var/log/messages
<6>Dec 7 01:59:21 node-1 kernel: [ 741.062232] NETDEV WATCHDOG: enp0s5 (e1000): transmit queue 0 timed out

During these jobs run the interface restarts occurred only during image upload which basically means that our networking stack can not sustain high throughput. Googling "NETDEV WATCHDOG: transmit queue 0 timed out" shows that we are not alone and there are some suggestions how to fix it depending on kernel version, driver, etc.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Linux team, seems like it is your call, see my comment above about network interface restart. Oh, by the way, the severity of the issue is not clear as so far it happens only on CI VMs. But it would be nice to fix CI anyway.

Changed in fuel:
assignee: MOS Oslo (mos-oslo) → MOS Linux (mos-linux)
tags: added: swarm-fail
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :
Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Need more time to investigate and fix the issue, retargeted to 9.3

Changed in fuel:
milestone: 9.2 → 9.3
tags: added: move-to-9.3
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

One more repro here:
https://product-ci.infra.mirantis.net/job/9.x.acceptance.ubuntu.mixed_os_components/40/

see node-1-10.109.15.4/var/log/messages:
<6>Jan 29 19:11:15 node-1 kernel: [ 5026.455659] NETDEV WATCHDOG: enp0s5 (e1000): transmit queue 0 timed out

Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

Haven't been reproduced for 2+ months, moving to invalid.

Changed in fuel:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.