Mirantis OpenStack

VM that got stuck in deleting state in case of bulk deleting

Bug #1551988 reported by Andrii Petrenko on 2016-03-01

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Mirantis OpenStack	Invalid	High	MOS Nova	Mirantis OpenStack 6.1-updates
	6.1.x	Invalid	High	Denis Puchkin	Mirantis OpenStack 6.1-updates

Bug Description

VM got stuck in case bulk deleting. Customer is able to reproduce this 99% of the time. If we try to delete 12 VMs simultaneously, at least one of them will get stuck. Log file from nova-compute is attached.

<182>Mar 1 20:06:28 host-049 nova-compute Instance destroyed successfully.
<182>Mar 1 20:06:28 host-049 nova-compute During sync_power_state the instance has a pending task (powering-off). Skip.
<182>Mar 1 20:06:29 host-049 nova-compute Task possibly preempted: Unexpected task state: expecting [u'powering-off'] but the actual state is deleting
<179>Mar 1 20:06:29 host-049 nova-compute Exception during message handling: Unexpected task state: expecting [u'powering-off'] but the actual state is deleting

Tags:

Revision history for this message

Andrii Petrenko (aplsms) wrote on 2016-03-01:

nova-compute.log Edit (36.0 KiB, text/plain)

tags:

added: customer-found support

Davanum Srinivas (DIMS) (dims-v) on 2016-03-01

Changed in mos:
assignee:	nobody → MOS Nova (mos-nova)

Dina Belova (dbelova) on 2016-03-02

Changed in mos:
milestone:	none → 6.1-updates
importance:	Undecided → High
status:	New → Confirmed
tags:	added: area-nova

Revision history for this message

Andrew Laski (alaski) wrote on 2016-03-02:

Looking at the nova-compute.log attached I see that there is a failure for "UnexpectedDeletingTaskStateError: Unexpected task state: expecting [u'powering-off'] but the actual state is deleting". That indicates that an attempt was made to stop the instance while it was deleting, which failed.

I do not see any errors around the delete in the logs so it's not clear what may be blocking that. But there is a failure around trying to stop an instance while deleting it, which is expected. You can't stop an instance that's being deleted.

Revision history for this message

Andrii Petrenko (aplsms) wrote on 2016-03-02:

Dear colleagues,

Could you please check: probably this bug https://bugs.launchpad.net/mos/+bug/1371723 related to this case.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-03-03:

Andrii, that issue must have been fixed prior to 6.1. If you are still seeing reconnects, these must be for other reasons.

I sent you an email, could you please at least share the logs privately? I'm mostly interesting in tracing the DELETE request (by the request id in logs) from the nova-api to the particular nova-compute to see, why it hasn't been completed.

Revision history for this message

Andrii Petrenko (aplsms) wrote on 2016-03-03:

Thank you, Roman

1. customer has MOS 6.1 MU4,
2. it is happening every time, when they delete more that 10-12 VM simultaneously
3. link to logs provided via slack direct message. Please let me know if you need any extra logs.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-03-03:

I've taken a look at the logs Andrii provided and see the following (there is no sensitive data, so it must be ok to post snippets here):

- there were 5 DELETE's triggered: http://paste.openstack.org/show/489164/

- for all requests except req-dea0d041-aef1-458e-a04e-2697caec2155 I found corresponding entries in nova-compute files's which denote successful instance termination

- for this specific request (req-dea0d041-aef1-458e-a04e-2697caec2155) there are no entries in nova-compute logs (nor for the corresponding instance id)

- there are many heatbeats timeouts in oslo.messaging, which cause reconnects to RabbitMQ: http://paste.openstack.org/show/489168/

^ the latter is interesting: from logs we can see that nova-compute got "stuck" for a couple of minutes during or right after deleting an instance - the periodic task that should be run every minute was skipped for 3 minutes - it's not clear from logs what nova-compute did during this time, but looks like it was stuck and didn't send hearbeats to RabbitMQ.

Vitaly Sedelnik (vsedelnik) on 2016-03-29

Changed in mos:
milestone:	6.1-updates → 6.1-mu-6

Revision history for this message

Denis Puchkin (dpuchkin) wrote on 2016-04-29:

incomplete for 6.1 because there is no clear steps to reproduce

Andrii, the customer still has problems in case of bulk deleting, or you found a workaround?

Revision history for this message

Andrii Petrenko (aplsms) wrote on 2016-04-29:

now I have no update from customer about this issue. We are working for reduce load on controller. I will let you know, in case i will have any update. next sync up in a Wednesday, May 4,

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2016-05-06:

Removed from 6.1-mu-6 scope as Incomplete

Changed in mos:
milestone:	6.1-mu-6 → 6.1-updates

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2016-07-22:

#10

Setting to Invalid as no additional information was provided

Changed in mos:
status:	Confirmed → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

nova-compute.log Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.