VM that got stuck in deleting state in case of bulk deleting

Bug #1551988 reported by Andrii Petrenko
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
High
MOS Nova
6.1.x
Invalid
High
Denis Puchkin

Bug Description

VM got stuck in case bulk deleting. Customer is able to reproduce this 99% of the time. If we try to delete 12 VMs simultaneously, at least one of them will get stuck. Log file from nova-compute is attached.

<182>Mar 1 20:06:28 host-049 nova-compute Instance destroyed successfully.
<182>Mar 1 20:06:28 host-049 nova-compute During sync_power_state the instance has a pending task (powering-off). Skip.
<182>Mar 1 20:06:29 host-049 nova-compute Task possibly preempted: Unexpected task state: expecting [u'powering-off'] but the actual state is deleting
<179>Mar 1 20:06:29 host-049 nova-compute Exception during message handling: Unexpected task state: expecting [u'powering-off'] but the actual state is deleting

Revision history for this message
Andrii Petrenko (aplsms) wrote :
tags: added: customer-found support
Changed in mos:
assignee: nobody → MOS Nova (mos-nova)
Dina Belova (dbelova)
Changed in mos:
milestone: none → 6.1-updates
importance: Undecided → High
status: New → Confirmed
tags: added: area-nova
Revision history for this message
Andrew Laski (alaski) wrote :

Looking at the nova-compute.log attached I see that there is a failure for "UnexpectedDeletingTaskStateError: Unexpected task state: expecting [u'powering-off'] but the actual state is deleting". That indicates that an attempt was made to stop the instance while it was deleting, which failed.

I do not see any errors around the delete in the logs so it's not clear what may be blocking that. But there is a failure around trying to stop an instance while deleting it, which is expected. You can't stop an instance that's being deleted.

Revision history for this message
Andrii Petrenko (aplsms) wrote :

Dear colleagues,

Could you please check: probably this bug https://bugs.launchpad.net/mos/+bug/1371723 related to this case.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Andrii, that issue must have been fixed prior to 6.1. If you are still seeing reconnects, these must be for other reasons.

I sent you an email, could you please at least share the logs privately? I'm mostly interesting in tracing the DELETE request (by the request id in logs) from the nova-api to the particular nova-compute to see, why it hasn't been completed.

Revision history for this message
Andrii Petrenko (aplsms) wrote :

Thank you, Roman

1. customer has MOS 6.1 MU4,
2. it is happening every time, when they delete more that 10-12 VM simultaneously
3. link to logs provided via slack direct message. Please let me know if you need any extra logs.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I've taken a look at the logs Andrii provided and see the following (there is no sensitive data, so it must be ok to post snippets here):

- there were 5 DELETE's triggered: http://paste.openstack.org/show/489164/

- for all requests except req-dea0d041-aef1-458e-a04e-2697caec2155 I found corresponding entries in nova-compute files's which denote successful instance termination

- for this specific request (req-dea0d041-aef1-458e-a04e-2697caec2155) there are no entries in nova-compute logs (nor for the corresponding instance id)

- there are many heatbeats timeouts in oslo.messaging, which cause reconnects to RabbitMQ: http://paste.openstack.org/show/489168/

^ the latter is interesting: from logs we can see that nova-compute got "stuck" for a couple of minutes during or right after deleting an instance - the periodic task that should be run every minute was skipped for 3 minutes - it's not clear from logs what nova-compute did during this time, but looks like it was stuck and didn't send hearbeats to RabbitMQ.

Changed in mos:
milestone: 6.1-updates → 6.1-mu-6
Revision history for this message
Denis Puchkin (dpuchkin) wrote :

incomplete for 6.1 because there is no clear steps to reproduce

Andrii, the customer still has problems in case of bulk deleting, or you found a workaround?

Revision history for this message
Andrii Petrenko (aplsms) wrote :

now I have no update from customer about this issue. We are working for reduce load on controller. I will let you know, in case i will have any update. next sync up in a Wednesday, May 4,

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Removed from 6.1-mu-6 scope as Incomplete

Changed in mos:
milestone: 6.1-mu-6 → 6.1-updates
Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Setting to Invalid as no additional information was provided

Changed in mos:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.