OpenStack Compute (nova)

Cleaning up deleted instances leaks resources

Bug #1714247 reported by Lucian Petrut on 2017-08-31

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Confirmed	Undecided	Unassigned

Bug Description

When the '_cleanup_running_deleted_instances' nova-compute manager periodic task cleans up an instance that still exists on the host although being deleted from the DB, the according network info is not properly retrieved. For this reason, vif ports will not be cleaned up.

In this situation there may also be stale volume connections. Those will be leaked as well as os-brick attempts to flush those inaccessible devices, which will fail. As per a recent os-brick change, a 'force' flag must be set in order to ignore flush errors.

Log: http://paste.openstack.org/raw/620048/

See original description

Tags:

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-08-31:

Isn't this fixed by https://review.openstack.org/#/c/486955/ so it's a duplicate of bug 1705683?

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-08-31:

Well, maybe this is something else. Can you be more specific? Are you talking about a periodic task in the compute manager?

tags:

added: compute

Revision history for this message

Lucian Petrut (petrutlucian94) wrote on 2017-09-01:

Indeed, I was talking about the compute manager periodic task, sorry for being vague. Updating the description.

Lucian Petrut (petrutlucian94) on 2017-09-01

description:	updated
description:	updated

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-09-01:

Can you be more specific where we're not passing through network info or terminating block device connections? Because looking at the periodic task code when it shuts down an instance to delete it locally, we call here:

https://github.com/openstack/nova/blob/16.0.0.0rc2/nova/compute/manager.py#L6764-L6766

And the block_device_info and network_info is passed through here:

https://github.com/openstack/nova/blob/16.0.0.0rc2/nova/compute/manager.py#L2278

And volume connections should be terminated here:

https://github.com/openstack/nova/blob/16.0.0.0rc2/nova/compute/manager.py#L2307

So what's missing? Is there a stacktrace in the nova-compute logs when this fails?

Revision history for this message

Lucian Petrut (petrutlucian94) wrote on 2017-09-01:

The network info fetched here [1] will be empty as it relies on the info cache [2], which won't contain vif details for deleted instances, as far as I can see.

There won't be any trace in the logs, it's just that those vifs will not be unplugged. It's really easy to reproduce, all you have to do is:
1. boot an instance, while using ovs ports
2. kill the nova compute service and wait for it to be reported as 'down'.
3. destroy the instance
4. bring the nova compute service back up. it will destroy the instance but the ports will not be unplugged. If iSCSI volumes were attached to that instance, the according iSCSI sessions will be leaked as well.

As for the volume connections, the BDMs are properly fetched. But if the nova compute service comes back up after the instances have been deleted from the db and volumes disconnected on the Cinder side, we end up having stale iSCSI sessions (of course, if using an iSCSI backend). The issue is that os-brick attempts to flush an inaccessible device, which fails. Until recently, it wasn't erroring out, moving on and removing the iSCSI session, but as per a recent change [3], it requires a 'force' flag in order to ignore flush exceptions.

[1] https://github.com/openstack/nova/blob/16.0.0.0rc2/nova/compute/manager.py#L2264
[2] https://github.com/openstack/nova/blob/16.0.0.0rc2/nova/objects/instance.py#L1169-L1172
[3] https://github.com/openstack/os-brick/commit/400ca5d6db818b966065001571e59198c6966e2f

Xuanzhou Perry Dong (oss-xzdong) on 2017-10-10

Changed in nova:
assignee:	nobody → Xuanzhou Perry Dong (oss-xzdong)

Revision history for this message

Xuanzhou Perry Dong (oss-xzdong) wrote on 2017-10-11:

Tested in the latest master branch using devstack:

1. vif is unplugged

See logs in: paste.openstack.org/show/623286/

2. no stale iscsi session

See logs in: http://paste.openstack.org/show/623288/

Hi, Lucian,

Could you check the logs to see if you do things differently?

BR/Perry

Changed in nova:
status:	New → Invalid

Revision history for this message

Lucian Petrut (petrutlucian94) wrote on 2017-10-11:

This happens when cleaning up instances that were deleted while the nova-compute service was down. I've checked your paste and I think that the missing step is exactly that: stopping the nova-compute service before deleting the instance and turning it back on afterwards.

Thanks for looking into this.

Lucian

Revision history for this message

Xuanzhou Perry Dong (oss-xzdong) wrote on 2017-10-12:

Thanks for the response. I have stopped and started the nova-compute service. The re-starting of nova-compute service is shown in the log (I am not sure why the stopping of nova-compute service is not shown; probably I should use "raw").

stack@devstack01:~/devstack$ systemctl start <email address hidden>
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ===
Authentication is required to start '<email address hidden>'.
Authenticating as: stack,,, (stack)
Password:
==== AUTHENTICATION COMPLETE ===

BR/Perry

Changed in nova:
status:	Invalid → Incomplete

Revision history for this message

Xuanzhou Perry Dong (oss-xzdong) wrote on 2017-10-12:

cmd_log Edit (15.7 KiB, application/octet-stream)

Attach a script log as a file so that the log won't get truncated.

Revision history for this message

Lucian Petrut (petrutlucian94) wrote on 2017-10-12:

#10

VM cleanup issues repro cmd log Edit (19.0 KiB, text/plain)

I've added a cmd log that describes and helps recreating the issue. I tried to keep it as simple as possible.

Revision history for this message

Xuanzhou Perry Dong (oss-xzdong) wrote on 2017-10-13:

#11

Lucian, thanks. I can reproduce the issue now.

Changed in nova:
status:	Incomplete → Confirmed

Matt Riedemann (mriedem) on 2018-07-31

Changed in nova:
assignee:	Xuanzhou Perry Dong (oss-xzdong) → nobody

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.