Cleaning up deleted instances leaks resources

Bug #1714247 reported by Lucian Petrut
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Undecided
Unassigned

Bug Description

When the '_cleanup_running_deleted_instances' nova-compute manager periodic task cleans up an instance that still exists on the host although being deleted from the DB, the according network info is not properly retrieved. For this reason, vif ports will not be cleaned up.

In this situation there may also be stale volume connections. Those will be leaked as well as os-brick attempts to flush those inaccessible devices, which will fail. As per a recent os-brick change, a 'force' flag must be set in order to ignore flush errors.

Log: http://paste.openstack.org/raw/620048/

Tags: compute
Revision history for this message
Matt Riedemann (mriedem) wrote :

Isn't this fixed by https://review.openstack.org/#/c/486955/ so it's a duplicate of bug 1705683?

Revision history for this message
Matt Riedemann (mriedem) wrote :

Well, maybe this is something else. Can you be more specific? Are you talking about a periodic task in the compute manager?

tags: added: compute
Revision history for this message
Lucian Petrut (petrutlucian94) wrote :

Indeed, I was talking about the compute manager periodic task, sorry for being vague. Updating the description.

description: updated
description: updated
Revision history for this message
Matt Riedemann (mriedem) wrote :

Can you be more specific where we're not passing through network info or terminating block device connections? Because looking at the periodic task code when it shuts down an instance to delete it locally, we call here:

https://github.com/openstack/nova/blob/16.0.0.0rc2/nova/compute/manager.py#L6764-L6766

And the block_device_info and network_info is passed through here:

https://github.com/openstack/nova/blob/16.0.0.0rc2/nova/compute/manager.py#L2278

And volume connections should be terminated here:

https://github.com/openstack/nova/blob/16.0.0.0rc2/nova/compute/manager.py#L2307

So what's missing? Is there a stacktrace in the nova-compute logs when this fails?

Revision history for this message
Lucian Petrut (petrutlucian94) wrote :

The network info fetched here [1] will be empty as it relies on the info cache [2], which won't contain vif details for deleted instances, as far as I can see.

There won't be any trace in the logs, it's just that those vifs will not be unplugged. It's really easy to reproduce, all you have to do is:
1. boot an instance, while using ovs ports
2. kill the nova compute service and wait for it to be reported as 'down'.
3. destroy the instance
4. bring the nova compute service back up. it will destroy the instance but the ports will not be unplugged. If iSCSI volumes were attached to that instance, the according iSCSI sessions will be leaked as well.

As for the volume connections, the BDMs are properly fetched. But if the nova compute service comes back up after the instances have been deleted from the db and volumes disconnected on the Cinder side, we end up having stale iSCSI sessions (of course, if using an iSCSI backend). The issue is that os-brick attempts to flush an inaccessible device, which fails. Until recently, it wasn't erroring out, moving on and removing the iSCSI session, but as per a recent change [3], it requires a 'force' flag in order to ignore flush exceptions.

[1] https://github.com/openstack/nova/blob/16.0.0.0rc2/nova/compute/manager.py#L2264
[2] https://github.com/openstack/nova/blob/16.0.0.0rc2/nova/objects/instance.py#L1169-L1172
[3] https://github.com/openstack/os-brick/commit/400ca5d6db818b966065001571e59198c6966e2f

Changed in nova:
assignee: nobody → Xuanzhou Perry Dong (oss-xzdong)
Revision history for this message
Xuanzhou Perry Dong (oss-xzdong) wrote :

Tested in the latest master branch using devstack:

1. vif is unplugged

See logs in: paste.openstack.org/show/623286/

2. no stale iscsi session

See logs in: http://paste.openstack.org/show/623288/

Hi, Lucian,

Could you check the logs to see if you do things differently?

BR/Perry

Changed in nova:
status: New → Invalid
Revision history for this message
Lucian Petrut (petrutlucian94) wrote :

This happens when cleaning up instances that were deleted while the nova-compute service was down. I've checked your paste and I think that the missing step is exactly that: stopping the nova-compute service before deleting the instance and turning it back on afterwards.

Thanks for looking into this.

Lucian

Revision history for this message
Xuanzhou Perry Dong (oss-xzdong) wrote :

Thanks for the response. I have stopped and started the nova-compute service. The re-starting of nova-compute service is shown in the log (I am not sure why the stopping of nova-compute service is not shown; probably I should use "raw").

stack@devstack01:~/devstack$ systemctl start <email address hidden>
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ===
Authentication is required to start '<email address hidden>'.
Authenticating as: stack,,, (stack)
Password:
==== AUTHENTICATION COMPLETE ===

BR/Perry

Changed in nova:
status: Invalid → Incomplete
Revision history for this message
Xuanzhou Perry Dong (oss-xzdong) wrote :

Attach a script log as a file so that the log won't get truncated.

Revision history for this message
Lucian Petrut (petrutlucian94) wrote :

I've added a cmd log that describes and helps recreating the issue. I tried to keep it as simple as possible.

Revision history for this message
Xuanzhou Perry Dong (oss-xzdong) wrote :

Lucian, thanks. I can reproduce the issue now.

Changed in nova:
status: Incomplete → Confirmed
Matt Riedemann (mriedem)
Changed in nova:
assignee: Xuanzhou Perry Dong (oss-xzdong) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.