nova orphan instances

Bug #1820802 reported by Yongli He
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Wishlist
Yongli He

Bug Description

Description
===========
Under some corner conditions, Instances might become orphan: Nova does not aware that instance is running on the host anymore.

Steps to reproduce
==================

1) Suppose nova-compute get down for some reason, and during this downtime period, the user deletes the server by API, then it's records deleted from the DB. After this, nova-compute comes back up again. Now the guest VM is still running on this compute node and consuming resources.

2) During Live-Migration, after the Live-Migration begins, it then runs to completion controlled by libvirt. If something happened to the under-layer infrastructure, eg, rabbitmq dead or networking is terrible congestion, it may not delete the instance on source compute, or it try to rollback but failed, then, there will be 2 of the same instance on both source and destination compute node. On the source host, the instance is a duplication, it's orphan instance for source compute node.

Expected result
===============
There should be no orphan instances.

Actual result
=============
Some instances is out of management of Nova.

Environment
===========
Reproduce such condition is not easy. Refer to discuss on stein meetup:
https://etherpad.openstack.org/p/nova-ptg-stein L931

Fix
=====

Proposal to add a periodic task which provides what action would be taken if find an orphan instance, suggest action is:
* reap the instance.
* stop the instance.
* log the messages only. [default]

The interval of the periodic task should be configurable.

This was proposed as a Blueprints previously but more qualified as a bug. Refer to:

https://blueprints.launchpad.net/nova/+spec/periodic-orphan-instances-delete

Changed in nova:
assignee: nobody → Yongli He (yongli-he)
status: New → In Progress
Matt Riedemann (mriedem)
tags: added: compute starlingx
Revision history for this message
melanie witt (melwitt) wrote :

Note that having nova be able to clean up instances it doesn't know about (nova-manage db archive_deleted_rows on deleted!=0 instances) is a fundamental shift from how things have historically been done. Today, nova will not touch instances that do not have records in the database. Scenario: ops engineer creates a libvirt domain on a compute host out-of-band from nova in order to do some testing -- nova-compute will not touch it today. If we make a change to be able to reap them, nova-compute could destroy that testing libvirt domain.

I'm not saying it's necessarily bad, but it's different than what we've been doing. For this reason (and the fact that it will introduce config options), I think it should have a release note.

Also noting that the scenario I describe above seems alleviated by the fact that the default config in the proposed change would be to only log messages if orphans are detected. This way, an operator has to opt-in to having nova-compute destroy instances it doesn't know about and thus engineers in their org should know they can't do out-of-band tests like the one described in the scenario.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/648912

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/648913

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Yongli He (<email address hidden>) on branch: master
Review: https://review.openstack.org/648913
Reason: duplicated with https://review.openstack.org/#/c/627765/

Revision history for this message
Matt Riedemann (mriedem) wrote :

Marking this wishlist because as Mel noted this is a new feature really, and not something nova has historically dealt with, but I've heard from several operators that run lots of live migrations that they need this kind of cleanup routine. It could be implemented external to nova, but it's nice to have one tool that handles it natively.

Changed in nova:
importance: Undecided → Medium
importance: Medium → Wishlist
Changed in nova:
assignee: Yongli He (yongli-he) → Eric Fried (efried)
Eric Fried (efried)
Changed in nova:
assignee: Eric Fried (efried) → Yongli He (yongli-he)
Changed in nova:
assignee: Yongli He (yongli-he) → Eric Fried (efried)
Eric Fried (efried)
Changed in nova:
assignee: Eric Fried (efried) → Yongli He (yongli-he)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.