libvirt driver leaves interface residue after failed start

Bug #1648840 reported by Dan Smith
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Dan Smith
Newton
Fix Committed
Medium
Lee Yarwood

Bug Description

When the libvirt driver fails to start a VM due to reasons other than neutron plug timeout, it leaves interfaces on the system from the vif plugging. If a subsequent delete is performed and completes successfully, these will be removed. However, in cases where connectivity is preventing a normal delete, a local delete will be performed at the api level and the interfaces will remain.

In at least one real world situation I have observed, a script was creating test instances which were failing and leaving residue. After the residue interface count reached about 6,000 on the system, VM creates started failing with "Argument list too long" as libvirt was choking on enumerating the interfaces it had left behind.

Tags: libvirt
Dan Smith (danms)
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Dan Smith (danms)
Changed in nova:
status: Confirmed → In Progress
Matt Riedemann (mriedem)
tags: added: libvirt
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/408806
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5e7f765266e0b94807e019b645c8be89770e7428
Submitter: Jenkins
Branch: master

commit 5e7f765266e0b94807e019b645c8be89770e7428
Author: Dan Smith <email address hidden>
Date: Thu Dec 8 12:25:37 2016 -0800

    Cleanup after any failed libvirt spawn

    When we go to spawn a libvirt domain, we catch a few types of exceptions
    and perform cleanup before failing the operation. For some reason, we
    don't do this universally, which means that we leave things like network
    devices laying around (from plug_vifs()). If a delete comes later, it
    should clean those things up. However, if a subsequent failure prevents
    that, and especially if we do a local delete at the API, we'll leak those
    interfaces.

    As seen in at least one real-world situation, this can cause us to leak
    interfaces until we have tens of thousands of them on the system, which
    then causes secondary failures.

    Since we run the cleanup() routine for certain failures, it certainly
    seems appropriate to run it always and not leave residue until a
    successful delete is performed.

    Closes-Bug: #1648840
    Change-Id: Iab5afdf1b5b8d107ea0e5895c24d50712e7dc7b1

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/409706

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/newton)

Reviewed: https://review.openstack.org/409706
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ee759f62b41a8afd7ac88b4e4a20d31d7959c12f
Submitter: Jenkins
Branch: stable/newton

commit ee759f62b41a8afd7ac88b4e4a20d31d7959c12f
Author: Dan Smith <email address hidden>
Date: Thu Dec 8 12:25:37 2016 -0800

    Cleanup after any failed libvirt spawn

    When we go to spawn a libvirt domain, we catch a few types of exceptions
    and perform cleanup before failing the operation. For some reason, we
    don't do this universally, which means that we leave things like network
    devices laying around (from plug_vifs()). If a delete comes later, it
    should clean those things up. However, if a subsequent failure prevents
    that, and especially if we do a local delete at the API, we'll leak those
    interfaces.

    As seen in at least one real-world situation, this can cause us to leak
    interfaces until we have tens of thousands of them on the system, which
    then causes secondary failures.

    Since we run the cleanup() routine for certain failures, it certainly
    seems appropriate to run it always and not leave residue until a
    successful delete is performed.

    Closes-Bug: #1648840
    Change-Id: Iab5afdf1b5b8d107ea0e5895c24d50712e7dc7b1
    (cherry picked from commit 5e7f765266e0b94807e019b645c8be89770e7428)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.0.0.0b2

This issue was fixed in the openstack/nova 15.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.3

This issue was fixed in the openstack/nova 14.0.3 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.