libvirt virt driver does not wait for network-vif-plugged event during hard reboot

Bug #1946729 reported by Balazs Gibizer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Balazs Gibizer

Bug Description

The libvirt virt driver has a logic during spawn to create the domain in libvirt, the pause it, then only resume it after the network-vif-plugged events are received from neutron for the ports of the instance being spawned. This is in place to avoid starting the guest OS before the networking backend can finish set up the networking for the ports. Without this a guest might start and request IP via DHCP before the networking setup is finished and therefore might not get IP at all.

In case of hard reboot (and start as that is a hard reboot too) nova cleans up the instance from the hypervisor (except the local disk) including unplugging the vifs of the instance. Then nova recreate everything including re-plugging the vifs. This is intentional as hard reboot is considered to be an operation that is capable of recovering instances in bad / inconsistent states. However during the hard reboot nova does not wait for the nework-vif-plugged events before it let the domain start running. In a mass instance startup scenario (e.g. after a compute host recovery) there is potentially a lot of vif unplug/plug hits the networking backend. Processing these replugs takes time. Nova does not wait for the network-vif-plugged event, so the guest OS can start the DHCP request a way before the networking backend can catch up with the unplug/plug request. This leads to connectivity issues in the guest.

Changed in nova:
status: New → In Progress
Changed in nova:
importance: Undecided → Medium
assignee: nobody → Balazs Gibizer (balazs-gibizer)
tags: added: compute libvirt reboot
Revision history for this message
Vivekanandan Narasimhan (vivekanandan-narasimhan) wrote :

First thanks a lot for raising this bug.

We kindly request it would be great to have both vif_plugged and vif_unplugged handshakes between the nova and the networking-backend, thereby it will enable more collaboration enabling easier troubleshooting of which part of VM start activity failed during mass start of VMs on a typical batch of 50 compute-hosts.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

@Vivek: Right now nova does not know if the networking backend sends plug and unplug (like ovs) sends only plug (like networking-odl, see[1]) or does not send any plug time events at all (I think like ovn).

The currently proposed temporary workaround fix[2] adds a conditional wait for plug. I assume if nova waits for the plug then nova can be sure that both unplug and plug happened in the backend as nova issued the vif.unplug _before_ the vif.plug.

Therefore I don't think it worth the additional config flag for unplug and the resulting complexity to wait for both events conditionally.

What happens in you case with the currently proposed fix[2] applied and the wait_for_vif_plugged_event_during_hard_reboot config flag is set to True?

[1] https://github.com/openstack/networking-odl/blob/master/networking_odl
/ml2/port_status_update.py#L89-L90

[2] https://review.opendev.org/c/openstack/nova/+/813419

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/813419
Committed: https://opendev.org/openstack/nova/commit/68c970ea9915a95f9828239006559b84e4ba2581
Submitter: "Zuul (22348)"
Branch: master

commit 68c970ea9915a95f9828239006559b84e4ba2581
Author: Balazs Gibizer <email address hidden>
Date: Mon Oct 11 14:41:37 2021 +0200

    Add a WA flag waiting for vif-plugged event during reboot

    The libvirt driver power on and hard reboot destroys the domain first
    and unplugs the vifs then recreate the domain and replug the vifs.
    However nova does not wait for the network-vif-plugged event before
    unpause the domain. This can cause that the domain starts running and
    requesting IP via DHCP before the networking backend finished plugging
    the vifs.

    So this patch adds a workaround config option to nova to wait for
    network-vif-plugged events during hard reboot the same way as nova waits
    for this event during new instance spawn.

    This logic cannot be enabled unconditionally as not all neutron
    networking backend sending plug time events to wait for. Also the logic
    needs to be vnic_type dependent as ml2/ovs and the in tree sriov backend
    often deployed together on the same compute. While ml2/ovs sends plug
    time event the sriov backend does not send it reliably. So the
    configuration is not just a boolean flag but a list of vnic_types
    instead. This way the waiting for the plug time event for a vif that is
    handled by ml2/ovs is possible while the instance has other vifs handled
    by the sriov backend where no event can be expected.

    Change-Id: Ie904d1513b5cf76d6d5f6877545e8eb378dd5499
    Closes-Bug: #1946729

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/818515

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/818519

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/818559

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/818564

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/818598

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/c/openstack/nova/+/818601

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/c/openstack/nova/+/818604

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/c/openstack/nova/+/818605

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/nova/+/818515
Committed: https://opendev.org/openstack/nova/commit/0c41bfb8c5c60f1cc930ae432e6be460ee2e97ac
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 0c41bfb8c5c60f1cc930ae432e6be460ee2e97ac
Author: Balazs Gibizer <email address hidden>
Date: Mon Oct 11 14:41:37 2021 +0200

    Add a WA flag waiting for vif-plugged event during reboot

    The libvirt driver power on and hard reboot destroys the domain first
    and unplugs the vifs then recreate the domain and replug the vifs.
    However nova does not wait for the network-vif-plugged event before
    unpause the domain. This can cause that the domain starts running and
    requesting IP via DHCP before the networking backend finished plugging
    the vifs.

    So this patch adds a workaround config option to nova to wait for
    network-vif-plugged events during hard reboot the same way as nova waits
    for this event during new instance spawn.

    This logic cannot be enabled unconditionally as not all neutron
    networking backend sending plug time events to wait for. Also the logic
    needs to be vnic_type dependent as ml2/ovs and the in tree sriov backend
    often deployed together on the same compute. While ml2/ovs sends plug
    time event the sriov backend does not send it reliably. So the
    configuration is not just a boolean flag but a list of vnic_types
    instead. This way the waiting for the plug time event for a vif that is
    handled by ml2/ovs is possible while the instance has other vifs handled
    by the sriov backend where no event can be expected.

    Change-Id: Ie904d1513b5cf76d6d5f6877545e8eb378dd5499
    Closes-Bug: #1946729
    (cherry picked from commit 68c970ea9915a95f9828239006559b84e4ba2581)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/818519
Committed: https://opendev.org/openstack/nova/commit/89c4ff5f7b45f1a5bed8b6b9b4586fceaa391bfb
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 89c4ff5f7b45f1a5bed8b6b9b4586fceaa391bfb
Author: Balazs Gibizer <email address hidden>
Date: Mon Oct 11 14:41:37 2021 +0200

    Add a WA flag waiting for vif-plugged event during reboot

    The libvirt driver power on and hard reboot destroys the domain first
    and unplugs the vifs then recreate the domain and replug the vifs.
    However nova does not wait for the network-vif-plugged event before
    unpause the domain. This can cause that the domain starts running and
    requesting IP via DHCP before the networking backend finished plugging
    the vifs.

    So this patch adds a workaround config option to nova to wait for
    network-vif-plugged events during hard reboot the same way as nova waits
    for this event during new instance spawn.

    This logic cannot be enabled unconditionally as not all neutron
    networking backend sending plug time events to wait for. Also the logic
    needs to be vnic_type dependent as ml2/ovs and the in tree sriov backend
    often deployed together on the same compute. While ml2/ovs sends plug
    time event the sriov backend does not send it reliably. So the
    configuration is not just a boolean flag but a list of vnic_types
    instead. This way the waiting for the plug time event for a vif that is
    handled by ml2/ovs is possible while the instance has other vifs handled
    by the sriov backend where no event can be expected.

    Conflicts:
          nova/conf/workarounds.py due to
          I2da867f2734b590a884b1fe1200c402cbf7e9e1c is not in stable/wallaby

    Change-Id: Ie904d1513b5cf76d6d5f6877545e8eb378dd5499
    Closes-Bug: #1946729
    (cherry picked from commit 68c970ea9915a95f9828239006559b84e4ba2581)
    (cherry picked from commit 0c41bfb8c5c60f1cc930ae432e6be460ee2e97ac)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/nova/+/818559
Committed: https://opendev.org/openstack/nova/commit/c531fdcc192afb5af628ac567cb0ff8aa3eab052
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit c531fdcc192afb5af628ac567cb0ff8aa3eab052
Author: Balazs Gibizer <email address hidden>
Date: Mon Oct 11 14:41:37 2021 +0200

    Add a WA flag waiting for vif-plugged event during reboot

    The libvirt driver power on and hard reboot destroys the domain first
    and unplugs the vifs then recreate the domain and replug the vifs.
    However nova does not wait for the network-vif-plugged event before
    unpause the domain. This can cause that the domain starts running and
    requesting IP via DHCP before the networking backend finished plugging
    the vifs.

    So this patch adds a workaround config option to nova to wait for
    network-vif-plugged events during hard reboot the same way as nova waits
    for this event during new instance spawn.

    This logic cannot be enabled unconditionally as not all neutron
    networking backend sending plug time events to wait for. Also the logic
    needs to be vnic_type dependent as ml2/ovs and the in tree sriov backend
    often deployed together on the same compute. While ml2/ovs sends plug
    time event the sriov backend does not send it reliably. So the
    configuration is not just a boolean flag but a list of vnic_types
    instead. This way the waiting for the plug time event for a vif that is
    handled by ml2/ovs is possible while the instance has other vifs handled
    by the sriov backend where no event can be expected.

    Conflicts:
          nova/virt/libvirt/driver.py both
          I73305e82da5d8da548961b801a8e75fb0e8c4cf1 and
          I0b93bdc12cdce591c7e642ab8830e92445467b9a are not in
          stable/victoria

    The stable/victoria specific changes:

    * The list of accepted vnic_type-s are adapted to what is supported by
      neutron on this release. So vdpa, accelerator-direct, and
      accelerator-direct-physical are removed as they are only added in
      stable/wallaby

    Change-Id: Ie904d1513b5cf76d6d5f6877545e8eb378dd5499
    Closes-Bug: #1946729
    (cherry picked from commit 68c970ea9915a95f9828239006559b84e4ba2581)
    (cherry picked from commit 0c41bfb8c5c60f1cc930ae432e6be460ee2e97ac)
    (cherry picked from commit 89c4ff5f7b45f1a5bed8b6b9b4586fceaa391bfb)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 22.4.0

This issue was fixed in the openstack/nova 22.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 23.2.0

This issue was fixed in the openstack/nova 23.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 24.1.0

This issue was fixed in the openstack/nova 24.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 25.0.0.0rc1

This issue was fixed in the openstack/nova 25.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/pike)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/pike
Review: https://review.opendev.org/c/openstack/nova/+/813437
Reason: stable/pike has transitioned to End of Life for nova, open patches need to be abandoned in order to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/queens)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/queens
Review: https://review.opendev.org/c/openstack/nova/+/818605
Reason: This branch transitioned to End of Life for this project, open patches needs to be closed to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/rocky)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/rocky
Review: https://review.opendev.org/c/openstack/nova/+/818604
Reason: This branch transitioned to End of Life for this project, open patches needs to be closed to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/stein)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/stein
Review: https://review.opendev.org/c/openstack/nova/+/818601
Reason: This branch transitioned to End of Life for this project, open patches needs to be closed to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/train)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/818598
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/ussuri)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/818564
Reason: stable/ussuri branch of openstack/nova transitioned to End of Life and is about to be deleted. To be able to do that, all open patches need to be abandoned.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.