plug_vhostuser may fail due to device not found error when setting mtu

Bug #1533876 reported by sean mooney on 2016-01-13
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
sean mooney
Liberty
Undecided
Ihar Hrachyshka
Mitaka
Undecided
Matt Riedemann

Bug Description

Setting the mtu of a vhost-user port with the ip command will cause vms to fail
to boot with a device not found error as vhost-user prots are not represented as
kernel netdevs.

this bug is present in stable/kilo, stable/liberty and master and i would like to ask that it be back ported if accepted
and fixed in master.

when using vhost-user with ovs-dpdk the vhost-user port is plugged into ovs by nova using a non atomic call to linux_net.create_ovs_vif_port to add an ovs port followed by a second call to linux_net.ovs_set_vhostuser_port_type to update the port type.

https://github.com/openstack/nova/blob/1bf6a8760f0ef226dba927b62d4354e248b984de/nova/virt/libvirt/vif.py#L652-L655

the reuse of the create_ovs_vif_port has an untended concequece of introducing an error where
the ip tool is invoked to try and set the mtu on the userspace vhost-user interface which dose not exist
as a kernel netdev.
https://github.com/openstack/nova/blob/1bf6a8760f0ef226dba927b62d4354e248b984de/nova/network/linux_net.py#L1379

this results in the in the call to set_device_mtu throwing an exception as the ip comand exits with code 1
https://github.com/openstack/nova/blob/1bf6a8760f0ef226dba927b62d4354e248b984de/nova/network/linux_net.py#L1340-L1342

as a result the second function call to ovs_set_vhostuser_port_type is never maid and the vm fails to boot.

to resolve this issue i would like to introduce a new function to inux_net.py
create_ovs_vhostuser_port which will create the vhostuser port as an atomic action
and will not set the mtu similar to the impentation in the os-vif vhost-user driver

https://github.com/jaypipes/vif_plug_vhostuser/blob/8ac30ce32b3e0bae5d2d8f1edc9d64ac2871608e/vif_plug_vhostuser/linux_net.py#L34-L46

an alternitive solution would be to add "1" to the retrun code check here https://github.com/openstack/nova/blob/master/nova/network/linux_net.py#L1339 or catch the exception here https://github.com/openstack/nova/blob/1bf6a8760f0ef226dba927b62d4354e248b984de/nova/virt/libvirt/vif.py#L652
 however neither solve the underlying cause.

this was observed with kilo openstack on ubuntu 14.04 with ovs-dpdk deployed with puppet/fule.

sean mooney (sean-k-mooney) wrote :

i will try to submit a patch to fix this in then next day or too.

Changed in nova:
assignee: nobody → sean mooney (sean-k-mooney)
description: updated

@sean mooney:

Bug skimming result
===================
Do you still actively working on this bug? I have troubles to
understand how I reproduce the issue you described. Would you please
* either push a change to gerrit (with unit tests)
* or explain the issue so that others can reproduce it? A template
  can be found at [1].

Nevertheless, the link to the code you provided won't point to the
pieces of code you mean and fail to document the issue properly.
Please use a stable link with a commit id like:

    https://github.com/openstack/nova/blob/4dbc6abef99b3da41a6089f322d32676f44cd1f6/nova/virt/libvirt/vif.py#L637-L640

If you have questions, I'm available in the IRC channel #openstack-nova
under the name "markus_z".

References
==========
[1] https://wiki.openstack.org/wiki/Nova/BugsTeam/BugReportTemplate

sean mooney (sean-k-mooney) wrote :

hi yes i should have a patch for it today.

just fixing unit tests now

Fix proposed to branch: master
Review: https://review.openstack.org/271444

Changed in nova:
status: New → In Progress
description: updated
sean mooney (sean-k-mooney) wrote :

@markus_z

if you have a ubuntu 14.04 system and run the following command it will result in an
exit code of 1

stack@silpixa00390506:~/devstack$ ip link set eth50 mtu 1000
Cannot find device "eth50"
stack@silpixa00390506:~/devstack$ echo $?
1

as the vhost-user ports are not represented in the kernel network stack setting the
mtu on vhost-user port will always cause the binding to be failed with any version of
ip link /iproute2 package that returns 1 when a device is not found.

to replicate the nova boot issue you would need to deploy ovs with dpdk though that is not nessacary
as the root cause is the uncaught return code caused by the incorrect call to set the mtu.

the patch i have submitted will remove the errant call to ip link to set the mtu.
it also make the creation of the vhost-user port atomic which will silence an ovs error message that is currently emitted
every time a vhost-user port is created due to the type not being set as part of the create.

ideally this change should be backported to kilo and liberty as the non atomic create and mtu issue have always been present since support was introduced in kilo.

Vladimir Eremin (yottatsa) wrote :

I've confirmed this bug on mitaka. There are only one workaround here is to unset network_device_mtu variable. It's not really good idea.

David Edery (david-edery) wrote :

what we see is that when _set_device_mtu is called on the vhostuser device the device becomes non-functional:
[root@compute-0-0 ~]# ovs-vsctl show
f954b20d-8b91-4357-a6fa-df707f0bd1e6
    Bridge br-int
        fail_mode: secure
        Port "vhu272ad051-d5"
            Interface "vhu272ad051-d5"
                error: "could not open network device vhu272ad051-d5 (No such device)"

And its correspondent socket (/var/run/openvswitch) doesn't exist (==the call itself changed the state of the port for some reason)
Removing the call (temporarily for debugging) solved the issue.

David Edery (david-edery) wrote :

type: s/correspondent/corresponding/g

sean mooney (sean-k-mooney) wrote :

what is annorying me about this is i do not see the bug on my local devstack deployment but we were able to see it on a puppet install of the same codebase. in any case the current logic is incorrect so yes i hope this can be fixed.

i will aim to rebase and resubmit by monday.
ideally i would like to also backport this to liberty and kilo.

sean mooney (sean-k-mooney) wrote :

recently
https://review.openstack.org/#/c/283847/6/nova/virt/libvirt/vif.py
https://review.openstack.org/#/c/284407/3/neutron/plugins/ml2/config.py

have mergered which cause this bug to happen frequently in our ci( we suspect that passing build have not been rebased on the neutron change)

i will rebase the patch shortly and leave a comment on both commits that point to this bug

Matt Riedemann (mriedem) on 2016-03-01
tags: added: kilo-backport-potential liberty-backport-potential
Changed in nova:
importance: Undecided → Medium
tags: added: network
Matt Riedemann (mriedem) on 2016-03-11
tags: added: mitaka-rc-potential
Matt Riedemann (mriedem) wrote :

Considering there are backports proposed for this bug, it doesn't appear to be a regression in mitaka and is a latent issue, so I've removed the mitaka-rc-potential tag.

tags: removed: mitaka-rc-potential
sean mooney (sean-k-mooney) wrote :

i have updated the bug discription as it is not version specific as it should fail with any released version of ip tool.

this latent bug was masked by the previous default value for the mtu of 0 in the nova config wich caused the ip link command to not be called. as default value was used in the upstream ci and in our third party ci the issue was not seen initially.

description: updated
description: updated
tags: added: mitaka-rc-potential
Matt Riedemann (mriedem) on 2016-03-18
tags: removed: mitaka-rc-potential
Matt Riedemann (mriedem) on 2016-03-22
tags: added: mitaka-backport-potential
tags: added: mitaka-rc-potential

Reviewed: https://review.openstack.org/271444
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=adf7ba61dd73fe4bfffa20295be9a4b1006a1fe6
Submitter: Jenkins
Branch: master

commit adf7ba61dd73fe4bfffa20295be9a4b1006a1fe6
Author: Sean Mooney <email address hidden>
Date: Fri Jan 22 17:00:36 2016 +0000

    stop setting mtu when plugging vhost-user ports

    vhost-user is a userspace protocol to establish connectivity
    between a virto-net frontend typically qemu and a
    userspace virtio backend such as ovs with dpdk.

    vhost-user interfaces exist only in userspace from the host perspective
    and are not represented in the linux networking stack as kernel netdevs.
    As a result attempting to set the mtu on a vhost-user interface
    using ifconfig or ip link will fail with a device not found error.

    - this change removes a call to _set_device_mtu when plugging
      vhost-user interfaces.
    - this change prevents the device not found error from occurring
      which stopped vms booting with vhost-user interfaces
      due to an uncaught exception resulting in a failure to set the
      interface type in ovs.
    - this change make creating vhost-user interface
      an atomic action.

    This latent bug is only triggered when the mtu value is set to a
    value other than 0 which was the default proir to mitaka.

    Change-Id: I2e17723d5052d57cd1557bd8a173c06ea0dcb2d4
    Closes-Bug: #1533876

Changed in nova:
status: In Progress → Fix Released
no longer affects: nova/kilo

Change abandoned by sean mooney (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/289374
Reason: ill abandon this for now and we can reopen later if this issue starts to be hit more widespread in kilo. mmost commercial openstack release have moved to liberty as such i would not expect many new kilo deployment with ovs and dpdk to be deployed.

Reviewed: https://review.openstack.org/294649
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c7eb823fe73e3db5dca48df5879db18cbab5bd8d
Submitter: Jenkins
Branch: stable/mitaka

commit c7eb823fe73e3db5dca48df5879db18cbab5bd8d
Author: Sean Mooney <email address hidden>
Date: Fri Jan 22 17:00:36 2016 +0000

    stop setting mtu when plugging vhost-user ports

    vhost-user is a userspace protocol to establish connectivity
    between a virto-net frontend typically qemu and a
    userspace virtio backend such as ovs with dpdk.

    vhost-user interfaces exist only in userspace from the host perspective
    and are not represented in the linux networking stack as kernel netdevs.
    As a result attempting to set the mtu on a vhost-user interface
    using ifconfig or ip link will fail with a device not found error.

    - this change removes a call to _set_device_mtu when plugging
      vhost-user interfaces.
    - this change prevents the device not found error from occurring
      which stopped vms booting with vhost-user interfaces
      due to an uncaught exception resulting in a failure to set the
      interface type in ovs.
    - this change make creating vhost-user interface
      an atomic action.

    This latent bug is only triggered when the mtu value is set to a
    value other than 0 which was the default proir to mitaka.

    Change-Id: I2e17723d5052d57cd1557bd8a173c06ea0dcb2d4
    Closes-Bug: #1533876
    (cherry picked from commit adf7ba61dd73fe4bfffa20295be9a4b1006a1fe6)

Matt Riedemann (mriedem) on 2016-03-24
tags: removed: mitaka-rc-potential

This issue was fixed in the openstack/nova 13.0.0.0rc2 release candidate.

Download full text (11.5 KiB)

Reviewed: https://review.openstack.org/302578
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a8ebbebd4ee0c3bb1452ea32f92e1588a6b35067
Submitter: Jenkins
Branch: master

commit 7105f888ee1f52d2a462fc0ece3130dc0d3d49f5
Author: OpenStack Proposal Bot <email address hidden>
Date: Thu Mar 31 06:28:06 2016 +0000

    Imported Translations from Zanata

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: Ibe5d4d38834fbcb99c0332d3375659a21d94154e

commit 5de98cb2de2eca3d061488c55f96e6f7c9bc56a8
Author: OpenStack Proposal Bot <email address hidden>
Date: Wed Mar 30 06:41:25 2016 +0000

    Imported Translations from Zanata

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: Ia46d661560b1141c1c1522c9477c510d28a0d0e7

commit a9d55427b6e8d2472088e3d40a8a5151ce408283
Author: Moshe Levi <email address hidden>
Date: Wed Mar 23 10:59:04 2016 +0200

    Fix detach SR-IOV when using LibvirtConfigGuestHostdevPCI

    This patch fixes an issue which was introduced by this
    change If3edc1965c01a077eb61984a442e0d778d870d75.
    Usually the vif config is of type LibvirtConfigGuestInterface,
    but some vif use LibvirtConfigGuestHostdevPCI config
    (e.g. the ib_hostdev). The difference is that
    LibvirtConfigGuestInterface keeps the pci address in source_dev
    while LibvirtConfigGuestHostdevPCI has domain, bus, slot and
    function, instead of relying on the vif config type we can take the
    pci address for the neutron port.

    Closes-Bug: #1560860

    Change-Id: I62a7ff16f1c9c5da923451520fbeeabb5cc0c5c6
    (cherry picked from commit f15d9a9693b19393fcde84cf4bc6f044d39ffdca)

commit 5b6ee702df7ad901f68bec2ed8d43b66aa6d98c1
Author: OpenStack Proposal Bot <email address hidden>
Date: Tue Mar 29 06:37:30 2016 +0000

    Imported Translations from Zanata

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: Iad0e42a18bd3a7dcf216b4df17b9893e13382efe

commit 29042e06f7e570bd13607b62b997a6ae21db80c5
Author: OpenStack Proposal Bot <email address hidden>
Date: Mon Mar 28 06:34:19 2016 +0000

    Imported Translations from Zanata

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: If159133a2e32c6ef53ba104751a3eb054a95b733

commit 3e9819dab8249ec9993b0b9874e80a78f2ed1754
Author: Matt Riedemann <email address hidden>
Date: Sun Mar 27 19:31:32 2016 -0400

    Update cells blacklist regex for test_server_basic_ops

    Tempest change 9bee3b92f1559cb604c8bd74dcca57805a85a97a
    renamed a test in our blacklist so update the filter to
    handle the old and new name.

    The Tempest team is hesitant to revert the change so we
    should handle it ourselves and eventually move to using
    test uuids for our blacklist, but there might need to
    be work in devstack-gate for that fi...

Reviewed: https://review.openstack.org/289370
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=98464d54d0fcdba452191bc0291d59957c9cdae6
Submitter: Jenkins
Branch: stable/liberty

commit 98464d54d0fcdba452191bc0291d59957c9cdae6
Author: Sean Mooney <email address hidden>
Date: Fri Jan 22 17:00:36 2016 +0000

    stop setting mtu when plugging vhost-user ports

    vhost-user is a userspace protocol to establish connectivity
    between a virto-net frontend typically qemu and a
    userspace virtio backend such as ovs with dpdk.

    vhost-user interfaces exist only in userspace from the host perspective
    and are not represented in the linux networking stack as kernel netdevs.
    As a result attempting to set the mtu on a vhost-user interface
    using ifconfig or ip link will fail with a device not found error.

    - this change removes a call to _set_device_mtu when plugging
      vhost-user interfaces.
    - this change prevents the device not found error from occurring
      which stopped vms booting with vhost-user interfaces
      due to an uncaught exception resulting in a failure to set the
      interface type in ovs.
    - this change make creating vhost-user interface
      an atomic action.

    This latent bug is only triggered when the mtu value is set to a
    value other than 0 which was the default proir to mitaka.

    Conflicts:
     nova/network/model.py
     nova/tests/unit/virt/libvirt/test_vif.py
     nova/virt/libvirt/vif.py

    Change-Id: I2e17723d5052d57cd1557bd8a173c06ea0dcb2d4
    Closes-Bug: #1533876
    (cherry picked from commit adf7ba61dd73fe4bfffa20295be9a4b1006a1fe6)

This issue was fixed in the openstack/nova 14.0.0.0b1 development milestone.

This issue was fixed in the openstack/nova 12.0.4 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers