live migration of a vm using the single port binding work flow is broken in train as a result of the introduction of sriov live migration

Bug #1888395 reported by Sergey Galas' on 2020-07-21
44
This bug affects 7 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
sean mooney
Train
High
sean mooney
Ussuri
High
Billy Olsen
Ubuntu Cloud Archive
Undecided
Unassigned
Train
Undecided
Unassigned
Ussuri
Undecided
Unassigned
Victoria
Undecided
Unassigned
networking-opencontrail
Undecided
Unassigned
nova (Ubuntu)
Undecided
Unassigned
Focal
High
Unassigned
Groovy
Undecided
Unassigned

Bug Description

[Impact]

Live migration of instances in an environment that uses neutron backends that do not support multiple port bindings will fail with error 'NotImplemented', effectively rendering live-migration inoperable in these environments.

This is fixed by first checking to ensure the backend supports the multiple port bindings before providing the port bindings.

[Test Plan]

1. deploy a Train/Ussuri OpenStack cloud w/ at least 2 compute nodes using an SDN that does not support multiple port bindings (e.g. opencontrail).

2. Attempt to perform a live migration of an instance.

3. Observe that the live migration will fail without this fix due to the trace below (NotImplementedError: Cannot load 'vif_type' in the base class), and should succeed with this fix.

[Where problems could occur]

This affects the live migration code, so likely problems would arise in this area. Specifically, the check introduced is guarding information provided for instances using SR-IOV indirect migration.

Regressions would likely occur in the form of live migration errors around features that rely on the multiple port bindings (e.g. the SR-IOV) and not the more generic/common use case. Errors may be seen in standard network providers that are included with distro packaging, but may also be seen in scenarios where proprietary SDNs are used.

[Original Description]
it was working in queens but fails in train. nova compute at the target aborts with the exception:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
    res = self.dispatcher.dispatch(message)
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 274, in dispatch
    return self._do_dispatch(endpoint, method, ctxt, args)
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
    result = func(ctxt, **new_args)
  File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 79, in wrapped
    function_name, call_dict, binary, tb)
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 69, in wrapped
    return f(self, context, *args, **kw)
  File "/usr/lib/python2.7/site-packages/nova/compute/utils.py", line 1372, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 219, in decorated_function
    kwargs['instance'], e, sys.exc_info())
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ self.force_reraise()
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb) File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 207, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7007, in pre_live_migration
    bdm.save()
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6972, in pre_live_migration
    migrate_data)
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 9190, in pre_live_migration
    instance, network_info, migrate_data)
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 9071, in _pre_live_migration_plug_vifs
    vif_plug_nw_info.append(migrate_vif.get_dest_vif())
  File "/usr/lib/python2.7/site-packages/nova/objects/migrate_data.py", line 90, in get_dest_vif
    vif['type'] = self.vif_type
  File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 67, in getter
    self.obj_load_attr(name)
  File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 603, in obj_load_attr
    _("Cannot load '%s' in the base class") % attrname)
NotImplementedError: Cannot load 'vif_type' in the base class

steps to reproduce:
- train centos 7 based deployment: 1 controller, 2 computes, libvirt + qemu-kvm, ceph shared storage, neutron with contrail vrouter virtual network;
- create and start a vm;
- live migrate it between computes.

expected result: vm migrates successfully.

rpm -qa | grep nova:

python2-novaclient-15.1.1-1.el7.noarch
openstack-nova-common-20.3.0-1.el7.noarch
python2-nova-20.3.0-1.el7.noarch
openstack-nova-compute-20.3.0-1.el7.noarch

Changed in nova:
assignee: nobody → Sergey Galas' (shrike742)
Changed in nova:
assignee: Sergey Galas' (shrike742) → Kirill Egorov (kegorov-progmaticlab)
status: New → In Progress
sean mooney (sean-k-mooney) wrote :

as requested in https://review.opendev.org/#/c/742180/4/nova/objects/migrate_data.py@97

can you please provided addtional logs and repoduction steps.

specfically the nova compute server logs form the souce and dest compute node + the conductor logs and ideally the neutron server logs for this instance?

Changed in nova:
status: In Progress → Incomplete
sean mooney (sean-k-mooney) wrote :

this does not appear to be a nova bug.

i am still waiting for the reporter to clarify what network backedn driver they are using
but i suspect its the networking_opencontrail ml2 driver

https://opendev.org/x/networking-opencontrail/src/branch/master/networking_opencontrail/ml2/mech_driver.py#L35

the opencontrial ml2 driver does not implement supported_extensions
so the default implemenation which retruns all exentions is used
https://github.com/openstack/neutron-lib/blob/96e1d028b84419d187f085b587e672447df00ae3/neutron_lib/plugins/ml2/api.py#L458-L471

as a result support for the 'binding-extended' extension
https://github.com/openstack/neutron-lib/blob/master/neutron_lib/api/definitions/portbindings_extended.py
is likely incorrectly being reported

can you povide the our put of "openstack extension list | grep binding" to confirm

the only extion that should be listed is

| Port Binding | binding | Expose port bindings of a virtual port to external application

the following one should not be present on a deployment using netwroking-opencontial since it does not support it

| Port Bindings Extended | binding-extended | Expose port bindings of a virtual port to external application

but based on the behaviour we are seeing it seams this is not the case.

can you confirm?

tags: added: live-migration network
tags: added: neutron
removed: network
sean mooney (sean-k-mooney) wrote :

moving this to triaaged and setting this to high
the regression was introduced in train by
https://opendev.org/openstack/nova/commit/fd8fdc934530fb49497bc6deaa72adfa51c8783a
specifically
https://github.com/openstack/nova/blob/b8ca3ce31ca15ddaa18512271c2de76835f908bb/nova/compute/manager.py#L7654-L7656

adding

  migrate_data.vifs = \
                migrate_data_obj.VIFMigrateData.create_skeleton_migrate_vifs(
                    instance.get_network_info())

uncondtionally activates the code path that require multiple port bindings
as when support for the multiple port bindings was added in rocky it used migrate_data.vif as a sentel
for the new workflow.

e.g. if it is populated the new migration workflow should be used.

  migrate_data.vifs = \
                migrate_data_obj.VIFMigrateData.create_skeleton_migrate_vifs(
                    instance.get_network_info())

should be

if self.network_api.supports_port_binding_extension(ctxt):
    migrate_data.vifs = migrate_data_obj.VIFMigrateData.create_skeleton_migrate_vifs(instance.get_network_info())

this bug prevents live migation with any neutron backend that does not support multiple port bindigns form train on so i am setting this to high.

Changed in nova:
importance: Undecided → High
status: Incomplete → Triaged
Changed in nova:
assignee: Kirill Egorov (kegorov-progmaticlab) → sean mooney (sean-k-mooney)
status: Triaged → In Progress

Related fix proposed to branch: master
Review: https://review.opendev.org/750217

Reviewed: https://review.opendev.org/747454
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=71bc6fc9b89535679252ffe5a737eddad60e4102
Submitter: Zuul
Branch: master

commit 71bc6fc9b89535679252ffe5a737eddad60e4102
Author: Sean Mooney <email address hidden>
Date: Fri Aug 21 17:17:50 2020 +0000

    add functional regression test for bug #1888395

    This change adds a funcitonal regression test that
    assert the broken behavior when trying to live migrate
    with a neutron backend that does not support multiple port
    bindings.

    Change-Id: I470a016d35afe69809321bd67359f466c3feb90a
    Partial-Bug: #1888395

Reviewed: https://review.opendev.org/742180
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b8f3be6b3c5af91d215b4a0cecb9be098e8d8799
Submitter: Zuul
Branch: master

commit b8f3be6b3c5af91d215b4a0cecb9be098e8d8799
Author: root <email address hidden>
Date: Sat Jul 18 00:32:54 2020 -0400

    Set migrate_data.vifs only when using multiple port bindings

    In the rocky cycle nova was enhanced to support the multiple
    port binding live migration workflow when neutron supports
    the binding-extended API extension.
    When the migration_data object was extended to support
    multiple port bindings, populating the vifs field was used
    as a sentinel to indicate that the new workflow should
    be used.

    In the train release
    I734cc01dce13f9e75a16639faf890ddb1661b7eb
    (SR-IOV Live migration indirect port support)
    broke the semantics of the migrate_data object by
    unconditionally populating the vifs field

    This change restores the rocky semantics, which are depended
    on by several parts of the code base, by only conditionally
    populating vifs if neutron supports multiple port bindings.

    Co-Authored-By: Sean Mooney <email address hidden>
    Change-Id: Ia00277ac8a68a635db85f9e0ce2c6d8df396e0d8
    Closes-Bug: #1888395

Changed in nova:
status: In Progress → Fix Released

@Sean, is it possible to cherry-pick this fix into Train and into Ussuri?

sean mooney (sean-k-mooney) wrote :

yes we just have not done it yet but we should

Change abandoned by Xav Paice (<email address hidden>) on branch: stable/ussuri
Review: https://review.opendev.org/759151
Reason: need to re-do this change with the correct tags for the cherry-pick.

I've just fixed up the merge conflicts for the cherry-pick, but this will need the original author to review as well as the usual thorough testing for a backport.

Vern Hart (vern) wrote :

Subscribing field critical as this is blocking a train deployment.

Download full text (5.1 KiB)

I upgraded nova-compute nodes to test live migration patch
 - HTTPS_PROXY=http://172.31.254.9:8080/ sudo -E apt-add-repository ppa:billy-olsen/lp1888395-train
 - sudo apt install nova-api-metadata nova-common nova-compute nova-compute-kvm nova-compute-libvirt python3-nova
 - sudo systemctl restart nova-*

Created VM and tried to live-migrate it
 - openstack server create --image auto-sync/ubuntu-xenial-16.04-amd64-server-20200922-disk1.img --network VN_Red --flavor g1t1.small --boot-from-volume 3 --availability-zone zone1 vern1
 - openstack server migrate vern1 --live-migration
 - nova.exception.InternalError: Failure running os_vif plugin unplug method: Failed to unplug VIF VIFGeneric(active=True,address=02:5b:d4:23:c6:b1,has_traffic_filtering=True,id=5bd423c6-b16d-4678-a05a-3ab94af82d4a,network=Network(e593a2d7-f9e0-4038-957a-1378977bc314),plugin='vrouter',port_profile=<?>,preserve_on_delete=False,vif_name='tap5bd423c6-b1'). Got error: Error during the call to vrouter-port-control: ('vrouter-port-control', '--oper=delete', '--uuid=5bd423c6-b16d-4678-a05a-3ab94af82d4a')

Upgraded nova-cloud-controller units as well
 - HTTPS_PROXY=http://172.31.254.9:8080/ sudo -E apt-add-repository ppa:billy-olsen/lp1888395-train
 - sudo apt install nova-api-os-compute nova-common nova-conductor nova-scheduler nova-spiceproxy python3-nova
 - sudo systemctl restart nova-* apache2

Retrying the above create and migrate commands
 - Same error.

Full trace from `openstack server show vern1 -f value -c fault | sed 's/\\n/\n/g;s/\\'\''/'\''/g'`:

{'code': 500, 'created': '2020-10-27T06:16:17Z', 'message': "Failure running os_vif plugin unplug method: Failed to unplug VIF VIFGeneric(active=True,address=02:26:27:d1:5a:36,has_traffic_filtering=True,id=2627d15a-3604-4a11-80d7-12b92ed9cea6,network=Network(e593a2d7-f9e0-4038-957a-1378977bc314),plugin='vrouter',po", 'details': 'Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/os_vif/__init__.py", line 110, in unplug
    plugin.unplug(vif, instance_info)
  File "/usr/lib/python3/dist-packages/vif_plug_vrouter/vrouter.py", line 301, in unplug
    self._vrouter_port_delete(instance_info, vif)
  File "/usr/lib/python3/dist-packages/vif_plug_vrouter/vrouter.py", line 293, in _vrouter_port_delete
    vhostuser_socket, vhostuser_mode)
  File "/usr/lib/python3/dist-packages/oslo_privsep/priv_context.py", line 245, in _wrap
    return self.channel.remote_call(name, args, kwargs)
  File "/usr/lib/python3/dist-packages/oslo_privsep/daemon.py", line 204, in remote_call
    raise exc_type(*result[2])
vif_plug_vrouter.exception.VrouterPortControlError: Error during the call to vrouter-port-control: ('vrouter-port-control', '--oper=delete', '--uuid=2627d15a-3604-4a11-80d7-12b92ed9cea6')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/vif.py", line 820, in _unplug_os_vif
    os_vif.unplug(vif, instance_info)
  File "/usr/lib/python3/dist-packages/os_vif/__init__.py", line 115, in unplug
    raise os_vif.exception.UnplugException(vif=vif, err=err)
os_v...

Read more...

sean mooney (sean-k-mooney) wrote :

the error you are getting is from the vrouter plugin
did you also update that? from what i can see you usge updated nova.

the vrouter os vif plugin is a third party plugin htat is not part of os-vif or vrouter and you will need to install the correct version of both.

https://launchpad.net/~billy-olsen/+archive/ubuntu/lp1888395-train only contains nova so it will not be sufficent to just enable that to test this.

vrouter-port-control is a command line client that is provided by the vrouter prouduct so the failure that your are seeing are not part of openstack but vrouter and the vrouter integratoin.

this is the code that is failing
https://github.com/tungstenfabric/tf-nova-vif-driver/blob/master/vif_plug_vrouter/vrouter.py#L100-L135

i would guess that the version fo contrail you have deplopyed uses a different set of commandline arguemnts or is otherwise incompatible or broken indepently of the nova issue tracked by this bug.

summary: - shared live migration of a vm with a vif is broken in train
+ live migration of a vm using the single port binding work flow is broken
+ in train as a result of the introduction of sriov live migration
Vern Hart (vern) wrote :

I agree that the vrouter-port-control command error is a separate issue. Billy's train backport of Sean's fix did indeed resolve the "Cannot load vif_type" error we were seeing on live migration.

Just for reference, the contrail related vrouter-port-control issue we were seeing turned out to be an apparmor permission issue. As a workaround we turned off enforcing apparmor rules.

Adam Vinsh (adam-vinsh) wrote :

Hello team,
Anything we can do to help with merging the Train cherry pick?

Adam Vinsh (adam-vinsh) wrote :

I'll add that we are using the nsx-t neutron plugin.. and hit this bug in live migration after our train upgrade.

Changed in nova (Ubuntu Groovy):
status: New → Fix Released
Changed in nova (Ubuntu Focal):
status: New → Triaged
importance: Undecided → High
Corey Bryant (corey.bryant) wrote :

Billy/Xav, Is there any chance these patches are appropriate to land in upstream stable/train?

Billy Olsen (billy-olsen) wrote :

Corey - they are, and I've been working on them here and there to get them into shape. The Train version might slightly differ in the functional tests that are being added, but the actual code change is the same.

description: updated
sean mooney (sean-k-mooney) wrote :

i tought that cannonical did not reuse upstream project bugs for tracking the change in teh ubuntu cloud archive?

the convention previously was to file a different bug for the cloud archive that referenced the upstream bug no?

using the same bug for upstream and downstream kind of make it hard to set tags properly.
its not the end of the world but in general we do not add downstream nova/cloud archive versions as part of upstream triage.

you have keept the orgininal bug desciription i guess so taht is oke but rewriting the bug desciption to follow the downstream bug template could cause confution in some cases so that might be better to keep in a comment that said its probaly more clear what the error is now then it was before.

sean mooney (sean-k-mooney) wrote :

by the way i also want to see this backported to train upstream so any review ectra that ye can provide to make that happen more quickly is great :)

Corey Bryant (corey.bryant) wrote :

@sean I don't think we have a standard of not re-using upstream bugs for distro. I've thought about this frequently but up to this point nobody complained, to me at least, so I've just continued on. I personally like to re-use the upstream bug since it's nice to have everything in one place and it's fewer bugs to track from my pov. I can see that it could be annoying/noisy from an upstream pov though.

Out of curiosity, I just searched through Nova bugs that are fix-committed or fix-released that also have the 'verification-done' tag (one of the tags used for SRU processing into older Ubuntu released), and found that there are a reasonable amount of upstream bugs that are also used for Ubuntu & Cloud Archive tracking: https://bugs.launchpad.net/nova/+bugs?search=Search&field.status:list=FIXCOMMITTED&field.status:list=FIXRELEASED&field.tag=verification-done

This issue was fixed in the openstack/nova 21.2.0 release.

Robie Basak (racb) wrote :

I see that this code change already exists in Ubuntu Hirsute, so I'm setting that task Fix Released.

> rewriting the bug desciption to follow the downstream bug template could cause confution in some cases so that might be better to keep in a comment

FWIW, from the Ubuntu SRU team perspective I think it'd be absolutely fine for the information to be in a comment rather than the bug description if upstream would prefer that.

Changed in nova (Ubuntu):
status: New → Fix Released
Changed in nova (Ubuntu Focal):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-focal

Hello Sergey, or anyone else affected,

Accepted nova into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/nova/2:21.1.2-0ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Łukasz Zemczak (sil2100) wrote :

Hello Sergey, or anyone else affected,

Accepted nova into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/nova/2:21.2.0-0ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers