Hi Sean,

Thanks a lot for the thorough review and evaluation of the bug.
I appreciated it! It took me a while so I could get time to parse and
get you a proper response. 

> 1 if the sriov nic agent is used for standard sriov vnic types (direct,
> direct-physical, macvtap) nics __must not__ be in __switchdev__ mode,
> __must__ be in __legacy__ mode

Not sure exactly what you mean here but, there is only one agent
(openvswitch-agent) on the computes and network nodes. That agent uses
the configuration as in [2] and is not configured as SRIOV, the
switchdev/hw offloading configuration is done in openvswitch.

> 2 vdpa support in nova current does not support any move operations,
> vdpa support in nova requires the nic to be in switchdev mode.

I don't believe we are using this.

> 3 hardware offloaded ovs uses the ml2/ovs or ml2/ovn mechansiume
> drivers and does not use the sriov nic agent.

Right, this is how we are doing.

> 4 we do not support using the sriov nic agent and ovs hardware offload
> on the same phsyical nic when using the sriov-nic-agent the nic must be
> in legacy mode and when using hardware offload it must be in swtichdev
> mode live migration form host using sriov nic agent to hardware
> offlowaded ovs was not in scope.

The migration is beteween 2 switchdev hosts with ml2/ovs

> this is cause by trying to do move other vms that have neutron sriov
> port with shelve and unshleve
> https://bugs.launchpad.net/nova/+bug/1851545

The bug above might be one of the possible problems related to this
message. If you follow the logs[3] you will see that here, this is
happening because:

1 - During pre_live_migration, the neutron port is attached on the
    destination host[4]
2 - pre_live_migration fails on the destination host and triggers an
    exception on the source host[5] [6]->[7]
3 - rollback is triggered and tries re-attach the port to source host,
    but QEMU instance still holds the PCI address[8] and the PCI message
    error is triggered

> what i suspect has happend here is the live migration fails in the
  migration phase after pre_live_migrate

As I mentioned above, the failure was in the pre_live_migration funcion
(I caused in in my env, but it happeneded for some reason at the customer
site)

> unless you have correctly configured network manager in the guest to
> retrigger on the hotplug of the interface the guest wont have network
> connectivty restored until it reboots and the on boot network
> configurtion scripts run.

So, these are standard ubuntu images and are for sure configured to 
hotplug given they don't loose connection when the migration works.

So, it seems that the bug here is that rollback_live_migration_at_source()
is called for both when the migration fails on pre_live_migration()
or live_migration. But, for the case when something fails on
pre_live_migration, this shouldn't be done.

Now I'm curious and will test if this attempt/error to re-attach the
device in the same address is the thing that is making the instance to
loose conectivity. I'll test that. Please let me know your thoguths on
my suspition above.

Erlon
_____________________
[1] neutron.conf: https://gist.github.com/sombrafam/ca6ba9224629a69e48e571b5e45f2040
[2] openvswitch_agent.ini: https://gist.github.com/sombrafam/feab8c8f7a389d9c92e89f35a629abb0
[3] detailed migration error logs, of 29e7d319 from compute 0 -> compute 1: https://gist.githubusercontent.com/sombrafam/6edfc04fc45631621c73054909df510d/raw/838f9d6f4139fc4c52c8b22d5008a61d45dca0f6/migration%2520log
[4] https://gist.github.com/sombrafam/6edfc04fc45631621c73054909df510d#file-migration-log-L130
[5] https://gist.github.com/sombrafam/6edfc04fc45631621c73054909df510d#file-migration-log-L179
[6] https://github.com/openstack/nova/blob/stable/ussuri/nova/virt/libvirt/driver.py#L9585
[7] https://github.com/openstack/nova/blob/stable/ussuri/nova/compute/manager.py#L8157
[8] https://gist.github.com/sombrafam/6edfc04fc45631621c73054909df510d#file-migration-log-L266