macvtap: possible race of interfaces scan/deletion at migration

Bug #1817548 reported by git.user
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Expired
High
Unassigned

Bug Description

A few days ago I stumbled upon a failed migration of VM with macvtap interface. Nova "joyed" me with error "Unsupported VIF type binding_failed". A short investigation lead me to the real problem: macvtap agent crushed

2019-02-21 18:40:38.636 54364 ERROR neutron.agent.linux.utils [req-362a8541-643e-4355-b3f7-a02fb9358b66 - - - - -] Exit code: 1; Stdin: ; Stdout: ; Stderr: Device "macvtap1" does not exist.

2019-02-21 18:40:38.637 54364 ERROR oslo_service.service [req-362a8541-643e-4355-b3f7-a02fb9358b66 - - - - -] Error starting thread.: ProcessExecutionError: Exit code: 1; Stdin: ; Stdout: ; Stderr: Device "macvtap1" does not exist.
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service Traceback (most recent call last):
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/oslo_service/service.py", line 729, in run_service
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service service.start()
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/osprofiler/profiler.py", line 158, in wrapper
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service result = f(*args, **kwargs)
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/agent/_common_agent.py", line 111, in start
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service self.daemon_loop()
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/osprofiler/profiler.py", line 158, in wrapper
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service result = f(*args, **kwargs)
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/agent/_common_agent.py", line 447, in daemon_loop
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service device_info = self.scan_devices(previous=device_info, sync=sync)
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/osprofiler/profiler.py", line 158, in wrapper
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service result = f(*args, **kwargs)
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/agent/_common_agent.py", line 377, in scan_devices
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service current_devices = self.mgr.get_all_devices()
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/macvtap/agent/macvtap_neutron_agent.py", line 138, in get_all_devices
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service mac = ip_lib.get_device_mac(device_name)
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/ip_lib.py", line 950, in get_device_mac
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service return IPDevice(device_name, namespace=namespace).link.address
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/ip_lib.py", line 541, in address
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service return self.attributes.get('link/ether')
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/ip_lib.py", line 565, in attributes
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service return self._parse_line(self._run(['o'], ('show', self.name)))
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/ip_lib.py", line 360, in _run
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service return self._parent._run(options, self.COMMAND, args)
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/ip_lib.py", line 88, in _run
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service return self._execute(options, command, args)
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/ip_lib.py", line 102, in _execute
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service log_fail_as_error=self.log_fail_as_error)
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py", line 151, in execute
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service raise ProcessExecutionError(msg, returncode=returncode)
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service ProcessExecutionError: Exit code: 1; Stdin: ; Stdout: ; Stderr: Device "macvtap1" does not exist.
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service
2019-02-21 18:40:38.637 54364 ERROR oslo_service.service
2019-02-21 18:40:38.640 54364 INFO neutron.plugins.ml2.drivers.agent._common_agent [-] Stopping Macvtap agent agent.

I have digg a bit and it looks like there's a race between interface deletion and a periodic scan in daemon_loop. For now get_all_devices (neutron/plugins/ml2/drivers/macvtap/agent/macvtap_neutron_agent.py +138) does'n check interface existence before get_device_mac:

    def get_all_devices(self):
...............
        for device_name in all_device_names:
            if device_name.startswith(constants.MACVTAP_DEVICE_PREFIX):
                if ip_lib.device_exists(device_name):
                    mac = ip_lib.get_device_mac(device_name)
                    self.mac_device_name_mappings[mac] = device_name
                    devices.add(mac)

It's quite possible that device will be deleted while we're looping over (in my case with ~10% probability).
An obvious 'solution' is to check device on every loop but it doesn't really solve an issue though reduce race window so that I able to migrate VM for about 100 times before crush.
I'll attache this patch anyway.

Revision history for this message
git.user (git.user) wrote :
Changed in neutron:
importance: Undecided → High
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

I have some questions related to this issue.
1) Are you migrating or live-migrating the VM?
2) Is this error 100% reproducible? When you migrate the VM, does it always happen?
3) Did you happen to know if this device "macvtap1" is created after the agent error?
4) Did you see Nova compute agent (in debug mode) creating this interface during the migration?
5) Are you aware of [1]?
"""
Instance migration requires the same values for the physical_interface_mapping configuration option on each compute node. For more information, see https://bugs.launchpad.net/neutron/+bug/1550400.
"""

Regards.

[1]https://docs.openstack.org/neutron/pike/admin/config-macvtap.html

Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :

Let's close it as INCOMPLETE for that point but feel free to re-open it with more details if needed.

The patch proposed sounds more like a hack than a real fix and the bug as been reported long time ago without back from reported.

I would point the CI issue that we have suffered few days ago, may be related.

  https://bugs.launchpad.net/neutron/+bug/2012510

Changed in neutron:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.