linuxbridge-agent missed updated device sometimes

Bug #1781129 reported by Yuki Nishiwaki
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Yuki Nishiwaki

Bug Description

* Version:
master branch head as of 11 July 2018
https://github.com/openstack/neutron/commit/5db397958160ea3dc952794fb7f6ec68a2da2055

* Summary:
When the operation that make tap interface disappeared/appeared in short interval executed like rebuilding VM,
linuxbridge-agent can miss updated device events depending on when tap device disappeared. this cause eventually following "VirtualInterfaceCreateException" error in nova-compute because neutron-server didn't send vif_plugged event to Nova.

---
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4946, in _create_domain_and_network
2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa] raise exception.VirtualInterfaceCreateException()
2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa] VirtualInterfaceCreateException: Virtual Interface creation failed
2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa]
---

* Reproducing:
Actually this is very difficult to reproduce because the pre-condition to make it reproduce strongly depending on the running state in linuxbridge-agent, so I'm gonna explain the state transition for the logic of detection to updated device step by step

---
let's say Hypervisor have 1 tap device for 1 VM which is "tapA" and tapA's interface index is 1 and User just requested rebuilding this VM.

0. Previous device info is like following
                       {'added': set(),
                        'current': set("tapA"),
                        'updated': set(),
                        'removed': set(),
                        'timestamps': {"tapA": 1}}

1. Get current_devices https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/agent/_common_agent.py#L377
-> current_devices is "tapA"

__ disappeared tapA due to rebuilding VM __

2. Get timestamp(interface index in the case of linuxbridge-agent) https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L395
-> current timestamp is {"tapA": None}. this is because we failed to get interface information here

3. Check device locally changed or not
https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L397
-> locally "tapA" is detected as locally changed device. because timestamp information is change from before (1 != None)

4. Generate device_info like following
                       {'added': set(),
                        'current': set("tapA"),
                        'updated': set("tapA"),
                        'removed': set(),
                        'timestamps': {"tapA": None}}

5. Process linuxbridge-agent interface plugging logic for tapA, but checking device existence failed because there is no such a device. here note even if check for device existence failed, this function won't raise exception and re-sync won't happen https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py#L521-L531

-- appeared tapA again due to rebuilding VM --
-- next scan_device iteration start --

6. Get current_devices
-> current_devices is "tapA"

7. Get timestamp
-> current timestamp is {"tapA":2}.

8. Check device locally changed or not
-> no locally device is detected because of this line https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L369

9. Generate device_info like following
                       {'added': set(),
                        'current': set("tapA"),
                        'updated': set(),
                        'removed': set(),
                        'timestamps': {"tapA": 2}}

next iteration is expected to detect device updated but didn't in this case.
So we have to improve this locally changed device detection logic. otherwise rebooting/rebuilding operation would fail sometimes(it's really rare case though)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/581648

Changed in neutron:
assignee: nobody → Yuki Nishiwaki (uckey-1067)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/582084

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Yuki Nishiwaki (<email address hidden>) on branch: master
Review: https://review.openstack.org/582084
Reason: I unintentionally created new patch set when I try to update https://review.openstack.org/#/c/581648/ .
plz see above original one

Changed in neutron:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/581648
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b803195a9979f7b3b0fd9cea41699b33fc8cf2bb
Submitter: Zuul
Branch: master

commit b803195a9979f7b3b0fd9cea41699b33fc8cf2bb
Author: Yuki Nishiwaki <email address hidden>
Date: Wed Jul 25 16:05:30 2018 +0900

    Dont use dict.get() to know certain key is in dict

    In CommonAgentLoop class, there is logic to detect tap device is changed
    locally or not by comparing timestamp with previous.
    Sometimes timestamp value could be None depending on the timing (see bug/1781129)

    But current _get_devices_locally_modified logic can not detect local
    change from None to something because _get_devices_locally_modified
    function don't always compare if previous timestamp value was None.

    In order not to miss updated device always, better not to use dict.get() to
    know previous iteration have timestamp or not.

    Change-Id: Ib0361ad5c281f88558e8e048cfec588b9f9b1de4
    Closes-Bug: #1781129

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.0.0b3

This issue was fixed in the openstack/neutron 13.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers