[ml2/ovs]Empty binding_levels=[] cause ovs-agent skipped to process port
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Opinion
|
Medium
|
LIU Yulong |
Bug Description
In our production environment we noticed some VM boot failures like this:
1. Create port for nova (port revision_number 0->1)
2. Nova boot VM with --nic port-id
3. Nova scheduled this VM to a host and plug the port
4. Nova update the port device_owner (port revision_number 1->2)
5. Nova update the port host (port revision_number 2->3)
(Yes, nova will call update_port twice!)
6. Before call real _bind_port_
7. Neutron-server try to bind this port
8. Neutron-ovs-agent rpc_loop try to get the port details
9. Neutron-ovs-agent Info cache RPC gets empty binding_levels=[] and skip processing port
10. Neutron-server port bind is done and send Info cache,
and now the port revision_number is still 3, while binding_
11. neutron-ovs-agent get the new info cache notification, but the revision_number is not changed, so the cache is not updated.
The port will not be processed anymore.
summary: |
- Empty binding_levels=[] cause ovs-agent skipped to process port + [ml2/ovs]Empty binding_levels=[] cause ovs-agent skipped to process port |
description: | updated |
Changed in neutron: | |
assignee: | nobody → LIU Yulong (dragon889) |
Hi Yulong,
Thanks for the report! I guess this is a bug that occurs infrequently. How freuqent it is? Did you observe this on master or some other version? Do you have a method that makes it reproducible at will? The cause sounds timing dependent, so maybe inserting a sleep() at a critical place?
It seems to me a possible fix would be to ensure the port revision_number gets bumped when the binding_levels change from empty to something (between point 6 and 10). Do you have an idea why the revision_number is not bumped between point 6 and 10? In my environment (where this bug does not occur) if I boot a vm with --nic port-id=port0 then port0's revision_number is 4 when everything is done.
Or do you propose a different way to fix? Do you want to take this bug?