VM can't be accessed, openflow rules are removed and neutron-ovs-agent is always reporting errors after rabbitmq is in split-brain state

Bug #1873265 reported by Yi Yang
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Incomplete
Undecided
Unassigned
oslo.messaging
New
Undecided
Unassigned

Bug Description

After rabbitmq cluster is in split brain state because of some network issues, neutron ovs agents in compute nodes are always reporting error info, some open flow rules are removed, so VMs can't be accessed normally.

Here are some important error info:

2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID ede4a709807e43dc90df4469ca9d88f3
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 328, in _report_state
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent True)
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/agent/rpc.py", line 97, in report_state
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent return method(context, 'report_state', **kwargs)
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=self.retry)
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=retry)
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent call_monitor_timeout, retry=retry)
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent call_monitor_timeout)
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent message = self.waiters.get(msg_id, timeout=timeout)
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 'to message ID %s' % msg_id)
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID ede4a709807e43dc90df4469ca9d88f3
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID ede4a709807e43dc90df4469ca9d88f3
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 328, in _report_state
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent True)
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/agent/rpc.py", line 97, in report_state
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent return method(context, 'report_state', **kwargs)
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=self.retry)
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=retry)
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent call_monitor_timeout, retry=retry)
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent call_monitor_timeout)
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent message = self.waiters.get(msg_id, timeout=timeout)
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 'to message ID %s' % msg_id)
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID ede4a709807e43dc90df4469ca9d88f3
2020-04-16 08:35:05.578 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
2020-04-16 08:35:05.586 21720 WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent._report_state' run outlasted interval by 30.01 sec
2020-04-16 08:35:44.208 21720 ERROR neutron.common.rpc [req-fe5b94c8-c725-4e77-b47b-1b2c0a1e1583 - - - - -] Timeout in RPC method update_device_list. Waiting for 40 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID c1bb1fe4c2fe42259dd3726e7ff9e48a
2020-04-16 08:35:44.208 21720 ERROR neutron.common.rpc [req-fe5b94c8-c725-4e77-b47b-1b2c0a1e1583 - - - - -] Timeout in RPC method update_device_list. Waiting for 40 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID c1bb1fe4c2fe42259dd3726e7ff9e48a
2020-04-16 08:35:44.216 21720 WARNING neutron.common.rpc [req-fe5b94c8-c725-4e77-b47b-1b2c0a1e1583 - - - - -] Increasing timeout for update_device_list calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID c1bb1fe4c2fe42259dd3726e7ff9e48a
2020-04-16 08:35:44.216 21720 WARNING neutron.common.rpc [req-fe5b94c8-c725-4e77-b47b-1b2c0a1e1583 - - - - -] Increasing timeout for update_device_list calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID c1bb1fe4c2fe42259dd3726e7ff9e48a
2020-04-16 08:36:05.413 21720 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : e9e7407c173a438096eb1fad7e22d9fb
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID cc669410f7a345649ca8c653f8ccf939
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 328, in _report_state
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent True)
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/agent/rpc.py", line 97, in report_state
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent return method(context, 'report_state', **kwargs)
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=self.retry)
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=retry)
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent call_monitor_timeout, retry=retry)
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent call_monitor_timeout)
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent message = self.waiters.get(msg_id, timeout=timeout)
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 'to message ID %s' % msg_id)
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID cc669410f7a345649ca8c653f8ccf939
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID cc669410f7a345649ca8c653f8ccf939
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 328, in _report_state
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent True)
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/agent/rpc.py", line 97, in report_state
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent return method(context, 'report_state', **kwargs)
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=self.retry)
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=retry)
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent call_monitor_timeout, retry=retry)
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent call_monitor_timeout)
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent message = self.waiters.get(msg_id, timeout=timeout)
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 'to message ID %s' % msg_id)
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID cc669410f7a345649ca8c653f8ccf939
2020-04-16 08:36:05.592 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
2020-04-16 08:36:05.600 21720 WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent._report_state' run outlasted interval by 30.01 sec
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-fe5b94c8-c725-4e77-b47b-1b2c0a1e1583 - - - - -] Error while processing VIF ports: MessagingTimeout: Timed out waiting for a reply to message ID c1bb1fe4c2fe42259dd3726e7ff9e48a
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 2236, in rpc_loop
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent port_info, provisioning_needed)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/osprofiler/profiler.py", line 159, in wrapper
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent result = f(*args, **kwargs)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 1802, in process_network_ports
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent failed_devices['added'] |= self._bind_devices(need_binding_devices)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 938, in _bind_devices
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent self.conf.host, agent_restarted=agent_restarted)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/agent/rpc.py", line 170, in update_device_list
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent agent_restarted=agent_restarted)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/common/rpc.py", line 173, in call
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent time.sleep(wait)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent self.force_reraise()
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent six.reraise(self.type_, self.value, self.tb)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/common/rpc.py", line 150, in call
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent return self._original_context.call(ctxt, method, **kwargs)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=self.retry)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=retry)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent call_monitor_timeout, retry=retry)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent call_monitor_timeout)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent message = self.waiters.get(msg_id, timeout=timeout)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 'to message ID %s' % msg_id)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID c1bb1fe4c2fe42259dd3726e7ff9e48a
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-fe5b94c8-c725-4e77-b47b-1b2c0a1e1583 - - - - -] Error while processing VIF ports: MessagingTimeout: Timed out waiting for a reply to message ID c1bb1fe4c2fe42259dd3726e7ff9e48a
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 2236, in rpc_loop
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent port_info, provisioning_needed)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/osprofiler/profiler.py", line 159, in wrapper
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent result = f(*args, **kwargs)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 1802, in process_network_ports
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent failed_devices['added'] |= self._bind_devices(need_binding_devices)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 938, in _bind_devices
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent self.conf.host, agent_restarted=agent_restarted)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/agent/rpc.py", line 170, in update_device_list
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent agent_restarted=agent_restarted)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/common/rpc.py", line 173, in call
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent time.sleep(wait)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent self.force_reraise()
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent six.reraise(self.type_, self.value, self.tb)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/neutron/common/rpc.py", line 150, in call
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent return self._original_context.call(ctxt, method, **kwargs)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=self.retry)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent retry=retry)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent call_monitor_timeout, retry=retry)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent call_monitor_timeout)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent message = self.waiters.get(msg_id, timeout=timeout)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 'to message ID %s' % msg_id)
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID c1bb1fe4c2fe42259dd3726e7ff9e48a
2020-04-16 08:36:24.558 21720 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
2020-04-16 08:36:24.568 21720 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-fe5b94c8-c725-4e77-b47b-1b2c0a1e1583 - - - - -] Agent rpc_loop - iteration:0 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 10, 'removed': 0}}. Elapsed:137.142 loop_count_and_wait /var/lib/openstack/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1944
2020-04-16 08:36:24.568 21720 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-fe5b94c8-c725-4e77-b47b-1b2c0a1e1583 - - - - -] Agent rpc_loop - iteration:0 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 10, 'removed': 0}}. Elapsed:137.142 loop_count_and_wait /var/lib/openstack/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1944
2020-04-16 08:36:24.575 21720 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-fe5b94c8-c725-4e77-b47b-1b2c0a1e1583 - - - - -] Loop iteration exceeded interval (2 vs. 137.142431974)! loop_count_and_wait /var/lib/openstack/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1951

Tags: rocky
Revision history for this message
Yi Yang (yangyi01) wrote :

neutron-ovs-agent log file

summary: VM can't be accessed, openflow rules are removed and neutron-ovs-agent
- is always reporting errors and after rabbitmq is in split-brain state
+ is always reporting errors after rabbitmq is in split-brain state
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Yi:

Apart from the log, can you elaborate a bit more what happened in your system?

Can you provide the version you are using?

Are you using a modified OVS agent? I don't see "pod_health_probe_method_ignore_errors" in the OVS agent RPC calls.

Regards.

Revision history for this message
Yi Yang (yangyi01) wrote :

@Rodolfo Alonso This is a 2019/09/01 Rocky, we used openstack-helm to manage neutron-ovs-agent as a k8s pod.

In my environment, rabbitmq entered split brain state because of physical network loop, once rabbitmq cluster is in split brain state, every rabbitmq node will be out of message sync with one another, so sometimes neutron-ovs-agent can't read full neutron network data, so it can't install full flows into ovs bridges, the result is VMs can't be accessed normally because correct flows aren't there. neutron-ovs-agent also is restarted by k8s because liveness probe failed.

I don't understand why neutron-ovs-agent restart will trigger all the flows will be completely re-installed, this obviously will interrupt ovs bridge data plane forwarding, ideally it should only install new flows or modified flows.

In our environment, now ovs bridges only installed default flows, many flows aren't installed.

Revision history for this message
Yi Yang (yangyi01) wrote :

@Rodolfo Alonso Per my check, openstack-helm/neutron/templates/bin/_health-probe.py.tpl calls this rpc pod_health_probe_method_ignore_errors, it is used to do pod liveness probe for k8s, it thinks neutron-ovs-agent is healthy only if neutron-ovs-agent consumes this rpc message even if this rpc doesn't exist, so this should be ok, not a problem.

Revision history for this message
Oleg Bondarev (obondarev) wrote :

@Yi Yang, I guess the reason for this issue is that agent is not tolerant to such rabbitmq instability, when some messages are delivered, and some are not (as you said "sometimes neutron-ovs-agent can't read full neutron network data"): this leads to serios inconsistency in agent data received from server and as a result to inconsistent flows and more generally to undefined behavior.
Indeed, ideally agent should not break data plane - this would happen if rabbimq is completely down, but in case it's partially functional - I'm not sure how could agent properly detect it.

Revision history for this message
Yi Yang (yangyi01) wrote :

@Oleg Bondarev Only one node in rabbitmq cluster is disconnected, other two rabbitmq nodes are still working, I don't understand why rpc is still routed to disconnected rabbit node, can you help explain it.

By the way, other two rabbitmq nodes aren't restarted because of split brain. So they can handle rpcs as before.

Revision history for this message
Oleg Bondarev (obondarev) wrote :

@Yi Yang, sorry, I'm afraid my rabbitmq knowledge isn't enough to answer your question, probably it's better to clarify with oslo.messaging team.
All I know is that clustered rabbitmq has a long history of issues and some distributions are even moving from clustered to stand-alone rabbitmq per service/group of services.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Hi,

I'm inclined to agree with Oleg here. If 2 of 3 rabbit servers are working fine and agent is still connected to the broken one, I think it's something for oslo.messaging. Neutron is just using oslo.messaging project to connect with rabbitmq.

Changed in neutron:
status: New → Incomplete
Revision history for this message
Yi Yang (yangyi01) wrote :

@Slawek Kaplonski, do you know which one decides one RPC call should be routed to which one rabbitmq node in cluster, oslo.messaging or rabbitmq server?

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Hi Yi Yang,
Sorry, but I don't know that for sure. My RabbitMQ knowledge is very limited.

Revision history for this message
Ken Giusti (kgiusti) wrote :

I'll be first to admit that my knowledge of rabbitmq clustering is not very deep. However I don't believe that there is a way to have oslo.messaging even detect that rabbitmq is in split brain much less recover from it. In fact I'd venture to guess that even the rabbitmq node itself isn't aware it's in split brain, else it would initiate some sort of cluster recovery I would imagine (again, not much experience in this area).

oslo.messaging's fault recovery is limited to connection loss and failover (via configuration of the transport url). It really has no further insight into rabbitmq's inner state - including the health of the cluster.

Revision history for this message
Yi Yang (yangyi01) wrote :

Per statement in https://docs.openstack.org/oslo.messaging/latest/reference/transport.html

"""
You may include multiple different network locations separated by commas. The client will connect to any of the available locations and will automatically fail over to another should the connection fail.
"""

So it looks like that oslo.messaging will fail over to another node if it failed to connect to one node. So I assume oslo.messaging _send function always can deliver a rpc call successfully unless all the rabbitmq nodes are disconnected.

Yes, rabbitmq will auto-heal split brain if it is configured appropriately by reboot.

From neutron-ovs-agent log, we can see many error messages as below. So I think rabbitmq has been in abnormal state, otherwise why are there some duplicate message even if it has skipped it.

2020-04-16 08:14:55.260 19871 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: DuplicateMessageError: Found duplicate message(7aa9087c4ad047d08af3a1d5ff409834). Skipping it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.