broken pipe errors cause neutron metadata agent to fail

Bug #1855919 reported by Albert Braden
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Low
Unassigned

Bug Description

After we increased computes to 200, we started seeing "broken pipe" errors in neutron-metadata-agent.log on the controllers. After a neutron restart the errors are reduced, then they increase until the log is mostly errors, and the neutron metadata service fails, and VMs cannot boot. Another symptom is that unacked RMQ messages build up in the q-plugin queue. This is the first error we see; this one occurs as the server is starting:

2019-12-10 10:56:01.942 1838536 INFO eventlet.wsgi.server [-] (1838536) wsgi starting up on http:/var/lib/neutron/metadata_proxy
2019-12-10 10:56:01.943 1838538 INFO eventlet.wsgi.server [-] (1838538) wsgi starting up on http:/var/lib/neutron/metadata_proxy
2019-12-10 10:56:01.945 1838539 INFO eventlet.wsgi.server [-] (1838539) wsgi starting up on http:/var/lib/neutron/metadata_proxy
2019-12-10 10:56:21.138 1838538 INFO eventlet.wsgi.server [-] Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 521, in handle_one_response
    write(b''.join(towrite))
  File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 462, in write
    wfile.flush()
  File "/usr/lib/python2.7/socket.py", line 307, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
  File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 390, in sendall
    tail = self.send(data, flags)
  File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 384, in send
    return self._send_loop(self.fd.send, data, flags)
  File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 371, in _send_loop
    return send_method(data, *args)
error: [Errno 32] Broken pipe

2019-12-10 10:56:21.138 1838538 INFO eventlet.wsgi.server [-] 10.195.74.25,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 0 time: 19.0296111
2019-12-10 10:56:25.059 1838516 INFO eventlet.wsgi.server [-] 10.195.74.28,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.2840948
2019-12-10 10:56:25.181 1838529 INFO eventlet.wsgi.server [-] 10.195.74.68,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.2695429
2019-12-10 10:56:25.259 1838518 INFO eventlet.wsgi.server [-] 10.195.74.28,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.1980510

Then we see some "call queues" warnings and the threshold increases to 40:

2019-12-10 10:56:31.414 1838515 WARNING oslo_messaging._drivers.amqpdriver [-] Number of call queues is 11, greater than warning threshold: 10. There could be a leak. Increasing threshold to: 20

Next we see RPC timeout errors:

2019-12-10 10:57:02.043 1838520 WARNING oslo_messaging._drivers.amqpdriver [-] Number of call queues is 11, greater than warning threshold: 10. There could be a leak. Increasing threshold to: 20
2019-12-10 10:57:02.059 1838534 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 37 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 1ed3e021607e466f8b9b84cd3b05b188
2019-12-10 10:57:02.059 1838534 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 1ed3e021607e466f8b9b84cd3b05b188
2019-12-10 10:57:02.285 1838521 INFO eventlet.wsgi.server [-] 10.195.74.27,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.7959940

2019-12-10 10:57:16.215 1838531 WARNING oslo_messaging._drivers.amqpdriver [-] Number of call queues is 21, greater than warning threshold: 20. There could be a leak. Increasing threshold to: 40

2019-12-10 10:57:17.339 1838539 WARNING oslo_messaging._drivers.amqpdriver [-] Number of call queues is 11, greater than warning threshold: 10. There could be a leak. Increasing threshold to: 20

2019-12-10 10:57:24.838 1838524 INFO eventlet.wsgi.server [-] 10.195.73.242,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.6842020
2019-12-10 10:57:24.882 1838524 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 3 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 2bb5faa3ec8d4f5b9d3bd3e2fe095f9e
2019-12-10 10:57:24.883 1838524 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 2bb5faa3ec8d4f5b9d3bd3e2fe095f9e
2019-12-10 10:57:24.887 1838525 INFO eventlet.wsgi.server [-] 10.195.74.26,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.9827850
2019-12-10 10:57:24.903 1838518 INFO eventlet.wsgi.server [-] 10.195.74.43,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.5630379
2019-12-10 10:57:25.045 1838529 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 21 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID b38361bf9906482b8b24c5b534a6652b
2019-12-10 10:57:25.046 1838529 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID b38361bf9906482b8b24c5b534a6652b
2019-12-10 10:57:25.055 1838537 INFO eventlet.wsgi.server [-] 10.195.73.247,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.7542410
2019-12-10 10:57:25.119 1838523 INFO eventlet.wsgi.server [-] 10.195.74.2,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.7057869
2019-12-10 10:57:25.185 1838524 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 47 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID f1a268d937f94def97bd238916715744
2019-12-10 10:57:25.261 1838529 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 26 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 93c31cf4f5d34bd1a5ba90165e89cb79
2019-12-10 10:57:25.284 1838536 INFO eventlet.wsgi.server [-] 10.195.73.207,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.4315739
2019-12-10 10:57:25.319 1838520 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 50 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 37c1b168536e4c70b522c330209b11ec
2019-12-10 10:57:25.319 1838520 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 37c1b168536e4c70b522c330209b11ec
2019-12-10 10:57:25.374 1838530 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 30 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID fb837fc73c664209bfbada0fb32886ad
2019-12-10 10:57:25.375 1838530 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID fb837fc73c664209bfbada0fb32886ad
2019-12-10 10:57:25.388 1838526 INFO eventlet.wsgi.server [-] 10.195.65.7,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.5798080
2019-12-10 10:57:25.446 1838520 INFO eventlet.wsgi.server [-] 10.195.74.104,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.6868739
2019-12-10 10:57:25.448 1838528 INFO eventlet.wsgi.server [-] 10.195.74.202,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.7513518
2019-12-10 10:57:25.452 1838519 WARNING oslo_messaging._drivers.amqpdriver [-] Number of call queues is 21, greater than warning threshold: 20. There could be a leak. Increasing threshold to: 40
2019-12-10 10:57:25.504 1838535 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 15 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7b677a7d40274b0ea22510dcf3865cf6
2019-12-10 10:57:25.505 1838535 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 7b677a7d40274b0ea22510dcf3865cf6
2019-12-10 10:57:25.609 1838539 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 20 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 378f11ce14334be38ffaa95ec3fc26f2
2019-12-10 10:57:25.610 1838539 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 378f11ce14334be38ffaa95ec3fc26f2
2019-12-10 10:57:25.661 1838524 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 28 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 0c911a0ac95f42209cfa8b265d4d5c3d
2019-12-10 10:57:25.787 1838525 INFO eventlet.wsgi.server [-] 10.195.74.86,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.7191069
2019-12-10 10:57:25.831 1838522 INFO eventlet.wsgi.server [-] 10.195.64.185,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.5980189
2019-12-10 10:57:25.837 1838532 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 51 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 9a7cfd81ba714d2680aa223ba96798f0
2019-12-10 10:57:25.837 1838532 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 9a7cfd81ba714d2680aa223ba96798f0
2019-12-10 10:57:25.903 1838536 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 28 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID da2628209cde4562bf47dc0bdfecbf1d
2019-12-10 10:57:25.904 1838536 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID da2628209cde4562bf47dc0bdfecbf1d
2019-12-10 10:57:25.914 1838521 INFO eventlet.wsgi.server [-] 10.195.74.44,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.8841410
2019-12-10 10:57:25.936 1838524 INFO eventlet.wsgi.server [-] 10.195.73.231,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.3305228

A minute or two after starting the server we get more errors. At this point VMs are unable to build. If I try to pull metadata from a VM get a 503 or 504 and nothing is logged in neutron-metadata-agent.log. Haproxy logs the 503/504 response.

albertb@<html><body><h1>503:~ $ curl -s http://169.254.169.254/2009-04-04/meta-data/hostname
<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>

Now the log is almost all errors:

2019-12-10 10:57:27.666 1838530 INFO eventlet.wsgi.server [-] 10.195.73.174,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.7101889
2019-12-10 10:57:27.719 1838537 INFO eventlet.wsgi.server [-] 10.195.65.6,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.3497119
2019-12-10 10:57:27.720 1838525 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 60 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 138877a7326e40f38de23b05fb97127a
2019-12-10 10:57:27.720 1838525 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 138877a7326e40f38de23b05fb97127a
2019-12-10 10:57:27.741 1838523 INFO eventlet.wsgi.server [-] 10.195.74.86,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 4.7329929
2019-12-10 10:57:27.820 1838525 INFO eventlet.wsgi.server [-] 10.195.73.206,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 1.4146030
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent [-] Unexpected error.: MessagingTimeout: Timed out waiting for a reply to message ID 2bb5faa3ec8d4f5b9d3bd3e2fe095f9e
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent Traceback (most recent call last):
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/agent.py", line 89, in __call__
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent instance_id, tenant_id = self._get_instance_and_tenant_id(req)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/agent.py", line 162, in _get_instance_and_tenant_id
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent ports = self._get_ports(remote_address, network_id, router_id)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/agent.py", line 155, in _get_ports
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent return self._get_ports_for_remote_address(remote_address, networks)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/common/cache_utils.py", line 116, in __call__
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent return self.func(target_self, *args, **kwargs)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/agent.py", line 137, in _get_ports_for_remote_address
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent ip_address=remote_address)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/agent.py", line 106, in _get_ports_from_server
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent return self.plugin_rpc.get_ports(self.context, filters)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/agent.py", line 72, in get_ports
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent return cctxt.call(context, 'get_ports', filters=filters)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 173, in call
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent time.sleep(wait)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent self.force_reraise()
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent six.reraise(self.type_, self.value, self.tb)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 150, in call
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent return self._original_context.call(ctxt, method, **kwargs)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 179, in call
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent retry=self.retry)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 133, in _send
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent retry=retry)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent call_monitor_timeout, retry=retry)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent call_monitor_timeout)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent message = self.waiters.get(msg_id, timeout=timeout)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent 'to message ID %s' % msg_id)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent MessagingTimeout: Timed out waiting for a reply to message ID 2bb5faa3ec8d4f5b9d3bd3e2fe095f9e
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent
2019-12-10 10:57:27.828 1838524 INFO eventlet.wsgi.server [-] Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 521, in handle_one_response
    write(b''.join(towrite))
  File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 462, in write
    wfile.flush()
  File "/usr/lib/python2.7/socket.py", line 307, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
  File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 390, in sendall
    tail = self.send(data, flags)
  File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 384, in send
    return self._send_loop(self.fd.send, data, flags)
  File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 371, in _send_loop
    return send_method(data, *args)
error: [Errno 32] Broken pipe

2019-12-10 10:57:27.828 1838524 INFO eventlet.wsgi.server [-] 10.195.73.248,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 500 len: 0 time: 63.0060959
2019-12-10 10:57:27.873 1838528 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 41 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7d73358e2fe841e4a4b818395e2e5b2d
2019-12-10 10:57:27.877 1838524 INFO eventlet.wsgi.server [-] 10.195.73.238,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 4.4559531
2019-12-10 10:57:27.921 1838538 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 6 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 297db0c16653413cabc868027f9e6abb
2019-12-10 10:57:27.921 1838538 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 297db0c16653413cabc868027f9e6abb
2019-12-10 10:57:27.967 1838520 INFO eventlet.wsgi.server [-] 10.195.74.29,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.4040241
2019-12-10 10:57:28.006 1838517 INFO eventlet.wsgi.server [-] 10.195.74.202,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.6681471
2019-12-10 10:57:28.026 1838522 INFO eventlet.wsgi.server [-] 10.195.74.202,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.3529530
2019-12-10 10:57:28.058 1838519 INFO eventlet.wsgi.server [-] 10.195.74.121,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.5390451

To reproduce this issue:

Build openstack cluster on Rocky and add 200 computes. 3 controllers with 48 CPU Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz and 92G RAM.

This bug seems severe to us. It is ruining our production cluster and we cannot build VMs.

Revision history for this message
Albert Braden (ozzzo) wrote :

This looks similar to bug 1853071 so we installed the patch but it didn't seem to make a difference:

https://bugs.launchpad.net/neutron/+bug/1853071

https://review.opendev.org/#/c/697405/

summary: - roken pipe erros cause neutron metadata agent to fail
+ broken pipe erros cause neutron metadata agent to fail
summary: - broken pipe erros cause neutron metadata agent to fail
+ broken pipe errors cause neutron metadata agent to fail
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Albert:

I don't think this is related to DB locks but to the MQ you are using. Did you check the status of the MQ? What is the status of the Neutron server? Do you see any error in the server logs?

If you are running RabbitMQ, please use https://www.rabbitmq.com/cli.html to check the status of the MQ.

Regards.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Hi,

Can You provide neutron-server logs from the same time? Errors like "Timeout in RPC method get_ports" can indicate that there was some issue on neutron-server's side or in rabbitmq itself.
So please also check rabbitmq status.
Another thing to check is status of Your database and overall controllers/db nodes load as another potential reason of this can be simple overloaded control plane.

I'm marking this for now as incomplete. Please switch it to new again when You will provide additional data.

Changed in neutron:
status: New → Incomplete
Revision history for this message
Albert Braden (ozzzo) wrote :
Download full text (21.3 KiB)

neutron-server logs show a few warnings while the server starts but no errors:

2019-12-10 10:55:59.961 1838396 INFO neutron.api.extensions [req-1ec38ea9-9081-47d7-be2b-22f076080cab - - - - -] Extension router-service-type not supported by any of loaded plugins
2019-12-10 10:55:59.961 1838396 INFO neutron.api.extensions [req-1ec38ea9-9081-47d7-be2b-22f076080cab - - - - -] Loaded extension: port-security-groups-filtering
2019-12-10 10:55:59.962 1838396 WARNING neutron.api.extensions [req-1ec38ea9-9081-47d7-be2b-22f076080cab - - - - -] Did not find expected name "Security_groups_port_filtering_lib" in /usr/lib/python2.7/dist-packages/neutron/extensions/security_groups_port_filtering_lib.py
2019-12-10 10:55:59.964 1838396 INFO neutron.api.extensions [req-1ec38ea9-9081-47d7-be2b-22f076080cab - - - - -] Loaded extension: security-group
2019-12-10 10:55:59.965 1838396 INFO neutron.api.extensions [req-1ec38ea9-9081-47d7-be2b-22f076080cab - - - - -] Extension segment not supported by any of loaded plugins
2019-12-10 10:55:59.966 1838396 INFO neutron.api.extensions [req-1ec38ea9-9081-47d7-be2b-22f076080cab - - - - -] Extension segments-peer-subnet-host-routes not supported by any of loaded plugins
2019-12-10 10:55:59.966 1838396 INFO neutron.api.extensions [req-1ec38ea9-9081-47d7-be2b-22f076080cab - - - - -] Loaded extension: service-type
2019-12-10 10:55:59.967 1838396 INFO neutron.api.extensions [req-1ec38ea9-9081-47d7-be2b-22f076080cab - - - - -] Loaded extension: sorting
2019-12-10 10:55:59.967 1838396 INFO neutron.api.extensions [req-1ec38ea9-9081-47d7-be2b-22f076080cab - - - - -] Extension standard-attr-segment not supported by any of loaded plugins
2019-12-10 10:55:59.968 1838396 INFO neutron.api.extensions [req-1ec38ea9-9081-47d7-be2b-22f076080cab - - - - -] Loaded extension: standard-attr-description
2019-12-10 10:55:59.968 1838396 WARNING neutron.api.extensions [req-1ec38ea9-9081-47d7-be2b-22f076080cab - - - - -] Did not find expected name "Stdattrs_common" in /usr/lib/python2.7/dist-packages/neutron/extensions/stdattrs_common.py
2019-12-10 10:55:59.969 1838396 INFO neutron.api.extensions [req-1ec38ea9-9081-47d7-be2b-22f076080cab - - - - -] Extension subnet-segmentid-writable not supported by any of loaded plugins
2019-12-10 10:55:59.969 1838396 INFO neutron.api.extensions [req-1ec38ea9-9081-47d7-be2b-22f076080cab - - - - -] Loaded extension: subnet-service-types
2019-12-10 10:55:59.970 1838396 INFO neutron.api.extensions [req-1ec38ea9-9081-47d7-be2b-22f076080cab - - - - -] Loaded extension: subnet_allocation
2019-12-10 10:55:59.971 1838396 INFO neutron.api.extensions [req-1ec38ea9-9081-47d7-be2b-22f076080cab - - - - -] Loaded extension: standard-attr-tag
2019-12-10 10:55:59.971 1838396 INFO neutron.api.extensions [req-1ec38ea9-9081-47d7-be2b-22f076080cab - - - - -] Loaded extension: standard-attr-timestamp
2019-12-10 10:55:59.972 1838396 INFO neutron.api.extensions [req-1ec38ea9-9081-47d7-be2b-22f076080cab - - - - -] Extension trunk not supported by any of loaded plugins
2019-12-10 10:55:59.972 1838396 INFO neutron.api.extensions [req-1ec38ea9-9081-47d7-be2b-22f076080cab - - - - -] Extension trunk-details not supported by any of loaded plug...

Revision history for this message
Albert Braden (ozzzo) wrote :
Download full text (13.2 KiB)

I also see some RPC errors in neutron-linuxbridge-agent.log:

2019-12-10 10:57:02.562 1838419 INFO neutron.agent.securitygroups_rpc [req-4e5730f4-0207-4662-a738-eb4d13031eb4 5b288691527245bda715ab7744a193e9 deea2d8541f741eda6fb0d242d16bb23 - - -] Security group member updated [u'7272e9f5-d2fa-4b90-9ab3-aa850f0a9676']
2019-12-10 10:57:02.641 1838419 ERROR neutron.common.rpc [req-1d88bbfd-ff27-4760-aba5-550b02849e6b - - - - -] Timeout in RPC method security_group_info_for_devices. Waiting for 27 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID e790ca44021d42a093bcb5f2d910fb78
2019-12-10 10:57:02.641 1838419 WARNING neutron.common.rpc [req-1d88bbfd-ff27-4760-aba5-550b02849e6b - - - - -] Increasing timeout for security_group_info_for_devices calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID e790ca44021d42a093bcb5f2d910fb78
2019-12-10 10:57:03.000 1838419 INFO neutron.agent.securitygroups_rpc [req-aae08091-6a85-4c7b-a163-96d0f4addb4f 5b288691527245bda715ab7744a193e9 deea2d8541f741eda6fb0d242d16bb23 - - -] Security group member updated [u'7272e9f5-d2fa-4b90-9ab3-aa850f0a9676']

2019-12-10 10:57:15.026 1838419 INFO neutron.agent.securitygroups_rpc [req-32e97ef9-379c-443c-9fef-3dd495f338ef 5b288691527245bda715ab7744a193e9 deea2d8541f741eda6fb0d242d16bb23 - - -] Security group member updated [u'7272e9f5-d2fa-4b90-9ab3-aa850f0a9676']
2019-12-10 10:57:30.119 1838419 ERROR neutron.plugins.ml2.drivers.agent._common_agent [req-1d88bbfd-ff27-4760-aba5-550b02849e6b - - - - -] Error in agent loop. Devices info: {'current': set(['tapf365fa90-08', 'tapb947ec93-64', 'tap704b42f4-6a', 'tapf380ae4c-c2']), 'timestamps': {'tapb947ec93-64': 9, 'tap704b42f4-6a': 10, 'tapf365fa90-08': 11, 'tapf380ae4c-c2': 12}, 'removed': set([]), 'added': set(['tapf365fa90-08', 'tapb947ec93-64', 'tap704b42f4-6a', 'tapf380ae4c-c2']), 'updated': set([])}: MessagingTimeout: Timed out waiting for a reply to message ID e790ca44021d42a093bcb5f2d910fb78
2019-12-10 10:57:30.119 1838419 ERROR neutron.plugins.ml2.drivers.agent._common_agent Traceback (most recent call last):
2019-12-10 10:57:30.119 1838419 ERROR neutron.plugins.ml2.drivers.agent._common_agent File "/usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/agent/_common_agent.py", line 458, in daemon_loop
2019-12-10 10:57:30.119 1838419 ERROR neutron.plugins.ml2.drivers.agent._common_agent sync = self.process_network_devices(device_info)
2019-12-10 10:57:30.119 1838419 ERROR neutron.plugins.ml2.drivers.agent._common_agent File "/usr/lib/python2.7/dist-packages/osprofiler/profiler.py", line 158, in wrapper
2019-12-10 10:57:30.119 1838419 ERROR neutron.plugins.ml2.drivers.agent._common_agent result = f(*args, **kwargs)
2019-12-10 10:57:30.119 1838419 ERROR neutron.plugins.ml2.drivers.agent._common_agent File "/usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/agent/_common_agent.py", line 202, in process_network_devices
2...

Revision history for this message
Albert Braden (ozzzo) wrote :
Download full text (6.3 KiB)

I didn't check RMQ status yesterday after restarting Neutron services, but I can check it now. After running for a day the neutron-metadata-agent.log is almost all errors and VMs cannot build. Here's the RMQ status and cluster_status:

root@us01odc-p02-ctrl1:/var/log/nova# rabbitmqctl status
Status of node 'rabbit@us01odc-p02-ctrl1'
[{pid,1926926},
 {running_applications,
     [{rabbit,"RabbitMQ","3.6.10"},
      {mnesia,"MNESIA CXC 138 12","4.15.3"},
      {ranch,"Socket acceptor pool for TCP protocols.","1.3.0"},
      {ssl,"Erlang/OTP SSL application","8.2.3"},
      {public_key,"Public key infrastructure","1.5.2"},
      {asn1,"The Erlang ASN1 compiler version 5.0.4","5.0.4"},
      {rabbit_common,
          "Modules shared by rabbitmq-server and rabbitmq-erlang-client",
          "3.6.10"},
      {compiler,"ERTS CXC 138 10","7.1.4"},
      {xmerl,"XML parser","1.3.16"},
      {crypto,"CRYPTO","4.2"},
      {os_mon,"CPO CXC 138 46","2.4.4"},
      {syntax_tools,"Syntax tools","2.1.4"},
      {sasl,"SASL CXC 138 11","3.1.1"},
      {stdlib,"ERTS CXC 138 10","3.4.3"},
      {kernel,"ERTS CXC 138 10","5.4.1"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang/OTP 20 [erts-9.2] [source] [64-bit] [smp:48:48] [ds:48:48:10] [async-threads:768] [kernel-poll:true]\n"},
 {memory,
     [{total,2678094784},
      {connection_readers,109660624},
      {connection_writers,10235008},
      {connection_channels,262634328},
      {connection_other,220976648},
      {queue_procs,333485128},
      {queue_slave_procs,377952},
      {plugins,0},
      {other_proc,67189592},
      {mnesia,19216376},
      {metrics,14410456},
      {mgmt_db,0},
      {msg_index,1846528},
      {other_ets,23565736},
      {binary,1545825448},
      {code,21401226},
      {atom,900041},
      {other_system,60777341}]},
 {alarms,[]},
 {listeners,[{clustering,25672,"::"},{amqp,5672,"::"}]},
 {vm_memory_high_watermark,0.4},
 {vm_memory_limit,39982369996},
 {disk_free_limit,50000000},
 {disk_free,82705231872},
 {file_descriptors,
     [{total_limit,65436},
      {total_used,4068},
      {sockets_limit,58890},
      {sockets_used,4066}]},
 {processes,[{limit,1048576},{used,65120}]},
 {run_queue,0},
 {uptime,62151},
 {kernel,{net_ticktime,60}}]
root@us01odc-p02-ctrl1:/var/log/nova# rabbitmqctl cluster_status
Cluster status of node 'rabbit@us01odc-p02-ctrl1'
[{nodes,[{disc,['rabbit@us01odc-p02-ctrl1','rabbit@us01odc-p02-ctrl2',
                'rabbit@us01odc-p02-ctrl3']}]},
 {running_nodes,['rabbit@us01odc-p02-ctrl2','rabbit@us01odc-p02-ctrl3',
                 'rabbit@us01odc-p02-ctrl1']},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]},
 {alarms,[{'rabbit@us01odc-p02-ctrl2',[]},
          {'rabbit@us01odc-p02-ctrl3',[]},
          {'rabbit@us01odc-p02-ctrl1',[]}]}]

This is what neutron-metadata-agent.log looks like now:

2019-12-11 10:01:52.393 1932065 INFO eventlet.wsgi.server [-] 10.195.65.93,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 0 time: 349.4078939
2019-12-11 10:01:52.395 1932068 INFO eventlet.wsgi.server [-] Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/eventlet/wsgi....

Read more...

Revision history for this message
Albert Braden (ozzzo) wrote :
Download full text (61.3 KiB)

Here's the MySQL status:

root@us01odc-p02-ctrl1:/var/log/nova# mysql -e "show status"
+--------------------------------------------------------------+--------------------------------------------------+
| Variable_name | Value |
+--------------------------------------------------------------+--------------------------------------------------+
| Aborted_clients | 14089 |
| Aborted_connects | 4 |
| Access_denied_errors | 0 |
| Acl_column_grants | 0 |
| Acl_database_grants | 16 |
| Acl_function_grants | 0 |
| Acl_procedure_grants | 0 |
| Acl_package_spec_grants | 0 |
| Acl_package_body_grants | 0 |
| Acl_proxy_users | 1 |
| Acl_role_grants | 0 |
| Acl_roles | 0 |
| Acl_table_grants | 0 |
| Acl_users | 20 |
| Aria_pagecache_blocks_not_flushed | 0 |
| Aria_pagecache_blocks_unused | 15706 |
| Aria_pagecache_blocks_used | 389 |
| Aria_pagecache_read_requests | 103205350 |
| Aria_pagecache_reads | 30418596 |
| Aria_pagecache_write_requests | 61739452 |
| Aria_pagecache_writes | 61739452 |
| Aria_transaction_log_syncs | 22785 |
| Binlog_commits | 0 |
| Binlog_group_commits | 0 ...

Revision history for this message
Albert Braden (ozzzo) wrote :

Load is 13 on the controllers which should be fine with 48 CPUs. We have nova, neutron and RMQ all running on 3 controllers. MySQL runs on ctrl1.

root@us01odc-p02-ctrl1:/var/log/nova# uptime
 10:41:08 up 7 days, 23:29, 3 users, load average: 12.66, 13.12, 13.51
root@us01odc-p02-ctrl1:/var/log/nova# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
Stepping: 4
CPU MHz: 2669.000
BogoMIPS: 4600.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 16896K
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp_epp pku ospke md_clear flush_l1d

Changed in neutron:
status: Incomplete → New
Changed in neutron:
importance: Undecided → High
Revision history for this message
YAMAMOTO Takashi (yamamoto) wrote :

have you tried to bump the number of rpc workers?

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Albert:

Reviewing the agent and the server logs (not in the same time, that's a pity), the server looks healthy, replying to the API calls in a very short time. However the agent seems to have problems with the RPC calls, not receiving any reply. I can't dig more on this.

About the DB, I took a look at Innodb_row_lock_time_max (562ms) and Innodb_row_lock_time_avg (1ms). Both parameters look perfectly OK.

Can you check the connectivity from the faulty agent to the MQ? I think this is the problem you are having there. IMO, this looks like a configuration problem rather than a code one. Of course, the information I have is limited. Please, don't take this as the ultimate truth.

Regards.

Revision history for this message
Albert Braden (ozzzo) wrote :

Setting rpc_workers to 8 in /etc/neutron/neutron.conf fixed the problem.

How could I have found this information without opening a bug? Is there a document that I should have read?

Revision history for this message
Albert Braden (ozzzo) wrote :

It appears that the information should be added to:

https://docs.openstack.org/neutron/rocky/configuration/neutron.html.

This section:

rpc_workers

Type: integer
Default: 1
Number of RPC worker processes for service.

Should include this information:

If your neutron-metadata-agent.log is full of "error: [Errno 32] Broken pipe" then increase this setting

Revision history for this message
YAMAMOTO Takashi (yamamoto) wrote :

Albert,

glad to hear it fixed the problem.

i myself found it by "git log". (later versions have larger default for rpc_workers.)

it's documented here. (for master)
https://docs.openstack.org/neutron/train/admin/config-wsgi.html#neutron-worker-processes

i don't know if it's documented for rocky.

Revision history for this message
YAMAMOTO Takashi (yamamoto) wrote :

lower priority as there's a workaround. of course, documentation improvements are welcome.

Changed in neutron:
importance: High → Low
status: New → Opinion
Chason Chan (chen-xing)
Changed in neutron:
assignee: nobody → Chason Chan (chen-xing)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/699372

Changed in neutron:
status: Opinion → In Progress
Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee: Chason Chan (chen-xing) → nobody
status: In Progress → New
tags: added: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.opendev.org/699372
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/869770

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/869770
Committed: https://opendev.org/openstack/neutron/commit/ed048638f4ef6a1799a1dcbd8819627b3fa8e75e
Submitter: "Zuul (22348)"
Branch: master

commit ed048638f4ef6a1799a1dcbd8819627b3fa8e75e
Author: Brian Haley <email address hidden>
Date: Tue Jan 10 16:40:29 2023 -0500

    Add text to WSGI config guide for debugging

    Add "broken pipe" example error message to WSGI guide in
    rpc_workers section, could help diagnose agent issues.

    Trivialfix

    Change-Id: I6e95614bce42124f125be4c23fff0d923ea9ebcc
    Closes-bug: #1855919

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 22.0.0.0rc1

This issue was fixed in the openstack/neutron 22.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.