fs022 pike reporting AMQP timeouts for nova-compute/neutron ovs agent.
errors on controller side:
https://logs.rdoproject.org/openstack-periodic-24hr/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset022-pike/d9e1506/overcloud-controller-0/var/log/extra/errors.txt.gz#_2018-03-27_06_45_18_768
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource Traceback (most recent call last):
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 849, in _action_recorder
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource yield
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 957, in _do_action
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource yield self.action_handler_task(action, args=handler_args)
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 351, in wrapper
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource step = next(subtask)
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 908, in action_handler_task
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource done = check(handler_data)
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/nova/server.py", line 869, in check_create_complete
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource check = self.client_plugin()._check_active(server_id)
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/clients/os/nova.py", line 238, in _check_active
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource 'code': fault.get('code', _('Unknown'))
2018-03-27 06:45:18.768 ERROR /var/log/containers/heat/heat-engine.log: 6 ERROR heat.engine.resource ResourceInError: Went to status ERROR due to "Message: No valid host was found. , Code: 500"
errors on compute side:
https://logs.rdoproject.org/openstack-periodic-24hr/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset022-pike/d9e1506/overcloud-novacompute-0/var/log/extra/errors.txt.gz#_2018-03-27_06_42_50_630
2018-03-27 06:41:13.072 ERROR /var/log/extra/docker/containers/neutron_ovs_agent/log/neutron/neutron-openvswitch-agent.log: 23260 ERROR neutron.common.rpc [req-392cd54f-005c-4d4f-a350-af1cde36bf4b - - - - -] Timeout in RPC method tunnel_sync. Waiting for 53 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID e495637cb37148db9a4ca34b61db11de
2018-03-27 06:42:50.630 ERROR /var/log/extra/docker/containers/nova_libvirt/log/nova/nova-compute.log: 7 ERROR oslo.messaging._drivers.impl_rabbit [req-1d44cbaa-5d4e-4f88-928e-c80cb50711fe - - - - -] [ad70f6b4-1e81-49df-baee-1b35303fd4cf] AMQP server on overcloud-controller-0.internalapi.localdomain:5672 is unreachable: [Errno 110] Connection timed out. Trying again in 1 seconds. Client port: 43950: error: [Errno 110] Connection timed out
This looks like a problem on the client side, somewhere in oslo.messaging/ kombu/py- amqp.
You can see the client src ports being used from the compute log:
$ curl -s https:/ /logs.rdoprojec t.org/openstack -periodic- 24hr/periodic- tripleo- ci-centos- 7-ovb-1ctlr_ 1comp-featurese t022-pike/ d9e1506/ overcloud- novacompute- 0/var/log/ extra/errors. txt.gz# _2018-03- 27_06_42_ 50_630 | egrep -o 'port: [0-9]+' | sort | uniq
port: 43950
port: 43960
port: 44038
port: 44048
port: 44060
And then you can see in the rabbit log that those same client connections are not handshaking properly for some reason:
$ curl -s https:/ /logs.rdoprojec t.org/openstack -periodic- 24hr/periodic- tripleo- ci-centos- 7-ovb-1ctlr_ 1comp-featurese t022-pike/ d9e1506<email address hidden> | grep -B2 handshake 2018::06: 43:00 === 24.14:43950 -> 192.168. 24.17:5672) : timeout, frame_header} 2018::06: 44:45 === 24.14:43960 -> 192.168. 24.17:5672) : timeout, frame_header} 2018::06: 47:30 === 24.14:44038 -> 192.168. 24.17:5672) : timeout, frame_header} 2018::06: 49:14 === 24.14:44048 -> 192.168. 24.17:5672) : timeout, frame_header} 2018::06: 50:58 === 24.14:44060 -> 192.168. 24.17:5672) : timeout, frame_header} 2018::06: 52:43 === 24.14:44130 -> 192.168. 24.17:5672) : timeout, frame_header}
=ERROR REPORT==== 27-Mar-
closing AMQP connection <0.7859.0> (192.168.
{handshake_
--
=ERROR REPORT==== 27-Mar-
closing AMQP connection <0.8794.0> (192.168.
{handshake_
--
=ERROR REPORT==== 27-Mar-
closing AMQP connection <0.10299.0> (192.168.
{handshake_
--
=ERROR REPORT==== 27-Mar-
closing AMQP connection <0.11187.0> (192.168.
{handshake_
--
=ERROR REPORT==== 27-Mar-
closing AMQP connection <0.12067.0> (192.168.
{handshake_
--
=ERROR REPORT==== 27-Mar-
closing AMQP connection <0.12963.0> (192.168.
{handshake_
So for some reason the client knows something is wrong, and reconnects with a new TCP connection with a new source port, but doesn't perform an AMQP handshake after the connection is established.