Numerous {handshake_timeout,handshake} in RabbitMQ log

Bug #1362155 reported by Dmitry Teselkin
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Committed
Medium
MOS Oslo
5.1.x
Won't Fix
Medium
Dmitry Teselkin

Bug Description

Diagnostic snapshot: http://172.18.10.94:8888/fuel-snapshot-2014-08-27_13-03-26.tgz

{"build_id": "2014-07-17_11-18-10", "mirantis": "yes", "build_number": "135", "ostf_sha": "09b6bccf7d476771ac859bb3c76c9ebec9da9e1f", "nailgun_sha": "1d08d6f80b6514085dd8c0af4d437ef5d37e2802", "production": "docker", "api": "1.0", "fuelmain_sha": "c8e13df4c7de3ce3504c2bcb6d51a165b9aae0b6", "astute_sha": "9a74b788be9a7c5682f1c52a892df36e4766ce3f", "release": "5.0.1", "fuellib_sha": "e8c2bb726be6b78c3a34f75c84337a3a5662bb35"}

Problem: unable to open noVNC console for any instance.

When I click the 'Console' link in dropdown menu (under 'More' button), blank white window in browser appears. No items of openstack dashboard is displayed on that screen, and it is in 'loading' state.

If I press 'Stop' and then try to open the link to the noVNC console againg, nothing happens on the screen, but new error messages in nova-all.log file appears.

Sometimes I managed to open noVNC console after a few 'Refresh' actions.

RabbitMQ log contains constantly appearing error messages like shown below:

================================================================================
=INFO REPORT==== 27-Aug-2014::11:43:11 ===
accepting AMQP connection <0.27627.0> (192.168.0.2:54186 -> 192.168.0.2:5672)

=INFO REPORT==== 27-Aug-2014::11:43:17 ===
accepting AMQP connection <0.27743.0> (192.168.0.2:54188 -> 192.168.0.2:5672)

=ERROR REPORT==== 27-Aug-2014::11:43:21 ===
closing AMQP connection <0.27627.0> (192.168.0.2:54186 -> 192.168.0.2:5672):
{handshake_timeout,handshake}

=ERROR REPORT==== 27-Aug-2014::11:43:27 ===
closing AMQP connection <0.27743.0> (192.168.0.2:54188 -> 192.168.0.2:5672):
{handshake_timeout,handshake}

=INFO REPORT==== 27-Aug-2014::11:43:52 ===
accepting AMQP connection <0.27918.0> (192.168.0.2:54189 -> 192.168.0.2:5672)

=ERROR REPORT==== 27-Aug-2014::11:44:02 ===
closing AMQP connection <0.27918.0> (192.168.0.2:54189 -> 192.168.0.2:5672):
{handshake_timeout,handshake}
================================================================================

In the same time there are trace messages in OpenStack logs, for example, /var/log/nova-all.log

================================================================================
<182>Aug 27 10:41:38 node-1 nova-urllib3.connectionpool INFO: Starting new HTTP connection (1): 192.168.0.2
<182>Aug 27 10:41:38 node-1 nova-nova.osapi_compute.wsgi.server INFO: 172.16.40.136 "GET /v2/631fa1bc73f84530b78c3efcfa38ebfd/servers/0a64f095-58d3-43f4-b81d-39e0864fa0b7 HTTP/1.1" status: 200 len: 2038 time: 0.1334660
<182>Aug 27 10:41:39 node-1 nova-nova.osapi_compute.wsgi.server INFO: 172.16.40.136 "GET /v2/631fa1bc73f84530b78c3efcfa38ebfd/servers/0a64f095-58d3-43f4-b81d-39e0864fa0b7/os-volume_attachments HTTP/1.1" status: 200 len: 197 time: 0.0401199
<182>Aug 27 10:41:39 node-1 nova-nova.osapi_compute.wsgi.server INFO: 172.16.40.136 "GET /v2/631fa1bc73f84530b78c3efcfa38ebfd/flavors/2 HTTP/1.1" status: 200 len: 600 time: 0.0119381
<179>Aug 27 10:41:39 node-1 nova-oslo.messaging._drivers.impl_rabbit ERROR: Failed to publish message to topic 'compute.node-3.vm.mirantis.net': [Errno 104] Connection reset by peer
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 632, in ensure
    return method(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 728, in _publish
    publisher = cls(self.conf, self.channel, topic, **kwargs)
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 384, in __init__
    **options)
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 331, in __init__
    self.reconnect(channel)
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 339, in reconnect
    routing_key=self.routing_key)
  File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 84, in __init__
    self.revive(self._channel)
  File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 218, in revive
    self.declare()
  File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 104, in declare
    self.exchange.declare()
  File "/usr/lib/python2.6/site-packages/kombu/entity.py", line 166, in declare
    nowait=nowait, passive=passive,
  File "/usr/lib/python2.6/site-packages/amqp/channel.py", line 613, in exchange_declare
    self._send_method((40, 10), args)
  File "/usr/lib/python2.6/site-packages/amqp/abstract_channel.py", line 56, in _send_method
    self.channel_id, method_sig, args, content,
  File "/usr/lib/python2.6/site-packages/amqp/method_framing.py", line 221, in write_method
    write_frame(1, channel, payload)
  File "/usr/lib/python2.6/site-packages/amqp/transport.py", line 177, in write_frame
    frame_type, channel, size, payload, 0xce,
  File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 309, in sendall
    tail = self.send(data, flags)
  File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 295, in send
    total_sent += fd.send(data[total_sent:], flags)
error: [Errno 104] Connection reset by peer
_send_method((40, 10), args)
  File "/usr/lib/python2.6/site-packages/amqp/abstract_channel.py", line 56, in _send_method
    self.channel_id, method_sig, args, content,
  File "/usr/lib/python2.6/site-packages/amqp/method_framing.py", line 221, in write_method
    write_frame(1, channel, payload)
  File "/usr/lib/python2.6/site-packages/amqp/transport.py", line 177, in write_frame
    frame_type, channel, size, payload, 0xce,
  File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 309, in sendall
    tail = self.send(data, flags)
  File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 295, in send
    total_sent += fd.send(data[total_sent:], flags)
error: [Errno 104] Connection reset by peer
<182>Aug 27 10:41:39 node-1 nova-oslo.messaging._drivers.impl_rabbit INFO: Reconnecting to AMQP server on 192.168.0.2:5672
<182>Aug 27 10:41:39 node-1 nova-oslo.messaging._drivers.impl_rabbit INFO: Delaying reconnect for 5.0 seconds...
<182>Aug 27 10:41:44 node-1 nova-oslo.messaging._drivers.impl_rabbit INFO: Connected to AMQP server on 192.168.0.2:5672
<180>Aug 27 10:41:44 node-1 nova-nova.consoleauth.manager AUDIT: Received Token: 14eac98c-280b-40ae-aee0-2a658cf25060, {'instance_uuid': u'0a64f095-58d3-43f4-b81d-39e0864fa0b7', 'internal_access_path': None, 'last_activity_at': 1409136104.2732379, 'console_type': u'novnc', 'host': u'192.168.0.4', 'token': u'14eac98c-280b-40ae-aee0-2a658cf25060', 'port': u'5903'}
<179>Aug 27 10:41:44 node-1 nova-oslo.messaging._drivers.impl_rabbit ERROR: Failed to publish message to topic 'reply_96632cd2e06642d4ad0d860256d55c66': [Errno 104] Connection reset by peer
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 632, in ensure
    return method(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 728, in _publish
    publisher = cls(self.conf, self.channel, topic, **kwargs)
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 365, in __init__
    type='direct', **options)
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 331, in __init__
    self.reconnect(channel)
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 339, in reconnect
    routing_key=self.routing_key)
  File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 84, in __init__
    self.revive(self._channel)
  File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 218, in revive
    self.declare()
  File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 104, in declare
    self.exchange.declare()
  File "/usr/lib/python2.6/site-packages/kombu/entity.py", line 166, in declare
    nowait=nowait, passive=passive,
  File "/usr/lib/python2.6/site-packages/amqp/channel.py", line 613, in exchange_declare
    self._send_method((40, 10), args)
  File "/usr/lib/python2.6/site-packages/amqp/abstract_channel.py", line 56, in _send_method
    self.channel_id, method_sig, args, content,
  File "/usr/lib/python2.6/site-packages/amqp/method_framing.py", line 221, in write_method
    write_frame(1, channel, payload)
  File "/usr/lib/python2.6/site-packages/amqp/transport.py", line 177, in write_frame
    frame_type, channel, size, payload, 0xce,
  File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 309, in sendall
    tail = self.send(data, flags)
  File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 295, in send
    total_sent += fd.send(data[total_sent:], flags)
error: [Errno 104] Connection reset by peer
<182>Aug 27 10:41:44 node-1 nova-oslo.messaging._drivers.impl_rabbit INFO: Reconnecting to AMQP server on 192.168.0.2:5672
<182>Aug 27 10:41:44 node-1 nova-oslo.messaging._drivers.impl_rabbit INFO: Delaying reconnect for 5.0 seconds...
================================================================================

Looks like the bug https://bugs.launchpad.net/ceilometer/+bug/1337715

Update: After a few hours this env lost the ability to launch new instances.

tags: added: ceilometer
Changed in mos:
milestone: none → 5.1
description: updated
description: updated
tags: added: messaging rabbitmq
removed: ceilometer
Revision history for this message
Serg Melikyan (smelikyan) wrote :

I suggest to reproduce this issue in release version of 5.0.1 and retest on 5.1. Issue observer on pre-release version of 5.0.1

Changed in mos:
status: New → Incomplete
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Dmitry, please reproduce the issue on 5.1 or 5.0.2

Changed in mos:
assignee: nobody → Dmitry Teselkin (teselkin-d)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

 messages like:

=ERROR REPORT====
closing AMQP connection
{handshake_timeout,handshake}

generates because of HAproxy service continuously checks rabbit nodes.

Changed in mos:
status: Incomplete → Won't Fix
importance: Undecided → Medium
Revision history for this message
Denis Kostryukov (dkostryukov) wrote :

Looks like "ceilometer-agent-notification" is the root of cause of {handshake_timeout,handshake} errors in rabbitmq logs.

Changed in mos:
milestone: 5.1 → 6.0
status: Won't Fix → Incomplete
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Status updated:
1) we need to track it both in release series it was found (5.1.x), and releases series that are our current focus (6.0.x).
2) marked as Iincomplete in 6.0 based on the comments above, looks like we need to clarify the impact and reproducibility of this bug.
3) marked as Won't Fix for 5.1: unless this turns out High or Critical priority, we're not going to backport a fix to 5.1.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Some update from related zendesk ticket #3234:
* Looks like "ceilometer-agent-notification" is the root of cause of {handshake_timeout,handshake} errors in rabbitmq logs.
* after "ceilometer-agent-notification" service down, the errors doesn't appear

Changed in mos:
status: Incomplete → Confirmed
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Seems like an issue in oslo.messaging notification logic, hence assigning to MOS Oslo team

Changed in mos:
assignee: Dmitry Teselkin (teselkin-d) → MOS Oslo (mos-oslo)
Revision history for this message
Miroslav Anashkin (manashkin) wrote :

Please check same upstream bug:
https://bugs.launchpad.net/ceilometer/+bug/1337715

The interesting thing - we reproduced this issue against MOS 5.1 with Oslo messaging 1.4.1 - python-oslo-messaging-1.4.1-fuel6.0.mira18.noarch.rpm, while we updated only this packages and direct requirements,
without updating/installing only the followingthe following components:
python-oslo-config-1.4.0-fuel6.0.mira22.noarch.rpm
python-oslo-db-1.0.1-fuel6.0.mira21.noarch.rpm
python-oslo-vmware-0.6.0-fuel6.0.mira19.noarch.rpm

This updated python-oslo-messaging-1.4.1-fuel6.0.mira18 already includes latest fix to drivers/impl_rabbit.py, so it at least re-uses existing RabbitMQ queues.

MOS Oslo team suggests the issue may be somewhere in Ceilometer.

MOS Oslo team, could you please confirm?

Revision history for this message
Ilya Pekelny (i159) wrote : Re: [Bug 1362155] Re: Numerous {handshake_timeout, handshake} in RabbitMQ log
Download full text (11.2 KiB)

Yes, the last messaging version has the fix for the re-connection issue.

Sorry for late response, I hope it doesn't affect any more.

On Tue, Nov 18, 2014 at 10:19 AM, Miroslav Anashkin <email address hidden>
wrote:

> Please check same upstream bug:
> https://bugs.launchpad.net/ceilometer/+bug/1337715
>
> The interesting thing - we reproduced this issue against MOS 5.1 with Oslo
> messaging 1.4.1 - python-oslo-messaging-1.4.1-fuel6.0.mira18.noarch.rpm,
> while we updated only this packages and direct requirements,
> without updating/installing only the followingthe following components:
> python-oslo-config-1.4.0-fuel6.0.mira22.noarch.rpm
> python-oslo-db-1.0.1-fuel6.0.mira21.noarch.rpm
> python-oslo-vmware-0.6.0-fuel6.0.mira19.noarch.rpm
>
> This updated python-oslo-messaging-1.4.1-fuel6.0.mira18 already includes
> latest fix to drivers/impl_rabbit.py, so it at least re-uses existing
> RabbitMQ queues.
>
> MOS Oslo team suggests the issue may be somewhere in Ceilometer.
>
> MOS Oslo team, could you please confirm?
>
> --
> You received this bug notification because you are a member of MOS Oslo,
> which is a bug assignee.
> https://bugs.launchpad.net/bugs/1362155
>
> Title:
> Numerous {handshake_timeout,handshake} in RabbitMQ log
>
> Status in Mirantis OpenStack:
> Confirmed
> Status in Mirantis OpenStack 5.1.x series:
> Won't Fix
>
> Bug description:
> Diagnostic snapshot: http://172.18.10.94:8888/fuel-
> snapshot-2014-08-27_13-03-26.tgz
>
> {"build_id": "2014-07-17_11-18-10", "mirantis": "yes", "build_number":
> "135", "ostf_sha": "09b6bccf7d476771ac859bb3c76c9ebec9da9e1f",
> "nailgun_sha": "1d08d6f80b6514085dd8c0af4d437ef5d37e2802",
> "production": "docker", "api": "1.0", "fuelmain_sha":
> "c8e13df4c7de3ce3504c2bcb6d51a165b9aae0b6", "astute_sha":
> "9a74b788be9a7c5682f1c52a892df36e4766ce3f", "release": "5.0.1",
> "fuellib_sha": "e8c2bb726be6b78c3a34f75c84337a3a5662bb35"}
>
> Problem: unable to open noVNC console for any instance.
>
> When I click the 'Console' link in dropdown menu (under 'More'
> button), blank white window in browser appears. No items of openstack
> dashboard is displayed on that screen, and it is in 'loading' state.
>
> If I press 'Stop' and then try to open the link to the noVNC console
> againg, nothing happens on the screen, but new error messages in nova-
> all.log file appears.
>
> Sometimes I managed to open noVNC console after a few 'Refresh'
> actions.
>
> RabbitMQ log contains constantly appearing error messages like shown
> below:
>
>
> ================================================================================
> =INFO REPORT==== 27-Aug-2014::11:43:11 ===
> accepting AMQP connection <0.27627.0> (192.168.0.2:54186 ->
> 192.168.0.2:5672)
>
> =INFO REPORT==== 27-Aug-2014::11:43:17 ===
> accepting AMQP connection <0.27743.0> (192.168.0.2:54188 ->
> 192.168.0.2:5672)
>
> =ERROR REPORT==== 27-Aug-2014::11:43:21 ===
> closing AMQP connection <0.27627.0> (192.168.0.2:54186 ->
> 192.168.0.2:5672):
> {handshake_timeout,handshake}
>
> =ERROR REPORT==== 27-Aug-2014::11:43:27 ===
> closing AMQP connection <0.27743.0> (192.168.0.2:54188 ->
> 192.168.0....

Revision history for this message
Ilya Pekelny (i159) wrote :
Changed in mos:
status: Confirmed → Fix Committed
Revision history for this message
dreamhu (dreamhu) wrote :

the fixed url is not found!

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

dreamhu, the patch is hidden from the public. But essentially it points to heartbeats implementation we had at that time (Icehouse, MOS 5.x). You may find current implementation in upstream OpenStack https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L186-L194 , see parameters heartbeat_timeout_threshold and heartbeat_rate

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.