Rabbit is unavailable

Bug #1342719 reported by Sergey Murashov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
Critical
Sergey Otpuschennikov
5.0.x
Confirmed
Critical
Registry Administrators

Bug Description

Steps to reproduce:
ISO: http://jenkins-product.srt.mirantis.net:8080/job/fuel_iso_with_gerrit_commits_docker/413/
1. Install OS(simple mode, 1 Controller + 1 Compute + 1 Ceph,GRE, Murano, CentOS)(Vbox)

Actual result:
Install is successfull but in nova logs we can see:
<182>Jul 16 13:06:03 node-1 nova-oslo.messaging._drivers.impl_rabbit INFO: Reconnecting to AMQP server on 192.168.0.1:5672
<182>Jul 16 13:06:03 node-1 nova-oslo.messaging._drivers.impl_rabbit INFO: Delaying reconnect for 5.0 seconds...
<182>Jul 16 13:06:08 node-1 nova-oslo.messaging._drivers.impl_rabbit INFO: Connected to AMQP server on 192.168.0.1:5672
<179>Jul 16 13:06:09 node-1 nova-oslo.messaging._drivers.impl_rabbit ERROR: Failed to publish message to topic 'reply_bfc1926896534a839eb4f0414831ab8e': [Errno 104] Connection reset by peer
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 632, in ensure
    return method(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 728, in _publish
    publisher = cls(self.conf, self.channel, topic, **kwargs)
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 365, in __init__
    type='direct', **options)
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 331, in __init__
    self.reconnect(channel)
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 339, in reconnect
    routing_key=self.routing_key)
  File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 84, in __init__
    self.revive(self._channel)
  File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 218, in revive
    self.declare()
  File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 104, in declare
    self.exchange.declare()
  File "/usr/lib/python2.6/site-packages/kombu/entity.py", line 166, in declare
    nowait=nowait, passive=passive,
  File "/usr/lib/python2.6/site-packages/amqp/channel.py", line 613, in exchange_declare
    self._send_method((40, 10), args)
  File "/usr/lib/python2.6/site-packages/amqp/abstract_channel.py", line 56, in _send_method
    self.channel_id, method_sig, args, content,
  File "/usr/lib/python2.6/site-packages/amqp/method_framing.py", line 221, in write_method
    write_frame(1, channel, payload)
  File "/usr/lib/python2.6/site-packages/amqp/transport.py", line 177, in write_frame
    frame_type, channel, size, payload, 0xce,
  File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 309, in sendall
    tail = self.send(data, flags)
  File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 295, in send
    total_sent += fd.send(data[total_sent:], flags)
error: [Errno 104] Connection reset by peer
<182>Jul 16 13:06:09 node-1 nova-oslo.messaging._drivers.impl_rabbit INFO: Reconnecting to AMQP server on 192.168.0.1:5672
<182>Jul 16 13:06:09 node-1 nova-oslo.messaging._drivers.impl_rabbit INFO: Delaying reconnect for 5.0 seconds...

And the following logs in murano.log:
.py:697
2014-07-16 13:09:42.058 4668 DEBUG oslo.messaging._drivers.impl_rabbit [-] Timed out waiting for RPC response: timed out _error_callback /usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py:697
2014-07-16 13:09:43.013 4668 DEBUG oslo.messaging._drivers.impl_rabbit [-] Timed out waiting for RPC response: timed out _error_callback /usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py:697
^[[A2014-07-16 13:09:43.059 4668 DEBUG oslo.messaging._drivers.impl_rabbit [-] Timed out waiting for RPC response: timed out _error_callback /usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py:697
^[[A2014-07-16 13:09:43.723 4690 DEBUG oslo.messaging._drivers.impl_rabbit [-] Timed out waiting for RPC response: timed out _error_callback /usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py:697
2014-07-16 13:09:44.015 4668 DEBUG oslo.messaging._drivers.impl_rabbit [-] Timed out waiting for RPC response: timed out _error_callback /usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py:697
2014-07-16 13:09:44.060 4668 DEBUG oslo.messaging._drivers.impl_rabbit [-] Timed out waiting for RPC response: timed out _error_callback /usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py:697

In cinder logs:
<158>Jul 16 13:10:36 node-1 cinder-oslo.messaging._drivers.impl_rabbit INFO: Reconnecting to AMQP server on 192.168.0.1:5672
<158>Jul 16 13:10:36 node-1 cinder-oslo.messaging._drivers.impl_rabbit INFO: Delaying reconnect for 5.0 seconds...
<158>Jul 16 13:10:41 node-1 cinder-oslo.messaging._drivers.impl_rabbit INFO: Connected to AMQP server on 192.168.0.1:5672
<156>Jul 16 13:10:41 node-1 cinder-cinder.context WARNING: Arguments dropped when creating context: {'user': None, 'tenant': None, 'user_identity': u'- - - - -'}
<155>Jul 16 13:10:41 node-1 cinder-oslo.messaging._drivers.impl_rabbit ERROR: Failed to publish message to topic 'None': [Errno 104] Connection reset by peer
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 632, in ensure
    return method(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 729, in _publish
    publisher.send(msg, timeout)
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 349, in send
    self.producer.publish(msg)
  File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 168, in publish
    routing_key, mandatory, immediate, exchange, declare)
  File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 184, in _publish
    mandatory=mandatory, immediate=immediate,
  File "/usr/lib/python2.6/site-packages/amqp/channel.py", line 2122, in _basic_publish
    self._send_method((60, 40), args, msg)
  File "/usr/lib/python2.6/site-packages/amqp/abstract_channel.py", line 56, in _send_method
    self.channel_id, method_sig, args, content,
  File "/usr/lib/python2.6/site-packages/amqp/method_framing.py", line 221, in write_method
    write_frame(1, channel, payload)
  File "/usr/lib/python2.6/site-packages/amqp/transport.py", line 177, in write_frame
    frame_type, channel, size, payload, 0xce,
  File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 309, in sendall
    tail = self.send(data, flags)
  File "/usr/lib/python2.6/site-packages/eventlet/greenio.py", line 295, in send
    total_sent += fd.send(data[total_sent:], flags)
error: [Errno 104] Connection reset by peer

Tags: heartbeat
Revision history for this message
Sergey Murashov (smurashov) wrote :
Changed in fuel:
importance: Undecided → Critical
assignee: nobody → Vladimir Sharshov (vsharshov)
milestone: none → 5.0.1
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

OpenStack cloud doen't work - we can't start VMs.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

We should increase default heartbeat timeout for RabbitMQ from 5 to 60 (Default for RabbitMQ 580 sec)

Changed in fuel:
assignee: Vladimir Sharshov (vsharshov) → Fuel OSCI Team (fuel-osci)
status: New → In Progress
Roman Vyalov (r0mikiam)
Changed in fuel:
assignee: Fuel OSCI Team (fuel-osci) → Sergey Otpuschennikov (sotpuschennikov)
Revision history for this message
Sergey Murashov (smurashov) wrote :

reproduced on KVM(HA, 3 Controller, 1 Compute, 3 Ceph, VLAN, CentOS, Murano)

Revision history for this message
OSCI Robot (oscirobot) wrote :

Package python-oslo.messaging has been built from changeset: http://gerrit.mirantis.com/18356
RPM Repository URL: http://osci-obs.vm.mirantis.net:82/centos-fuel-5.0.1-stable-18356/centos
You can build an ISO with this package:
make iso EXTRA_RPM_REPOS="osci-testing,http://osci-obs.vm.mirantis.net:82/centos-fuel-5.0.1-stable-18356/centos"

Revision history for this message
OSCI Robot (oscirobot) wrote :

Package python-oslo.messaging has been built from changeset: http://gerrit.mirantis.com/18357
DEB Repository URL: http://osci-obs.vm.mirantis.net:82/ubuntu-fuel-5.0.1-stable-18357/ubuntu
You can build an ISO with this package:
make iso EXTRA_DEB_REPOS="http://osci-obs.vm.mirantis.net:82/ubuntu-fuel-5.0.1-stable-18357/ubuntu /"

Revision history for this message
Anastasia Palkina (apalkina) wrote :
Download full text (3.3 KiB)

Reproduced on ISO #130
"build_id": "2014-07-16_00-31-14",
"mirantis": "yes",
"build_number": "130",
"ostf_sha": "09b6bccf7d476771ac859bb3c76c9ebec9da9e1f",
"nailgun_sha": "1d08d6f80b6514085dd8c0af4d437ef5d37e2802",
"production": "docker",
"api": "1.0",
"fuelmain_sha": "13e7eb64352a90edf1265f672d87fce10fee5093",
"astute_sha": "9a74b788be9a7c5682f1c52a892df36e4766ce3f",
"release": "5.0.1",
"fuellib_sha": "2d1e1369c13bc9771e9473086cb064d257a21fc2"

1. Create new environment (CentOS, HA mode)
2. Choose GRE segmentation
3. Choose Ceilometer
4. Add controller, compute, mongo
5. Start deployment. It was successful
6. Start OSTF tests. Instances have error state. There is error in Horizon: No valid host not found.

In logs controller node-9, compute node-10

<182>Jul 16 13:48:34 node-9 nova-oslo.messaging._drivers.impl_rabbit INFO: Reconnecting to AMQP server on 192.168.0.2:5673
<182>Jul 16 13:48:34 node-9 nova-oslo.messaging._drivers.impl_rabbit INFO: Delaying reconnect for 5.0 seconds...
<182>Jul 16 13:48:39 node-9 nova-oslo.messaging._drivers.impl_rabbit INFO: Connected to AMQP server on 192.168.0.2:5673
<179>Jul 16 13:49:20 node-9 nova-oslo.messaging._drivers.impl_rabbit ERROR: Failed to publish message to topic 'reply_9b46bc4764094efcab664805d52bee30': Socket closed
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 632, in ensure
    return method(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 728, in _publish
    publisher = cls(self.conf, self.channel, topic, **kwargs)
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 365, in __init__
    type='direct', **options)
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 331, in __init__
    self.reconnect(channel)
  File "/usr/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_rabbit.py", line 339, in reconnect
    routing_key=self.routing_key)
  File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 84, in __init__
    self.revive(self._channel)
  File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 218, in revive
    self.declare()
  File "/usr/lib/python2.6/site-packages/kombu/messaging.py", line 104, in declare
    self.exchange.declare()
  File "/usr/lib/python2.6/site-packages/kombu/entity.py", line 166, in declare
    nowait=nowait, passive=passive,
  File "/usr/lib/python2.6/site-packages/amqp/channel.py", line 620, in exchange_declare
    (40, 11), # Channel.exchange_declare_ok
  File "/usr/lib/python2.6/site-packages/amqp/abstract_channel.py", line 67, in wait
    self.channel_id, allowed_methods)
  File "/usr/lib/python2.6/site-packages/amqp/connection.py", line 237, in _wait_method
    self.method_reader.read_method()
  File "/usr/lib/python2.6/site-packages/amqp/method_framing.py", line 189, in read_method
    raise m
IOError: Socket closed
<182>Jul 16 13:49:20 node-9 nova-oslo.messaging._drivers.impl_rabbit INFO: Reconnecting to AMQP server on 127.0.0.1:5673
<182>Jul 16 13:49:20 node-9 nova-oslo.messaging._drivers.impl_rabbit INFO: Delaying reconnec...

Read more...

Revision history for this message
Anastasia Palkina (apalkina) wrote :
Revision history for this message
OSCI Robot (oscirobot) wrote :

Package python-oslo.messaging has been built from changeset: http://gerrit.mirantis.com/18356
RPM Repository URL: http://osci-obs.vm.mirantis.net:82/centos-fuel-5.0.1-stable/centos
You can build an ISO with this package:
make iso EXTRA_RPM_REPOS="osci-testing,http://osci-obs.vm.mirantis.net:82/centos-fuel-5.0.1-stable/centos"

Revision history for this message
OSCI Robot (oscirobot) wrote :

Package python-oslo.messaging has been built from changeset: http://gerrit.mirantis.com/18357
DEB Repository URL: http://osci-obs.vm.mirantis.net:82/ubuntu-fuel-5.0.1-stable/ubuntu
You can build an ISO with this package:
make iso EXTRA_DEB_REPOS="http://osci-obs.vm.mirantis.net:82/ubuntu-fuel-5.0.1-stable/ubuntu /"

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Anastasia Palkina (apalkina) wrote :
Download full text (6.0 KiB)

Reproduced on ISO #134
"build_id": "2014-07-17_00-31-14",
"mirantis": "yes",
"build_number": "134",
"ostf_sha": "09b6bccf7d476771ac859bb3c76c9ebec9da9e1f",
"nailgun_sha": "1d08d6f80b6514085dd8c0af4d437ef5d37e2802",
"production": "docker",
"api": "1.0",
"fuelmain_sha": "069686abb90f458f67cfcb4018cacc19971e4b4d",
"astute_sha": "9a74b788be9a7c5682f1c52a892df36e4766ce3f",
"release": "5.0.1",
"fuellib_sha": "2d1e1369c13bc9771e9473086cb064d257a21fc2"

1. Create new environment (Ubuntu, HA mode)
2. Choose GRE segmentation
3. Choose Ceph for images
4. Choose Sahara
5. Add 3 controllers, compute, cinder, 3 ceph
6. Untag storage and management netwroks and move it to other interfaces
7. Start deployment. It was successful
8. But there are errors on compute in nova-compute.log (node-4):

2014-07-17 11:03:23.916 27747 INFO oslo.messaging._drivers.impl_rabbit [req-3d94560f-a050-4d62-90ea-de4b2cdc8ea2 ] Reconnecting to AMQP server on localhost:5672
2014-07-17 11:03:23.916 27747 INFO oslo.messaging._drivers.impl_rabbit [req-3d94560f-a050-4d62-90ea-de4b2cdc8ea2 ] Delaying reconnect for 1.0 seconds...
2014-07-17 11:03:24.944 27747 ERROR oslo.messaging._drivers.impl_rabbit [req-3d94560f-a050-4d62-90ea-de4b2cdc8ea2 ] AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 19 seconds.

2014-07-17 12:57:39.909 6797 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to publish message to topic 'conductor': [Errno 32] Broken pipe
2014-07-17 12:57:39.909 6797 TRACE oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2014-07-17 12:57:39.909 6797 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 632, in ensure
2014-07-17 12:57:39.909 6797 TRACE oslo.messaging._drivers.impl_rabbit return method(*args, **kwargs)
2014-07-17 12:57:39.909 6797 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 728, in _publish
2014-07-17 12:57:39.909 6797 TRACE oslo.messaging._drivers.impl_rabbit publisher = cls(self.conf, self.channel, topic, **kwargs)
2014-07-17 12:57:39.909 6797 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 384, in __init__
2014-07-17 12:57:39.909 6797 TRACE oslo.messaging._drivers.impl_rabbit **options)
2014-07-17 12:57:39.909 6797 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 331, in __init__
2014-07-17 12:57:39.909 6797 TRACE oslo.messaging._drivers.impl_rabbit self.reconnect(channel)
2014-07-17 12:57:39.909 6797 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 339, in reconnect
2014-07-17 12:57:39.909 6797 TRACE oslo.messaging._drivers.impl_rabbit routing_key=self.routing_key)
2014-07-17 12:57:39.909 6797 TRACE oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/dist-packages/kombu/messaging.py", line 84, in __init__
2014-07-17 12:57:39.909 6797 TRACE oslo.messaging._drivers.imp...

Read more...

Revision history for this message
Anastasia Palkina (apalkina) wrote :
Changed in fuel:
status: Fix Committed → Confirmed
Revision history for this message
Sergey Murashov (smurashov) wrote :

not reproduced on 135 iso

Revision history for this message
Ryan Moe (rmoe) wrote :

What was the root cause of this issue and why does adding support for RabbitMQ heartbeats fix it? While I agree that the patch to oslo messaging is useful in the general sense I think knowing exactly WHY this broke and WHY this patch fixed it will help our understanding of HA issues in the future.

Changed in fuel:
milestone: 5.0.1 → 5.1
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

For 5.1, this problem is being ddressed under https://bugs.launchpad.net/mos/+bug/1341656. This bug is specifically about impact on 5.0.1.

Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

We've ran out of time for dealing with this in 5.0 time frame, please revert oslo.messaging patches created for this bug from the package for 5.0.1 and make sure vanilla package gets synced into the mirrors.

Revision history for this message
Roman Vyalov (r0mikiam) wrote :
tags: added: heartbeat
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.