Mirantis OpenStack

RPC clients cannot find a reply queue after restart of the last RabbitMQ server in the cluster

Bug #1463802 reported by Artem Panchenko on 2015-06-10

This bug affects 5 people

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Fix Released	High	Dmitry Mescheryakov	Mirantis OpenStack 7.0
5.1.x	Fix Released	High	Alexey Khivin	Mirantis OpenStack 5.1.1-mu-1
6.0.x	Fix Released	High	Alexander Nevenchannyy	Mirantis OpenStack 6.0-mu-4
6.1.x	Fix Released	High	Alexey Khivin	Mirantis OpenStack 6.1-mu-1
7.0.x	Fix Released	High	Dmitry Mescheryakov	Mirantis OpenStack 7.0
8.0.x	Fix Released	High	MOS Oslo	Mirantis OpenStack 8.0

Bug Description

Steps to reproduce:

1. Deploy MOS environment in HA mode with several controllers
2. Shut down one of the controllers, either gracefully or not
3. Wait for MySQL, RabbitMQ and OpenStack to failover (several minutes)
4. Try to use OpenStack API which invokes internal messaging via Rabbit MQ. For instance, you can view console log of an instance, create instances, networks, volumes, etc.

Some requests sent in step #4 might fail with timeout. In the logs of the affected service the following message could be seen:
"Queue not found:Basic.consume: (404) NOT_FOUND - no queue 'reply_f7cac1a2428d414bb8b9e0a612"

Conditions for reproduction:

The issue occurs rather infrequently, though we don't have exact date. So far we have only 4 reproductions reported during last 2-3 weeks.

User impact:

While the issue exists, some requests to OpenStack might fail (those, which are processed on the affected controller/service). The issue does not disappear with time without manual fix (see workaround below).

Workaround:

The workaround is to restart the affected service. After restart, the service will immediately become operational.

Current plan:

We are planning to fix the issue in updates for 6.1. Right now we are reproducing it with additional logging enabled to understand the root cause.

Detailed analisys by Roman Podoliaka
==========================================================================

Reply queues created by oslo.messaging are not durable (i.e. they are gone after restart of the last RabbitMQ in the cluster). The problem is that after successful failover of RabbitMQ OpenStack services will correctly reconnect, but RPC calls will be broken until we restart the affected service: a reply queue is not recreated, which means no reply can be received for a given call, and the call will eventually fail with TimeoutError.

As it can be seen in the output of commands below, this particular reply queue of nova-conductor first migrated from one RabbitMQ node to another, then saw death of another mirror, and after RabbitMQ server on node-16 was restarted the queue was gone, still nova-conductor RPC client tried to consume messages from it.

rabbitmqctl list_queues: http://xsnippet.org/360736/raw/

root@node-16:~# grep reply_f7cac1a2428d414bb8b9e0a61291a468 -P3 /<email address hidden>: http://xsnippet.org/360737/raw/

This wouldn't be a problem, if a new reply queue was created for new RPC calls, but currently this makes RPC client unusable unless we restart the whole process.

Note: description of the original error in nova-conductor has been put below.

Initial description by Artem Panchenko
==========================================================================

Fuel version info (6.1 build #521 RC1): http://paste.openstack.org/show/277715/

After shutting down of primary controller OSTF tests which create Nova instances fail, because all new booted instances have ERROR state:

http://paste.openstack.org/show/281028/

Here is a part of nova-conductor.log (node-16):

http://paste.openstack.org/show/281014/

RabbitMQ cluster status looks good:

[root@fuel-lab-cz5558 ~]# runc 2 rabbitmqctl cluster_status
DEPRECATION WARNING: /etc/fuel/client/config.yaml exists and will be used as the source for settings. This behavior is deprecated. Please specify the path to your custom settings file in the FUELCLIENT_CUSTOM_SETTINGS environment variable.
node-16.mirantis.com
Cluster status of node 'rabbit@node-16' ...
[{nodes,[{disc,['rabbit@node-16','rabbit@node-7']}]},
{running_nodes,['rabbit@node-7','rabbit@node-16']},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]}]
...done.
node-7.mirantis.com
Cluster status of node 'rabbit@node-7' ...
[{nodes,[{disc,['rabbit@node-16','rabbit@node-7']}]},
{running_nodes,['rabbit@node-16','rabbit@node-7']},
{cluster_name,<<"<email address hidden>">>},
{partitions,[]}]
...done.

Here is AMQP queues info:

http://paste.openstack.org/show/281029/

Steps to reproduce:

1. Create environment: Ubuntu, NeutronGRE, Ceph, Sahara, Ceilometer
2. Add 1 controller, 2 controller+ceph, 1 compute and 3 mongo nodes
3. Deploy changes.
4. Run OSTF
5. Shutdown primary controller (gracefully using `poweroff` command)
6. Run OSTF

Expected result:

- all tests passed except 'Check that required services are running'

Actual:

- all tests which create Nova instances fail

Also, I didn't find why, but all API requests to Nova take a long time, for example `nova list` simple command execution takes 17 seconds:

http://paste.openstack.org/show/281031/

Diagnostic snapshot (environment ID - 2, nodes: 5,16,7,6,11,13,14): https://drive.google.com/file/d/0BzaZINLQ8-xkNDFrX2RKRS1GOWs/view?usp=sharing

See original description

Tags:

Roman Podoliaka (rpodolyaka) on 2015-06-10

Changed in mos:
assignee:	nobody → MOS Oslo (mos-oslo)
milestone:	none → 6.1
importance:	Undecided → High
no longer affects:	fuel
summary:	- Nova can't boot instances after primary controller graceful shutdown - 'MessagingTimeout: Timed out waiting for a reply to message ID xxx' + RPC clients do not recreate a reply queue after restart of the last + RabbitMQ server in the cluster
description:	updated
description:	updated
Changed in mos:
status:	New → Confirmed

Roman Podoliaka (rpodolyaka) on 2015-06-10

description:	updated
description:	updated

Revision history for this message

Oleksii Zamiatin (ozamiatin) wrote on 2015-06-10: Re: RPC clients do not recreate a reply queue after restart of the last RabbitMQ server in the cluster

This issue can be resolved using 'amqp_durable_queues' option which makes rabbitmq to keep queues on the server. After controller restart messages are gone but queue persists with the same name. The option may be set to True in Nova config without oslo.messaging modification.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-06-11:

Tried to reproduce this locally, but didn't manage to do that. All queues are redeclared correctly after detection of disconnect. It's not clear why this didn't happen on Artem's environment.

Perhaps, we should lower the priority of the bug, if we can't reproduce this again easily.

Revision history for this message

Alexey Khivin (akhivin) wrote on 2015-06-11:

@Roman This massage about the lost queue you can see each time when you rebooted RabbitMQ cluster during the rpc call.
If client have started an rpc-call and awating the anwer
In the case if queue disappeared during rpc-call, client will wait reply until timeout and
if server replies in this period then server recreates reply queue

So the message "queue not found" is not an error

reply queue will be recreated by client at the moment of new rpc-call start

So, the real issue in this case is the lost reply-message or server not replied at all

To avoid continous service interruption Nova should handle Timeout exception raised by oslo.messaging and we should investigate why reply message may be lost

Revision history for this message

Alexey Khivin (akhivin) wrote on 2015-06-11:

So, there are no big sense to recreate the queue in this particular case

Revision history for this message

Alexey Khivin (akhivin) wrote on 2015-06-11:

Sorry)
it makes little sense to recreate the queue in this particular case

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-06-11:

Deployed the following configuration:
Baremetal,Ubuntu,IBP, Neutron-vlan,Сeph-all,Nova-debug,nova-quotas,6.1_521
Controllers:3 Computes+ceph:3

1. performed part of octf tests "Functional tests. Duration 3 min - 14 min".
result: Success.
2. changed nova.conf file on each node:
amqp_durable_queues=True
3. restarted all nova service on each node
result: all nova-compute service were marked as "XXX" in output of "nova-manage service list" command. From nova-all log on computes: http://paste.openstack.org/show/283664/
4. stop rabbitmq cluster:
crm resource stop master_p_rabbitmq-server
5. cleanup mnesia on each controller node:
rm -rf /var/lib/rabbitmq/mnesia/rabbit@node-2/*
6. start rabbitmq cluster:
crm resource start master_p_rabbitmq-server
result: nova-computes came back online.
7. performed part of octf tests "Functional tests. Duration 3 min - 14 min".
result: Success.
8. restart 1st controller (node-2) by command "reboot" and wait it become online status
results: http://paste.openstack.org/show/283787/
9. restart 2d controller (node-5) by command "reboot"
results: http://paste.openstack.org/show/283788/
10. restart 3d controller (node-6) by command "reboot"
results: http://paste.openstack.org/show/283790/
11. performed part of octf tests "Functional tests. Duration 3 min - 14 min".
results: the vollowing tests have been failed
Check create, update and delete image actions using Glance v2
   1. Send request to create image
Create volume and boot instance from it
   2. Wait for volume status to become "available".
Create volume and attach it to instance
   5. Wait for "Active" status
Create security group
   1. Create a security group, check if it was created correctly.
Launch instance
   1. Create a new security group (if it doesn`t exist yet).

Eugene Bogdanov (ebogdanov) on 2015-06-11

tags:

added: 6.1rc2

Revision history for this message

Igor Marnat (imarnat) wrote on 2015-06-11:

@Roman: what about suggestion to Nova to process timeouts from comment #3? Does it make sense to implement this functionality, if it's missing?

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-06-12:

@Alexey

As I tried to describe in the bug description, the point here is: for some reason the reply queue *has not been* recreated on the disconnect, which caused failure of *all* subsequent RPC calls (i.e. the queue hasn't been recreated on the next call either).

>>> This massage about the lost queue you can see each time when you rebooted RabbitMQ cluster during the rpc call.

Well, what we see on our local environments and in oslo.messaging code: queues and exchanges are re-redeclared on reconnect (as they both are not durable and does not survive RabbitMQ server restart).

>>> If client have started an rpc-call and awating the anwer In the case if queue disappeared during rpc-call, client will wait reply until timeout and if server replies in this period then server recreates reply queue

I'm aware of that, but this particular queue *has been lost for hours* by the moment we saw "queue not found" error in nova-conductor logs. So the cause of TimeoutError is that the reply queue has never been recreated.

>>> reply queue will be recreated by client at the moment of new rpc-call start

Technically, it's recreated on reconnect https://github.com/openstack/oslo.messaging/blob/stable/juno/oslo/messaging/_drivers/impl_rabbit.py#L157-L162

What we see when testing this on local environment - all the queues are recreated after RabbitMQ restart even without RPC calls. I'm not sure why that didn't happen on the bug reporter's environment - probably we are hitting some edge case here.

>>> So, the real issue in this case is the lost reply-message or server not replied at all. To avoid continous service interruption Nova should handle Timeout exception raised by oslo.messaging and we should investigate why reply message may be lost

I respectfully disagree without you here: as you can see in the logs, this particular reply queue has been missing for hours (the first message in RabbitMQ logs happened on "9-Jun-2015::21:47:51" and nova-conductor error happened on "2015-06-10 08:22:55.914"). For some reason oslo.messaging never redeclared the queue, so we just missed the reply.

Handling of Timeout errors is a totally different question and it won't help here as we haven't provided a queue to receive the reply from.

@Alexey

>>> This massage about the lost queue you can see each time when you rebooted RabbitMQ cluster during the rpc call.

>>> reply queue will be recreated by client at the moment of new rpc-call start

Technically, it's recreated on reconnect https://github.com/openstack/oslo.messaging/blob/stable/juno/oslo/messaging/_drivers/impl_rabbit.py#L157-L162

Handling of Timeout errors is a totally different question and it won't help here as we haven't provided a queue to receive the reply from.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-06-12:

@Igor

Short answer, no it won't, as the root cause of the problem is that oslo.messaging does not recreate a reply queue, so we are going to get Timeout error again and again for each subsequent RPC call.

Long answer, handling of Timeout errors *might* help us to mitigate short RabbitMQ server disruptions, if we missed some requests/replies. But it's a much bigger question. The way we look at MQ in oslo.messaging is just a layer to implement simple RPC protocol upon, treating all local/remote process calls in the very same way.

The problem with that is that, if you wanted to handle Timeout errors gracefully, you would end up retrying call possible RPC calls in your code. And not all of those calls are idempotent (i.e. can be safely retried). So we could do that in oslo.messaging, but the consequences might be even worse than without retries.

There is nothing specific in Nova here, it's how all OpenStack projects work with RPC right now.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-15:

#10

The logs snapshot contains no logs from nodes, ./remote is empty

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-15:

#11

As far as I can see from the logs snapshot, the reply_15faea725ae24ce2ab5884632fb10848 is present in the rabbitmq-report commands output, so wasn't queue was lost.

summary:

- RPC clients do not recreate a reply queue after restart of the last
- RabbitMQ server in the cluster
+ RPC clients cannot find a reply queue after restart of the last RabbitMQ
+ server in the cluster

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-15:

#12

I renamed the bug as the queues seems cannot be found, although exists.

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-06-15:

#13

Bogdan, not sure where you took that queue name from, but I meant "reply_f7cac1a2428d414bb8b9e0a61291a468".

As you can see:

1) http://xsnippet.org/360736/raw/
2) http://xsnippet.org/360737/raw/
3) http://paste.openstack.org/show/281014/

it *does not* exist at the time nova-conductor tries to consume messages from it.

What we know, is that it didn't survive a restart of the last RabbitMQ server (as it's not *durable*) What we don't know, is why oslo.messaging reconnect detection layer hasn't recreated the queue after RabbitMQ failover (what we see in the synthetic tests).

Revision history for this message

Alexey Khivin (akhivin) wrote on 2015-06-15:

#14

Possibly, related bug
https://bugs.launchpad.net/mos/+bug/1465300

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-06-15:

#15

Upon request from Eugene Bogdanov, I'll provide a summary here.

User impact:

Sometimes after successful failover of RabbitMQ server, OpenStack services may start to fail RPC calls with the following messages in logs: "Queue not found:Basic.consume: (404) NOT_FOUND - no queue 'reply_f7cac1a2428d414bb8b9e0a612". This happens when one or more controllers are brought down, but not frequently. The workaround is to restart the affected service. After restart, the service will immediately become operational.

What we've done so far:

- analyzed existing logs of reproduce (Artem Panchenko's and Leonid Istomin's environments)
- did audit of oslo.messaging code to find out if failover is handled correctly (it must be: queues are explicitly redeclared on reconnect)
- tried to reproduce the issue 'synthetically' without MOS on a RabbitMQ cluster (as well as one-node RabbitMQ server), without any luck - oslo.messaging performs as expected and redeclares all the queues used
- tried to reproduce the issue on a MOS installation with oslo.messaging debug logs enabled - unfortunately, the whole cluster went into weird a state with both Galera and RabbitMQ clusters failing to start
- reproduced the issue once on a small MOS installation (3 controllers + 2 computes), but with oslo.messaging debug logs disabled - we are sure that we see what we see (a reply being missing and not re-created on the next RPC call, leaving the affected service unabled to make RPC calls until it's restarted/reconnect is triggered), despite what Alexey/Bogdan suggested in the comments
- when we managed to reproduce the issue once, we tried to trigger a reconnect (gdb -p $PID; call close($FD)) - oslo.messaging correctly reconnected and redeclared all the queues

The plan is to get a simple repro with oslo.messaging debug logs enabled (ideally, without MOS at all, just plain oslo.messaging) and find out why queue redeclare code path may not be executed properly on reconnect after failover.

Possible workarounds:

1) restart of affected services

2) make sure reply queues are durable and thus survive a RabbitMQ restart (so that even if oslo.messaging fails to redeclare the queues explicitly, they are persisted in RabbitMQ itself). <-- the problem with that is that we are not fixing the root cause of the issue

Upon request from Eugene Bogdanov,  I'll provide a summary here.

User impact:

What we've done so far:

Possible workarounds:

1) restart of affected services

2) make sure reply queues are durable and thus survive a RabbitMQ restart (so that even if oslo.messaging fails to redeclare the queues explicitly, they are persisted in RabbitMQ itself).  <-- the problem with that is that we are not fixing the root cause of the issue

Revision history for this message

Alexey Khivin (akhivin) wrote on 2015-06-15:

#16

https://bugs.launchpad.net/mos/+bug/1465300 could be the root cause of this bug because it relates to wrong handling of timeout exception

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2015-06-16:

#17

Alexey, I suggest we try to emulate the problem with raising of that exception, but we haven't seen any NameError ones in the logs, so IMO, that's a different issue.

Revision history for this message

Alexey Khivin (akhivin) wrote on 2015-06-16:

#18

Roman,
possible effects are not limited by NameError
it is possible that timeout exception may be handled as a _previous_ exception

Revision history for this message

Eugene Bogdanov (ebogdanov) wrote on 2015-06-16:

#19

Removing 6.1rc2 tag as it's not a blocker for release.

tags:

removed: 6.1rc2

Roman Podoliaka (rpodolyaka) on 2015-06-16

Changed in mos:
assignee:	MOS Oslo (mos-oslo) → Victor Sergeyev (vsergeyev)

Roman Podoliaka (rpodolyaka) on 2015-06-16

description:

updated

Roman Podoliaka (rpodolyaka) on 2015-06-16

description:

updated

Revision history for this message

Alexey Khivin (akhivin) wrote on 2015-06-16:

#20

yet another possible reason
https://bugs.launchpad.net/mos/+bug/1465757

Dmitry Mescheryakov (dmitrymex) on 2015-06-16

description:

updated

Revision history for this message

Leontii Istomin (listomin) wrote on 2015-06-16:

#21

reproduced with 6.1-521.
We've deployed the following configuration there:
Baremetal,Centos,IBP,HA, Neutron-gre,Ceph-all,Nova-debug,Nova-quotas, 6.1-521+ https://review.openstack.org/#/c/190137/
Controllers:3 Computes:47
Then we performed the following tests on the env:
Shaker: from 15 22:06:00 to 16 02:23:05
light rally: from 16 02:23:05 to 16 02:53:16
full rally: from 16 02:53:16 to 16 14:53:13
We have face with a rabbitmq issue https://bugs.launchpad.net/fuel/+bug/1463433. Some nodes of the rabbitmq cluster was down, but the rabbitmq cluster was alive at all:
http://paste.openstack.org/show/295708/

Then we found that Nova couldn't be reconnected to rabbitmq:
http://paste.openstack.org/show/295700/

Diagnostic Snapshot is here: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-06-16_16-09-08.tar.xz

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-17:

#22

Related https://bugs.launchpad.net/fuel/+bug/1463433/comments/29

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-17:

#23

Roman, you're right, there is no more queue for reply_f7cac1a2428d414bb8b9e0a61291a468, but exchange exists in logs snapshot commands outputs. So it seems that it was recreated OK, while queue was not.

Vitaly Sedelnik (vsedelnik) on 2015-06-17

Changed in mos:
milestone:	6.1 → 6.1-updates

Revision history for this message

Viktor Serhieiev (vsergeyev) wrote on 2015-06-18:

#24

Addressed by https://review.fuel-infra.org/#/c/7925/

Changed in mos:
assignee:	Victor Sergeyev (vsergeyev) → Alex Khivin (akhivin)

Alexey Khivin (akhivin) on 2015-06-19

Changed in mos:
status:	Confirmed → In Progress

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-06-23: Fix merged to openstack/oslo.messaging (openstack-ci/fuel-6.1/2014.2)

#25

Reviewed: https://review.fuel-infra.org/7925
Submitter: Oleksii Zamiatin <email address hidden>
Branch: openstack-ci/fuel-6.1/2014.2

Commit: e9a91becba156715706504a0bfb1cf587b5343ad
Author: Alex Khivin <email address hidden>
Date: Mon Jun 22 11:25:26 2015

Improve "Queue not found" exception handling

In the case queue was disapeared during reconnection process, "Queue not
found" exception may break consumption in other queues, thus rpc subsystem
may got stuck.

Added a new method _try_consume() to consume queue. If queue is not found,
this method reconnect queue (this supposes to re-create a lost queue),
and try consume it one more time.

Change-Id: I41ffe7aacbae1ac176e0063a20cdd256cef69127
Closes-bug: #1465757
Closes-bug: #1463802

Alexey Khivin (akhivin) on 2015-06-23

Changed in mos:
status:	In Progress → Fix Committed

Eugene Nikanorov (enikanorov) on 2015-06-24

tags:

added: customer-found

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-06-24: Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-6.0-updates/2014.2)

#26

Fix proposed to branch: openstack-ci/fuel-6.0-updates/2014.2
Change author: Alex Khivin <email address hidden>
Review: https://review.fuel-infra.org/8461

Vitaly Sedelnik (vsedelnik) on 2015-06-25

tags:

added: 6.1-mu-1

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-06-25: Fix merged to openstack/oslo.messaging (openstack-ci/fuel-6.0-updates/2014.2)

#27

Reviewed: https://review.fuel-infra.org/8461
Submitter: Vitaly Sedelnik <email address hidden>
Branch: openstack-ci/fuel-6.0-updates/2014.2

Commit: 0a4b32d018392506b10635bdffda419bb0ccec05
Author: Alex Khivin <email address hidden>
Date: Wed Jun 24 22:57:37 2015

Improve "Queue not found" exception handling

In the case queue was disapeared during reconnection process, "Queue not
found" exception may break consumption in other queues, thus rpc subsystem
may got stuck.

Added a new method _try_consume() to consume queue. If queue is not found,
this method reconnect queue (this supposes to re-create a lost queue),
and try consume it one more time.

Change-Id: I41ffe7aacbae1ac176e0063a20cdd256cef69127
Closes-bug: #1465757
Closes-bug: #1463802
(cherry picked from commit e9a91becba156715706504a0bfb1cf587b5343ad)

Vitaly Sedelnik (vsedelnik) on 2015-07-06

Changed in mos:
milestone:	6.1-updates → 6.1-mu-1

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-07-06: Related fix proposed to patching-tests (stable/6.1)

#28

Related fix proposed to branch: stable/6.1
Change author: Alex Khivin <email address hidden>
Review: https://review.fuel-infra.org/9076

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-07-07: Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-6.1/2014.2)

#29

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: Vladimir Kuklin <email address hidden>
Review: https://review.fuel-infra.org/9122

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-07-08: Related fix merged to patching-tests (stable/6.1)

#30

Reviewed: https://review.fuel-infra.org/9076
Submitter: Vitaly Sedelnik <email address hidden>
Branch: stable/6.1

Commit: c9a755144a9343ba7fc1befd2b17fb8f21620a27
Author: Alex Khivin <email address hidden>
Date: Tue Jul 7 14:55:39 2015

RPC clients cannot find a reply queue after restart of the last RabbitMQ server in the cluster

Change-Id: I63f30f5b7af22783d13722f05c7005efbcbd0e00
Related-Bug: #1463802

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-07-08: Fix merged to openstack/oslo.messaging (openstack-ci/fuel-6.1/2014.2)

#31

Reviewed: https://review.fuel-infra.org/9122
Submitter: Vitaly Sedelnik <email address hidden>
Branch: openstack-ci/fuel-6.1/2014.2

Commit: a33faa3bce92dfe45bf902c8cf68b1246617ffc2
Author: Vladimir Kuklin <email address hidden>
Date: Tue Jul 7 16:51:33 2015

Empty commit to publish already merged fix

This commit just makes sure that previous commit
gets into correct repository residing on our
OBS host as it was merged between HCF and actual
6.1 GA

Closes-bug: #1463802

Change-Id: I85e7b1553168e360e9d903fc80c56489d6ca8897

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-07-10: Related fix proposed to patching-tests (stable/6.1)

#32

Related fix proposed to branch: stable/6.1
Change author: Alex Khivin <email address hidden>
Review: https://review.fuel-infra.org/9226

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-07-10: Related fix merged to patching-tests (stable/6.1)

#33

Reviewed: https://review.fuel-infra.org/9226
Submitter: Vitaly Sedelnik <email address hidden>
Branch: stable/6.1

Commit: 82b62598b622439ca001de549349b92aba3dcf21
Author: Alex Khivin <email address hidden>
Date: Fri Jul 10 13:07:19 2015

Text amendment

Change-Id: I2326cb24102ac4f68ab60672d2266b8d9d489558
Related-Bug: #1463802

Revision history for this message

Wang Yanbin (wangyanbin) wrote on 2015-07-13:

#34

In my evaluation environment, I also got this bug. There are only 1 controller node and 1 computer+ceph node.
How can I get this bug fix of oslo.messaging code ? I didn't find the related commit in https://github.com/openstack/oslo.messaging .
when I try to access https://review.fuel-infra.org/8461 , it said "Code Review - Error The page you requested was not found, or you do not have permission to view this page."
Thanks!

Revision history for this message

Alexey Khivin (akhivin) wrote on 2015-07-13:

#35

Hello Wang
This patch have been implemented for MOS 6.1 and it have been included into MOS updates. You may ask MOS support to get updates with the patch included

It seems there is no patch for upstream OpenStack yet.

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-07-14: Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-5.1.1-updates/2014.1.1)

#36

Fix proposed to branch: openstack-ci/fuel-5.1.1-updates/2014.1.1
Change author: Alex Khivin <email address hidden>
Review: https://review.fuel-infra.org/9335

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-07-14: Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-5.1-updates/2014.1.1)

#37

Fix proposed to branch: openstack-ci/fuel-5.1-updates/2014.1.1
Change author: Alex Khivin <email address hidden>
Review: https://review.fuel-infra.org/9337

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-07-15: Fix merged to openstack/oslo.messaging (openstack-ci/fuel-5.1.1-updates/2014.1.1)

#38

Reviewed: https://review.fuel-infra.org/9335
Submitter: Vitaly Sedelnik <email address hidden>
Branch: openstack-ci/fuel-5.1.1-updates/2014.1.1

Commit: 022c6c28dd73e7a76270b76c4cbe895c9750e7cb
Author: Alex Khivin <email address hidden>
Date: Tue Jul 14 15:54:34 2015

Improve "Queue not found" exception handling

In the case queue was disapeared during reconnection process, "Queue not
found" exception may break consumption in other queues, thus rpc subsystem
may got stuck.

Added a new method _try_consume() to consume queue. If queue is not found,
this method reconnect queue (this supposes to re-create a lost queue),
and try consume it one more time.

Closes-bug: #1465757
Closes-bug: #1463802

Change-Id: I41ffe7aacbae1ac176e0063a20cdd256cef69127

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-07-21: Related fix proposed to patching-tests (stable/6.1)

#39

Related fix proposed to branch: stable/6.1
Change author: Vitaly Sedelnik <email address hidden>
Review: https://review.fuel-infra.org/9705

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-07-21: Related fix merged to patching-tests (stable/6.1)

#40

Reviewed: https://review.fuel-infra.org/9705
Submitter: Vitaly Sedelnik <email address hidden>
Branch: stable/6.1

Commit: 335a96cc09e90640abb9351dbe372e3d7cc2210e
Author: Vitaly Sedelnik <email address hidden>
Date: Tue Jul 21 13:41:41 2015

Provide specific commands to restart HA services

Replace generic text to restart all HA services with exact
sequence of commands.

Related-Bug: #1463802

Change-Id: Ie1cf5344288bcacffc79ae0218e3a7bc77b82a9f

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-08-13: Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-7.0/2015.1.0)

#41

Fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Alexey Khivin <email address hidden>
Review: https://review.fuel-infra.org/10420

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-08-14: Fix merged to openstack/oslo.messaging (openstack-ci/fuel-7.0/2015.1.0)

#42

Reviewed: https://review.fuel-infra.org/10420
Submitter: Oleksii Zamiatin <email address hidden>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: 750fcc8f96761b512226d696271c9d67db56240b
Author: Alexey Khivin <email address hidden>
Date: Thu Aug 13 17:55:01 2015

Improve "Queue not found" exception handling

In the case queue was disapeared during reconnection process, "Queue not
found" exception may break consumption in other queues, thus rpc subsystem
may got stuck.

Added a new method _try_consume() to consume queue. If queue is not found,
this method reconnect queue (this supposes to re-create a lost queue),
and try consume it one more time.

Also fixed a strange test in test_utils.py

Change-Id: I41ffe7aacbae1ac176e0063a20cdd256cef69127
Closes-bug: #1465757
Closes-bug: #1463802
Closes-bug: #1415932

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-08-14:

#43

There is a chance that the fix will not get into Liberty by the time we consume it, hence assigning bug to 8.0 in order not to forget porting it.

Roman Rufanov (rrufanov) on 2015-09-17

tags:

added: support

Revision history for this message

weiguo sun (wsun2) wrote on 2015-10-23:

#44

Hi, is there a proposed fix for this bug to upstream openstack main trunk? --wg

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-10-29: Fix proposed to openstack/oslo.messaging (openstack-ci/fuel-8.0/liberty)

#45

Fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Alexey Khivin <email address hidden>
Review: https://review.fuel-infra.org/13463

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-11-27: Change abandoned on openstack/oslo.messaging (openstack-ci/fuel-8.0/liberty)

#46

Change abandoned by Dmitry Mescheryakov <email address hidden> on branch: openstack-ci/fuel-8.0/liberty
Review: https://review.fuel-infra.org/13463
Reason: It seems like the bug, fixed by this CR, is also fixed in upstream with a different approach: https://review.openstack.org/#/c/195688/2

The upstream fix is present in 8.0 and so we are not going to merge current change. We will reconsider it only if the issue reoccurs.

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2015-11-27:

#47

We expect that this upstream change has fixed the issue https://review.openstack.org/#/c/195688

Maksym Strukov (unbelll) on 2016-02-16

tags:

added: on-verification

Revision history for this message

Maksym Strukov (unbelll) wrote on 2016-02-16:

#48

Verified as fixed in 8.0-566

tags:

removed: on-verification

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.