oslo.messaging

Bug #1581148
Activity log

Activity log for bug #1581148

Date	Who	What changed	Old value	New value	Message
2016-05-12 17:31:29	Kirill Bespalov	bug			added bug
2016-05-12 17:31:37	Kirill Bespalov	oslo.messaging: assignee		Kirill Bespalov (k-besplv)
2016-05-12 18:03:28	OpenStack Infra	oslo.messaging: status	New	In Progress
2016-05-18 14:18:45	Kirill Bespalov	description	Version: 9.0 Steps to reproduce: 1. Deploy environment MOS environment. 2. Run some tests on it (exact cause is unknown yet) Expected results: All logs are clean Actual results: In one of OpenStack components log you find a lot of exceptions like NotFound: Basic.consume: (404) NOT_FOUND - no queue 'reply_4b5920a6600d4d779c61c1a82dd7b81a' in vhost '/' (full stack trace from neutron-server logs - http://paste.openstack.org/show/494399/) This indicates that process lost a queue it was listening on and the situation does not end by itself. Loosing a queue has an impact that server stops processing messages from it, which might be crucial to its work (depends on the queue). In rabbit logs on node-61 with grep one can find the following entries (only several earliest are shown): http://paste.openstack.org/show/494589/ Note the pattern - first two queue.declare operations timed out and then basic.consume fail in endless loop. It seems that RabbitMQ failed to create the queue due to overload or something and oslo.messaging did not notice that. Unfortunately the relevant neutron-server logs were already rotated, so it is not clear what happened in oslo.messaging at the time of the queue declaration.	Version: 9.0 Steps to reproduce: 1. Deploy environment MOS environment. 2. Run some tests on it (exact cause is unknown yet) Expected results: All logs are clean Actual results: In one of OpenStack components log you find a lot of exceptions like NotFound: Basic.consume: (404) NOT_FOUND - no queue 'reply_4b5920a6600d4d779c61c1a82dd7b81a' in vhost '/' (full stack trace from neutron-server logs - http://paste.openstack.org/show/494399/) It happens due to the next HA race condition: (1) A cluster consists of two nodes: A and B (2) The queue 'abc' hosted on the node A. (3) A consumer due to reconnection declare the queue on node B (not self). (4) The node A is down and lose the queue 'abc'. (5) The node B delete the queue metadata (because home node is down) and does not send the basic.cancel to consumers, because in this time they are not declared. (6) The consumer trying declare self on missing queue and recieve 404. Loosing a queue has an impact that server stops processing messages from it, which might be crucial to its work (depends on the queue).
2016-05-20 14:51:43	OpenStack Infra	oslo.messaging: status	In Progress	Fix Released
2016-06-06 18:02:33	OpenStack Infra	tags		in-feature-amqp-dispatch-router
2016-07-20 21:53:23	Nobuto Murata	bug			added subscriber Nobuto Murata
2016-08-30 16:21:14	OpenStack Infra	tags	in-feature-amqp-dispatch-router	in-feature-amqp-dispatch-router in-stable-mitaka