Activity log for bug #1581148

Date Who What changed Old value New value Message
2016-05-12 17:31:29 Kirill Bespalov bug added bug
2016-05-12 17:31:37 Kirill Bespalov oslo.messaging: assignee Kirill Bespalov (k-besplv)
2016-05-12 18:03:28 OpenStack Infra oslo.messaging: status New In Progress
2016-05-18 14:18:45 Kirill Bespalov description Version: 9.0 Steps to reproduce: 1. Deploy environment MOS environment. 2. Run some tests on it (exact cause is unknown yet) Expected results: All logs are clean Actual results: In one of OpenStack components log you find a lot of exceptions like NotFound: Basic.consume: (404) NOT_FOUND - no queue 'reply_4b5920a6600d4d779c61c1a82dd7b81a' in vhost '/' (full stack trace from neutron-server logs - http://paste.openstack.org/show/494399/) This indicates that process lost a queue it was listening on and the situation does not end by itself. Loosing a queue has an impact that server stops processing messages from it, which might be crucial to its work (depends on the queue). In rabbit logs on node-61 with grep one can find the following entries (only several earliest are shown): http://paste.openstack.org/show/494589/ Note the pattern - first two queue.declare operations timed out and then basic.consume fail in endless loop. It seems that RabbitMQ failed to create the queue due to overload or something and oslo.messaging did not notice that. Unfortunately the relevant neutron-server logs were already rotated, so it is not clear what happened in oslo.messaging at the time of the queue declaration. Version: 9.0 Steps to reproduce: 1. Deploy environment MOS environment. 2. Run some tests on it (exact cause is unknown yet) Expected results: All logs are clean Actual results: In one of OpenStack components log you find a lot of exceptions like NotFound: Basic.consume: (404) NOT_FOUND - no queue 'reply_4b5920a6600d4d779c61c1a82dd7b81a' in vhost '/' (full stack trace from neutron-server logs - http://paste.openstack.org/show/494399/) It happens due to the next HA race condition: (1) A cluster consists of two nodes: A and B (2) The queue 'abc' hosted on the node A. (3) A consumer due to reconnection declare the queue on node B (not self). (4) The node A is down and lose the queue 'abc'. (5) The node B delete the queue metadata (because home node is down) and does not send the basic.cancel to consumers, because in this time they are not declared. (6) The consumer trying declare self on missing queue and recieve 404. Loosing a queue has an impact that server stops processing messages from it, which might be crucial to its work (depends on the queue).
2016-05-20 14:51:43 OpenStack Infra oslo.messaging: status In Progress Fix Released
2016-06-06 18:02:33 OpenStack Infra tags in-feature-amqp-dispatch-router
2016-07-20 21:53:23 Nobuto Murata bug added subscriber Nobuto Murata
2016-08-30 16:21:14 OpenStack Infra tags in-feature-amqp-dispatch-router in-feature-amqp-dispatch-router in-stable-mitaka