rabbitmqctl list_channels timeout causes RabbitMQ restart

Bug #1566816 reported by Dmitry Stepanenko on 2016-04-06
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
High
Alexey Galkin
9.x
High
Alexey Galkin

Bug Description

Detailed bug description:

BVT_2 build #197 failed with several messages in logs saying that AMQP server is unreachable (nova-all.log, neutron-all.log, neutron/server.log). "lrmd.log" contains several messages like that:

2016-04-03T14:22:01.180966+00:00 info: INFO: p_rabbitmq-server[16942]: get_status(): failed with code 69. Command output: Error: unable to connect to node 'rabbit@messaging-node-2': nodedown

Also there are several messages saying that rabbitmqctl list_channels is timed out
2016-04-03T14:07:13.201907+00:00 err: ERROR: p_rabbitmq-server[3919]: get_monitor(): 'rabbitmqctl list_channels' timed out, per-node explanation:

RabbitMQ was restarted because of these timeouts and as a result the test failed.

Expected results:
RabbitMQ works seamlessly, test passes

Actual result:
RabbitMQ restarts, test fails

Reproducibility:

seen once in 197th build of BVT_2 task

See attachment for futher clarifications.

Dina Belova (dbelova) on 2016-04-06
Changed in mos:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → MOS Oslo (mos-oslo)
milestone: none → 9.0
tags: added: area-oslo

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

version

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info

The issue was found only once but we anyway should check the log files to make sure that we don't have a bugs here.

Dmitry Mescheryakov (dmitrymex) wrote :

QA team, please reproduce the issue and provide us with environment where 'rabbitmqctl list_channels' times out. Here are the suggest steps:

1. Before performing the test and after deploying env execute the following command on one of the controllers:
 crm_resource --resource p_rabbitmq-server --set-parameter max_rabbitmqctl_timeouts --parameter-value 1000000
   If rabbitmq restarts, wait for it to come up.

2. Do some tests
3. After doing the tests, examine end of /var/log/node-X.domain.tld/lrmd.log for each controller. If you see there the following lines:
2016-04-03T14:17:16.020640+00:00 err: ERROR: p_rabbitmq-server[20677]: get_monitor(): 'rabbitmqctl list_channels' timed out 5 of max. 3 time(s) in a row and is not responding. The resource is failed.

then you got the reproduction.

Changed in mos:
assignee: MOS Oslo (mos-oslo) → Alexey Galkin (agalkin)
summary: - RabbitMQ falls down with error "Error: unable to connect to node 'rabbit
- @messaging-node-2': nodedown"
+ rabbitmqctl list_channels timeout causes RabbitMQ restart
description: updated
Changed in mos:
milestone: 9.0 → 9.0-updates
Alexey Galkin (agalkin) wrote :

This bug can't reproduce. Temporary moved to Invalid.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers