RabbitMQ beam.smp and rabbitmqctl segfaults sporadically leading to different issues with deployment and scalability

Bug #1541819 reported by Vladimir Kuklin
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Alexey Lebedeff
8.0.x
Fix Released
High
Alexey Lebedeff
Mitaka
Fix Released
High
Alexey Lebedeff

Bug Description

According to a lot of bugs with controllers addition/deletion we see some issues with rabbitmq both in master and 8.0 branches. Currently, we see deployment scripts of OCF scripts failing sporadically due to segfaults of beam.smp and/or rabbitmqctl binaries.

Usually segfaults look like this:

http://pastebin.com/E49c70NF

Bugs that are related/caused by these segfaults are the following:

https://bugs.launchpad.net/fuel/+bug/1541040
https://bugs.launchpad.net/fuel/+bug/1541029
https://bugs.launchpad.net/fuel/+bug/1540915

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Vladimir, segfaults are not that important. In our current RabbitMQ version _any_ rabbitmqctl failure leads to its segfault. For instance, if rabbitmqctl was unable to connect to RabbitMQ, it will segfault. But normally it would return error anyway. For example, take a look at the snapshot in bug https://bugs.launchpad.net/fuel/+bug/1541029 . Here files

./nailgun.test.domain.local/var/log/docker-logs/remote/node-1.test.domain.local/kernel.log
./nailgun.test.domain.local/var/log/docker-logs/remote/node-1.test.domain.local/lrmd.log

In kernel.log you may see 5 beam.smp segfaults at 00:53. But, if you look into lrmd.log, you will see that they correspond to RabbitMQ restart which started on 00:51. I.e. here RabbitMQ restart caused rabbitmqctl segfaults and not vice versa.

We also don't like that rabbitmqctl segfaults instead of just returning error and we will look into that. But segfaults themselves seem to be of little significance. Hence I am lowering importance to high and I am pretty sure that the issue could be delayed to maintenance update.

tags: added: promoted-to-critical
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Based on feedback from Dmitry ^ and Alexey (in Slack) we believe it's safe to move this to 8.0-updates and continue the investigation of the root cause.

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

Segfaults only happen when rabbitmqctl wants to print some error message to stderr. rabbitmqctl code contains code that triggers undefined behaviour, and this undefined behaviour is what causes segfault for erlang 18.X.

Upstream bugs:
https://github.com/rabbitmq/rabbitmq-common/issues/53
http://bugs.erlang.org/browse/ERL-91

As segfault happens only when rabbitmqctl needs to report some other error condition, this can't be a cause for any stability issue - error is there even before crash.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/trusty/rabbitmq-server (8.0)

Fix proposed to branch: 8.0
Change author: Alexey Lebedeff <email address hidden>
Review: https://review.fuel-infra.org/17055

tags: added: release-notes
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Text for the release note:
When printing error report, rabbitmqctl might fail with segmentation fault. That does not cause any problem, since rabbitmqctl would also return error code even if it succeeds printing error report. But users might see a lot of segfault errors in kernel.log on controller nodes or in lrmd.log on master node as a result of Pacemaker monitoring calls. As it was outlined earlier, these messages do not indicate any problem.

tags: added: 8.0 release-notes-done
removed: release-notes
Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

rabbit 3.6.1 was packaged for 9.0

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

Proposed fix for 8.0 doesn't require any restarts of rabbitmq servers because changes are local to `rabbitmqctl`. It's enough to install a fixed package.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote :

Fix proposed to branch: 8.0
Change author: Alexey Lebedeff <email address hidden>
Review: https://review.fuel-infra.org/18898

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on packages/trusty/rabbitmq-server (8.0)

Change abandoned by Alexey Lebedeff <email address hidden> on branch: 8.0
Review: https://review.fuel-infra.org/17055

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/centos7/rabbitmq-server (8.0)

Fix proposed to branch: 8.0
Change author: Alexey Lebedeff <email address hidden>
Review: https://review.fuel-infra.org/18903

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

Bug is easily reproduced by stoping rabbitmq broker, and then calling `rabbitmqctl status` in a loop until coredump is produced. It usually takes just a few attempts.

tags: added: rabbitmq
Revision history for this message
Alexey Galkin (agalkin) wrote :

Verified as fixed in 9.0-220

Roman Rufanov (rrufanov)
tags: added: ct1 customer-found support
Alexey Galkin (agalkin)
Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on packages/centos7/rabbitmq-server (8.0)

Change abandoned by Alexey Lebedeff <email address hidden> on branch: 8.0
Review: https://review.fuel-infra.org/18903

Revision history for this message
TatyanaGladysheva (tgladysheva) wrote :

Verified on 8.0 + MU4 updates.

Steps to verify:
1. On controller node stop rabbitmq broker:
crm resource stop master_p_rabbitmq-server
2. Run 'rabbitmqctl status' several times
3. Check kernel.log, lrmd.log for 'segfault' messages

Actual results:
Before the fix there are 'segfault' messages in the logs.
After the fix there is no 'segfault' messages in the logs.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.