Fuel for OpenStack

RabbitMQ beam.smp and rabbitmqctl segfaults sporadically leading to different issues with deployment and scalability

Bug #1541819 reported by Vladimir Kuklin on 2016-02-04

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	High	Alexey Lebedeff	Fuel for OpenStack 9.0
8.0.x	Fix Released	High	Alexey Lebedeff	Fuel for OpenStack 8.0-mu-4
Mitaka	Fix Released	High	Alexey Lebedeff	Fuel for OpenStack 9.0

Bug Description

According to a lot of bugs with controllers addition/deletion we see some issues with rabbitmq both in master and 8.0 branches. Currently, we see deployment scripts of OCF scripts failing sporadically due to segfaults of beam.smp and/or rabbitmqctl binaries.

Usually segfaults look like this:

http://pastebin.com/E49c70NF

Bugs that are related/caused by these segfaults are the following:

https://bugs.launchpad.net/fuel/+bug/1541040
https://bugs.launchpad.net/fuel/+bug/1541029
https://bugs.launchpad.net/fuel/+bug/1540915

Tags:

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-04:

Vladimir, segfaults are not that important. In our current RabbitMQ version _any_ rabbitmqctl failure leads to its segfault. For instance, if rabbitmqctl was unable to connect to RabbitMQ, it will segfault. But normally it would return error anyway. For example, take a look at the snapshot in bug https://bugs.launchpad.net/fuel/+bug/1541029 . Here files

./nailgun.test.domain.local/var/log/docker-logs/remote/node-1.test.domain.local/kernel.log
./nailgun.test.domain.local/var/log/docker-logs/remote/node-1.test.domain.local/lrmd.log

In kernel.log you may see 5 beam.smp segfaults at 00:53. But, if you look into lrmd.log, you will see that they correspond to RabbitMQ restart which started on 00:51. I.e. here RabbitMQ restart caused rabbitmqctl segfaults and not vice versa.

We also don't like that rabbitmqctl segfaults instead of just returning error and we will look into that. But segfaults themselves seem to be of little significance. Hence I am lowering importance to high and I am pretty sure that the issue could be delayed to maintenance update.

Roman Podoliaka (rpodolyaka) on 2016-02-04

tags:

added: promoted-to-critical

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-02-08:

Based on feedback from Dmitry ^ and Alexey (in Slack) we believe it's safe to move this to 8.0-updates and continue the investigation of the root cause.

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-02-15:

Segfaults only happen when rabbitmqctl wants to print some error message to stderr. rabbitmqctl code contains code that triggers undefined behaviour, and this undefined behaviour is what causes segfault for erlang 18.X.

Upstream bugs:
https://github.com/rabbitmq/rabbitmq-common/issues/53
http://bugs.erlang.org/browse/ERL-91

As segfault happens only when rabbitmqctl needs to report some other error condition, this can't be a cause for any stability issue - error is there even before crash.

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-02-15: Fix proposed to packages/trusty/rabbitmq-server (8.0)

Fix proposed to branch: 8.0
Change author: Alexey Lebedeff <email address hidden>
Review: https://review.fuel-infra.org/17055

Dmitry Mescheryakov (dmitrymex) on 2016-02-16

tags:

added: release-notes

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-02-17:

Text for the release note:
When printing error report, rabbitmqctl might fail with segmentation fault. That does not cause any problem, since rabbitmqctl would also return error code even if it succeeds printing error report. But users might see a lot of segfault errors in kernel.log on controller nodes or in lrmd.log on master node as a result of Pacemaker monitoring calls. As it was outlined earlier, these messages do not indicate any problem.

Olga Gusarenko (ogusarenko) on 2016-02-25

tags:

added: 8.0 release-notes-done
removed: release-notes

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-03-25:

rabbit 3.6.1 was packaged for 9.0

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-03-28:

Proposed fix for 8.0 doesn't require any restarts of rabbitmq servers because changes are local to `rabbitmqctl`. It's enough to install a fixed package.

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-03-29:

Fix proposed to branch: 8.0
Change author: Alexey Lebedeff <email address hidden>
Review: https://review.fuel-infra.org/18898

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-03-29: Change abandoned on packages/trusty/rabbitmq-server (8.0)

Change abandoned by Alexey Lebedeff <email address hidden> on branch: 8.0
Review: https://review.fuel-infra.org/17055

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-03-29: Fix proposed to packages/centos7/rabbitmq-server (8.0)

#10

Fix proposed to branch: 8.0
Change author: Alexey Lebedeff <email address hidden>
Review: https://review.fuel-infra.org/18903

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-03-31:

#11

Bug is easily reproduced by stoping rabbitmq broker, and then calling `rabbitmqctl status` in a loop until coredump is produced. It usually takes just a few attempts.

Anastasia Kuznetsova (akuznetsova) on 2016-04-15

tags:

added: rabbitmq

Revision history for this message

Alexey Galkin (agalkin) wrote on 2016-04-22:

#12

Verified as fixed in 9.0-220

Roman Rufanov (rrufanov) on 2016-08-04

tags:

added: ct1 customer-found support

Alexey Galkin (agalkin) on 2016-09-28

Changed in fuel:
status:	Fix Committed → Fix Released

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2017-04-05: Change abandoned on packages/centos7/rabbitmq-server (8.0)

#13

Change abandoned by Alexey Lebedeff <email address hidden> on branch: 8.0
Review: https://review.fuel-infra.org/18903

Revision history for this message

TatyanaGladysheva (tgladysheva) wrote on 2017-04-05:

#14

Verified on 8.0 + MU4 updates.

Steps to verify:
1. On controller node stop rabbitmq broker:
crm resource stop master_p_rabbitmq-server
2. Run 'rabbitmqctl status' several times
3. Check kernel.log, lrmd.log for 'segfault' messages

Actual results:
Before the fix there are 'segfault' messages in the logs.
After the fix there is no 'segfault' messages in the logs.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.