RabbitMQ beam.smp and rabbitmqctl segfaults sporadically leading to different issues with deployment and scalability
Bug #1541819 reported by
Vladimir Kuklin
This bug affects 2 people
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Released
|
High
|
Alexey Lebedeff | ||
8.0.x |
Fix Released
|
High
|
Alexey Lebedeff | ||
Mitaka |
Fix Released
|
High
|
Alexey Lebedeff |
Bug Description
According to a lot of bugs with controllers addition/deletion we see some issues with rabbitmq both in master and 8.0 branches. Currently, we see deployment scripts of OCF scripts failing sporadically due to segfaults of beam.smp and/or rabbitmqctl binaries.
Usually segfaults look like this:
Bugs that are related/caused by these segfaults are the following:
https:/
https:/
https:/
tags: | added: promoted-to-critical |
tags: | added: release-notes |
tags: |
added: 8.0 release-notes-done removed: release-notes |
tags: | added: rabbitmq |
tags: | added: ct1 customer-found support |
Changed in fuel: | |
status: | Fix Committed → Fix Released |
To post a comment you must log in.
Vladimir, segfaults are not that important. In our current RabbitMQ version _any_ rabbitmqctl failure leads to its segfault. For instance, if rabbitmqctl was unable to connect to RabbitMQ, it will segfault. But normally it would return error anyway. For example, take a look at the snapshot in bug https:/ /bugs.launchpad .net/fuel/ +bug/1541029 . Here files
./nailgun. test.domain. local/var/ log/docker- logs/remote/ node-1. test.domain. local/kernel. log test.domain. local/var/ log/docker- logs/remote/ node-1. test.domain. local/lrmd. log
./nailgun.
In kernel.log you may see 5 beam.smp segfaults at 00:53. But, if you look into lrmd.log, you will see that they correspond to RabbitMQ restart which started on 00:51. I.e. here RabbitMQ restart caused rabbitmqctl segfaults and not vice versa.
We also don't like that rabbitmqctl segfaults instead of just returning error and we will look into that. But segfaults themselves seem to be of little significance. Hence I am lowering importance to high and I am pretty sure that the issue could be delayed to maintenance update.