Race condition in mnesia_locker after node down

Bug #1496409 reported by Jorge Niedbalski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
rabbitmq-server (Ubuntu)
Triaged
Low
Unassigned

Bug Description

[Description]

On a 3 nodes rabbitMQ cluster after a failure on one of the nodes (kernel crash) the following succession
of events has been registered:

===
This initial crash report

$ ack-grep -i mnesia_locker | wc -l
16

60-=CRASH REPORT==== 7-Sep-2015::14:48:45 ===
61- crasher:
--
66: {mnesia_locker,'rabbit@10-10-34-2',granted}}
67- in function gen_server2:terminate/3 (src/gen_server2.erl, line 1045)
68- ancestors: [worker_pool_sup,rabbit_sup,<0.110.0>]
69- messages: []
70- links: [<0.114.0>]
71- dictionary: [{{xtype_to_module,direct},rabbit_exchange_type_direct},
72- {{xtype_to_module,topic},rabbit_exchange_type_topic},
73- {random_seed,{5643,25632,27953}},
74- {{xtype_to_module,fanout},rabbit_exchange_type_fanout},
75- {worker_pool_worker,true},
76- {fhc_age_tree,{0,nil}},
--
88: Reason: {unexpected_info,{mnesia_locker,'rabbit@10-10-34-2',granted}}
89- Offender: [{pid,<0.142.0>},
90- {name,25},
91- {mfargs,{worker_pool_worker,start_link,[25]}},
92- {restart_type,transient},
93- {shutdown,4294967295},
94- {child_type,worker}]

sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1.log
5166:Mnesia('rabbit@10-10-34-1'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@10-10-34-2'}

sosreport-svz-op-fdc-os-3.00087142-20150909140820/var/log/rabbitmq/rabbit@10-10-34-3.log.1
545:Mnesia('rabbit@10-10-34-3'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@10-10-34-1'}

After this I started seeing this traces:

sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log:572:=CRASH REPORT==== 7-Sep-2015::14:48:45 ===
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log:573: crasher:
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-574- initial call: gen:init_it/6
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-575- pid: <0.2346.189>
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-576- registered_name: []
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-577- exception exit: {{unexpected_cast,{next_job_from,<0.4960.181>}},
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-578- {gen_server2,call,
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-579- [<0.22014.255>,
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-580- {submit,#Fun<rabbit_misc.6.25154013>,<0.2862.189>},
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-581- infinity]}}
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-582- in function gen_server2:terminate/3 (src/gen_server2.erl, line 1045)
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-583- ancestors: [rabbit_mirror_queue_slave_sup,rabbit_sup,<0.110.0>]
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-584- messages: []
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-585- links: [<0.259.0>]
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-586- dictionary: [{guid,{{3735536962,2437967587,3023977752,60868675},0}}]
sosreport-svz-op-fdc-os-1.00087142-20150909104806/var/log/rabbitmq/rabbit@10-10-34-1-sasl.log-587- trap_exit: true

Related Bugs:

- https://bugs.launchpad.net/mos/+bug/1401948
- https://github.com/erlang/otp/compare/maint...dgud:dgud%3Bmnesia%3Bsticky-race%3BOTP-11375

Possible related upstream commit:

- http://hg.rabbitmq.com/rabbitmq-server/rev/5a63c9e273cc

This seems to be related to [1] and [2] both fixes available since rabbitmq-server 3.4.0

[1] https://github.com/rabbitmq/rabbitmq-server/commit/1a32616b744f4dab09cba4e7a7e747c2b6550361#diff-3b9dc5e3c18be9549b0ab00763e4e123

[2] https://github.com/rabbitmq/rabbitmq-server/commit/238243b10ba6f666fcd8e84289525961fe6e68b9#diff-3b9dc5e3c18be9549b0ab00763e4e123

Tags: sts
tags: added: sts
description: updated
summary: - Race condition in mnesia_locker after nodedown
+ Race condition in mnesia_locker after node down
Revision history for this message
James Page (james-page) wrote :

Jorge

Have you seen this issue in Xenial or later releases of the RabbitMQ server package? Xenial included a 3.5.x series rabbitmq which should contain the fixes referenced.

Note that we also provide this version of RMQ for trusty uses via the Icehouse Ubuntu Cloud Archive.

Changed in rabbitmq-server (Ubuntu):
status: New → Triaged
importance: Undecided → Medium
importance: Medium → Low
Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

James,

No, I am not seeing this issue on > 3.5 series any longer. I think
it's safe to assume this issue has been fixed there.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.