RabbitMQ handles exit signals improperly

Bug #1577239 reported by Alexander Rubtsov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
MOS Maintenance

Bug Description

 --- Environment ---
MOS Version: 7.0 + MU-3
RabitMQ: 3.5.4-1~u14.04+mos3

 --- Description ---
RabbitMQ cluster broke down and Pacemaker was not able to bring it up automatically
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
p_rabbitmq-server (ocf::fuel:rabbitmq-server): FAILED cic-2.domain.tld
Masters: [ cic-1.domain.tld ]
Slaves: [ cic-3.domain.tld ]

As the result, OpenStack environment became unoperable:
OpenStack CLI started to return "Service Unavailable (HTTP 503)" error

There is the following crash report in the RabbitMQ log files:
=CRASH REPORT==== 19-Apr-2016::22:22:06 ===
  crasher:
    initial call: rabbit_reader:init/2
    pid: <0.19859.1062>
    registered_name: []
    exception exit: channel_termination_timeout
      in function rabbit_reader:wait_for_channel_termination/3 (src/rabbit_reader.erl, line 633)
      in call from rabbit_reader:handle_exception/3 (src/rabbit_reader.erl, line 669)
      in call from rabbit_reader:terminate/2 (src/rabbit_reader.erl, line 492)
      in call from rabbit_reader:handle_other/2 (src/rabbit_reader.erl, line 418)
      in call from rabbit_reader:mainloop/4 (src/rabbit_reader.erl, line 380)
      in call from rabbit_reader:run/1 (src/rabbit_reader.erl, line 309)
      in call from rabbit_reader:start_connection/5 (src/rabbit_reader.erl, line 270)
    ancestors: [<0.12202.1062>,rabbit_tcp_client_sup,rabbit_sup,<0.846.0>]
    messages: [{'EXIT',# Port<0.198604>,normal}]
    links: []
    dictionary: [{{channel,1},
                   {<0.24887.1062>,{method,rabbit_framing_amqp_0_9_1}}},
                  {{ch_pid,<0.24887.1062>},{1,# Ref<0.0.1189.84882>}},
                  {process_name,
                      {rabbit_reader,
                          <<"192.168.2.32:53241 -> 192.168.2.68:5673">>}}]
    trap_exit: true
    status: running
    heap_size: 1598
    stack_size: 27
    reductions: 25262
  neighbours:

There is the corresponding bug in RabbitMQ-server bugtracker at Github:
https://github.com/rabbitmq/rabbitmq-server/issues/544
According to the bug status, it was fixed in RabbitMQ-server 3.6.1 version

Is it possible to get RabbitMQ 3.6.1 or this certain fix in MOS 7.0 updates?
If some additional log files are required, I can provide them directly

Changed in fuel:
milestone: none → 7.0-updates
assignee: nobody → MOS Linux (mos-linux)
importance: Undecided → High
status: New → Confirmed
tags: added: area-linux
Revision history for this message
Albert Syriy (asyriy) wrote :

I guess, about updates (in MOS 7.0 - updates ) it's better to ask mos-maintenaince team ( https://launchpad.net/~mos-maintenance ).

Changed in fuel:
assignee: MOS Linux (mos-linux) → MOS Maintenance (mos-maintenance)
Changed in fuel:
milestone: 7.0-updates → 7.0-mu-4
Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

I beleive that upstream fix mentioned above is only about cosmetics - it'll prevent spamming logs with channel_termination_timeout. It's just happening at the time of controller crash (or network split) - at the same time as the real problems with rabbit are bringing it to the knees.

There were a lot of bugs fixed since rabbit 3.5.4 / erlang 16. But as far as I know upgrade is out of question, so the best thing to be done is to find out why OCF script failed to restart/reset rabbit to a working state. Maybe it's OCF script changes that should be ported.

Revision history for this message
Alexander Rubtsov (arubtsov) wrote :

Alexey,

Please let us know which information is required to debug OCF script
Support team can provide with RabbitMQ and Pacemaker log files and collect another info from the affected environment

Revision history for this message
Alexander Rubtsov (arubtsov) wrote :

sla1 for 7.0-updates

tags: added: sla1
Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Alexander, steps to reproduce and Pacemaker / Rabbit logs would be helpful.

Revision history for this message
Bartosz Kupidura (zynzel) wrote :

Steps to reproduce:
1) Boot 2 VMs from volume in parallel
2) Wait for VMs to 'active' state
3) Goto 1)

Command used to boot VM:
nova --os-tenant-id <tenant-id> boot --block-device source=image,id=<image-id>,dest=volume,size=20,shutdown=remove,bootindex=0 --flavor <flavor> --nic net-id=<net-id1>,v4-fixed-ip=<fixed_ip1> --nic net-id=<net-id2>,v4-fixed-ip=<fixed_ip2> --availability-zone nova:<compute-host> --meta evacuation_policy=NoEvacuation --key_name cic_key --config-drive true <VM-name>

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Let me describe my findings from logs and thoughts on the topic.

1. Steps to reproduce: Bartosz, we started dozens of VMs in parallel and this worked like a charm on a fully operational OpenStack environment. The behavior you observe is caused by the issue, these are not "steps to reproduce", if you see failures during VMs startup - this is the outcome of the issue, not the root cause.

2. As per comment #2 these "crashes" are harmless, and the fix you're asking to backport is just a cosmetic change, it won't help you to obtain stability. The root cause of the issue is not in these crashes, the crashes are only a signal.

3. Digging the logs I found the following:

   * There were about 1.2 million of outstanding messages in queues before a crash. (mostly in scheduler_fanout_*)
   * The cause of the first cluster partition is a missed net_tick (heartbeat) which came too late, and the node-3 has been marked "down" on the node-2, while node-1 received the tick on time. This caused the node-2 to intentionally disconnect from node-1 creating a network partition.
   * pacemaker via OCF noticed that there is a network partition and started to reassemble the cluster

From my point of view, all these things are the root cause of the issue. We discussed the findings with the oslo team and here is what can be done to avoid such in future:

   * increase the net_timetick parameter of rabbitmq/erlang to 60s (default)
   * introduce an expiration policy on rabbitmq's queues and exchanges, which should help against growing fanouts
   * increase amount of timeouts to ignore by pacemaker (https://docs.mirantis.com/openstack/fuel/fuel-7.0/operations.html#how-to-make-rabbitmq-ocf-script-tolerate-rabbitmqctl-timeouts)

All these workarounds have not been tested properly and change the reference design of the product. Including them into MU is a subject to a wide discussion.

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

I completely agree with Denis. Additional point from rabbit side:
false network partition (caused by low net_ticktitme) triggered rabbit
bug https://github.com/rabbitmq/rabbitmq-server/issues/749 Stacktraces
can be found in rabbit logs by string `find_common`. Bug fix is here
https://github.com/rabbitmq/rabbitmq-server/pull/779 and will be
released with rabbitmq 3.6.2

So here is some more possible solutions:
- Disable HA policy for queues - a lot of rabbit bugs in 3.5.4 is
  directly related to HA queues.
- Upgrade rabbitmq to at least to 3.5.7 (with all patches that were
  backported for 8.0 and the patch from above)
- If adding expiration policy will not reduce the number of
  outstanding messages - then upgrading rabbit to 3.6.2 will greatly
  reduce load during initial sync of HA-queues, as it now supports
  batched sync.

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Moving to Incomplete until telco team get results.

Changed in fuel:
status: Confirmed → Incomplete
Changed in fuel:
milestone: 7.0-mu-4 → 7.0-updates
Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Invalid for 7.0-updates as there is no update for more than a month

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Issue has not been reproduced during telco's tests. Closing as Invalid.

Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.