Fuel for OpenStack

RabbitMQ handles exit signals improperly

Bug #1577239 reported by Alexander Rubtsov on 2016-05-01

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	High	MOS Maintenance	Fuel for OpenStack 7.0-updates

Bug Description

--- Environment ---
MOS Version: 7.0 + MU-3
RabitMQ: 3.5.4-1~u14.04+mos3

--- Description ---
RabbitMQ cluster broke down and Pacemaker was not able to bring it up automatically
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
p_rabbitmq-server (ocf::fuel:rabbitmq-server): FAILED cic-2.domain.tld
Masters: [ cic-1.domain.tld ]
Slaves: [ cic-3.domain.tld ]

As the result, OpenStack environment became unoperable:
OpenStack CLI started to return "Service Unavailable (HTTP 503)" error

There is the following crash report in the RabbitMQ log files:
=CRASH REPORT==== 19-Apr-2016::22:22:06 ===
  crasher:
    initial call: rabbit_reader:init/2
    pid: <0.19859.1062>
    registered_name: []
    exception exit: channel_termination_timeout
      in function rabbit_reader:wait_for_channel_termination/3 (src/rabbit_reader.erl, line 633)
      in call from rabbit_reader:handle_exception/3 (src/rabbit_reader.erl, line 669)
      in call from rabbit_reader:terminate/2 (src/rabbit_reader.erl, line 492)
      in call from rabbit_reader:handle_other/2 (src/rabbit_reader.erl, line 418)
      in call from rabbit_reader:mainloop/4 (src/rabbit_reader.erl, line 380)
      in call from rabbit_reader:run/1 (src/rabbit_reader.erl, line 309)
      in call from rabbit_reader:start_connection/5 (src/rabbit_reader.erl, line 270)
    ancestors: [<0.12202.1062>,rabbit_tcp_client_sup,rabbit_sup,<0.846.0>]
    messages: [{'EXIT',# Port<0.198604>,normal}]
    links: []
    dictionary: [{{channel,1},
                   {<0.24887.1062>,{method,rabbit_framing_amqp_0_9_1}}},
                  {{ch_pid,<0.24887.1062>},{1,# Ref<0.0.1189.84882>}},
                  {process_name,
                      {rabbit_reader,
                          <<"192.168.2.32:53241 -> 192.168.2.68:5673">>}}]
    trap_exit: true
    status: running
    heap_size: 1598
    stack_size: 27
    reductions: 25262
  neighbours:

There is the corresponding bug in RabbitMQ-server bugtracker at Github:
https://github.com/rabbitmq/rabbitmq-server/issues/544
According to the bug status, it was fixed in RabbitMQ-server 3.6.1 version

Is it possible to get RabbitMQ 3.6.1 or this certain fix in MOS 7.0 updates?
If some additional log files are required, I can provide them directly

Tags:

Oleksiy Molchanov (omolchanov) on 2016-05-04

Changed in fuel:
milestone:	none → 7.0-updates
assignee:	nobody → MOS Linux (mos-linux)
importance:	Undecided → High
status:	New → Confirmed
tags:	added: area-linux

Revision history for this message

Albert Syriy (asyriy) wrote on 2016-05-04:

I guess, about updates (in MOS 7.0 - updates ) it's better to ask mos-maintenaince team ( https://launchpad.net/~mos-maintenance ).

Changed in fuel:
assignee:	MOS Linux (mos-linux) → MOS Maintenance (mos-maintenance)

Vitaly Sedelnik (vsedelnik) on 2016-05-05

Changed in fuel:
milestone:	7.0-updates → 7.0-mu-4

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-05-05:

I beleive that upstream fix mentioned above is only about cosmetics - it'll prevent spamming logs with channel_termination_timeout. It's just happening at the time of controller crash (or network split) - at the same time as the real problems with rabbit are bringing it to the knees.

There were a lot of bugs fixed since rabbit 3.5.4 / erlang 16. But as far as I know upgrade is out of question, so the best thing to be done is to find out why OCF script failed to restart/reset rabbit to a working state. Maybe it's OCF script changes that should be ported.

Revision history for this message

Alexander Rubtsov (arubtsov) wrote on 2016-05-05:

Alexey,

Please let us know which information is required to debug OCF script
Support team can provide with RabbitMQ and Pacemaker log files and collect another info from the affected environment

Revision history for this message

Alexander Rubtsov (arubtsov) wrote on 2016-05-05:

sla1 for 7.0-updates

tags:

added: sla1

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2016-05-05:

Alexander, steps to reproduce and Pacemaker / Rabbit logs would be helpful.

Revision history for this message

Bartosz Kupidura (zynzel) wrote on 2016-05-09:

Steps to reproduce:
1) Boot 2 VMs from volume in parallel
2) Wait for VMs to 'active' state
3) Goto 1)

Command used to boot VM:
nova --os-tenant-id <tenant-id> boot --block-device source=image,id=<image-id>,dest=volume,size=20,shutdown=remove,bootindex=0 --flavor <flavor> --nic net-id=<net-id1>,v4-fixed-ip=<fixed_ip1> --nic net-id=<net-id2>,v4-fixed-ip=<fixed_ip2> --availability-zone nova:<compute-host> --meta evacuation_policy=NoEvacuation --key_name cic_key --config-drive true <VM-name>

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2016-05-11:

Let me describe my findings from logs and thoughts on the topic.

1. Steps to reproduce: Bartosz, we started dozens of VMs in parallel and this worked like a charm on a fully operational OpenStack environment. The behavior you observe is caused by the issue, these are not "steps to reproduce", if you see failures during VMs startup - this is the outcome of the issue, not the root cause.

2. As per comment #2 these "crashes" are harmless, and the fix you're asking to backport is just a cosmetic change, it won't help you to obtain stability. The root cause of the issue is not in these crashes, the crashes are only a signal.

3. Digging the logs I found the following:

   * There were about 1.2 million of outstanding messages in queues before a crash. (mostly in scheduler_fanout_*)
   * The cause of the first cluster partition is a missed net_tick (heartbeat) which came too late, and the node-3 has been marked "down" on the node-2, while node-1 received the tick on time. This caused the node-2 to intentionally disconnect from node-1 creating a network partition.
   * pacemaker via OCF noticed that there is a network partition and started to reassemble the cluster

From my point of view, all these things are the root cause of the issue. We discussed the findings with the oslo team and here is what can be done to avoid such in future:

   * increase the net_timetick parameter of rabbitmq/erlang to 60s (default)
   * introduce an expiration policy on rabbitmq's queues and exchanges, which should help against growing fanouts
   * increase amount of timeouts to ignore by pacemaker (https://docs.mirantis.com/openstack/fuel/fuel-7.0/operations.html#how-to-make-rabbitmq-ocf-script-tolerate-rabbitmqctl-timeouts)

All these workarounds have not been tested properly and change the reference design of the product. Including them into MU is a subject to a wide discussion.

Revision history for this message

Alexey Lebedeff (alebedev-a) wrote on 2016-05-12:

I completely agree with Denis. Additional point from rabbit side:
false network partition (caused by low net_ticktitme) triggered rabbit
bug https://github.com/rabbitmq/rabbitmq-server/issues/749 Stacktraces
can be found in rabbit logs by string `find_common`. Bug fix is here
https://github.com/rabbitmq/rabbitmq-server/pull/779 and will be
released with rabbitmq 3.6.2

So here is some more possible solutions:
- Disable HA policy for queues - a lot of rabbit bugs in 3.5.4 is
  directly related to HA queues.
- Upgrade rabbitmq to at least to 3.5.7 (with all patches that were
  backported for 8.0 and the patch from above)
- If adding expiration policy will not reduce the number of
  outstanding messages - then upgrading rabbit to 3.6.2 will greatly
  reduce load during initial sync of HA-queues, as it now supports
  batched sync.

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2016-05-18:

Moving to Incomplete until telco team get results.

Changed in fuel:
status:	Confirmed → Incomplete

Denis Meltsaykin (dmeltsaykin) on 2016-05-18

Changed in fuel:
milestone:	7.0-mu-4 → 7.0-updates

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2016-07-22:

#10

Invalid for 7.0-updates as there is no update for more than a month

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2016-08-02:

#11

Issue has not been reproduced during telco's tests. Closing as Invalid.

Changed in fuel:
status:	Incomplete → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.