rabbit died and everything else died

Bug #1772236 reported by Iain Lane
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Auto Package Testing
Fix Released
Undecided
Unassigned
rabbitmq-server (Ubuntu)
New
Undecided
Unassigned

Bug Description

Why did it die?

Should it have self-restarted?

ubuntu@juju-prod-ues-proposed-migration-machine-1:~$ journalctl -u rabbitmq-server.service -n1000 | cat
-- Logs begin at Sun 2018-05-20 00:18:25 UTC, end at Sun 2018-05-20 08:58:27 UTC. --
May 20 04:00:11 juju-prod-ues-proposed-migration-machine-1 systemd[1]: rabbitmq-server.service: Main process exited, code=exited, status=137/n/a
May 20 04:00:12 juju-prod-ues-proposed-migration-machine-1 rabbitmq[28971]: Stopping and halting node 'rabbit@ps45-10-25-180-146' ...
May 20 04:00:12 juju-prod-ues-proposed-migration-machine-1 rabbitmq[28971]: Error: unable to connect to node 'rabbit@ps45-10-25-180-146': nodedown
May 20 04:00:12 juju-prod-ues-proposed-migration-machine-1 rabbitmq[28971]: DIAGNOSTICS
May 20 04:00:12 juju-prod-ues-proposed-migration-machine-1 rabbitmq[28971]: ===========
May 20 04:00:12 juju-prod-ues-proposed-migration-machine-1 rabbitmq[28971]: attempted to contact: ['rabbit@ps45-10-25-180-146']
May 20 04:00:12 juju-prod-ues-proposed-migration-machine-1 rabbitmq[28971]: rabbit@ps45-10-25-180-146:
May 20 04:00:12 juju-prod-ues-proposed-migration-machine-1 rabbitmq[28971]: * connected to epmd (port 4369) on ps45-10-25-180-146
May 20 04:00:12 juju-prod-ues-proposed-migration-machine-1 rabbitmq[28971]: * epmd reports: node 'rabbit' not running at all
May 20 04:00:12 juju-prod-ues-proposed-migration-machine-1 rabbitmq[28971]: other nodes on ps45-10-25-180-146: ['rabbitmq-cli-28979']
May 20 04:00:12 juju-prod-ues-proposed-migration-machine-1 rabbitmq[28971]: * suggestion: start the node
May 20 04:00:12 juju-prod-ues-proposed-migration-machine-1 rabbitmq[28971]: current node details:
May 20 04:00:12 juju-prod-ues-proposed-migration-machine-1 rabbitmq[28971]: - node name: 'rabbitmq-cli-28979@juju-prod-ues-proposed-migration-machine-1'
May 20 04:00:12 juju-prod-ues-proposed-migration-machine-1 rabbitmq[28971]: - home dir: .
May 20 04:00:12 juju-prod-ues-proposed-migration-machine-1 rabbitmq[28971]: - cookie hash: 7+AChRZDewWFJK8SEUhx+Q==
May 20 04:00:12 juju-prod-ues-proposed-migration-machine-1 systemd[1]: rabbitmq-server.service: Control process exited, code=exited status=2
May 20 04:00:12 juju-prod-ues-proposed-migration-machine-1 systemd[1]: rabbitmq-server.service: Unit entered failed state.
May 20 04:00:12 juju-prod-ues-proposed-migration-machine-1 systemd[1]: rabbitmq-server.service: Failed with result 'exit-code'.

Revision history for this message
Iain Lane (laney) wrote :
Download full text (25.8 KiB)

OK, it got OOM killed...

[Sun May 20 03:58:24 2018] snmpd invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
[Sun May 20 03:58:24 2018] snmpd cpuset=/ mems_allowed=0
[Sun May 20 03:58:24 2018] CPU: 0 PID: 27747 Comm: snmpd Tainted: G OE 4.4.0-116-generic #140-Ubuntu
[Sun May 20 03:58:24 2018] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
[Sun May 20 03:58:24 2018] 0000000000000286 e1b1ea220281c262 ffff8800785079d8 ffffffff813ffc13
[Sun May 20 03:58:24 2018] ffff880078507b90 ffff88007b410e00 ffff880078507a48 ffffffff8121012e
[Sun May 20 03:58:24 2018] ffff880011910e00 ffff88007b411150 ffff88007b410e00 ffff880078507b90
[Sun May 20 03:58:24 2018] Call Trace:
[Sun May 20 03:58:24 2018] [<ffffffff813ffc13>] dump_stack+0x63/0x90
[Sun May 20 03:58:24 2018] [<ffffffff8121012e>] dump_header+0x5a/0x1c5
[Sun May 20 03:58:24 2018] [<ffffffff811968f2>] oom_kill_process+0x202/0x3c0
[Sun May 20 03:58:24 2018] [<ffffffff81196d19>] out_of_memory+0x219/0x460
[Sun May 20 03:58:24 2018] [<ffffffff8119cd45>] __alloc_pages_slowpath.constprop.88+0x965/0xb00
[Sun May 20 03:58:24 2018] [<ffffffff8119d168>] __alloc_pages_nodemask+0x288/0x2a0
[Sun May 20 03:58:24 2018] [<ffffffff811e6ccc>] alloc_pages_current+0x8c/0x110
[Sun May 20 03:58:24 2018] [<ffffffff81192e0b>] __page_cache_alloc+0xab/0xc0
[Sun May 20 03:58:24 2018] [<ffffffff81195390>] filemap_fault+0x150/0x400
[Sun May 20 03:58:24 2018] [<ffffffff812a81d6>] ext4_filemap_fault+0x36/0x50
[Sun May 20 03:58:24 2018] [<ffffffff811c2216>] __do_fault+0x56/0xf0
[Sun May 20 03:58:24 2018] [<ffffffff811c5d75>] handle_mm_fault+0xfa5/0x1820
[Sun May 20 03:58:24 2018] [<ffffffff811bf088>] ? list_lru_add+0x58/0x120
[Sun May 20 03:58:24 2018] [<ffffffff81215c33>] ? __fput+0x193/0x230
[Sun May 20 03:58:24 2018] [<ffffffff8106c747>] __do_page_fault+0x197/0x400
[Sun May 20 03:58:24 2018] [<ffffffff8106ca17>] trace_do_page_fault+0x37/0xe0
[Sun May 20 03:58:24 2018] [<ffffffff81064fb9>] do_async_page_fault+0x19/0x70
[Sun May 20 03:58:24 2018] [<ffffffff81851a08>] async_page_fault+0x28/0x30
[Sun May 20 03:58:24 2018] Mem-Info:
[Sun May 20 03:58:24 2018] active_anon:464532 inactive_anon:6207 isolated_anon:0
                            active_file:94 inactive_file:82 isolated_file:0
                            unevictable:913 dirty:0 writeback:0 unstable:0
                            slab_reclaimable:6604 slab_unreclaimable:4637
                            mapped:855 shmem:6557 pagetables:2796 bounce:0
                            free:14212 free_pcp:113 free_cma:0
[Sun May 20 03:58:24 2018] Node 0 DMA free:10180kB min:356kB low:444kB high:532kB active_anon:5320kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:8kB shmem:4kB slab_reclaimable:40kB slab_unreclaimable:76kB kernel_stack:32kB pagetables:76kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[Sun May 20 03:58:24 2018] lowmem_reserve[]: 0 1945 1945 1945 1945
[Sun May 20 03:58:24 2018] Node 0 DMA32 fr...

Revision history for this message
Iain Lane (laney) wrote :

also: we should have been notified other than the workers shitting themselves

maybe consider multiple servers as a mitigation: https://www.rabbitmq.com/ha.html

Revision history for this message
Iain Lane (laney) wrote :

I've added delivery_mode=2 which is supposed to make the server write the messages to disk, so maybe next time we won't lose all queued requests.

Revision history for this message
Iain Lane (laney) wrote :

Worth noting that we ran for many months before hitting this, and then:

ubuntu@juju-prod-ues-proposed-migration-machine-1:~$ dmesg -T | grep "Out of memory: Kill" | uniq
[Wed May 9 16:06:23 2018] Out of memory: Kill process 1408 (beam) score 215 or sacrifice child
[Sun May 20 03:58:24 2018] Out of memory: Kill process 19495 (beam) score 437 or sacrifice child
[Fri Jun 1 02:28:15 2018] Out of memory: Kill process 6569 (beam) score 428 or sacrifice child
[Fri Jun 8 05:37:54 2018] Out of memory: Kill process 1142 (beam) score 434 or sacrifice child

4 times in a month.

One thing that happened "sort of" around this time is

  https://launchpad.net/ubuntu/+source/erlang/1:18.3-dfsg-1ubuntu3.1

but I have a vague correlation here, no causation.

Does seem like rabbit's memory usage grows over time until it's eventually killed.

Revision history for this message
Iain Lane (laney) wrote :

For autopkgtest-cloud I just cowboyed a change to add Restart=on-failure to rabbitmq-server.service. Maybe that helps us mitigate, in combination with delivery_mode=2.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I first thought we could log some data like:
 $ rabbitmqctl list_queues name durable owner_pid messages_ready messages_unacknowledged messages messages_ready_ram messages_unacknowledged_ram messages_ram messages_persistent message_bytes message_bytes_ram message_bytes_persistent memory state
via cron.
But then we don't know what exactly we look for yet.

I found that the service oriented
 $ rabbitmqctl report
has all the data you could want.
If we don't gather it too often, and maybe even gzip it

In my test it had 7.5k raw and 2.6k zipped.
A real case might be bigger, but if we do that hourly or so we would see which element grows over time.

Especially interesting is the definition of the base memory counter:
  memory Bytes of memory consumed by the Erlang process associated with the queue, including
         stack, heap and internal structures.

Yeah could be useful next time this happens.

Revision history for this message
Skia (hyask) wrote :

There has been many various improvements on that front:
* RabbitMQ's memory consumption has been studied closely, bringing many fixes in autopkgtest-cloud.
* We have a watchdog restarting RabbitMQ when things go bad.
* This watchdog has been exercised a lot before the memory consumption fixes and the code is able to deal with a RabbitMQ restart correctly.

Changed in auto-package-testing:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.