[SRU] rabbitmq-server does not properly shutdown

Bug #1784757 reported by Ivan Kurnosov on 2018-08-01
34
This bug affects 6 people
Affects Status Importance Assigned to Milestone
rabbitmq-server (Debian)
Fix Released
Unknown
rabbitmq-server (Ubuntu)
Medium
Unassigned
Bionic
Medium
Unassigned
Cosmic
Medium
Unassigned

Bug Description

[Impact]
TBD

[Test Case]
TBD

[Regression Potential]
TBD

[Fix]
TBD

The fix is available upstream in Debian as of 3.7.6, so would need backported for bionic and cosmic:

 rabbitmq-server | 3.5.7-1 | xenial | source, all
 rabbitmq-server | 3.6.10-1 | bionic | source, all
 rabbitmq-server | 3.6.10-1 | cosmic | source, all
 rabbitmq-server | 3.7.8-4ubuntu2 | disco | source, all
 rabbitmq-server | 3.7.8-4ubuntu2 | eoan | source, all

[Discussion]
TBD

[Original Report]
When I run `systemctl restart rabbitmq-server` it waits for 90 seconds then systemd sends SIGKILL to it.

Presumably the `epmd` process does not receive SIGTERM, since if I run `kill 1493` (or whatever pid it currently is) then restart happens straight after that successfully

● rabbitmq-server.service - RabbitMQ Messaging Server
   Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
   Active: deactivating (final-sigterm) since Wed 2018-08-01 01:17:04 UTC; 7s ago
  Process: 1183 ExecStop=/usr/sbin/rabbitmqctl stop (code=exited, status=0/SUCCESS)
  Process: 178 ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait (code=exited, status=0/SUCCESS)
  Process: 177 ExecStart=/usr/sbin/rabbitmq-server (code=killed, signal=TERM)
 Main PID: 177 (code=killed, signal=TERM)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/rabbitmq-server.service
           └─1493 /usr/lib/erlang/erts-9.2/bin/epmd -daemon

Aug 01 01:11:20 rmq-1 systemd[1]: rabbitmq-server.service: Failed to reset devices.list: Operation not permitted
Aug 01 01:11:20 rmq-1 systemd[1]: Starting RabbitMQ Messaging Server...
Aug 01 01:11:25 rmq-1 rabbitmq[178]: Waiting for 'rabbit@rmq-1'
Aug 01 01:11:25 rmq-1 rabbitmq[178]: pid is 204
Aug 01 01:11:30 rmq-1 systemd[1]: Started RabbitMQ Messaging Server.
Aug 01 01:17:04 rmq-1 systemd[1]: Stopping RabbitMQ Messaging Server...
Aug 01 01:17:06 rmq-1 rabbitmq[1183]: Stopping and halting node 'rabbit@rmq-1'

Robie Basak (racb) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. Please execute the following command only once, as it will automatically gather debugging information, in a terminal:
apport-collect 1784757

When reporting bugs in the future please use apport by using 'ubuntu-bug' and the name of the package affected. You can learn more about this functionality at https://wiki.ubuntu.com/ReportingBugs.

Changed in rabbitmq-server (Ubuntu):
status: New → Incomplete
Launchpad Janitor (janitor) wrote :

[Expired for rabbitmq-server (Ubuntu) because there has been no activity for 60 days.]

Changed in rabbitmq-server (Ubuntu):
status: Incomplete → Expired
Alex Tomkins (tomkins) wrote :

The rabbitmq-server-release GitHub repo has a systemd file which uses shutdown instead of stop:

https://github.com/rabbitmq/rabbitmq-server-release/blob/master/packaging/debs/Debian/debian/rabbitmq-server.service

Switching to this will shut down rabbitmq in the expected way, instead of having to wait ~90 seconds.

Changed in rabbitmq-server (Ubuntu):
status: Expired → Confirmed

The attachment "rabbitmq-stop-shutdown.patch" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch

Thanks for the update Alex.
In general the package seems rather different than upstream in terms of systemd integration.
We will have to take a wider look at this than just this line.

tags: added: server-next
Changed in rabbitmq-server (Ubuntu):
importance: Undecided → Medium
carl2187 (carl2187) wrote :

I can confirm this issue on the latest versions of 18.04.2, somewhat nasty, really slows down reboots when RabbitMQ is installed. Here's reproduce steps to see the bug in action:

rabbitmq installed via "sudo apt install rabbitmq-server"

Stop the service: "sudo systemctl stop rabbitmq-server.service"

Then running "status", we can see the 90 second delay between the stop attempt at 10:31:06, and it giving up and killing the service at 10:32:36

$ sudo systemctl status rabbitmq-server.service
[sudo] password for nearlotech:
● rabbitmq-server.service - RabbitMQ Messaging Server
   Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
   Active: failed (Result: timeout) since Thu 2019-05-09 10:32:36 MDT; 1min 32s ago
  Process: 2409 ExecStop=/usr/sbin/rabbitmqctl stop (code=exited, status=0/SUCCESS)
  Process: 899 ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait (code=exited, status=0/SUCCESS)
  Process: 898 ExecStart=/usr/sbin/rabbitmq-server (code=killed, signal=TERM)
 Main PID: 898 (code=killed, signal=TERM)

May 09 10:30:43 ubuntu1804vm systemd[1]: Starting RabbitMQ Messaging Server...
May 09 10:30:46 ubuntu1804vm rabbitmq[899]: Waiting for 'rabbit@ubuntu1804vm'
May 09 10:30:46 ubuntu1804vm rabbitmq[899]: pid is 978
May 09 10:30:50 ubuntu1804vm systemd[1]: Started RabbitMQ Messaging Server.
May 09 10:31:05 ubuntu1804vm systemd[1]: Stopping RabbitMQ Messaging Server...
May 09 10:31:06 ubuntu1804vm rabbitmq[2409]: Stopping and halting node 'rabbit@ubuntu1804vm'
May 09 10:32:36 ubuntu1804vm systemd[1]: rabbitmq-server.service: State 'stop-final-sigterm' timed out. Killing.
May 09 10:32:36 ubuntu1804vm systemd[1]: rabbitmq-server.service: Killing process 2706 (epmd) with signal SIGKILL.
May 09 10:32:36 ubuntu1804vm systemd[1]: rabbitmq-server.service: Failed with result 'timeout'.
May 09 10:32:36 ubuntu1804vm systemd[1]: Stopped RabbitMQ Messaging Server.

Hmm, the same took 1 second for me.

root@b:~# systemctl status rabbitmq-server
● rabbitmq-server.service - RabbitMQ Messaging Server
   Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Fri 2019-05-10 07:16:00 UTC; 5s ago
  Process: 6004 ExecStop=/usr/sbin/rabbitmqctl stop (code=exited, status=0/SUCCESS)
  Process: 5348 ExecStart=/usr/sbin/rabbitmq-server (code=killed, signal=TERM)
 Main PID: 5348 (code=killed, signal=TERM)

May 10 07:15:40 b systemd[1]: Starting RabbitMQ Messaging Server...
May 10 07:15:41 b rabbitmq[5349]: Waiting for rabbit@b
May 10 07:15:41 b rabbitmq[5349]: pid is 5365
May 10 07:15:43 b systemd[1]: Started RabbitMQ Messaging Server.
May 10 07:15:43 b systemd[1]: rabbitmq-server.service: Failed to reset devices.list: Operation not permitted
May 10 07:15:59 b systemd[1]: Stopping RabbitMQ Messaging Server...
May 10 07:15:59 b rabbitmq[6004]: Stopping and halting node rabbit@b
May 10 07:16:00 b systemd[1]: Stopped RabbitMQ Messaging Server.

Do you have any config changes or ongoing messages running while this shutdown happens that might explain and lead us to the root cause?

The stop action is just:
  $ /usr/sbin/rabbitmqctl stop

You might change the loglevel [1] and then in a second console attach to the log to see what is going on in that time?

[1]: https://www.rabbitmq.com/logging.html

Changed in rabbitmq-server (Ubuntu):
status: Confirmed → Incomplete
Bryce Harrington (bryce) wrote :

Like Christian, I also am seeing only a 1 second stop. I tried a few different things but couldn't reproduce the delay.

I poked through /usr/lib/rabbitmq/lib/rabbitmq_server-3.6.10/sbin/rabbitmq-server, but it's unclear what the technical difference is between 'stop' and 'shutdown'. Some comments in the code suggest that under some circumstances a wait might be triggered - if you instrument that code perhaps you can identify if it is hitting that wait, and if so why.

Without knowing more, just switching to using 'shutdown' feels like a kludge/workaround, and probably not what we should deploy as a fix to this. Instead, some additional investigation needs done to figure out why it is delaying, or at least more specific reproduction steps so others can do that sleuthing.

So, as I don't think there's anything more we can do on this end, I'm going to drop the server-next tag for the time being. Feel free to re-add it if you think there's enough info for us to take another stab at it though.

tags: removed: patch server-next
Ivan Kurnosov (zerkms) wrote :

I'm not sure why I did not mention originally, sorry for that, but there is a chance that what makes it reproducible is starting the daemon inside the lxd/lxc

For those who don't use lxd - the commands would be (I don't have access to linuxes or my infrastructure at the moment, so might be a bit incorrect):

lxc launch ubuntu:bionic
lxc exec <container name> bash

inside container:
apt update
apt install rabbitmq-server
systemct restart rabbitmq-server

Alex Tomkins (tomkins) wrote :
Download full text (3.8 KiB)

Fixes in Debian packaging which also address the problem:

- https://salsa.debian.org/openstack-team/third-party/rabbitmq-server/commit/9764f7e6ffeaa99325b8e5ff41e80b1c4b2894d3
- https://salsa.debian.org/openstack-team/third-party/rabbitmq-server/commit/c5b99f5c4811f460b73c8c8505a98fa55dcd752b

If I start the server with just `service rabbitmq-server start`, my process tree looks like:

rabbitmq 11316 0.0 0.1 4628 800 ? Ss 14:59 0:00 /bin/sh /usr/sbin/rabbitmq-server
rabbitmq 11337 0.0 0.3 4628 1792 ? S 14:59 0:00 /bin/sh /usr/lib/rabbitmq/bin/rabbitmq-server
rabbitmq 11633 8.8 12.4 2165876 61504 ? Sl 14:59 0:02 /usr/lib/erlang/erts-9.2/bin/beam.smp -W w -A 64 -P 1048576 -t 5000000 -stbt db -zdbbl 32000 -K true -B i -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.6.10/ebin -noshell -noinput -s rabbit boot -sname rabbit@golestandt -boot start_sasl -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/<email address hidden>"} -rabbit sasl_error_logger {file,"/<email address hidden>"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/plugins:/usr/lib/rabbitmq/lib/rabbitmq_server-3.6.10/plugins" -rabbit plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit@golestandt-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@golestandt" -kernel inet_dist_listen_min 25672 -kernel inet_dist_listen_max 25672
rabbitmq 11745 0.1 0.1 4520 728 ? Ss 14:59 0:00 erl_child_setup 65536
rabbitmq 11809 0.0 0.2 8264 1072 ? Ss 14:59 0:00 inet_gethost 4
rabbitmq 11810 0.0 0.3 10384 1660 ? S 14:59 0:00 inet_gethost 4
rabbitmq 11504 0.0 0.0 26852 204 ? S 14:59 0:00 /usr/lib/erlang/erts-9.2/bin/epmd -daemon

However, doing `service epmd start` before that gives:

epmd 12226 0.0 0.3 26852 1728 ? Ss 15:01 0:00 /usr/bin/epmd -systemd
rabbitmq 12263 0.0 0.1 4628 816 ? Ss 15:01 0:00 /bin/sh /usr/sbin/rabbitmq-server
rabbitmq 12287 0.0 0.3 4628 1736 ? S 15:01 0:00 /bin/sh /usr/lib/rabbitmq/bin/rabbitmq-server
rabbitmq 12582 25.5 13.2 2171584 65160 ? Sl 15:01 0:02 /usr/lib/erlang/erts-9.2/bin/beam.smp -W w -A 64 -P 1048576 -t 5000000 -stbt db -zdbbl 32000 -K true -B i -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.6.10/ebin -noshell -noinput -s rabbit boot -sname rabbit@golestandt -boot start_sasl -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/<email address hidden>"} -rabbit sasl_error_logger {file,"/<email address hidden>"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/plugins:/usr/lib/rab...

Read more...

Bryce Harrington (bryce) wrote :

Alex, thanks for looking up the Debian upstream situation, this is looking a lot more actionable so am re-adding it to server-next.

Sounds like if we just picked the two changes to the service file from the debian commits, those would solve the problem? If so, that should be straightforward to SRU.

tags: added: server-next
Changed in rabbitmq-server (Ubuntu):
status: Incomplete → Triaged
Bryce Harrington (bryce) on 2019-06-09
description: updated
Changed in rabbitmq-server (Ubuntu Bionic):
status: New → Triaged
Changed in rabbitmq-server (Ubuntu Cosmic):
status: New → Triaged
importance: Undecided → Medium
Changed in rabbitmq-server (Ubuntu Bionic):
importance: Undecided → Medium
Changed in rabbitmq-server (Ubuntu):
status: Triaged → Fix Released
summary: - rabbitmq-server does not properly shutdown
+ [SRU] rabbitmq-server does not properly shutdown
Changed in rabbitmq-server (Debian):
status: Unknown → Fix Released
Bryce Harrington (bryce) wrote :

I'm still not able to reproduce the error.

If you can provide an exact set of steps to run in an lxc container it'd help a lot. Here's what I've got so far:

```
$ lxc launch ubuntu:bionic bionic-lp-1784757
Creating bionic-lp-1784757
Starting bionic-lp-1784757
trent:~/ubuntu/RabbitmqServer/sru.1784757$ lxc exec bionic-lp-1784757 -- bash
root@bionic-lp-1784757:~# apt update
Hit:1 http://archive.ubuntu.com/ubuntu bionic InRelease
(...)
root@bionic-lp-1784757:~# ps aux | grep ^rabbit
rabbitmq 1390 0.0 0.0 4628 868 ? Ss 01:30 0:00 /bin/sh /usr/sbin/rabbitmq-server
rabbitmq 1403 0.0 0.0 4628 1844 ? S 01:30 0:00 /bin/sh /usr/lib/rabbitmq/bin/rabbitmq-server
rabbitmq 1553 0.0 0.0 26852 204 ? S 01:30 0:00 /usr/lib/erlang/erts-9.2/bin/epmd -daemon
rabbitmq 1684 0.0 0.1 4291312 62276 ? Sl 01:30 0:02 /usr/lib/erlang/erts-9.2/bin/beam.smp -W w -A 96 -P 1048576 (...)
rabbitmq 1834 0.0 0.0 4520 804 ? Ss 01:30 0:00 erl_child_setup 65536
rabbitmq 1913 0.0 0.0 8264 1128 ? Ss 01:30 0:00 inet_gethost 4
rabbitmq 1914 0.0 0.0 14616 1888 ? S 01:30 0:00 inet_gethost 4
root@bionic-lp-1784757:~# sudo time --format %E systemctl stop rabbitmq-server.service
0:00.88
```

I've also experimented with having epmd started or stopped, in various order with the start/stop of rabbitmq-server, but it always takes only a second or two to start or stop. Debian's changes suggest that the bug is some interrelationship with epmd, but it's unclear from the patch what exactly that is.

Are there any config settings, architecture specificness, or other considerations that need taken into account?

Changed in rabbitmq-server (Ubuntu Bionic):
status: Triaged → Incomplete
carl2187 (carl2187) wrote :

# I run into this problem reliably in Bionic following this process, (as of 7/19/2019, this process caused the bug to manifest)

# Clean install ubuntu server 18.04.2 (using traditional installer media, not live media)
# (installing into a gen2 Hyper-V VM running on Windows 10 1903)

# after install and first reboot, patch it up all the way:

sudo apt update
sudo apt full-upgrade

sudo shutdown -r now

sudo apt install -y apt-transport-https locales software-properties-common curl

# setup repo for nodejs version 8.x:
curl -sL https://deb.nodesource.com/setup_8.x | sudo -E bash -

sudo apt install -y fonts-noto nodejs git supervisor nginx ssl-cert postgresql adduser redis-server rabbitmq-server libstdc++6 libxml2 libboost-regex-dev zlib1g fonts-dejavu fonts-liberation ttf-mscorefonts-installer fonts-crosextra-carlito fonts-takao-gothic fonts-opensymbol libxss1 libcairo2 xvfb libxtst6 libgconf2-4 libasound2

sudo shutdown -r now

# on this reboot after Rabitmq-server has been installed (along with all that other junk) then the bug manifests reliably in bionic.

carl2187 (carl2187) wrote :

# my method to fix the problem in Bionic for now is simple, borrowing from the upstream Debian patch:

sudo nano /lib/systemd/system/rabbitmq-server.service

# make the [Unit] definition look like this:
[Unit]
Description=RabbitMQ Messaging Server
After=network.target epmd@0.0.0.0.socket
Wants=network.target epmd@0.0.0.0.socket

# then reload and reboot, no more delay at Rabbitmq
sudo systemctl daemon-reload
sudo systemctl restart rabbitmq-server.service

sudo shutdown -r now

carl2187 (carl2187) wrote :

# forgot to include the shutdown changes, so here's a full procedure to fix this in Bionic:

sudo nano /lib/systemd/system/rabbitmq-server.service

# fixup the [Unit] entry to look like this:
[Unit]
Description=RabbitMQ Messaging Server
After=network.target epmd@0.0.0.0.socket
Wants=network.target epmd@0.0.0.0.socket

# change just the ExecStop under [Service] to use "shutdown" instead of "stop":
ExecStop=/usr/sbin/rabbitmqctl shutdown

# reload and reboot:
sudo systemctl stop rabbitmq-server.service
sudo systemctl daemon-reload
sudo systemctl restart rabbitmq-server.service
sudo shutdown -r now

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.