[SRU] rabbitmq-server does not properly shutdown

Bug #1784757 reported by Ivan Kurnosov on 2018-08-01
44
This bug affects 6 people
Affects Status Importance Assigned to Milestone
rabbitmq-server (Debian)
Fix Released
Unknown
rabbitmq-server (Ubuntu)
Medium
Unassigned
Bionic
Medium
Sergio Durigan Junior
Cosmic
Medium
Unassigned

Bug Description

[Impact]

The systemd file rabbitmq-server.service on Bionic uses "Type=simple" when defining the service, but unfortunately this doesn't work very well for rabbitmq-server. In certain situations, systemd will fail to keep track of a start/stop/restart event, and will hang for 90 seconds before giving the prompt back to the user. Another problem is that rabbitmq-server must start after the epmd service, so we need to explicitly declare this dependency in the service file.

[Test Case]

Although I was able to reproduce this almost 100% of the time, there were rare occasions when the restart procedure finished normally. I was also only able to reproduce it using a bionic VM, not a container. If you have multipass or lxd configured to launch VMs, that should be easy.

The steps are:

$ lxc launch ubuntu-daily:bionic --vm bug1784757-rabbitmq-server # or use multipass
$ lxc shell bug1784757-rabbitmq-server
# apt update
# apt install rabbitmq-server -y
# systemctl restart rabbitmq-server.service

In a normal scenario, the restart should take around 3 seconds or less. With the bug, it takes around 90 seconds. If you can't reproduce it, try running "systemctl restart" again. A quick way to trigger it is to run a for loop like:

# for i in $(seq 10); do time systemctl restart rabbitmq-server.service ; done

[Regression Potential]

* Because rabbitmq-server implements systemd's "Type=notify" using socat to communicate with systemd-notify over a socket, we will be introducing another point of failure (socat) in the mix.

* So, albeit unlikely, there may be a problem when using socat. That would not be a regression, though, since the outcome would be the same as we have today: "systemctl restart" would not properly work, even though the service did restart.

[Original Report]

When I run `systemctl restart rabbitmq-server` it waits for 90 seconds then systemd sends SIGKILL to it.

Presumably the `epmd` process does not receive SIGTERM, since if I run `kill 1493` (or whatever pid it currently is) then restart happens straight after that successfully

● rabbitmq-server.service - RabbitMQ Messaging Server
   Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
   Active: deactivating (final-sigterm) since Wed 2018-08-01 01:17:04 UTC; 7s ago
  Process: 1183 ExecStop=/usr/sbin/rabbitmqctl stop (code=exited, status=0/SUCCESS)
  Process: 178 ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait (code=exited, status=0/SUCCESS)
  Process: 177 ExecStart=/usr/sbin/rabbitmq-server (code=killed, signal=TERM)
 Main PID: 177 (code=killed, signal=TERM)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/rabbitmq-server.service
           └─1493 /usr/lib/erlang/erts-9.2/bin/epmd -daemon

Aug 01 01:11:20 rmq-1 systemd[1]: rabbitmq-server.service: Failed to reset devices.list: Operation not permitted
Aug 01 01:11:20 rmq-1 systemd[1]: Starting RabbitMQ Messaging Server...
Aug 01 01:11:25 rmq-1 rabbitmq[178]: Waiting for 'rabbit@rmq-1'
Aug 01 01:11:25 rmq-1 rabbitmq[178]: pid is 204
Aug 01 01:11:30 rmq-1 systemd[1]: Started RabbitMQ Messaging Server.
Aug 01 01:17:04 rmq-1 systemd[1]: Stopping RabbitMQ Messaging Server...
Aug 01 01:17:06 rmq-1 rabbitmq[1183]: Stopping and halting node 'rabbit@rmq-1'

Related branches

Robie Basak (racb) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. Please execute the following command only once, as it will automatically gather debugging information, in a terminal:
apport-collect 1784757

When reporting bugs in the future please use apport by using 'ubuntu-bug' and the name of the package affected. You can learn more about this functionality at https://wiki.ubuntu.com/ReportingBugs.

Changed in rabbitmq-server (Ubuntu):
status: New → Incomplete
Launchpad Janitor (janitor) wrote :

[Expired for rabbitmq-server (Ubuntu) because there has been no activity for 60 days.]

Changed in rabbitmq-server (Ubuntu):
status: Incomplete → Expired
Alex Tomkins (tomkins) wrote :

The rabbitmq-server-release GitHub repo has a systemd file which uses shutdown instead of stop:

https://github.com/rabbitmq/rabbitmq-server-release/blob/master/packaging/debs/Debian/debian/rabbitmq-server.service

Switching to this will shut down rabbitmq in the expected way, instead of having to wait ~90 seconds.

Changed in rabbitmq-server (Ubuntu):
status: Expired → Confirmed

The attachment "rabbitmq-stop-shutdown.patch" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch

Thanks for the update Alex.
In general the package seems rather different than upstream in terms of systemd integration.
We will have to take a wider look at this than just this line.

tags: added: server-next
Changed in rabbitmq-server (Ubuntu):
importance: Undecided → Medium
carl2187 (carl2187) wrote :

I can confirm this issue on the latest versions of 18.04.2, somewhat nasty, really slows down reboots when RabbitMQ is installed. Here's reproduce steps to see the bug in action:

rabbitmq installed via "sudo apt install rabbitmq-server"

Stop the service: "sudo systemctl stop rabbitmq-server.service"

Then running "status", we can see the 90 second delay between the stop attempt at 10:31:06, and it giving up and killing the service at 10:32:36

$ sudo systemctl status rabbitmq-server.service
[sudo] password for nearlotech:
● rabbitmq-server.service - RabbitMQ Messaging Server
   Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
   Active: failed (Result: timeout) since Thu 2019-05-09 10:32:36 MDT; 1min 32s ago
  Process: 2409 ExecStop=/usr/sbin/rabbitmqctl stop (code=exited, status=0/SUCCESS)
  Process: 899 ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait (code=exited, status=0/SUCCESS)
  Process: 898 ExecStart=/usr/sbin/rabbitmq-server (code=killed, signal=TERM)
 Main PID: 898 (code=killed, signal=TERM)

May 09 10:30:43 ubuntu1804vm systemd[1]: Starting RabbitMQ Messaging Server...
May 09 10:30:46 ubuntu1804vm rabbitmq[899]: Waiting for 'rabbit@ubuntu1804vm'
May 09 10:30:46 ubuntu1804vm rabbitmq[899]: pid is 978
May 09 10:30:50 ubuntu1804vm systemd[1]: Started RabbitMQ Messaging Server.
May 09 10:31:05 ubuntu1804vm systemd[1]: Stopping RabbitMQ Messaging Server...
May 09 10:31:06 ubuntu1804vm rabbitmq[2409]: Stopping and halting node 'rabbit@ubuntu1804vm'
May 09 10:32:36 ubuntu1804vm systemd[1]: rabbitmq-server.service: State 'stop-final-sigterm' timed out. Killing.
May 09 10:32:36 ubuntu1804vm systemd[1]: rabbitmq-server.service: Killing process 2706 (epmd) with signal SIGKILL.
May 09 10:32:36 ubuntu1804vm systemd[1]: rabbitmq-server.service: Failed with result 'timeout'.
May 09 10:32:36 ubuntu1804vm systemd[1]: Stopped RabbitMQ Messaging Server.

Hmm, the same took 1 second for me.

root@b:~# systemctl status rabbitmq-server
● rabbitmq-server.service - RabbitMQ Messaging Server
   Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Fri 2019-05-10 07:16:00 UTC; 5s ago
  Process: 6004 ExecStop=/usr/sbin/rabbitmqctl stop (code=exited, status=0/SUCCESS)
  Process: 5348 ExecStart=/usr/sbin/rabbitmq-server (code=killed, signal=TERM)
 Main PID: 5348 (code=killed, signal=TERM)

May 10 07:15:40 b systemd[1]: Starting RabbitMQ Messaging Server...
May 10 07:15:41 b rabbitmq[5349]: Waiting for rabbit@b
May 10 07:15:41 b rabbitmq[5349]: pid is 5365
May 10 07:15:43 b systemd[1]: Started RabbitMQ Messaging Server.
May 10 07:15:43 b systemd[1]: rabbitmq-server.service: Failed to reset devices.list: Operation not permitted
May 10 07:15:59 b systemd[1]: Stopping RabbitMQ Messaging Server...
May 10 07:15:59 b rabbitmq[6004]: Stopping and halting node rabbit@b
May 10 07:16:00 b systemd[1]: Stopped RabbitMQ Messaging Server.

Do you have any config changes or ongoing messages running while this shutdown happens that might explain and lead us to the root cause?

The stop action is just:
  $ /usr/sbin/rabbitmqctl stop

You might change the loglevel [1] and then in a second console attach to the log to see what is going on in that time?

[1]: https://www.rabbitmq.com/logging.html

Changed in rabbitmq-server (Ubuntu):
status: Confirmed → Incomplete
Bryce Harrington (bryce) wrote :

Like Christian, I also am seeing only a 1 second stop. I tried a few different things but couldn't reproduce the delay.

I poked through /usr/lib/rabbitmq/lib/rabbitmq_server-3.6.10/sbin/rabbitmq-server, but it's unclear what the technical difference is between 'stop' and 'shutdown'. Some comments in the code suggest that under some circumstances a wait might be triggered - if you instrument that code perhaps you can identify if it is hitting that wait, and if so why.

Without knowing more, just switching to using 'shutdown' feels like a kludge/workaround, and probably not what we should deploy as a fix to this. Instead, some additional investigation needs done to figure out why it is delaying, or at least more specific reproduction steps so others can do that sleuthing.

So, as I don't think there's anything more we can do on this end, I'm going to drop the server-next tag for the time being. Feel free to re-add it if you think there's enough info for us to take another stab at it though.

tags: removed: patch server-next
Ivan Kurnosov (zerkms) wrote :

I'm not sure why I did not mention originally, sorry for that, but there is a chance that what makes it reproducible is starting the daemon inside the lxd/lxc

For those who don't use lxd - the commands would be (I don't have access to linuxes or my infrastructure at the moment, so might be a bit incorrect):

lxc launch ubuntu:bionic
lxc exec <container name> bash

inside container:
apt update
apt install rabbitmq-server
systemct restart rabbitmq-server

Alex Tomkins (tomkins) wrote :
Download full text (3.8 KiB)

Fixes in Debian packaging which also address the problem:

- https://salsa.debian.org/openstack-team/third-party/rabbitmq-server/commit/9764f7e6ffeaa99325b8e5ff41e80b1c4b2894d3
- https://salsa.debian.org/openstack-team/third-party/rabbitmq-server/commit/c5b99f5c4811f460b73c8c8505a98fa55dcd752b

If I start the server with just `service rabbitmq-server start`, my process tree looks like:

rabbitmq 11316 0.0 0.1 4628 800 ? Ss 14:59 0:00 /bin/sh /usr/sbin/rabbitmq-server
rabbitmq 11337 0.0 0.3 4628 1792 ? S 14:59 0:00 /bin/sh /usr/lib/rabbitmq/bin/rabbitmq-server
rabbitmq 11633 8.8 12.4 2165876 61504 ? Sl 14:59 0:02 /usr/lib/erlang/erts-9.2/bin/beam.smp -W w -A 64 -P 1048576 -t 5000000 -stbt db -zdbbl 32000 -K true -B i -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.6.10/ebin -noshell -noinput -s rabbit boot -sname rabbit@golestandt -boot start_sasl -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/<email address hidden>"} -rabbit sasl_error_logger {file,"/<email address hidden>"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/plugins:/usr/lib/rabbitmq/lib/rabbitmq_server-3.6.10/plugins" -rabbit plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit@golestandt-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@golestandt" -kernel inet_dist_listen_min 25672 -kernel inet_dist_listen_max 25672
rabbitmq 11745 0.1 0.1 4520 728 ? Ss 14:59 0:00 erl_child_setup 65536
rabbitmq 11809 0.0 0.2 8264 1072 ? Ss 14:59 0:00 inet_gethost 4
rabbitmq 11810 0.0 0.3 10384 1660 ? S 14:59 0:00 inet_gethost 4
rabbitmq 11504 0.0 0.0 26852 204 ? S 14:59 0:00 /usr/lib/erlang/erts-9.2/bin/epmd -daemon

However, doing `service epmd start` before that gives:

epmd 12226 0.0 0.3 26852 1728 ? Ss 15:01 0:00 /usr/bin/epmd -systemd
rabbitmq 12263 0.0 0.1 4628 816 ? Ss 15:01 0:00 /bin/sh /usr/sbin/rabbitmq-server
rabbitmq 12287 0.0 0.3 4628 1736 ? S 15:01 0:00 /bin/sh /usr/lib/rabbitmq/bin/rabbitmq-server
rabbitmq 12582 25.5 13.2 2171584 65160 ? Sl 15:01 0:02 /usr/lib/erlang/erts-9.2/bin/beam.smp -W w -A 64 -P 1048576 -t 5000000 -stbt db -zdbbl 32000 -K true -B i -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.6.10/ebin -noshell -noinput -s rabbit boot -sname rabbit@golestandt -boot start_sasl -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/<email address hidden>"} -rabbit sasl_error_logger {file,"/<email address hidden>"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/plugins:/usr/lib/rab...

Read more...

Bryce Harrington (bryce) wrote :

Alex, thanks for looking up the Debian upstream situation, this is looking a lot more actionable so am re-adding it to server-next.

Sounds like if we just picked the two changes to the service file from the debian commits, those would solve the problem? If so, that should be straightforward to SRU.

tags: added: server-next
Changed in rabbitmq-server (Ubuntu):
status: Incomplete → Triaged
Bryce Harrington (bryce) on 2019-06-09
description: updated
Changed in rabbitmq-server (Ubuntu Bionic):
status: New → Triaged
Changed in rabbitmq-server (Ubuntu Cosmic):
status: New → Triaged
importance: Undecided → Medium
Changed in rabbitmq-server (Ubuntu Bionic):
importance: Undecided → Medium
Changed in rabbitmq-server (Ubuntu):
status: Triaged → Fix Released
summary: - rabbitmq-server does not properly shutdown
+ [SRU] rabbitmq-server does not properly shutdown
Changed in rabbitmq-server (Debian):
status: Unknown → Fix Released
Bryce Harrington (bryce) wrote :

I'm still not able to reproduce the error.

If you can provide an exact set of steps to run in an lxc container it'd help a lot. Here's what I've got so far:

```
$ lxc launch ubuntu:bionic bionic-lp-1784757
Creating bionic-lp-1784757
Starting bionic-lp-1784757
trent:~/ubuntu/RabbitmqServer/sru.1784757$ lxc exec bionic-lp-1784757 -- bash
root@bionic-lp-1784757:~# apt update
Hit:1 http://archive.ubuntu.com/ubuntu bionic InRelease
(...)
root@bionic-lp-1784757:~# ps aux | grep ^rabbit
rabbitmq 1390 0.0 0.0 4628 868 ? Ss 01:30 0:00 /bin/sh /usr/sbin/rabbitmq-server
rabbitmq 1403 0.0 0.0 4628 1844 ? S 01:30 0:00 /bin/sh /usr/lib/rabbitmq/bin/rabbitmq-server
rabbitmq 1553 0.0 0.0 26852 204 ? S 01:30 0:00 /usr/lib/erlang/erts-9.2/bin/epmd -daemon
rabbitmq 1684 0.0 0.1 4291312 62276 ? Sl 01:30 0:02 /usr/lib/erlang/erts-9.2/bin/beam.smp -W w -A 96 -P 1048576 (...)
rabbitmq 1834 0.0 0.0 4520 804 ? Ss 01:30 0:00 erl_child_setup 65536
rabbitmq 1913 0.0 0.0 8264 1128 ? Ss 01:30 0:00 inet_gethost 4
rabbitmq 1914 0.0 0.0 14616 1888 ? S 01:30 0:00 inet_gethost 4
root@bionic-lp-1784757:~# sudo time --format %E systemctl stop rabbitmq-server.service
0:00.88
```

I've also experimented with having epmd started or stopped, in various order with the start/stop of rabbitmq-server, but it always takes only a second or two to start or stop. Debian's changes suggest that the bug is some interrelationship with epmd, but it's unclear from the patch what exactly that is.

Are there any config settings, architecture specificness, or other considerations that need taken into account?

Changed in rabbitmq-server (Ubuntu Bionic):
status: Triaged → Incomplete
carl2187 (carl2187) wrote :

# I run into this problem reliably in Bionic following this process, (as of 7/19/2019, this process caused the bug to manifest)

# Clean install ubuntu server 18.04.2 (using traditional installer media, not live media)
# (installing into a gen2 Hyper-V VM running on Windows 10 1903)

# after install and first reboot, patch it up all the way:

sudo apt update
sudo apt full-upgrade

sudo shutdown -r now

sudo apt install -y apt-transport-https locales software-properties-common curl

# setup repo for nodejs version 8.x:
curl -sL https://deb.nodesource.com/setup_8.x | sudo -E bash -

sudo apt install -y fonts-noto nodejs git supervisor nginx ssl-cert postgresql adduser redis-server rabbitmq-server libstdc++6 libxml2 libboost-regex-dev zlib1g fonts-dejavu fonts-liberation ttf-mscorefonts-installer fonts-crosextra-carlito fonts-takao-gothic fonts-opensymbol libxss1 libcairo2 xvfb libxtst6 libgconf2-4 libasound2

sudo shutdown -r now

# on this reboot after Rabitmq-server has been installed (along with all that other junk) then the bug manifests reliably in bionic.

carl2187 (carl2187) wrote :

# my method to fix the problem in Bionic for now is simple, borrowing from the upstream Debian patch:

sudo nano /lib/systemd/system/rabbitmq-server.service

# make the [Unit] definition look like this:
[Unit]
Description=RabbitMQ Messaging Server
After=network.target epmd@0.0.0.0.socket
Wants=network.target epmd@0.0.0.0.socket

# then reload and reboot, no more delay at Rabbitmq
sudo systemctl daemon-reload
sudo systemctl restart rabbitmq-server.service

sudo shutdown -r now

carl2187 (carl2187) wrote :

# forgot to include the shutdown changes, so here's a full procedure to fix this in Bionic:

sudo nano /lib/systemd/system/rabbitmq-server.service

# fixup the [Unit] entry to look like this:
[Unit]
Description=RabbitMQ Messaging Server
After=network.target epmd@0.0.0.0.socket
Wants=network.target epmd@0.0.0.0.socket

# change just the ExecStop under [Service] to use "shutdown" instead of "stop":
ExecStop=/usr/sbin/rabbitmqctl shutdown

# reload and reboot:
sudo systemctl stop rabbitmq-server.service
sudo systemctl daemon-reload
sudo systemctl restart rabbitmq-server.service
sudo shutdown -r now

Changed in rabbitmq-server (Ubuntu Cosmic):
status: Triaged → Won't Fix

I'm looking at this bug now.

I was able to reproduce this bug when using a bionic VM. I didn't have to go through all the steps from comment #13, but I also noticed that the bug doesn't manifest all the time; sometimes, a "systemct restart rabbitmq-server" will take just 1 or 2 seconds. However, I was able to reproduce the issue in one out of every two attempts. I'm preparing the fix now.

Changed in rabbitmq-server (Ubuntu Bionic):
status: Incomplete → Triaged
Dan Streetman (ddstreet) wrote :

adding @nicolasbock and @mruffell for their awareness, as rabbitmq-server is in x/b/f-proposed for bug 1874075 currently; hopefully that bug can be verified this week so it can be released at the 7-day aging period next monday, to clear the way for this one (normally i'd suggest just re-uploading with the fixes for this to release them together, but bug 1874075 has been going for a while so it should probably go out asap)

On Tuesday, July 07 2020, Dan Streetman wrote:

> adding @nicolasbock and @mruffell for their awareness, as rabbitmq-
> server is in x/b/f-proposed for bug 1874075 currently; hopefully that
> bug can be verified this week so it can be released at the 7-day aging
> period next monday, to clear the way for this one (normally i'd suggest
> just re-uploading with the fixes for this to release them together, but
> bug 1874075 has been going for a while so it should probably go out
> asap)

Thanks, Dan.

I have the fix ready for this bug, by the way. If you're interested,
you can take a look at it here:

  https://code.launchpad.net/~sergiodj/ubuntu/+source/rabbitmq-server/+git/rabbitmq-server/+ref/bug1784757-fix-server-restart-hang

I'll wait until the package enters -update before I resume the SRU
process.

Cheers,

--
Sergio
GPG key ID: E92F D0B3 6B14 F1F4 D8E0 EB2F 106D A1C8 C3CB BF14

Changed in rabbitmq-server (Ubuntu Bionic):
assignee: nobody → Sergio Durigan Junior (sergiodj)
description: updated

Hello Ivan, or anyone else affected,

Accepted rabbitmq-server into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/rabbitmq-server/3.6.10-1ubuntu0.4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in rabbitmq-server (Ubuntu Bionic):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-bionic
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.