Rabbitmq-server times out during server startup

Bug #1783325 reported by Th
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
rabbitmq-server (Ubuntu)
Invalid
Low
Unassigned

Bug Description

Ubuntu version
Description: Ubuntu 16.04.5 LTS
Release: 16.04

Package version
rabbitmq-server 3.5.7-1ubuntu0.16.04.2

What happens when server boots and Rabbitmq-server starts up:

Jul 19 09:19:33 hostname systemd[1]: Started RabbitMQ Messaging Server.
...
Jul 19 09:20:23 hostname rabbitmq[1002]: Waiting for 'rabbit@hostname' ...
Jul 19 09:20:23 hostname rabbitmq[1002]: pid is 1069 ...
...
Jul 19 09:21:03 hostname systemd[1]: rabbitmq-server.service: Start-post operation timed out. Stopping.
Jul 19 09:21:03 hostname systemd[1]: Failed to start RabbitMQ Messaging Server.
Jul 19 09:21:03 hostname systemd[1]: rabbitmq-server.service: Unit entered failed state.
Jul 19 09:21:03 hostname systemd[1]: rabbitmq-server.service: Failed with result 'timeout'.

At the same time this gets logged in /var/log/rabbitmq:
Server startup complete; 6 plugins started.
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * webmachine
 * rabbitmq_management_agent
 * mochiweb
 * amqp_client

No errors nor warnings get logged in /var/log/rabbitmq.

What I excepected to happen:
Rabbitmq-server should start without timing out.

Additional details:

I have seen this issue on multiple Ubuntu 16.04 servers that are running latest Rabbitmq server installed from Ubuntu repository. The timeout happens randomly, sometimes Rabbitmq server does not time out after reboot. The Rabbitmq server is standalone, not part of cluster.

As a workaround I have commented out the ExecStartPost line from Rabbitmq server Systemd unit file (/lib/systemd/system/rabbitmq-server.service). This has at least so far prevented the timeouts and I haven't yet seen any problems caused by this change.

Tags: xenial
tags: added: xenial
Revision history for this message
Th (timo-henriksson) wrote :

I found one common thing for hosts have had this problem: They are all vms that run on older and slower hardware. Vms on newer hardware with similar software and configuration have not had this problem.

Revision history for this message
James Page (james-page) wrote :

rabbitmq is also very sensitive to any hostname <-> IP mismatch as it uses this to communicate with the RMQ daemon.

'rabbit@hostname' looks suspicious here - I'd suggest that maybe something is a bit wonky.

If a "sudo su -" results a warning about hostnames this is def the issue.

Changed in rabbitmq-server (Ubuntu):
status: New → Incomplete
importance: Undecided → Low
Revision history for this message
James Page (james-page) wrote :

Please check that forward/reverse hostname to IP is working OK on the impacted systems; if it is then set the bug back to New with an appropriate comment, if not then we can close out this bug report IMHO.

Revision history for this message
Th (timo-henriksson) wrote :

The "@hostname" is placeholder for real hostname which I did not want to publish. Sorry for confusion.

"Sudo su -" and forward/reverse hostname to ip lookups work without problem or warnings.

Revision history for this message
Th (timo-henriksson) wrote :

Provided required information.

Changed in rabbitmq-server (Ubuntu):
status: Incomplete → New
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

After the reboot, whether it was one where rabbit failed to start or not, does a restart of the rabbit service always work?

Revision history for this message
Th (timo-henriksson) wrote :

Manual restarts have worked without problems.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Maybe post the output of these commands when run on a system that was rebooted and where rabbit failed to start:

sudo systemd-analyze blame

and

sudo systemd-analyze critical-chain

Maybe the resolver or network wasn't ready when rabbit tried to start.

Revision history for this message
Th (timo-henriksson) wrote :
Download full text (3.3 KiB)

Blame output:
2min 42.650s solr.service
    1min 30.057s rabbitmq-server.service
    1min 25.356s shibd.service
    1min 18.921s kong.service
    1min 13.311s postgresql@9.5-main.service
         47.776s apache2.service
         22.732s cloud-init-local.service
         22.192s snapd.service
         21.243s lwsmd.service
         13.928s accounts-daemon.service
         12.658s lxd-containers.service
         10.374s mdadm.service
         10.362s grub-common.service
         10.143s iscsid.service
          9.820s dnsmasq.service
          8.827s systemd-logind.service
          8.790s cloud-config.service
          8.669s rsyslog.service
          8.541s rc-local.service
          8.515s tomcat8.service
          8.327s systemd-journald.service
          8.326s dev-vda1.device
          7.493s ssh.service
          6.214s apparmor.service
          5.841s apt-daily.service
          4.500s systemd-tmpfiles-setup.service
          3.542s polkitd.service
          2.698s user@998.service
          2.618s snapd.seeded.service
          2.490s open-iscsi.service
          2.378s irqbalance.service
          2.332s console-setup.service
          2.157s lvm2-monitor.service
          2.130s systemd-journal-flush.service
          2.067s systemd-modules-load.service
          2.060s cloud-init.service
          1.978s systemd-tmpfiles-setup-dev.service
          1.630s cloud-final.service
          1.439s apport.service
          1.380s systemd-user-sessions.service
          1.353s postgresql.service
          1.350s networking.service
          1.191s systemd-udevd.service
          1.142s keyboard-setup.service
          1.062s resolvconf.service
           982ms setvtrgb.service
           747ms sysstat.service
           652ms systemd-remount-fs.service
           510ms systemd-sysctl.service
           435ms kmod-static-nodes.service
           429ms ufw.service
           383ms systemd-udev-trigger.service
           186ms user@919603791.service
           115ms systemd-random-seed.service
            93ms systemd-update-utmp.service
            91ms systemd-timesyncd.service
            79ms ondemand.service
            77ms snapd.socket
            69ms plymouth-read-write.service
            53ms lxd.socket
            49ms dev-mqueue.mount
            43ms dev-hugepages.mount
            36ms plymouth-quit.service
            35ms sys-kernel-debug.mount
            23ms systemd-update-utmp-runlevel.service
            22ms sys-fs-fuse-connections.mount
            15ms plymouth-quit-wait.service

Critical-chain output:
graphical.target @3min 31.194s
└─multi-user.target @3min 31.193s
  └─solr.service @48.542s +2min 42.650s
    └─remote-fs.target @48.534s
      └─remote-fs-pre.target @48.530s
        └─open-iscsi.service @46.036s +2.490s
          └─iscsid.service @35.889s +10.143s
            └─network-online.target @35.886s
              └─cloud-init.service @33.748s +2.060s
                └─networking.service @32.395s +1.350s
                  └─network-pre.target @32.375s
                    └─cloud-init-local.service @9.641s +22.732s
                      └─systemd-remount-fs.service @5.950s +652ms
                        └─system...

Read more...

Revision history for this message
Th (timo-henriksson) wrote :

Reproducing the problem to get the above outputs gave me better understading on conditions that lead to this problem.

If there is high load on virtualisation hosts when vms containing Rabbit server get rebooted the timeout is likely to happen. If reboots happen when virtualisation hosts have low load Rabbit starts without problems.

Maybe vm is too slow during startup for Rabbit to be ready before Systemd times it out.
If that is the case maybe this is not Rabbit bug after all?

Revision history for this message
James Page (james-page) wrote :

@Th - I think you're probably correct in that this is a load issue, rather than a specific rabbitmq software issue. It looks like the system is so loaded you're hitting some sort of 90 second timeout at which point systemd just gives up:

 /etc/systemd/system.conf: DefaultTimeoutStartSec=90s

you could try increasing this to see if that gives rmq enough time to kick into life.

Changed in rabbitmq-server (Ubuntu):
status: New → Invalid
Revision history for this message
Noel McLoughlin (noelmcloughlin) wrote :

The following 'worked for me' on CentOS verion 8.

$ cat /etc/systemd/system/rabbitmq-server/limits.conf
[Service]
LimitNOFILE=65536

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.