Comment 68 for bug 1874075

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Okay, I made test packages for Bionic and Xenial based on the above:

The ppa is available here:
https://launchpad.net/~mruffell/+archive/ubuntu/lp1874075-test

It contains (based off of -updates):
Xenial:
rabbitmq-server 3.5.7-1ubuntu0.16.04.2+lp1874075v20200629b1
Bionic:
rabbitmq-server 3.6.10-1ubuntu0.1+lp1874075v20200629b1

Debdiffs for the above builds are:
Xenial: https://paste.ubuntu.com/p/Jm8ZctJzny/
Bionic: https://paste.ubuntu.com/p/j6cBPzgWMD/

On Bionic:
When you install the test packages on both nodes and reboot them, then attempt to reproduce, the node which is attempting to rejoin the cluster will stay in the systemd activating state, and the wrapper script terminates after 5 minutes or 300 seconds, i.e. 10x 3000ms timeouts. When the wrapper script terminates, it terminates with a error exit code, and systemd restarts the service. This continues forever until the node joins the cluster, at which stage the systemd status turns active. Problem is fixed.

On Xenial:
When you install the test packages on both nodes and reboot them, then attempt to reproduce, the node which is attempting to rejoin the cluster will stay in the systemd activating state, and the wrapper script terminates after 60 seconds. This is much shorter than Bionic. When the wrapper script terminates, it terminates with a error exit code, and systemd restarts the service. This continues forever until the node joins the cluster, at which stage the systemd status turns active. Problem is fixed.

It seems the timeouts happen at the mercy of mnesia_table_loading_retry_limit and mnesia_table_loading_retry_timeout values, ignoring the -t 600 that we pass into 'rabbitmqctl wait'. Nicolas, it seems you are right, and that if we didn't want our services to restart every 60 (xenial) or 300 (bionic) seconds, we would need to adjust these timeouts. The problem is, we would have to introduce new configuration files to do this, which is normally frowned on when doing a SRU.

Now that we have Restart=on-failure and RestartSec=10 would I add config to change mnesia_table_loading_retry_timeout? To be honest I am happy with leaving them as is, and just relying on Restart=on-failure to do its job. @ddstreet do you have any strong opinions? Is a service restarting every 60 seconds unacceptable until the node can rejoin the cluster?

Nicolas, can you install and test these packages and double check that you also see what I see. If everything is good, you can submit new debdiffs for Xenial and Bionic based on my ones, and we can get some new builds into -proposed.

Nicolas, I think you are more or less right all along, and all you were missing is Restart=on-failure and RestartSec=10 in the service file.