It contains (based off of -updates):
Xenial:
rabbitmq-server 3.5.7-1ubuntu0.16.04.2+lp1874075v20200629b1
Bionic:
rabbitmq-server 3.6.10-1ubuntu0.1+lp1874075v20200629b1
On Bionic:
When you install the test packages on both nodes and reboot them, then attempt to reproduce, the node which is attempting to rejoin the cluster will stay in the systemd activating state, and the wrapper script terminates after 5 minutes or 300 seconds, i.e. 10x 3000ms timeouts. When the wrapper script terminates, it terminates with a error exit code, and systemd restarts the service. This continues forever until the node joins the cluster, at which stage the systemd status turns active. Problem is fixed.
On Xenial:
When you install the test packages on both nodes and reboot them, then attempt to reproduce, the node which is attempting to rejoin the cluster will stay in the systemd activating state, and the wrapper script terminates after 60 seconds. This is much shorter than Bionic. When the wrapper script terminates, it terminates with a error exit code, and systemd restarts the service. This continues forever until the node joins the cluster, at which stage the systemd status turns active. Problem is fixed.
It seems the timeouts happen at the mercy of mnesia_table_loading_retry_limit and mnesia_table_loading_retry_timeout values, ignoring the -t 600 that we pass into 'rabbitmqctl wait'. Nicolas, it seems you are right, and that if we didn't want our services to restart every 60 (xenial) or 300 (bionic) seconds, we would need to adjust these timeouts. The problem is, we would have to introduce new configuration files to do this, which is normally frowned on when doing a SRU.
Now that we have Restart=on-failure and RestartSec=10 would I add config to change mnesia_table_loading_retry_timeout? To be honest I am happy with leaving them as is, and just relying on Restart=on-failure to do its job. @ddstreet do you have any strong opinions? Is a service restarting every 60 seconds unacceptable until the node can rejoin the cluster?
Nicolas, can you install and test these packages and double check that you also see what I see. If everything is good, you can submit new debdiffs for Xenial and Bionic based on my ones, and we can get some new builds into -proposed.
Nicolas, I think you are more or less right all along, and all you were missing is Restart=on-failure and RestartSec=10 in the service file.
Okay, I made test packages for Bionic and Xenial based on the above:
The ppa is available here: /launchpad. net/~mruffell/ +archive/ ubuntu/ lp1874075- test
https:/
It contains (based off of -updates): 16.04.2+ lp1874075v20200 629b1 1ubuntu0. 1+lp1874075v202 00629b1
Xenial:
rabbitmq-server 3.5.7-1ubuntu0.
Bionic:
rabbitmq-server 3.6.10-
Debdiffs for the above builds are: /paste. ubuntu. com/p/Jm8ZctJzn y/ /paste. ubuntu. com/p/j6cBPzgWM D/
Xenial: https:/
Bionic: https:/
On Bionic:
When you install the test packages on both nodes and reboot them, then attempt to reproduce, the node which is attempting to rejoin the cluster will stay in the systemd activating state, and the wrapper script terminates after 5 minutes or 300 seconds, i.e. 10x 3000ms timeouts. When the wrapper script terminates, it terminates with a error exit code, and systemd restarts the service. This continues forever until the node joins the cluster, at which stage the systemd status turns active. Problem is fixed.
On Xenial:
When you install the test packages on both nodes and reboot them, then attempt to reproduce, the node which is attempting to rejoin the cluster will stay in the systemd activating state, and the wrapper script terminates after 60 seconds. This is much shorter than Bionic. When the wrapper script terminates, it terminates with a error exit code, and systemd restarts the service. This continues forever until the node joins the cluster, at which stage the systemd status turns active. Problem is fixed.
It seems the timeouts happen at the mercy of mnesia_ table_loading_ retry_limit and mnesia_ table_loading_ retry_timeout values, ignoring the -t 600 that we pass into 'rabbitmqctl wait'. Nicolas, it seems you are right, and that if we didn't want our services to restart every 60 (xenial) or 300 (bionic) seconds, we would need to adjust these timeouts. The problem is, we would have to introduce new configuration files to do this, which is normally frowned on when doing a SRU.
Now that we have Restart=on-failure and RestartSec=10 would I add config to change mnesia_ table_loading_ retry_timeout? To be honest I am happy with leaving them as is, and just relying on Restart=on-failure to do its job. @ddstreet do you have any strong opinions? Is a service restarting every 60 seconds unacceptable until the node can rejoin the cluster?
Nicolas, can you install and test these packages and double check that you also see what I see. If everything is good, you can submit new debdiffs for Xenial and Bionic based on my ones, and we can get some new builds into -proposed.
Nicolas, I think you are more or less right all along, and all you were missing is Restart=on-failure and RestartSec=10 in the service file.