Comment 67 for bug 1874075

Revision history for this message
Matthew Ruffell (mruffell) wrote :

I agree that the fixes in -proposed for Focal and Groovy are good as is, since it works nicely with type=notify on those systems.

I spent some time testing on Bionic. I can confirm that the existence of ExecStartPost is for the very reason that Nicolas describes, that RabbitMQ will return 0 regardless if the server started correctly or not.

When I removed ExecStartPost, and set

TimeoutStartSec=600
Restart=on-failure
RestartSec=10

and reproduced, once RabbitMQ did 10x 30000ms timeouts, it exited 0, and systemd assumed it was a clean and expected stop, and the service was stopped with ExitSuccess. The result being the service dies after 5 minutes. Its an improvement over 90 seconds, but not quite 10 min gold standard.

I then added ExecStartPost back in, and added the modification to the wrapper script that Nicolas put forward, aka /usr/lib/rabbitmq/bin/rabbitmq-server-wait:

/usr/lib/rabbitmq/bin/rabbitmqctl wait -t 600 $RABBITMQ_PID_FILE

I then went and reproduced. systemd now doesn't treat the service as started until it actually joins the cluster, instead it is in the activating state while it is waiting for the cluster leader to turn up.

I left the VM in this activating state for quite some time. Each time the 10x 3000ms timeouts finish, the wrapper script exits with failure, and systemd restarts the service, and we go back to another 10x 3000ms cycle. From my testing it never stops after 2 rounds, instead, it goes forever, likely due to Restart=on-failure and RestartSec=10 being set.

This works wonderfully, and fixes the problem. I'm sorry I ever doubted your solution Nicolas.

While I do think the better solution is to set type=notify, on Bionic it would require the socat dependency to actually send the notification to systemd, and as Dan mentioned before, that is unacceptable for a SRU, and the apt upgrade reasoning makes sense. We can enjoy perfect systemd controlled resilience on focal onwards.

On Bionic, I think staying with type=simple is fine, and we just need to set

TimeoutStartSec=600
Restart=on-failure
RestartSec=10

and make the below change to /usr/lib/rabbitmq/bin/rabbitmq-server-wait

/usr/lib/rabbitmq/bin/rabbitmqctl wait -t 600 $RABBITMQ_PID_FILE

I don't think we need to change the default mnesia_table_loading_retry_limit or mnesia_table_loading_retry_timeout values, since if we set the systemd Restart=on-failure setting, once the wrapper script dies at the 5 minute timeout, the service will just be restarted and we begin anew.

I'll make a test package now, and see how it goes.