OpenStack RabbitMQ Server Charm

mojo upgrade stable_to_next_ha xenial queens failed with rabbitmq app startup timing out

Bug #1824842 reported by Alex Kavanagh on 2019-04-15

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack RabbitMQ Server Charm	Confirmed	Medium	Unassigned

Bug Description

Running stable_to_next_ha in mojo for xenial/queens, the rabbitmq server failed on the upgrade:

juju status rabbitmq-server
Model Controller Cloud/Region Version SLA Timestamp
mojo serverstack serverstack/serverstack 2.5.4 unsupported 15:34:20Z

App Version Status Scale Charm Store Rev OS Notes
rabbitmq-server 3.6.10 error 3 rabbitmq-server local 151 ubuntu

Unit Workload Agent Machine Public address Ports Message
rabbitmq-server/0* error idle 37 10.5.0.51 5672/tcp hook failed: "config-changed"
rabbitmq-server/1 active idle 38 10.5.0.44 5672/tcp Unit is ready and clustered
rabbitmq-server/2 active idle 39 10.5.0.80 5672/tcp Unit is ready and clustered

Machine State DNS Inst id Series AZ Message
37 started 10.5.0.51 49c1ca9a-d4a2-4f85-838b-1dba8ff5fd66 xenial nova ACTIVE
38 started 10.5.0.44 3a5260d1-9db0-47ab-aba3-37c73da8f1b4 xenial nova ACTIVE
39 started 10.5.0.80 c0aa0ae0-99b8-41f4-bb55-2add3fd39f6e xenial nova ACTIVE

Excerpt from end of juju log for rabbitmq-server/0 unit:

2019-04-15 15:23:08 INFO juju-log Cookie already synchronized with peer.
2019-04-15 15:23:08 WARNING juju-log min-cluster-size is not defined, race conditions may occur if this is not a single unit deployment.
2019-04-15 15:23:08 INFO juju-log Clustering with new node
2019-04-15 15:23:08 INFO juju-log Host already clustered with rabbit@juju-d3ae8a-mojo-37.
2019-04-15 15:23:08 DEBUG juju-log am I clustered?: True
2019-04-15 15:23:10 DEBUG juju-log Running ['/usr/sbin/rabbitmqctl', 'set_policy', 'TTL', '^(versioned_)?notifications.*', '{"message-ttl":3600000}', '--priority', '1', '--apply-to', 'queues', '-p', 'openstack']
2019-04-15 15:23:11 DEBUG config-changed Setting policy "TTL" for pattern "^(versioned_)?notifications.*" to "{\"message-ttl\":3600000}" with priority "1"
2019-04-15 15:23:12 DEBUG juju-log Waiting 0 seconds for operation ...
2019-04-15 15:23:55 DEBUG config-changed Job for rabbitmq-server.service failed because the control process exited with error
code. See "systemctl status rabbitmq-server.service" and "journalctl -xe" for details.
2019-04-15 15:23:55 DEBUG juju-log Waiting for rabbitmq app to start: /<email address hidden>
2019-04-15 15:23:56 DEBUG juju-log Running ['timeout', '180', '/usr/sbin/rabbitmqctl', 'wait', '/<email address hidden>']
2019-04-15 15:23:57 DEBUG config-changed Waiting for 'rabbit@juju-d3ae8a-mojo-37'
2019-04-15 15:23:57 DEBUG config-changed pid is 961
2019-04-15 15:23:57 DEBUG config-changed Error: process_not_running
2019-04-15 15:23:59 DEBUG config-changed Error: unable to connect to node 'rabbit@juju-d3ae8a-mojo-37': nodedown
2019-04-15 15:23:59 DEBUG config-changed
2019-04-15 15:23:59 DEBUG config-changed DIAGNOSTICS
2019-04-15 15:23:59 DEBUG config-changed ===========
2019-04-15 15:23:59 DEBUG config-changed
2019-04-15 15:23:59 DEBUG config-changed attempted to contact: ['rabbit@juju-d3ae8a-mojo-37']
2019-04-15 15:23:59 DEBUG config-changed
2019-04-15 15:23:59 DEBUG config-changed rabbit@juju-d3ae8a-mojo-37:
2019-04-15 15:23:59 DEBUG config-changed * connected to epmd (port 4369) on juju-d3ae8a-mojo-37
2019-04-15 15:23:59 DEBUG config-changed * epmd reports: node 'rabbit' not running at all
2019-04-15 15:23:59 DEBUG config-changed no other nodes on juju-d3ae8a-mojo-37
2019-04-15 15:23:59 DEBUG config-changed * suggestion: start the node
2019-04-15 15:23:59 DEBUG config-changed
2019-04-15 15:23:59 DEBUG config-changed current node details:
2019-04-15 15:23:59 DEBUG config-changed - node name: 'rabbitmq-cli-79@juju-d3ae8a-mojo-37'
2019-04-15 15:23:59 DEBUG config-changed - home dir: /var/lib/rabbitmq
2019-04-15 15:23:59 DEBUG config-changed - cookie hash: qG36QeKV+2C7hymnHZ1e1A==
2019-04-15 15:23:59 DEBUG config-changed
2019-04-15 15:23:59 DEBUG config-changed Traceback (most recent call last):
2019-04-15 15:23:59 DEBUG config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/config-changed", line 972, in <module>
2019-04-15 15:23:59 DEBUG config-changed hooks.execute(sys.argv)
2019-04-15 15:23:59 DEBUG config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/charmhelpers/core/hookenv.py", line 914, in execute
2019-04-15 15:23:59 DEBUG config-changed self._hooks[hook_name]()
2019-04-15 15:23:59 DEBUG config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/rabbit_utils.py", line 895, in wrapped_f
2019-04-15 15:23:59 DEBUG config-changed wait_app()
2019-04-15 15:23:59 DEBUG config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/rabbit_utils.py", line 434, in wait_app
2019-04-15 15:23:59 DEBUG config-changed raise ex
2019-04-15 15:23:59 DEBUG config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/rabbit_utils.py", line 424, in wait_app
2019-04-15 15:23:59 DEBUG config-changed rabbitmqctl('wait', pid_file)
2019-04-15 15:23:59 DEBUG config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/rabbit_utils.py", line 376, in rabbitmqctl
2019-04-15 15:23:59 DEBUG config-changed subprocess.check_call(cmd)
2019-04-15 15:23:59 DEBUG config-changed File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
2019-04-15 15:23:59 DEBUG config-changed raise CalledProcessError(retcode, cmd)
2019-04-15 15:23:59 DEBUG config-changed subprocess.CalledProcessError: Command '['timeout', '180', '/usr/sbin/rabbitmqctl', 'wait', '/<email address hidden>']' returned non-zero exit status 70
2019-04-15 15:24:00 ERROR juju.worker.uniter.operation runhook.go:132 hook "config-changed" failed: exit status 1

Tags:

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2019-04-15:

More details from another run: from the failed unit, it didn't restart:

Apr 15 16:44:28 juju-494bf1-mojo-37 rabbitmq[13822]: pid is 13832
Apr 15 16:45:57 juju-494bf1-mojo-37 systemd[1]: rabbitmq-server.service: Start-post operation timed out. Stopping.
Apr 15 16:47:27 juju-494bf1-mojo-37 systemd[1]: rabbitmq-server.service: State 'stop-final-sigterm' timed out. Killing.
Apr 15 16:47:27 juju-494bf1-mojo-37 systemd[1]: Failed to start RabbitMQ Messaging Server.
Apr 15 16:47:27 juju-494bf1-mojo-37 systemd[1]: rabbitmq-server.service: Unit entered failed state.
Apr 15 16:47:27 juju-494bf1-mojo-37 systemd[1]: rabbitmq-server.service: Failed with result 'timeout'.

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2019-04-15:

Also from the relevant rabbitmq-server unit (which is failed with the rabbitmq instance stopped):

=INFO REPORT==== 15-Apr-2019::16:44:29 ===
Memory limit set to 598MB of 1495MB total.

=INFO REPORT==== 15-Apr-2019::16:44:29 ===
Enabling free disk space monitoring

=INFO REPORT==== 15-Apr-2019::16:44:29 ===
Disk free limit set to 50MB

=INFO REPORT==== 15-Apr-2019::16:44:29 ===
Limiting to approx 65436 file handles (58890 sockets)

=INFO REPORT==== 15-Apr-2019::16:44:29 ===
FHC read buffering: OFF
FHC write buffering: ON

=INFO REPORT==== 15-Apr-2019::16:44:30 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left

=WARNING REPORT==== 15-Apr-2019::16:45:00 ===
Error while waiting for Mnesia tables: {timeout_waiting_for_tables,
                                        [rabbit_user,rabbit_user_permission,
                                         rabbit_vhost,rabbit_durable_route,
                                         rabbit_durable_exchange,
                                         rabbit_runtime_parameters,
                                         rabbit_durable_queue]}

=INFO REPORT==== 15-Apr-2019::16:45:00 ===
Waiting for Mnesia tables for 30000 ms, 8 retries left

=WARNING REPORT==== 15-Apr-2019::16:45:30 ===
Error while waiting for Mnesia tables: {timeout_waiting_for_tables,
                                        [rabbit_user,rabbit_user_permission,
                                         rabbit_vhost,rabbit_durable_route,
                                         rabbit_durable_exchange,
                                         rabbit_runtime_parameters,
                                         rabbit_durable_queue]}

=INFO REPORT==== 15-Apr-2019::16:45:30 ===
Waiting for Mnesia tables for 30000 ms, 7 retries left

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2019-04-16:

Download full text (3.3 KiB)

== Debugging rabbitmq issue: https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1824842 ==

Defaults are 10 retries to connect to peer, with 30 seconds between attempts == 5 mins, then voluntary stop.

From: https://www.rabbitmq.com/clustering.html#restarting

> A stopping node picks an online cluster member (only disc nodes will be considered) to sync with after restart. Upon restart the node will try to contact that peer 10 times by default, with 30 second response timeouts. In case the peer becomes available in that time interval, the node successfully starts, syncs what it needs from the peer and keeps going. If the peer does not become available, the restarted node will give up and voluntarily stop.

rabbit (juju leader) (/0):
stopped at: 16:42:57
starting at: 16:44:29
errors out at (gave up at): 16:45:30
(why does it stop here??)
In the juju log for /0 (unit-rabbitmq-server-0):

2019-04-15 16:47:27 DEBUG config-changed Job for rabbitmq-server.service failed because a timeout was exceeded. See "systemctl status rabbitmq-server.service" and "journalctl -xe" for details.
2019-04-15 16:47:27 DEBUG juju-log Waiting for rabbitmq app to start: /<email address hidden>
2019-04-15 16:47:27 DEBUG juju-log Running ['timeout', '180', '/usr/sbin/rabbitmqctl', 'wait', '/<email address hidden>']

This log happens at the END of the command. so it was started at (optimistically) 16:44:27, which concurs with the 'starting at' above.

i.e. this command kills the start command at 3 minutes, whereas rabbit (by default) wants 5 minutes to do the restart, before it gives up.

I suspect that the rabbit log ends "early" (i.e. there are missing retries) as the rug is pulled out from under the process (it's most likely pkilled) and so the buffer never got flushed. However, I don't know if this is really true.

rabbit slave (/1):
says it is asked to stop at 16:43:57
starts complaining about /0 at 16:43:57 with ERROR REPORT notices
stops at 16:43:58
At 16:45:30 compains about /0 with an mnesia_event of {inconsistent_database, starting_partitioned_network, 'rabbit@juju-494bf1-mojo-37'}
starts at 16:45:30
waits for tables to come and restarting it's TCP listener
at 16:46:01 it's back up fully, and starts accepting AMQP connections.

rabbit slave (/2)
notices that /0 goes away at 16:42:58
starts closing AMQP connections at 16:42:58 with WARNING reports
starts connecting AMQP connections at 16:42:50
starts mirroring queues at 16:43:04
at 16:43:05 complains about /0 and /1 dying out of the mirrors
Hundreds of INFO reports on mirror queue activities (deaths, sync, batch sizes) until 16:44:28 when it gets told to stop.
At 16:44:28 this node stops
At 16:46:01 it comes back up
At 16:46:02 tables are back
At 16:46:03 mirrored queues start coming back
At 16:46:03 TCP comes up and AMQP connections are accepted, along with the first connections.

Thoughts:

1. The timeout command should wait longer than 5 minutes to allow rabbitmq to gracefully quit.
2. It's not clear if rabbit/0 would have rejoined the party ... but we can't see the logs for what it was doing from 16:45:30 until it was killed at 16:47:27

Re...

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Changed in charm-rabbitmq-server:
importance:	Undecided → Medium
status:	New → Confirmed

Changed in charm-rabbitmq-server:
milestone:	none → 19.07

Changed in charm-rabbitmq-server:
milestone:	19.07 → 19.10

Changed in charm-rabbitmq-server:
milestone:	19.10 → 20.01

Changed in charm-rabbitmq-server:
milestone:	20.01 → 20.05

Changed in charm-rabbitmq-server:
milestone:	20.05 → 20.08

Changed in charm-rabbitmq-server:
milestone:	20.08 → none