After charm-upgrade, config-changed times out waiting for the service to come back

Bug #1912638 reported by Aurelien Lourot
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
New
Undecided
Unassigned

Bug Description

Seen on OSCI when upgrading the charm from bionic-queens stable to -next:
http://osci:8080/job/mojo_runner/24910

~~~~~~~~~~
2021-01-20 13:30:41 INFO juju-log Unit is ready
2021-01-20 13:30:42 INFO juju.worker.uniter.operation runhook.go:142 ran "update-status" hook (via explicit, bespoke hook script)
2021-01-20 13:31:48 INFO juju.worker.uniter.charm bundles.go:79 downloading local:bionic/rabbitmq-server-151 from API server
2021-01-20 13:31:48 INFO juju.downloader download.go:111 downloading from local:bionic/rabbitmq-server-151
2021-01-20 13:31:48 INFO juju.downloader download.go:94 download complete ("local:bionic/rabbitmq-server-151")
2021-01-20 13:31:48 INFO juju.downloader download.go:174 download verified ("local:bionic/rabbitmq-server-151")
2021-01-20 13:31:49 INFO juju.worker.uniter resolver.go:147 found queued "upgrade-charm" hook
2021-01-20 13:31:56 INFO juju-log Unknown hook upgrade-charm.real - skipping.
2021-01-20 13:32:03 INFO juju-log Unit is ready
2021-01-20 13:32:04 INFO juju.worker.uniter.operation runhook.go:142 ran "upgrade-charm" hook (via explicit, bespoke hook script)
2021-01-20 13:32:04 INFO juju.worker.uniter resolver.go:147 found queued "config-changed" hook
2021-01-20 13:32:05 INFO juju-log Updating hosts file with: {'172.17.104.66': 'juju-f007d1-auto-osci-sv04-39'} (current: ['127.0.0.1 localhost\n', '\n', '# The following lines are desirable for IPv6 capable hosts\n', '::1 ip6-localhost ip6-loopback\n', 'fe00::0 ip6-localnet\n', 'ff00::0 ip6-mcastprefix\n', 'ff02::1 ip6-allnodes\n', 'ff02::2 ip6-allrouters\n', 'ff02::3 ip6-allhosts\n', '172.17.104.66 juju-f007d1-auto-osci-sv04-39\n', '172.17.104.37 juju-f007d1-auto-osci-sv04-37\n', '172.17.104.50 juju-f007d1-auto-osci-sv04-38\n'])
2021-01-20 13:32:05 INFO juju-log Removing line '172.17.104.66 juju-f007d1-auto-osci-sv04-39
' from hosts file
2021-01-20 13:32:05 INFO juju-log Changing ownership of path /var/lib/rabbitmq to rabbitmq:rabbitmq
2021-01-20 13:32:05 INFO juju-log Changing perms of path /var/lib/rabbitmq
2021-01-20 13:33:29 WARNING config-changed Job for rabbitmq-server.service failed because the control process exited with error code.
2021-01-20 13:33:29 WARNING config-changed See "systemctl status rabbitmq-server.service" and "journalctl -xe" for details.
2021-01-20 13:33:31 WARNING config-changed Error: process_not_running
2021-01-20 13:33:32 WARNING config-changed Error: unable to connect to node 'rabbit@juju-f007d1-auto-osci-sv04-39': nodedown
2021-01-20 13:33:32 WARNING config-changed
2021-01-20 13:33:32 WARNING config-changed DIAGNOSTICS
2021-01-20 13:33:32 WARNING config-changed ===========
2021-01-20 13:33:32 WARNING config-changed
2021-01-20 13:33:32 WARNING config-changed attempted to contact: ['rabbit@juju-f007d1-auto-osci-sv04-39']
2021-01-20 13:33:32 WARNING config-changed
2021-01-20 13:33:32 WARNING config-changed rabbit@juju-f007d1-auto-osci-sv04-39:
2021-01-20 13:33:32 WARNING config-changed * connected to epmd (port 4369) on juju-f007d1-auto-osci-sv04-39
2021-01-20 13:33:32 WARNING config-changed * epmd reports: node 'rabbit' not running at all
2021-01-20 13:33:32 WARNING config-changed no other nodes on juju-f007d1-auto-osci-sv04-39
2021-01-20 13:33:32 WARNING config-changed * suggestion: start the node
2021-01-20 13:33:32 WARNING config-changed
2021-01-20 13:33:32 WARNING config-changed current node details:
2021-01-20 13:33:32 WARNING config-changed - node name: 'rabbitmq-cli-25@juju-f007d1-auto-osci-sv04-39'
2021-01-20 13:33:32 WARNING config-changed - home dir: /var/lib/rabbitmq
2021-01-20 13:33:32 WARNING config-changed - cookie hash: v8dn+KKQAkI0C8qPyyiweQ==
2021-01-20 13:33:32 WARNING config-changed
2021-01-20 13:33:33 WARNING config-changed Traceback (most recent call last):
2021-01-20 13:33:33 WARNING config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-2/charm/hooks/config-changed", line 878, in <module>
2021-01-20 13:33:33 WARNING config-changed hooks.execute(sys.argv)
2021-01-20 13:33:33 WARNING config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-2/charm/charmhelpers/core/hookenv.py", line 945, in execute
2021-01-20 13:33:33 WARNING config-changed self._hooks[hook_name]()
2021-01-20 13:33:33 WARNING config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-2/charm/hooks/rabbit_utils.py", line 949, in wrapped_f
2021-01-20 13:33:33 WARNING config-changed wait_app()
2021-01-20 13:33:33 WARNING config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-2/charm/hooks/rabbit_utils.py", line 476, in wait_app
2021-01-20 13:33:33 WARNING config-changed raise ex
2021-01-20 13:33:33 WARNING config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-2/charm/hooks/rabbit_utils.py", line 466, in wait_app
2021-01-20 13:33:33 WARNING config-changed rabbitmqctl('wait', pid_file)
2021-01-20 13:33:33 WARNING config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-2/charm/hooks/rabbit_utils.py", line 418, in rabbitmqctl
2021-01-20 13:33:33 WARNING config-changed subprocess.check_call(cmd)
2021-01-20 13:33:33 WARNING config-changed File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
2021-01-20 13:33:33 WARNING config-changed raise CalledProcessError(retcode, cmd)
2021-01-20 13:33:33 WARNING config-changed subprocess.CalledProcessError: Command '['timeout', '180', '/usr/sbin/rabbitmqctl', 'wait', '/<email address hidden>']' returned non-zero exit status 70.
2021-01-20 13:33:33 ERROR juju.worker.uniter.operation runhook.go:136 hook "config-changed" (via explicit, bespoke hook script) failed: exit status 1
~~~~~~~~~~

From <email address hidden> however we see that at 13:33 the RabbitMQ service has been already back for a minute:

~~~~~~~~~~
=INFO REPORT==== 20-Jan-2021::13:30:38 ===
connection <0.28484.0> (172.17.104.36:59510 -> 172.17.104.66:5672 - ceilometer-polling:25892:2dfbe82f-ff34-4e8d-9ddf-e794aca353ad): user 'ceilometer' authenticated and granted access to vhost 'openstack'

=ERROR REPORT==== 20-Jan-2021::13:32:07 ===
** Connection attempt from disallowed node 'rabbitmq-cli-26@juju-f007d1-auto-osci-sv04-39' **

=INFO REPORT==== 20-Jan-2021::13:32:32 ===
rabbit on node 'rabbit@juju-f007d1-auto-osci-sv04-38' down

=INFO REPORT==== 20-Jan-2021::13:32:32 ===
Keep rabbit@juju-f007d1-auto-osci-sv04-38 listeners: the node is already back

=INFO REPORT==== 20-Jan-2021::13:32:33 ===
node 'rabbit@juju-f007d1-auto-osci-sv04-38' down: connection_closed

=INFO REPORT==== 20-Jan-2021::13:32:33 ===
node 'rabbit@juju-f007d1-auto-osci-sv04-38' up

=INFO REPORT==== 20-Jan-2021::13:32:34 ===
accepting AMQP connection <0.29738.0> (172.17.104.31:33892 -> 172.17.104.66:5672)
~~~~~~~~~~

This could be an instance of lp:1710247 where the /var/lib/rabbitmq/mnesia/*.pid file gets created with the wrong name but we can't tell because our CI system doesn't gather /var/lib/rabbitmq/ .

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
tags: added: charm-upgrade
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

CI change for gathering /var/lib/rabbitmq/ from now on: https://github.com/openstack-charmers/bot-control/pull/199

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

Happened again: http://osci:8080/job/mojo_runner/24932

We can see the the config-changed hook expects a <email address hidden> file in order to know that the service has come back after the charm-upgrade. At 17:30:06 it times out, although /<email address hidden> shows that the service was back 30 seconds earlier. On the other hand the logs shows a lot of errors and struggles until 17:33:14 when it seems to have then stabilized, so maybe increasing the timeout is the way to go.

This time we have collected /var/lib/rabbitmq and we can see that the pid file exists and is named <email address hidden> as expected by the errored hook, so this is definitely not a duplicate of lp:1710247 which was a mismatch between the name of the produced pid file (containing an IP address) and the name of the expected pid file.

Attaching the logs for later reference. I think it's worth trying to increase the timeout and if my intuition is right, the workaround for now on would just be to resolve the failed hook ~1 minute after the failure.

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
summary: - After charm-upgrade, config-changed thinks that the service hasn't come
- back
+ After charm-upgrade, config-changed times out waiting for the service to
+ come back
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.