Upgrade collision on package upgrade

Bug #1778829 reported by Xav Paice
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Fix Released
High
David Ames

Bug Description

Using charm stable/18.05, on a xenial/pike cloud.

source was set to cloud:xenial-pike

Ran 'juju config rabbitmq-server source="cloud:xenial-queens"'

The cluster immediately ran upgrade steps, as expected, and the cluster went down while packages updated. We expected that.

In the unit log, we see:

2018-06-27 00:59:39 DEBUG config-changed Reading package lists...
2018-06-27 00:59:39 DEBUG config-changed Building dependency tree...
2018-06-27 00:59:39 DEBUG config-changed Reading state information...
2018-06-27 00:59:39 DEBUG config-changed The following package was automatically installed and is no longer required:
2018-06-27 00:59:39 DEBUG config-changed libfreetype6
2018-06-27 00:59:39 DEBUG config-changed Use 'apt autoremove' to remove it.
2018-06-27 00:59:39 DEBUG config-changed 0 upgraded, 0 newly installed, 0 to remove and 57 not upgraded.
2018-06-27 00:59:39 DEBUG config-changed 1 not fully installed or removed.
2018-06-27 00:59:39 DEBUG config-changed After this operation, 0 B of additional disk space will be used.
2018-06-27 00:59:39 DEBUG config-changed Setting up rabbitmq-server (3.6.10-1~cloud0) ...
2018-06-27 00:59:42 DEBUG config-changed Job for rabbitmq-server.service failed because the control process exited with error code. See "systemctl status rabbitmq-server.service" and "journalctl -xe" for details.
2018-06-27 00:59:42 DEBUG config-changed invoke-rc.d: initscript rabbitmq-server, action "start" failed.
2018-06-27 00:59:42 DEBUG config-changed ESC[0;1;31m●ESC[0m rabbitmq-server.service - RabbitMQ Messaging Server
2018-06-27 00:59:42 DEBUG config-changed Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
2018-06-27 00:59:42 DEBUG config-changed Active: ESC[0;1;31mfailedESC[0m (Result: exit-code) since Wed 2018-06-27 00:59:42 UTC; 7ms ago
2018-06-27 00:59:42 DEBUG config-changed Process: 1043950 ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait ESC[0;1;31m(code=exited, status=70)ESC[0m
2018-06-27 00:59:42 DEBUG config-changed Process: 1043949 ExecStart=/usr/sbin/rabbitmq-server (code=exited, status=0/SUCCESS)
2018-06-27 00:59:42 DEBUG config-changed Main PID: 1043949 (code=exited, status=0/SUCCESS)
2018-06-27 00:59:42 DEBUG config-changed
2018-06-27 00:59:42 DEBUG config-changed Jun 27 00:59:40 juju-ed63ca-0-lxd-16 systemd[1]: Starting RabbitMQ Messaging....
2018-06-27 00:59:42 DEBUG config-changed Jun 27 00:59:41 juju-ed63ca-0-lxd-16 rabbitmq[1043950]: Waiting for 'rabbit@j...
2018-06-27 00:59:42 DEBUG config-changed Jun 27 00:59:41 juju-ed63ca-0-lxd-16 rabbitmq[1043950]: pid is 1043958
2018-06-27 00:59:42 DEBUG config-changed Jun 27 00:59:42 juju-ed63ca-0-lxd-16 rabbitmq[1043950]: Error: process_not_ru...
2018-06-27 00:59:42 DEBUG config-changed Jun 27 00:59:42 juju-ed63ca-0-lxd-16 systemd[1]: ESC[0;1;39mrabbitmq-server.service: Co...0ESC[0m
2018-06-27 00:59:42 DEBUG config-changed Jun 27 00:59:42 juju-ed63ca-0-lxd-16 systemd[1]: ESC[0;1;31mFailed to start RabbitMQ Me....ESC[0m
2018-06-27 00:59:42 DEBUG config-changed Jun 27 00:59:42 juju-ed63ca-0-lxd-16 systemd[1]: ESC[0;1;39mrabbitmq-server.service: Un....ESC[0m
2018-06-27 00:59:42 DEBUG config-changed Jun 27 00:59:42 juju-ed63ca-0-lxd-16 systemd[1]: ESC[0;1;39mrabbitmq-server.service: Fa....ESC[0m
2018-06-27 00:59:42 DEBUG config-changed Hint: Some lines were ellipsized, use -l to show in full.
2018-06-27 00:59:42 DEBUG config-changed dpkg: error processing package rabbitmq-server (--configure):
2018-06-27 00:59:42 DEBUG config-changed subprocess installed post-installation script returned error exit status 1
2018-06-27 00:59:43 DEBUG config-changed Errors were encountered while processing:
2018-06-27 00:59:43 DEBUG config-changed rabbitmq-server
2018-06-27 00:59:44 DEBUG config-changed E: Sub-process /usr/bin/dpkg returned an error code (1)
2018-06-27 00:59:44 DEBUG config-changed Traceback (most recent call last):
2018-06-27 00:59:44 DEBUG config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/config-changed", line 858, in <module>
2018-06-27 00:59:44 DEBUG config-changed hooks.execute(sys.argv)
2018-06-27 00:59:44 DEBUG config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/charmhelpers/core/hookenv.py", line 823, in execute
2018-06-27 00:59:44 DEBUG config-changed self._hooks[hook_name]()
2018-06-27 00:59:44 DEBUG config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/rabbit_utils.py", line 853, in wrapped_f
2018-06-27 00:59:44 DEBUG config-changed f(*args, **kwargs)
2018-06-27 00:59:44 DEBUG config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/charmhelpers/contrib/hardening/harden.py", line 79, in _harden_inner2
2018-06-27 00:59:44 DEBUG config-changed return f(*args, **kwargs)
2018-06-27 00:59:44 DEBUG config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/config-changed", line 739, in config_changed
2018-06-27 00:59:44 DEBUG config-changed add_source(config('source'), config('key'))
2018-06-27 00:59:44 DEBUG config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/charmhelpers/fetch/ubuntu.py", line 391, in add_source
2018-06-27 00:59:44 DEBUG config-changed fn(*m.groups())
2018-06-27 00:59:44 DEBUG config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/charmhelpers/fetch/ubuntu.py", line 487, in _add_cloud_distro_check
2018-06-27 00:59:44 DEBUG config-changed _add_cloud_pocket("{}-{}".format(cloud_archive_release, openstack_release))
2018-06-27 00:59:44 DEBUG config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/charmhelpers/fetch/ubuntu.py", line 445, in _add_cloud_pocket
2018-06-27 00:59:44 DEBUG config-changed fatal=True)
2018-06-27 00:59:44 DEBUG config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/charmhelpers/fetch/ubuntu.py", line 208, in apt_install
2018-06-27 00:59:44 DEBUG config-changed _run_apt_command(cmd, fatal)
2018-06-27 00:59:44 DEBUG config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/charmhelpers/fetch/ubuntu.py", line 560, in _run_apt_command
2018-06-27 00:59:44 DEBUG config-changed retry_message="Couldn't acquire DPKG lock")
2018-06-27 00:59:44 DEBUG config-changed File "/var/lib/juju/agents/unit-rabbitmq-server-0/charm/hooks/charmhelpers/fetch/ubuntu.py", line 536, in _run_with_retries
2018-06-27 00:59:44 DEBUG config-changed result = subprocess.check_call(cmd, **kwargs)
2018-06-27 00:59:44 DEBUG config-changed File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
2018-06-27 00:59:44 DEBUG config-changed raise CalledProcessError(retcode, cmd)
2018-06-27 00:59:44 DEBUG config-changed subprocess.CalledProcessError: Command '['apt-get', '--assume-yes', '--option=Dpkg::Options::=--force-confold', 'install']' returned non-zero exit status 100
2018-06-27 00:59:44 ERROR juju.worker.uniter.operation runhook.go:113 hook "config-changed" failed: exit status 1
2018-06-27 00:59:44 DEBUG juju.worker.uniter.operation executor.go:84 lock released
2018-06-27 00:59:44 INFO juju.worker.uniter resolver.go:100 awaiting error resolution for "config-changed" hook
2018-06-27 00:59:44 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] error: hook failed: "config-changed"

All 3 units are unable to start, when I try to start the service:

root@juju-ed63ca-0-lxd-16:~# systemctl start rabbitmq-server.service
Job for rabbitmq-server.service failed because the control process exited with error code. See "systemctl status rabbitmq-server.service" and "journalctl -xe" for details.
root@juju-ed63ca-0-lxd-16:~# systemctl status rabbitmq-server.service
● rabbitmq-server.service - RabbitMQ Messaging Server
   Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2018-06-27 01:04:05 UTC; 12s ago
  Process: 1044703 ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait (code=exited, status=70)
  Process: 1044702 ExecStart=/usr/sbin/rabbitmq-server (code=exited, status=0/SUCCESS)
 Main PID: 1044702 (code=exited, status=0/SUCCESS)

Jun 27 01:04:02 juju-ed63ca-0-lxd-16 systemd[1]: Starting RabbitMQ Messaging Server...
Jun 27 01:04:03 juju-ed63ca-0-lxd-16 rabbitmq[1044703]: Waiting for 'rabbit@juju-ed63ca-0-lxd-16'
Jun 27 01:04:03 juju-ed63ca-0-lxd-16 rabbitmq[1044703]: pid is 1044713
Jun 27 01:04:05 juju-ed63ca-0-lxd-16 rabbitmq[1044703]: Error: process_not_running
Jun 27 01:04:05 juju-ed63ca-0-lxd-16 systemd[1]: rabbitmq-server.service: Control process exited, code=exited status=70
Jun 27 01:04:05 juju-ed63ca-0-lxd-16 systemd[1]: Failed to start RabbitMQ Messaging Server.
Jun 27 01:04:05 juju-ed63ca-0-lxd-16 systemd[1]: rabbitmq-server.service: Unit entered failed state.
Jun 27 01:04:05 juju-ed63ca-0-lxd-16 systemd[1]: rabbitmq-server.service: Failed with result 'exit-code'.

Revision history for this message
Xav Paice (xavpaice) wrote :

Subscribed field-high as this customer affecting bug caused significant downtime.

James Troup (elmo)
tags: added: upgrade
Revision history for this message
David Ames (thedac) wrote :

Xav,

I have run a number of 18.05 xenial-pike > xenial-queens upgrades without seeing this.

The log entry suggests apt was already in use
retry_message="Couldn't acquire DPKG lock"

Can we get the rabbitmq log directories to try and figure out what went wrong?

Changed in charm-rabbitmq-server:
status: New → Incomplete
Revision history for this message
Xav Paice (xavpaice) wrote :
Download full text (5.1 KiB)

The apt log does show:

(rabbitmq-server/0):
Start-Date: 2018-06-27 00:47:57
Commandline: apt-get --assume-yes --option=Dpkg::Options::=--force-confold install rabbitmq-server python-amqplib lockfile-progs
Upgrade: rabbitmq-server:amd64 (3.5.7-1ubuntu0.16.04.2, 3.6.10-1~cloud0)
Error: Sub-process /usr/bin/dpkg returned an error code (1)
End-Date: 2018-06-27 00:48:03

In the Rabbit log at the same time, I see:

=INFO REPORT==== 27-Jun-2018::00:47:40 ===
Mirrored queue 'conductor.juju-ed63ca-0-lxd-13' in vhost 'openstack': Synchronising: 0 messages to synchronise

=INFO REPORT==== 27-Jun-2018::00:47:40 ===
Mirrored queue 'engine_worker.cd770ae7-dcb3-415e-a814-29ec77451484' in vhost 'openstack': Master <email address hidden> saw deaths of mirrors <email address hidden>

=INFO REPORT==== 27-Jun-2018::00:47:40 ===
Mirrored queue 'heat-engine-listener.230c2044-2727-408b-8ee4-c06b4df53488' in vhost 'openstack': Master <email address hidden> saw deaths of mirrors <email address hidden>

=INFO REPORT==== 27-Jun-2018::00:47:40 ===
Mirrored queue 'heat-engine-listener.9602fb4b-8ec3-4c1f-bcc1-e4e5a941d354' in vhost 'openstack': Master <email address hidden> saw deaths of mirrors <email address hidden>

=INFO REPORT==== 27-Jun-2018::00:47:40 ===
Mirrored queue 'q-server-resource-versions_fanout_22a68acb7db54262b85aebaf325b7762' in vhost 'openstack': Master <email address hidden> saw deaths of mirrors <email address hidden>

=INFO REPORT==== 27-Jun-2018::00:47:40 ===
Mirrored queue 'neutron-vo-Network-1.0_fanout_dce4d356679c467ea2e7c64658537de4' in vhost 'openstack': Master <email address hidden> saw deaths of mirrors <email address hidden>

=INFO REPORT==== 27-Jun-2018::00:47:40 ===
Mirrored queue 'engine_worker.bb85dbea-b12e-4d20-b974-d4702ff16544' in vhost 'openstack': Synchronising: 0 messages to synchronise

=INFO REPORT==== 27-Jun-2018::00:47:40 ===
Mirrored queue 'reply_1f0019e691bf4751be5b03c1db8db8a2' in vhost 'openstack': Slave <email address hidden> saw deaths of mirrors <email address hidden>

=INFO REPORT==== 27-Jun-2018::00:47:40 ===
Mirrored queue 'reply_1f0019e691bf4751be5b03c1db8db8a2' in vhost 'openstack': Promoting slave <email address hidden> to master

<snip a million of those>

=INFO REPORT==== 27-Jun-2018::00:47:46 ===
accepting AMQP connection <0.8368.82> (10.243.254.189:46246 -> 10.243.254.206:5672)

=INFO REPORT==== 27-Jun-2018::00:47:49 ===
accepting AMQP connection <0.8374.82> (10.243.254.216:37658 -> 10.243.254.206:5672)

=INFO REPORT==== 27-Jun-2018::00:47:52 ===
accepting AMQP connection <0.8385.82> (10.243.254.191:38186 -> 10.243.254.206:5672)

=INFO REPORT==== 27-Jun-2018::00:47:52 ===
accepting AMQP connection <0.8397.82> (10.243.254.228:60470 -> 10.243.254.206:5672)

=INFO REPORT==== 27-Jun-2018::00:47:53 ===
accepting AMQP connection <0.8409.82> (10.243.254.191:38188 -> 10.243.254.206:5672)

=INFO REPORT==== 27-Jun-2018::00:47:56 ===
accepting AMQP connection <0.8420.82> (10.243.254.195:38466 -...

Read more...

Changed in charm-rabbitmq-server:
status: Incomplete → New
Revision history for this message
David Ames (thedac) wrote :

< What's happened is that all the 3 rabbit units have done the upgrade at almost the same time, and because they're running at different versions, rabbit won't start back up because the erlang version is different to the other units.

That is the root cause. We have had 2 or more units stop and attempt to restart rabbit at the same time.

So this bug is not specific to the queens upgrade.

Right now the hooks.rabbit_utils.cluster_wait is used to attempt to avoid collisions. We need to add this to the upgrade-charm function before the package install.

summary: - Upgrade to queens results in stopped cluster
+ Upgrade collision on package upgrade
Changed in charm-rabbitmq-server:
status: New → Triaged
importance: Undecided → High
milestone: none → 18.08
Revision history for this message
James Page (james-page) wrote :
Changed in charm-rabbitmq-server:
status: Triaged → In Progress
assignee: nobody → David Ames (thedac)
tags: added: backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (master)

Reviewed: https://review.openstack.org/581440
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=86edf1e5d86b97ec071578b1e910cd94b7a019e4
Submitter: Zuul
Branch: master

commit 86edf1e5d86b97ec071578b1e910cd94b7a019e4
Author: David Ames <email address hidden>
Date: Tue Jul 10 10:48:21 2018 -0700

    Avoid package upgrade collisions

    During upgrade to a new cloud archive and a new rabbitmq-server
    package, the package update will stop the rabbitmq service. If two or
    more units do this near the same time the services will fail to restart.

    This change adds the cluster_wait function before a package upgrade to
    avoid collisions.

    Fix relation_get for cluster_with causing extraneous hooks to execute.
    Fix nagios stats collection logic.

    Change-Id: I8d9aa38d917583fa45b5570eca9a78c813303e2b
    Closes-Bug: #1778829

Changed in charm-rabbitmq-server:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (stable/18.05)

Fix proposed to branch: stable/18.05
Review: https://review.openstack.org/585137

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (stable/18.05)

Reviewed: https://review.openstack.org/585137
Committed: https://git.openstack.org/cgit/openstack/charm-rabbitmq-server/commit/?id=9005a38f1da1cde68838c23d49e454bcf5074a82
Submitter: Zuul
Branch: stable/18.05

commit 9005a38f1da1cde68838c23d49e454bcf5074a82
Author: David Ames <email address hidden>
Date: Tue Jul 10 10:48:21 2018 -0700

    Avoid package upgrade collisions

    During upgrade to a new cloud archive and a new rabbitmq-server
    package, the package update will stop the rabbitmq service. If two or
    more units do this near the same time the services will fail to restart.

    This change adds the cluster_wait function before a package upgrade to
    avoid collisions.

    Fix relation_get for cluster_with causing extraneous hooks to execute.
    Fix nagios stats collection logic.

    Change-Id: I8d9aa38d917583fa45b5570eca9a78c813303e2b
    Closes-Bug: #1778829
    (cherry picked from commit 86edf1e5d86b97ec071578b1e910cd94b7a019e4)

David Ames (thedac)
Changed in charm-rabbitmq-server:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.