upgrade-series complete hangs when upgrade from Focal to Jammy due to .erlang.cookie being overwritten

Bug #2006484 reported by Chuan Li
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Fix Released
High
Alex Kavanagh
rabbitmq-server (Ubuntu)
New
Undecided
Unassigned

Bug Description

charm version 3.9/stable #154
upgrade series from Focal to Jammy failed.

Steps:

#juju deploy --series focal --channel 3.9/stable rabbitmq-server -n 3
#juju set-series rabbitmq-server jammy
#juju run-action --wait rabbitmq-server/0 pause
#juju upgrade-series 0 prepare jammy
#juju ssh 0 sudo apt update
#juju ssh 0 sudo apt full-upgrade
#juju ssh 0 sudo do-release-upgrade
#juju upgrade-series 0 complete
"
machine-0 complete phase started
machine-0 start units after series upgrade
rabbitmq-server/0 post-series-upgrade hook running
"

unit-log:
2023-02-07 15:08:33 WARNING unit.rabbitmq-server/0.post-series-upgrade logger.go:60 subprocess.CalledProcessError: Command '['/usr/sbin/rabbitmqctl', 'cluster_status', '--formatter=json']' returned non-zero exit status 69.

rabbitmq-server.service is in activating state.

BTW, I can successfully deploy the same charm onto 3 Jammy machines.

Detail steps and output: https://pastebin.canonical.com/p/tnCsfFpr7q/

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I've reproduced the error, or at least I think I have. What seems to be happening during the series-upgrade is that the /var/lib/rabbitmq/.erlang.cookie is being overwritten when the new packages were installed.

When the .erlang.cookie doesn't match within the cluster, it won't reform and thus the hook times out. This may be new behaviour in the package, or something different in upstream code.

What I think we need to do, is update the post-series-upgrade hook to ensure that the peer-storage('cookie') is synced to the .erlang.cookie before attempting to start the service.

Changed in charm-rabbitmq-server:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I've confirmed that the it's the package upgrade that changes the /var/lib/rabbitmq/.erlang.cookie - thus this could be a packaging bug or an upstream bug.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Okay, further investigation suggests it's a combination of a packaging issue (deliberately?) and the way the charm generates the cookie:

From the package's debian/rabbitmq-server.postinst file:

 # We generate the erlang cookie by ourselves, just to make sure we don't
 # leave the job to erlang that doesn't do it with enough entropy.
 if ! [ -e /var/lib/rabbitmq/.erlang.cookie ] ; then
  OLD_UMASK=$(umask)
  umask 077; openssl rand -base64 -out /var/lib/rabbitmq/.erlang.cookie 42
  umask ${OLD_UMASK}
 else
  # This matches an Erlang generated cookie file: 20 upper case chars
  if grep -q -E '^[A-Z]{20}$' /var/lib/rabbitmq/.erlang.cookie ; then
   OLD_UMASK=$(umask)
   umask 077; openssl rand -base64 -out /var/lib/rabbitmq/.erlang.cookie 42
   umask ${OLD_UMASK}
   if [ ""$(ps --no-headers -o comm 1) = "systemd" ] ; then
    if systemctl is-active --quiet rabbitmq-server.service ; then
     systemctl restart rabbitmq-server.service
    fi
   fi
  fi
 fi

A cookie that the charm generates is:

# cat .erlang.cookie
STPUDRGPEHHJDUXOISZO

And the grep command:

# grep -q -E '^[A-Z]{20}$' /var/lib/rabbitmq/.erlang.cookie
# echo $?
0

Whereas for a 'new' cookie:

# cat .erlang.cookie
xbU/nSZTfUW8bl24b2ssjCj9Z8EGqwtdqaRVn1TJfjrrYXb/gG1t+0mz

and the grep command:

# grep -q -E '^[A-Z]{20}$' /var/lib/rabbitmq/.erlang.cookie
# echo $?
1

Therefore, I'm fairly confident it's caused by the package. From the debian/README this is intentional:

2/ Erlang cookie

Prior to Debian version 3.9.8-3, rabbitmq-server generated an Erlang
"magic cookie" shared secret if one did not exist. This secret is
stored in /var/lib/rabbitmq/.erlang.cookie

However, due to predictable seeds and a non-cryptographic randomizer,
the automatically-generated secret written by Erlang only supplies 20
to 40 bits of entropy. This allows a remote attacker with access to
port 25672 to brute-force the "magic cookie," potentially within
minutes, authenticate as a remote node, and EXECUTE ARBITRARY CODE.

Since 3.9.8-3, the rabbitmq-server node will use openssl to generate a
cryptographically-secure cookie during first installation, mitigating
this vulnerability.

Servers which installed a prior version, and are upgrading to 3.9.8-3
or higher, ARE STILL VULNERABLE, as the package will not regenerate
the secret if it exists already. This is because the secret is
designed to be shared between nodes in a cluster, and thus
regenerating it would break existing clusters.

Operators upgrading from earlier versions of rabbitmq-server are
strongly encouraged to generate a new secret. This can be done via:

    openssl rand -base64 42 >/var/lib/rabbitmq/.erlang.cookie

---

i.e. I don't think the postinst script should be doing this?

Thoughts welcome.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Yes, I think this is a rabbitmq-server packaging bug if the package itself is replacing the .erlang.cookie file in a clustered installation. It will indeed break the cluster as the readme calls out.

Changed in charm-rabbitmq-server:
assignee: nobody → Alex Kavanagh (ajkavanagh)
summary: - upgrade-series complete hangs when upgrade from Focal to Jammy
+ upgrade-series complete hangs when upgrade from Focal to Jammy due to
+ .erlang.cookie being overwritten
Changed in charm-rabbitmq-server:
status: Triaged → In Progress
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Review that fixed the bug: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/874783 (unfortunately, I got the bug number wrong in the commit message)

Changed in charm-rabbitmq-server:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (stable/jammy)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (stable/jammy)

Reviewed: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/876921
Committed: https://opendev.org/openstack/charm-rabbitmq-server/commit/e381a20f1d5238caf567ac66c13b3a943ecc654d
Submitter: "Zuul (22348)"
Branch: stable/jammy

commit e381a20f1d5238caf567ac66c13b3a943ecc654d
Author: Alex Kavanagh <email address hidden>
Date: Wed Feb 22 14:56:27 2023 +0000

    Fix focal to jammy series upgrade

    This is a fix/workaround to the package upgrade bug that affects the
    charm. The post-inst package script updates the .erlang.cookie if it is
    insecure during the upgrade of rabbit from 3.8 to 3.9. This breaks the
    series-upgrade resulting in a charm erroring on the post-series-upgrade
    hook.

    This fix works by checking if the .erlang.cookie has changed during the
    post-series-upgrade hook and either updating the cookie in peer storage
    (if it is insecure) or ensuring that the cookie from peer storage is
    written to the .erlang.cookie if it isn't the leader. This ensures that
    the cluster continues to work and that the series-upgrade can be
    completed across the cluster.

    Change-Id: I540ea8da85b3b4326ccb8194f1d8b1050b04eae9
    Closes-Bug: #2006484
    (cherry picked from commit 55b985f55ca4eb2b2a7229c8dcc70abc8c8940f4)

tags: added: in-stable-jammy
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

The jammy/stable version of the charm now includes this fix.

Changed in charm-rabbitmq-server:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.