MRE updates of rabbitmq-server for Jammy,Focal

Bug #2060248 reported by Mitchell Dzurick
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
rabbitmq-server (Ubuntu)
New
Undecided
Mitchell Dzurick
Focal
In Progress
Undecided
Unassigned
Jammy
In Progress
Undecided
Unassigned

Bug Description

[ Impact ]
This bug tracks an update for the rabbitmq-server package in Ubuntu.
This bug tracks an update to the following versions:
 * Focal (20.04): rabbitmq-server 3.8.3
 * Jammy (22.04): rabbitmq-server 3.9.27
(NOTE) - Jammy is only updating to 3.9.27 because 3.9.28 requires Erlang 24.3. If Erlang updates in the future, then we can upgrade further.
(NOTE) - Focal is only updating to 3.8.3 from 3.8.2 because 3.8.4 requires etcd v3.4.
This is the first MRE of rabbitmq-server.
Upstream has a very rapid release cadence with micro releases that contain many bug fixes that would be good to bring into our LTS releases.
One major hurdle with this is the lack of proper dep8 tests, which a limited suite of dep8 tests were created for this MRE, which is planned to get integrated into newer releases once approved.
rabbitmq-server is a complicated package that the new dep8 tests will not be able to cover everything, therefore our openstack charms CI/CD ran the new verison to provide more confidence in the package, and to at least verify that our workflow works. The results of these runs can be found at https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/915836.
In addition to this, only Jammy has github workflows to build+test the package, where the results can be found at https://github.com/mitchdz/rabbitmq-server-3-9-27-tests/actions/runs/8955069098/job/24595393599.

Changelogs can be found at https://github.com/rabbitmq/rabbitmq-server/tree/main/release-notes

Jammy included the introduction of a new ERL_MAX_PORTS variable in the microrelease. A delta was introduced in Ubuntu to respect the previous values on upgrade, making no action required by the user, and eliminating unforeseen downtime due to a microrelease upgrade limiting the max erlang ports.

[ Test Plan ]
The test plan for rabbitmq-server involves 3 different types of tests.

(A) OpenStack CI/CD
This is what we run for CI/CD. Testing the newer version in CI/CD tests real world use-cases, and is a minimum that should be done to ensure our own tooling should work. Tester will need to request the new version to be ran from the OpenStack team. An example of a run as mentioned before is:

https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/915836

(B) dep8 tests
New dep8 tests were added to the package which must pass. These cover simple, but real use-cases.

(C) Upgrade testing
1. lxc launch ubuntu:jammy j-vm --vm
2. lxc shell j-vm
3. sudo apt install -y rabbitmq-server
4. Enable proposed
5. sudo apt install -y rabbitmq-server
# ensure no errors or issues during upgrade

For jammy, also ensure ERL_MAX_PORTS and LimitNOFILE are correctly honored on upgrade. Checking ERL_MAX_PORTS is easy to do with the following erlang command:
# rabbitmqctl eval 'erlang:system_info(port_limit).'

To test this, the following steps will be performed.
1. lxc launch ubuntu:jammy j-vm --vm
2. lxc shell j-vm
3. sudo apt install -y rabbitmq-server
4. Set LimitNoFile to a higher value than the default
5. Upgrade
6. Check max open file handles && max erlang ports, they should match.

A test should also be made for differing ERL_MAX_PORTS, specifically where LimitNOFILE > ERL_MAX_PORTS, and ensure the max open file handles and max erlang ports are respected on upgrade. This is the same as the steps above, except step 4 also adds the ERL_MAX_PORTS environment variable.

[ Where problems could occur ]
* This is the first MRE of this package, so extra caution should be taken.
* Upgrading the server may cause downtime during upgrade.
* Reports of upgrade failures can happen if users have misconfigured rabbitmq-server and the maintainer scripts attempt to stop/start the server.
* a change was made to Jammy to respect the maximum Erlang ports on upgrade. This change could cause erratic behavior.

Related branches

description: updated
description: updated
Changed in rabbitmq-server (Ubuntu):
assignee: nobody → Mitchell Dzurick (mitchdz)
Changed in rabbitmq-server (Ubuntu Focal):
status: New → In Progress
Changed in rabbitmq-server (Ubuntu Jammy):
status: New → In Progress
no longer affects: rabbitmq-server (Ubuntu Mantic)
no longer affects: rabbitmq-server (Ubuntu Noble)
Revision history for this message
Paride Legovini (paride) wrote :

Focal MRE uploaded:

Uploading rabbitmq-server_3.8.3-0ubuntu0.1.dsc
Uploading rabbitmq-server_3.8.3.orig.tar.xz
Uploading rabbitmq-server_3.8.3-0ubuntu0.1.debian.tar.xz
Uploading rabbitmq-server_3.8.3-0ubuntu0.1_source.buildinfo
Uploading rabbitmq-server_3.8.3-0ubuntu0.1_source.changes

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

Thanks Paride!

Just a heads up - let's please block both jammy/focal in the -proposed pocket until our openstack team can run the CI/CD with these versions.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Thank you for highlighting the change to the default concurrent client connections.

Regarding this change, specifically:

> Jammy Changes:
> - Notices:
> + Nodes now default to 65536 concurrent client connections instead of
> using the effective kernel open file handle limit.

Can you elaborate if in the case of Ubuntu Jammy that new limit of 65536 will be higher or lower than what was allowed before, assuming a default installation (i.e., deploy system, install rabbitmq-server)?

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

Thanks for the question Andreas,

Looking into this, I noticed the defaults are already 65536!

root@j:~# dpkg -s rabbitmq-server | grep Version
Version: 3.9.13-1ubuntu0.22.04.2
root@j:~# cat /proc/$(pgrep -f rabbitmq)/limits | grep "Max open files"
Max open files 65536 65536 files

I thought wow, what a coincidence, maybe too much of a coincidence. It turns out that this default is _already_ available in the current Jammy version 3.9.13. Debian has set the cap via systemd service since 9 years ago[0]!

ubuntu@ip-172-31-36-170:~$ sudo systemctl show rabbitmq-server.service --property=LimitNOFILE
LimitNOFILE=65536

[0] - https://salsa.debian.org/openstack-team/third-party/rabbitmq-server/-/commit/6ced1e42f48d114edf9140ce6e04b1d6581515cd

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

I added that changelog entry due to v3.9.23 making that change[0].

Before pushing, I would like to remove that changelog entry as to not cause any confusion, but will wait for the rest of the review in-case some more changes are needed.

[0] - https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.9.23

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

One more thing I'd like to have a discussion on - if rabbitmq-server's database has open queues, they will be lost on upgrade. This is already the current behavior, and I assume this is fine, but I'd like to at the very least bring it up.

For example, if you open up an lxc shell, and run the new hello-world test that is in the MP for both Jammy/Focal, you can see:

root@j:~# rabbitmqctl list_queues
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
name messages
hello 1
root@j:~# sudo apt install --reinstall rabbitmq-server
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
0 upgraded, 0 newly installed, 1 reinstalled, 0 to remove and 27 not upgraded.
Need to get 15.2 MB of archives.
After this operation, 0 B of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 rabbitmq-server all 3.9.13-1ubuntu0.22.04.2 [15.2 MB]
Fetched 15.2 MB in 1s (11.7 MB/s)
(Reading database ... 36794 files and directories currently installed.)
Preparing to unpack .../rabbitmq-server_3.9.13-1ubuntu0.22.04.2_all.deb ...
Unpacking rabbitmq-server (3.9.13-1ubuntu0.22.04.2) over (3.9.13-1ubuntu0.22.04.2) ...
Setting up rabbitmq-server (3.9.13-1ubuntu0.22.04.2) ...
Processing triggers for man-db (2.10.2-1) ...
Scanning processes...

No services need to be restarted.

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor (qemu) binaries on this host.
root@j:~# rabbitmqctl list_queues
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
root@j:~#

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Please trim the d/changelog file. It's not supposed to contain the entirety of upstream's changes. For inspiration, I suggest to look at the postgresql MRE ones. They highlight some important changes, just a few, and then link to the upstream release notes.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

Andreas, got it. I'll update the section about the file handle limit change as well. Want me to just update the MP I already have up? Or should I make a new MP?

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

> I thought wow, what a coincidence, maybe too much of a coincidence. It turns out that this default is
> _already_ available in the current Jammy version 3.9.13. Debian has set the cap via systemd service since
> 9 years ago[0]!
>
> ubuntu@ip-172-31-36-170:~$ sudo systemctl show rabbitmq-server.service --property=LimitNOFILE
> LimitNOFILE=65536

So looking at the upstream release notes, where it says:

> 1. Pick a new limit value they would like to use, for instance, 100K
> 2. Set the maximum open file handle limit (for example, via `systemd`
> or similar tooling) for the OS user used by RabbitMQ to 100K
> 3. Set the ERL_MAX_PORTS environment variable to 100K

What I would like to determine is potential regressions on an update due to this change.

So let's imagine some scenarios:

a) jammy package, with the 65546 limit in the systemd unit
If the user wants to raise that limit, is it a matter of just bumping LimitNOFILE in the systemd unit? Assuming yes, and the new limit was set via an override (and not by editing the unit directly in /lib/systemd/system), and is working, what happens if they upgrade to the version from this update? I assume that bumped limit would be ignored, because now the kernel limit is ignored, and rabbit uses another mechanism to set the maximum concurrent connections. Can you check if this scenario would indeed regress?

b) with the updated package, how would the user raise the number of concurrent connections? I presume a change to the unit file is in order again. In fact, two changes, if I understood this correctly:
b1) bump LimitNOFile
b2) export ERL_MAX_PORTS to the same value (new directive in the unit: it's not there today)

Is that it for (b)?

If (a) indeed regresses, how could we prevent that regression from happening? Without thinking much, three options jump to my mind:
1) revert the upstream change from 3.9.23 that introduced this
2) have some way to automatically export ERL_MAX_PORTS to match LimitNOFile, maybe with a wrapper script (sounds cumbersome)
3) limit the MRE to 3.9.22, just prior to the change in 3.9.23. Would have to check what we are missing all the way up to 3.9.27 and if it's still worth updating to 3.9.22.

I think that if going with >= 3.9.23, we should at least export ERL_MAX_PORTS in the systemd unit file, so that users checking that file will at least be aware of that variable and what it does. But that would still not solve the possible regression in behavior.

Note I don't think the scenario in (a) would make rabbit fail to start, it would just serve less connections, and eventually services trying to connect would fail, if I understood this correctly.

What do you think?

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

> I added that changelog entry due to v3.9.23 making that change[0].

The ERL_MAX_PORTS/max connections change (not necessarily the full steps for the workaround) should definitely be in the changelog, but you wrote 218 lines of changelog :)

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

> if rabbitmq-server's database has open queues, they will be lost on upgrade.

Isn't queue persistence a property of the queue?

Quick googling found this[1], but I'm sure there are more details out there.

1. https://www.rabbitmq.com/docs/classic-queues#:~:text=Persistence%20(Durable%20Storage)%20in%20Classic,transient%20or%20messages%20are%20transient

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I also just realized that this bug does not have the SRU template filled in, please do that.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

Regarding comment #10 I was planning on documenting the need to set the environment variable, but didn't consider having systemd export the variable, which I really like that idea.

I'll test and include that change in my MP soon.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I will reject this upload from the unapproved queue due to the changes we discussed above (changelog, and the unit file with the env var), and there will probably be some more testing of the upgrade scenarios which I will work on with Mitchell.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

In the focal changelog, do you have more details about this change:

      * An undocumented "automagic login" feature on the login form was
       removed.

Any idea how it could have been used? Even though it says it's undocumented, if it was something heavily used even so, then we should consider this change.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

https://github.com/rabbitmq/rabbitmq-management/pull/748

In the comment chain:
> As this wasn't a documented feature, I didn't think that it could be considered a breaking change. Let's discuss privately re PCF. I am not aware of anyone other than us using this feature.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

I'll see if I can find evidence that it was used by many people, but it doesn't seem like there was a lot of blowing up in the pull request after it was merged.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

I'm investigating the jammy MRE - specifically on the addition of ERL_MAX_PORTS. I'm not a fan of the ideas I have so far. The concern is that a user may have increased the cap, and on upgrade would silently cap, which could cause failures.

This of course will be documented in the d/changelog, but I don't think a "you didn't look at d/changelog" is a valid excuse to let that behavior happen, it should be expected that if you modify something as simple as the open file handle limit, that should be respected on upgrade.

The current ideas/why I don't like them:

1. parse the systemd unit configuration - add a new line to add the environment variable to the systemd unit on upgrade
    - Just not a fan of parsing like this, in-case the user has made other modifications such as various comments that could mangle the parsing.

2. Remove the addition of this environment variable
    - This diverges from upstream, could cause confusion when looking at upstream docs. This could not be an issue, but I think in this case it should be considered an issue because having a too high value could cause unexpected DoS and be difficult to debug.

3. modify the upstream code to, at runtime, match ERL_MAX_PORTS from the systemd unit, and leave it uncapped in LimitNOFILE is not present
    - Same issue as #2, diverges from upstream. Also, this assumes that the user knowingly disabled the limit in systemd and is not using another mechanism to cap the open file handle limit. This has the potential of not playing nice with whatever mechanism the user decides to use.

Due to this, I am thinking to have the Jammy MRE upgrade to version v3.9.22 instead, as the bug fixes between v3.9.{23..27} are of course nice to have, but I think are fine to skip out on.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

As per the previous comment - I will do some research in v3.9.22 for Jammy, ensuring that no bugs are introduced there that is fixed in a more recent version, and proceed to do the MRE for that version instead.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

I prepared a v3.9.22 in Jammy and have ran the autopkgtest, I am seeing a failure[0] with the new rpc test in ppc64el. I re-ran it twice so far with the same timeout, and waiting for a third run to see the results.

The logs unfortunately do not have a lot of useful information, so I'm hoping it's just a flaky situation, although a new test being flaky is not ideal.

[0] - https://autopkgtest.ubuntu.com/results/autopkgtest-jammy-mitchdz-rabbitmq-server-mre-2204-v2/jammy/ppc64el/r/rabbitmq-server/20240619_044237_d738a@/log.gz

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :
Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

Finally got my hands on a ppc64el system. Running the rpc test manually now.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

Okay, manually ran the rpc test on a ppc64el system and it passed fine.

Ran on bobonevm4

$ dpkg -s rabbitmq-server | grep Version:
Version: 3.9.22-0ubuntu0.1~jammy1
$ python3 rpc_server.py &
[1] 5069
$ client_result=$(python3 rpc_client.py)
 [.] fib(30)
$ echo $client_result
[x] Requesting fib(30) [.] Got 832040

This is just from manually copy+paste the contents of the test. Maybe the autopkgtest runner was flaky? I just re-ran another test from the PPA so will wait to hear back from that.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

Okay quick response - this time the dep8 test passed in the LP runners[0].

3 dep8 test failures in a row seems fishy though, so I'd like to run the test a few more times to see if it is really flaky, or if the runners were just sad on the previous runs. Will run 2 more tests now.

[0] - https://autopkgtest.ubuntu.com/results/autopkgtest-jammy-mitchdz-rabbitmq-server-mre-2204-v2/jammy/ppc64el/r/rabbitmq-server/20240626_202946_36422@/log.gz

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

While waiting for the new autopkgtest to publish and run, I am going to attempt to kick off the upstream CI/CD tests for 3.9.22.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

This still needs the SRU template filled in.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

Apologies on delay, been busy with +1.

description: updated
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.