MRE updates of rabbitmq-server for Jammy,Focal

Bug #2060248 reported by Mitchell Dzurick
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
rabbitmq-server (Ubuntu)
New
Undecided
Mitchell Dzurick
Focal
In Progress
Undecided
Unassigned
Jammy
In Progress
Undecided
Unassigned

Bug Description

This bug tracks an update for the rabbitmq-server package in Ubuntu.

This bug tracks an update to the following versions:

 * Focal (20.04): rabbitmq-server 3.8.3
 * Jammy (22.04): rabbitmq-server 3.9.27

(NOTE) - Jammy is only updating to 3.9.27 because 3.9.28 requires Erlang 24.3. If Erlang updates in the future, then we can upgrade further.
(NOTE) - Focal is only updating to 3.8.3 from 3.8.2 because 3.8.4 requires etcd v3.4.

This is the first MRE of rabbitmq-server.

Upstream has a very rapid release cadence with micro releases that contain many bug fixes that would be good to bring into our LTS releases.

One major hurdle with this is the lack of proper dep8 tests, which a limited suite of dep8 tests were created for this MRE, which is planned to get integrated into newer releases once approved.

rabbitmq-server is a complicated package that the new dep8 tests will not be able to cover everything, therefore our openstack charms CI/CD ran the new verison to provide more confidence in the package, and to at least verify that our workflow works. The results of these runs can be found at https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/915836.

In addition to this, only Jammy has github workflows to build+test the package, where the results can be found at https://github.com/mitchdz/rabbitmq-server-3-9-27-tests/actions/runs/8955069098/job/24595393599.

Reviewing the changes, there is only one change that I want to bring to attention. That is version 3.9.23 (https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.9.23 ) introduces the following change:
Nodes now default to 65536 concurrent client connections instead of using the effective kernel open file handle limit

------------------------------------------------------------------------------

Jammy Changes:
  - Notices:
    + Nodes now default to 65536 concurrent client connections instead of
      using the effective kernel open file handle limit. Users who want to
      override this default, that is, have nodes that should support more
      concurrent connections and open files, now have to perform an additional
      configuration step:

      1. Pick a new limit value they would like to use, for instance, 100K
      2. Set the maximum open file handle limit (for example, via `systemd`
         or similar tooling) for the OS user used by RabbitMQ to 100K
      3. Set the ERL_MAX_PORTS environment variable to 100K

      This change was introduced because of a change in several Linux
      distributions: they now use a default open file handle limit so high,
      they cause a significant (say, 1.5 GiB) memory preallocated the Erlang
      runtime.
  - Updates:
    + Free disk space monitor robustness improvements.
    + `raft.adaptive_failure_detector.poll_interval` exposes aten()'s
      poll_interval setting to RabbitMQ users. Increasing it can reduce the
      probability of false positives in clusters where inter-node
      communication links are used at close to maximum capacity. The default
      is `5000` (5 seconds).
    + When both `disk_free_limit.relative` and `disk_free_limit.absolute`,
      or both `vm_memory_high_watermark.relative` and
      `vm_memory_high_watermark.absolute` are set, the absolute settings will
      now take precedence.
    + New key supported by `rabbitmqctl list_queues`:
      `effective_policy_definition` that returns merged definitions of regular
      and operator policies effective for the queue.
    + New HTTP API endpoint, `GET /api/config/effective`, returns effective
      node configuration. This is an HTTP API counterpart of
      `rabbitmq-diagnostics environment`.
    + Force GC after definition import to reduce peak memory load by mostly
      idle nodes that import a lot of definitions.
    + A way to configure an authentication timeout, much like in some other
      protocols RabbitMQ supports.
    + Windows installer Service startup is now optional. More environment
      variables are respected by the installer.
    + In environments where DNS resolution is not yet available at the time
      RabbitMQ nodes boot and try to perform peer discovery, such as CoreDNS
      with default caching interval of 30s on Kubernetes, nodes now will
      retry hostname resolution (including of their own host) several times
      with a wait interval.
    + Prometheus plugin now exposes one more metric process_start_time_seconds
      the moment of node process startup in seconds.
    + Reduce log noise when `sysctl` cannot be accessed by node memory
      monitor.
    + Shovels now handle consumer delivery timeouts gracefully and restart.
    + Optimization: internal message GUID is no longer generated for quorum
      queues and streams, as they are specific to classic queues.
    + Two more AMQP 1.0 connection lifecycle events are now logged.
    + TLS configuration for inter-node stream replication connections now can
      use function references and definitions.
    + Stream protocol connection logging is now less verbose.
    + Max stream segment size is now limited to 3 GiB to avoid a potential
      stream position overflow.
    + Logging messages that use microseconds now use "us" for the SI symbol to
      be compatible with more tools.
    + Consul peer discovery now supports client-side TLS options, much like
      its Kubernetes and etcd peers.
    + A minor quorum queue optimization.
    + 40 to 50% throughput improvement for some workloads where AMQP 0-9-1
      clients consumed from a [stream](https://rabbitmq.com/stream.html).
    + Configuration of fallback secrets for Shovel and Federation credential
      obfuscation. This feature allows for secret rotation during rolling
      cluster node restarts.
    + Reduced memory footprint of individual consumer acknowledgements of
      quorum queue consumers.
    + `rabbitmq-diagnostics status` now reports crypto library (OpenSSL,
      LibreSSL, etc) used by the runtime, as well as its version details.
    + With a lot of busy quorum queues, nodes hosting a moderate number of of
      leader replicas could experience growing memory footprint of one of the
      Raft implementation processes.
    + Re-introduced key file log rotation settings. Some log rotation settings
      were left behind during the migration to the standard runtime logger
      starting with 3.9.0. Now some key settings have been re-introduced.
    + Cleaned up some compiler options that are no longer relevant.
    + Quorum queues: better forward compatibility with RabbitMQ 3.10.
    + Significantly faster queue re-import from definitions on subsequent node
      restarts. Initial definition import still takes the same amount of time
      as before.
    + Significantly faster exchange re-import from definitions on subsequent
      node restarts. Initial definition import still takes the same amount of
      time as before.
    + RabbitMQ nodes will now filter out certain log messages related to
      connections, channels, and queue leader replicas receiving internal
      protocol messages sent to this node before a restart. These messages
      usually raise more questions and cause confusion than help.
    + More Erlang 24.3's `eldap` library compatibility improvements.
    + Restart of a node that hosted one or more stream leaders resulted in
      their consumers not "re-attaching" to the newly elected leader.
    + Large fanouts experienced a performance regression when streams were not
      enabled using a feature flag.
    + Stream management plugin did not support mixed version clusters.
    + Stream deletion did not result in a `basic.cancel` being sent to AMQP
      0-9-1 consumers.
    + Stream clients did not receive a correct stream unavailability error in
      some cases.
    + It is again possible to clear user tags and update the password in a
      single operation.
    + Forward compatibility with Erlang 25.
    + File handle cache efficiency improvements.
    + Uknown stream properties (e.g. those requested by a node that runs a
      newer version)
      are now handled gracefully.
    + Temporary hostname resolution issues-attempts that fail with `nxdomain`
      are now handled more gracefully and with a delay of several seconds.
    + Build time compatibility with Elixir 1.13.
    + `auth_oauth2.additional_scopes_key` in `rabbitmq.conf` was not converted
       correctly during configuration translation and thus had no effect.
    + Adapt to a breaking Erlang 24.3 LDAP client change.
    + Shovels now can be declared with `delete-after` parameter set to `0`.
      Such shovels will immediately stop instead of erroring and failing to
      start after a node restart.
    + Support for Consul 1.1 response code changes
      when an operation is attempted on a non-existent health check.
  - Bug Fixes:
    + Classic queues with Single Active Consumer enabled could run into an
      exception.
    + When a global parameter was cleared,
      nodes emitted an internal event of the wrong type.
    + Fixed a type analyzer definition.
    + LDAP server password could end up in the logs in certain types of
      exceptions.
    + `rabbitmq-diagnostics status` now handles server responses where free
      disk space is not yet computed. This is the case with nodes early in the
      boot process.
    + Management UI links now include "noopener" and "noreferrer" attributes
      to protect them against reverse tabnabbing. Note that since management
      UI only includes a small number of external links to trusted resources,
      reverse tabnabbing is unlikely to affect most users. However, it can
      show up in security scanner results and become an issue in environments
      where a modified version of RabbitMQ is offered as a service.
    + Plugin could stop in environments where no static Shovels were defined
      and a specific sequence of events happens at the same time.
    + When installation directory was overridden, the plugins directory did
      not respect the updated base installation path.
    + Intra-cluster communication link metric collector could run into an
      exception when peer connection has just been re-established, e.g. after
      a peer node restart.
    + When a node was put into maintenance mode, it closed all MQTT client
      connections cluster-wide instead of just local client connections.
    + Reduced log noise from exceptions connections could run into when a
      client was closings it connection end concurrently with other activity.
    + `rabbitmq-env-conf.bat§ on Windows could fail to load when its path
      contained spaces.
    + Stream declaration could run into an exception when stream parameters
      failed validation.
    + Some counters on the Overview page have been moved to global counters
      introduced in RabbitMQ 3.9.
    + Avoid an exception when MQTT client closes TCP connection before server
      could fully process a `CONNECT` frame sent earlier by the same client.
    + Channels on connections to mixed clusters that had 3.8 nodes in them
      could run into an exception.
    + Inter-node cluster link statistics did not have any data when TLS was
      enabled for them.
    + Quorum queues now correctly propagate errors when a `basic.get` (polling
      consumption) operation hits a timeout.
    + Stream consumer that used AMQP 0-9-1 instead of a stream protocol
      client, and disconnected, leaked a file handle.
    + Max frame size and client heartbeat parameters for [RabbitMQ stream]()
      clients were not correctly set when taken from `rabbitmq.conf`.
    + Removed a duplicate exchange decorator set operation.
    + Node restarts could result in a hashing ring inconsistency.
    + Avoid seeding default user in old clusters that still use the deprecated
      `management.load_definitions` option.
    + Streams could run into an exception or fetch stale stream position data
      in some scenarios.
    + `rabbitmqctl set_log_level` did not have any effect on logging via
      `amq.rabbitmq.log`.
    + `rabbitmq-diagnostics status` is now more resilient and won't fail if
      free disk space monitoring repeatedly fails (gets disabled) on the node.
    + CLI tools failed to run on Erlang 25 because of an old version of Elixir
      (compiled on Erlang 21) was used in the release pipeline. Erlang 25 no
      longer loads modules
      compiled on Erlang 21 or older.
    + Default log level used a four-character severity abbreviation instead of
      more common longer format, for example, `warn` instead of `warning`.
    + `rabbitmqctl set_log_level` documentation clarification.
    + Nodes now make sure that maintenance mode status table exists after node
      boot as long as the feature flag is enabled.
    + "In flight" messages directed to an exchange that has just been deleted
      will be silently dropped or returned back to the publisher instead of
      causing an exception.
    + rabbitmq-upgrade await_online_synchronized_mirror is now a no-op in
      single node clusters
    + One metric that was exposed via CLI tools and management plugin's HTTP
      API was not exposed via Prometheus scraping API.
    + Stream delivery rate could drop if concurrent stream consumers consumed
      in a way that made them reach the end of the stream often.
    + If a cluster that had streams enabled was upgraded with a jump of
      multiple patch releases, stream state could fail an upgrade.
    + Significantly faster queue re-import from definitions on subsequent node
      restarts. Initial definition import still takes the same amount of time
      as before.
    + When a policy contained keys unsupported by a particular queue
      type, and later updated or superseded by a higher priority policy,
      effective optional argument list could become inconsistent (policy
      would not have the expected effect).
    + Priority queues could run into an exception in some cases.
    + Maintenance mode could run into a timeout during queue leadership
      transfer.
    + Prometheus collector could run into an exception early on node's
      schema database sync.
    + Connection data transfer rate units were incorrectly displayed when
      rate was less than 1 kiB per second.
    + `rabbitmqadmin` now correctly loads TLS-related keys from its
      configuration file.
    + Corrected a help message for node memory usage tool tip.
* Added new dep8 tests:
  - d/t/hello-world
  - d/t/publish-subscribe
  - d/t/rpc
  - d/t/work-queue
* Remove patches fixed upstream:
  - d/p/lp1999816-fix-rabbitmqctl-status-disk-free-timeout.patch

------------------------------------------------------------------------------

Focal Changes:
* New upstream verison 3.8.3 (LP: #2060248).
 - Updates:
   + Some Proxy protocol errors are now logged at debug level.
     This reduces log noise in environments where TCP load balancers and
     proxies perform health checks by opening a TCP connection but never
     sending any data.
   + Quorum queue deletion operation no longer supports the "if unused" and
     "if empty" options. They are typically used for transient queues don't
     make much sense for quorum ones.
   + Do not treat applications that do not depend on rabbit as plugins.
     This is especially important for applications that should not be stopped
     before rabbit is stopped.
   + RabbitMQ nodes will now gracefully shutdown when receiving a `SIGTERM`
     signal. Previously the runtime would invoke a default handler that
     terminates the VM giving RabbitMQ no chance to execute its shutdown
     steps.
   + Every cluster now features a persistent internal cluster ID that can be
     used by core features or plugins. Unlike the human-readable cluster name,
     the value cannot be overridden by the user.
   + Speedup execution of boot steps by a factor of 2N, where N is the number
     of attributes per step.
   + New health checks that can be used to determine if it's a good moment to
     shut down a node for an upgrade.

     ``` sh
     # Exits with a non-zero code if target node hosts leader replica of at
     # least one queue that has out-of-sync mirror.
     rabbitmq-diagnostics check_if_node_is_mirror_sync_critical

     # Exits with a non-zero code if one or more quorum queues will lose
     # online quorum should target node be shut down
     rabbitmq-diagnostics check_if_node_is_quorum_critical
     ```
   + Some proxy protocol errors are now logged at debug level. This reduces
     log noise in environments where TCP load balancers and proxies perform
     health checks by opening a TCP connection but never sending any data.
   + Management and Management Agent Plugins:
     * An undocumented "automagic login" feature on the login form was
       removed.
     * A new `POST /login` endpoint can be used by custom management UI login
       forms to authenticate the user and set the cookie.
     * A new `POST /rebalance/queues` endpoint that is the HTTP API equivalent
       of `rabbitmq-queues rebalance`
     * Warning about a missing `handle.exe` in `PATH` on Windows is now only
       logged every 10 minutes.
     * `rabbitmqadmin declare queue` now supports a new `queue_type` parameter
       to simplify declaration of quorum queues.
     * HTTP API request log entries now includes acting user.
     * Content Security Policy headers are now also set for static assets such
       as JavaScript files.
   + Prometheus Plugin:
     * Add option to aggregate metrics for channels, queues & connections.
       Metrics are now aggregated by default (safe by default).
   + Kubernetes Peer Discovery Plugin:
     * The plugin will now notify Kubernetes API of node startup and peer
       stop/unavailability events. This new behaviour can be disabled via
       `prometheus.return_per_object_metrics = true` config.
   + Federation Plugin:
     * "Command" operations such as binding propagation now use a separate
       channel for all links, preventing latency spikes for asynchronous
       operations (such as message publishing) (a head-of-line blocking
       problem).
   + Auth Backend OAuth 2 Plugin:
     * Additional scopes can be fetched from a predefined JWT token field.
       Those scopes will be combined with the standard scopes field.
   + Trust Store Plugin:
     * HTTPS certificate provider will not longer terminate if upstream
       service response contains invalid JSON.
   + MQTT Plugin:
     * Avoid blocking when registering or unregistering a client ID.
   + AMQP 1.0 Client Plugin:
     * Handle heartbeat in `close_sent/2`.
 - Bug Fixes:
   + Reduced scheduled GC activity in connection socket writer to one run per
     1 GiB of data transferred, with an option to change the value or disable
     scheduled run entirely.
   + Eliminated an inefficiency in recovery of quorum queues with a backlog of
     messages.
   + In a case where a node hosting a quorum queue replica went offline and
     was removed from the cluster, and later came back, quorum queues could
     enter a loop of Raft leader elections.
   + Quorum queues with a dead lettering could fail to recover.
   + The node now can recover even if virtual host recovery terms file was
     corrupted.
   + Autoheal could fail to finish if one of its state transitions initiated
     by a remote node timed out.
   + Syslog client is now started even when Syslog logging is configured only
     for some log sinks.
   + Policies that quorum queues ignored were still listed as applied to them.
   + If a quorum queue leader rebalancing operation timed out, CLI tools
     failed with an exception instead of a sensible internal API response.
   + Handle timeout error on the rebalance function.
   + Handle and raise protocol error for absent queues assumed to be alive.
   + `rabbitmq-diagnostics status` failed to display the results when executed
     against a node that had high VM watermark set as an absolute value
     (using `vm_memory_high_watermark.absolute`).
   + Management and Management Agent Plugins:
     * Consumer section on individual page was unintentionally hidden.
   + Management and Management Agent Plugins:
     * Fix queue-type select by adding unsafe-inline CSP policy.
   + Etcd Peer Discovery Plugin:
     * Only run healthcheck when backend is configured.
   + Federation Plugin:
     * Use vhost to delete federated exchange.
* Added new dep8 tests:
  - d/t/smoke-test
  - d/t/hello-world
  - d/t/publish-subscribe
  - d/t/rpc
  - d/t/work-queue

Related branches

description: updated
description: updated
Changed in rabbitmq-server (Ubuntu):
assignee: nobody → Mitchell Dzurick (mitchdz)
Changed in rabbitmq-server (Ubuntu Focal):
status: New → In Progress
Changed in rabbitmq-server (Ubuntu Jammy):
status: New → In Progress
no longer affects: rabbitmq-server (Ubuntu Mantic)
no longer affects: rabbitmq-server (Ubuntu Noble)
Revision history for this message
Paride Legovini (paride) wrote :

Focal MRE uploaded:

Uploading rabbitmq-server_3.8.3-0ubuntu0.1.dsc
Uploading rabbitmq-server_3.8.3.orig.tar.xz
Uploading rabbitmq-server_3.8.3-0ubuntu0.1.debian.tar.xz
Uploading rabbitmq-server_3.8.3-0ubuntu0.1_source.buildinfo
Uploading rabbitmq-server_3.8.3-0ubuntu0.1_source.changes

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

Thanks Paride!

Just a heads up - let's please block both jammy/focal in the -proposed pocket until our openstack team can run the CI/CD with these versions.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Thank you for highlighting the change to the default concurrent client connections.

Regarding this change, specifically:

> Jammy Changes:
> - Notices:
> + Nodes now default to 65536 concurrent client connections instead of
> using the effective kernel open file handle limit.

Can you elaborate if in the case of Ubuntu Jammy that new limit of 65536 will be higher or lower than what was allowed before, assuming a default installation (i.e., deploy system, install rabbitmq-server)?

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

Thanks for the question Andreas,

Looking into this, I noticed the defaults are already 65536!

root@j:~# dpkg -s rabbitmq-server | grep Version
Version: 3.9.13-1ubuntu0.22.04.2
root@j:~# cat /proc/$(pgrep -f rabbitmq)/limits | grep "Max open files"
Max open files 65536 65536 files

I thought wow, what a coincidence, maybe too much of a coincidence. It turns out that this default is _already_ available in the current Jammy version 3.9.13. Debian has set the cap via systemd service since 9 years ago[0]!

ubuntu@ip-172-31-36-170:~$ sudo systemctl show rabbitmq-server.service --property=LimitNOFILE
LimitNOFILE=65536

[0] - https://salsa.debian.org/openstack-team/third-party/rabbitmq-server/-/commit/6ced1e42f48d114edf9140ce6e04b1d6581515cd

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

I added that changelog entry due to v3.9.23 making that change[0].

Before pushing, I would like to remove that changelog entry as to not cause any confusion, but will wait for the rest of the review in-case some more changes are needed.

[0] - https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.9.23

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

One more thing I'd like to have a discussion on - if rabbitmq-server's database has open queues, they will be lost on upgrade. This is already the current behavior, and I assume this is fine, but I'd like to at the very least bring it up.

For example, if you open up an lxc shell, and run the new hello-world test that is in the MP for both Jammy/Focal, you can see:

root@j:~# rabbitmqctl list_queues
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
name messages
hello 1
root@j:~# sudo apt install --reinstall rabbitmq-server
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
0 upgraded, 0 newly installed, 1 reinstalled, 0 to remove and 27 not upgraded.
Need to get 15.2 MB of archives.
After this operation, 0 B of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 rabbitmq-server all 3.9.13-1ubuntu0.22.04.2 [15.2 MB]
Fetched 15.2 MB in 1s (11.7 MB/s)
(Reading database ... 36794 files and directories currently installed.)
Preparing to unpack .../rabbitmq-server_3.9.13-1ubuntu0.22.04.2_all.deb ...
Unpacking rabbitmq-server (3.9.13-1ubuntu0.22.04.2) over (3.9.13-1ubuntu0.22.04.2) ...
Setting up rabbitmq-server (3.9.13-1ubuntu0.22.04.2) ...
Processing triggers for man-db (2.10.2-1) ...
Scanning processes...

No services need to be restarted.

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor (qemu) binaries on this host.
root@j:~# rabbitmqctl list_queues
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
root@j:~#

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Please trim the d/changelog file. It's not supposed to contain the entirety of upstream's changes. For inspiration, I suggest to look at the postgresql MRE ones. They highlight some important changes, just a few, and then link to the upstream release notes.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

Andreas, got it. I'll update the section about the file handle limit change as well. Want me to just update the MP I already have up? Or should I make a new MP?

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

> I thought wow, what a coincidence, maybe too much of a coincidence. It turns out that this default is
> _already_ available in the current Jammy version 3.9.13. Debian has set the cap via systemd service since
> 9 years ago[0]!
>
> ubuntu@ip-172-31-36-170:~$ sudo systemctl show rabbitmq-server.service --property=LimitNOFILE
> LimitNOFILE=65536

So looking at the upstream release notes, where it says:

> 1. Pick a new limit value they would like to use, for instance, 100K
> 2. Set the maximum open file handle limit (for example, via `systemd`
> or similar tooling) for the OS user used by RabbitMQ to 100K
> 3. Set the ERL_MAX_PORTS environment variable to 100K

What I would like to determine is potential regressions on an update due to this change.

So let's imagine some scenarios:

a) jammy package, with the 65546 limit in the systemd unit
If the user wants to raise that limit, is it a matter of just bumping LimitNOFILE in the systemd unit? Assuming yes, and the new limit was set via an override (and not by editing the unit directly in /lib/systemd/system), and is working, what happens if they upgrade to the version from this update? I assume that bumped limit would be ignored, because now the kernel limit is ignored, and rabbit uses another mechanism to set the maximum concurrent connections. Can you check if this scenario would indeed regress?

b) with the updated package, how would the user raise the number of concurrent connections? I presume a change to the unit file is in order again. In fact, two changes, if I understood this correctly:
b1) bump LimitNOFile
b2) export ERL_MAX_PORTS to the same value (new directive in the unit: it's not there today)

Is that it for (b)?

If (a) indeed regresses, how could we prevent that regression from happening? Without thinking much, three options jump to my mind:
1) revert the upstream change from 3.9.23 that introduced this
2) have some way to automatically export ERL_MAX_PORTS to match LimitNOFile, maybe with a wrapper script (sounds cumbersome)
3) limit the MRE to 3.9.22, just prior to the change in 3.9.23. Would have to check what we are missing all the way up to 3.9.27 and if it's still worth updating to 3.9.22.

I think that if going with >= 3.9.23, we should at least export ERL_MAX_PORTS in the systemd unit file, so that users checking that file will at least be aware of that variable and what it does. But that would still not solve the possible regression in behavior.

Note I don't think the scenario in (a) would make rabbit fail to start, it would just serve less connections, and eventually services trying to connect would fail, if I understood this correctly.

What do you think?

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

> I added that changelog entry due to v3.9.23 making that change[0].

The ERL_MAX_PORTS/max connections change (not necessarily the full steps for the workaround) should definitely be in the changelog, but you wrote 218 lines of changelog :)

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

> if rabbitmq-server's database has open queues, they will be lost on upgrade.

Isn't queue persistence a property of the queue?

Quick googling found this[1], but I'm sure there are more details out there.

1. https://www.rabbitmq.com/docs/classic-queues#:~:text=Persistence%20(Durable%20Storage)%20in%20Classic,transient%20or%20messages%20are%20transient

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I also just realized that this bug does not have the SRU template filled in, please do that.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

Regarding comment #10 I was planning on documenting the need to set the environment variable, but didn't consider having systemd export the variable, which I really like that idea.

I'll test and include that change in my MP soon.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I will reject this upload from the unapproved queue due to the changes we discussed above (changelog, and the unit file with the env var), and there will probably be some more testing of the upgrade scenarios which I will work on with Mitchell.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

In the focal changelog, do you have more details about this change:

      * An undocumented "automagic login" feature on the login form was
       removed.

Any idea how it could have been used? Even though it says it's undocumented, if it was something heavily used even so, then we should consider this change.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

https://github.com/rabbitmq/rabbitmq-management/pull/748

In the comment chain:
> As this wasn't a documented feature, I didn't think that it could be considered a breaking change. Let's discuss privately re PCF. I am not aware of anyone other than us using this feature.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

I'll see if I can find evidence that it was used by many people, but it doesn't seem like there was a lot of blowing up in the pull request after it was merged.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

I'm investigating the jammy MRE - specifically on the addition of ERL_MAX_PORTS. I'm not a fan of the ideas I have so far. The concern is that a user may have increased the cap, and on upgrade would silently cap, which could cause failures.

This of course will be documented in the d/changelog, but I don't think a "you didn't look at d/changelog" is a valid excuse to let that behavior happen, it should be expected that if you modify something as simple as the open file handle limit, that should be respected on upgrade.

The current ideas/why I don't like them:

1. parse the systemd unit configuration - add a new line to add the environment variable to the systemd unit on upgrade
    - Just not a fan of parsing like this, in-case the user has made other modifications such as various comments that could mangle the parsing.

2. Remove the addition of this environment variable
    - This diverges from upstream, could cause confusion when looking at upstream docs. This could not be an issue, but I think in this case it should be considered an issue because having a too high value could cause unexpected DoS and be difficult to debug.

3. modify the upstream code to, at runtime, match ERL_MAX_PORTS from the systemd unit, and leave it uncapped in LimitNOFILE is not present
    - Same issue as #2, diverges from upstream. Also, this assumes that the user knowingly disabled the limit in systemd and is not using another mechanism to cap the open file handle limit. This has the potential of not playing nice with whatever mechanism the user decides to use.

Due to this, I am thinking to have the Jammy MRE upgrade to version v3.9.22 instead, as the bug fixes between v3.9.{23..27} are of course nice to have, but I think are fine to skip out on.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

As per the previous comment - I will do some research in v3.9.22 for Jammy, ensuring that no bugs are introduced there that is fixed in a more recent version, and proceed to do the MRE for that version instead.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

I prepared a v3.9.22 in Jammy and have ran the autopkgtest, I am seeing a failure[0] with the new rpc test in ppc64el. I re-ran it twice so far with the same timeout, and waiting for a third run to see the results.

The logs unfortunately do not have a lot of useful information, so I'm hoping it's just a flaky situation, although a new test being flaky is not ideal.

[0] - https://autopkgtest.ubuntu.com/results/autopkgtest-jammy-mitchdz-rabbitmq-server-mre-2204-v2/jammy/ppc64el/r/rabbitmq-server/20240619_044237_d738a@/log.gz

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :
Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

Finally got my hands on a ppc64el system. Running the rpc test manually now.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

Okay, manually ran the rpc test on a ppc64el system and it passed fine.

Ran on bobonevm4

$ dpkg -s rabbitmq-server | grep Version:
Version: 3.9.22-0ubuntu0.1~jammy1
$ python3 rpc_server.py &
[1] 5069
$ client_result=$(python3 rpc_client.py)
 [.] fib(30)
$ echo $client_result
[x] Requesting fib(30) [.] Got 832040

This is just from manually copy+paste the contents of the test. Maybe the autopkgtest runner was flaky? I just re-ran another test from the PPA so will wait to hear back from that.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

Okay quick response - this time the dep8 test passed in the LP runners[0].

3 dep8 test failures in a row seems fishy though, so I'd like to run the test a few more times to see if it is really flaky, or if the runners were just sad on the previous runs. Will run 2 more tests now.

[0] - https://autopkgtest.ubuntu.com/results/autopkgtest-jammy-mitchdz-rabbitmq-server-mre-2204-v2/jammy/ppc64el/r/rabbitmq-server/20240626_202946_36422@/log.gz

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

While waiting for the new autopkgtest to publish and run, I am going to attempt to kick off the upstream CI/CD tests for 3.9.22.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.