sync-mirror-pocket fails after 30 minutes

Bug #2020909 reported by Jorge Merlino
62
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Landscape Bundles
Fix Committed
Undecided
Kevin Nasto
Landscape Server
Confirmed
Undecided
Kevin Nasto
OpenStack RabbitMQ Server Charm
Fix Committed
Wishlist
Kevin Nasto
Jammy
Fix Released
Undecided
Unassigned

Bug Description

When a sync mirror command is issued and takes longer that 30 minutes to finish (which is not strange when landscape is installed for the first time) the activity shows as failed with result text "No transition: delivered=>delivered".

In the job-handler.log file we can see this error:

'PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more'

so this seems to be by design. Furthermore, after this error is shown, reprepro continues running and eventually the sync process finishes successfully even if it is shown as failed in Landscape.

In conclusion I think that 30 minutes is not long enough for this timeout as sync activities can reasonably take several hours in case of large repositories and not super fast network connections.

information type: Proprietary → Public
Revision history for this message
Steven LaCosse (motosteven) wrote :

I am running into this issue as well with the initial sync with larger repos.
For example when syncing the release of bionic.

Issuing the command:

landscape-api sync-mirror-pocket release bionic ubuntu

fails after 30 min: error log same as Jorge.

(406, 'PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more', 0, 0) content = None

I do not see in the docs to configure this fail, is there a workaround to change the value?

Revision history for this message
Rigoberto Sanchez (sanch1) wrote :

I've hit the same issue, if you look at the activity you can see the progress still going, so it's definitely syncing? but I get those error messages too. Not sure where to change that setting either

Changed in landscape:
status: New → Confirmed
Revision history for this message
Andy Wu (qch2012) wrote :

I've hit this issue in one of the deployment, the failure is caused by the default consumer_timeout v value(30min) in rabbitmq. The first sync pockets activity, depending on the network bandwidth, can take more than 30 min, while the sync is in progress, the rabbitmq close the session after the timeout, causing the acvitivty to fail in landscape.

The workaround is to increase or disable the consumer timeout before the first sync

juju ssh rabbitmq-server/leader

cd /etc/rabbitmq
echo "
[
  {rabbit, [
    {consumer_timeout, undefined}
  ]}
]. " | sudo tee /etc/rabbitmq/advanced.config

# restart rabbitmq
sudo systemctl restart rabbitmq-server

# ensure timeout is disabled
sudo rabbitmq-diagnostics environment | grep consumer_timeout

Revision history for this message
Peter Jose De Sousa (pjds) wrote :

Just a note on this bug - the reprepro process will continue running. The safest way to handle this situation is to locate the reprepro process and wait for it to complete.

DO NOT RUN THE UNBLOCK SCRIPT - this will just unblock landscape, but reprepro will still be running, if triggered again, it will cause an inconsistent state.

Revision history for this message
Peter Jose De Sousa (pjds) wrote :

marking field high due to affecting multiple deployments.

Revision history for this message
Felipe Reyes (freyes) wrote :

Adding a task for charm-rabbitmq-server since it seems the connection_timeout config option should be exposed there

Changed in charm-rabbitmq-server:
importance: Undecided → Wishlist
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (master)
Changed in charm-rabbitmq-server:
status: New → In Progress
Changed in landscape:
assignee: nobody → Kevin Nasto (silverdrake11)
Changed in charm-rabbitmq-server:
assignee: nobody → Kevin Nasto (silverdrake11)
Revision history for this message
Kevin Nasto (silverdrake11) wrote :

To recreate this issue quickly, set consumer_timeout = 60000 in rabbitmq. This is equal to one minute (lowest value rabbit will support).

The following repo sync should take about 2 minutes. So with a one minute rabbitmq value, it should not finish in time and reproduce the bug. The activity should fail with "No transition: delivered=>delivered"

landscape-api create-distribution ubuntu

landscape-api create-series --pockets security --components restricted --architectures i386 --gpg-key mirror-key --mirror-uri http://archive.ubuntu.com/ubuntu/ --mirror-series focal focal ubuntu
landscape-api sync-mirror-pocket security focal ubuntu --json
landscape-api get-activities --query type:SyncPocketRequest --limit 1

Then delete the series and try again with a better timeout, restart rabbit and it should work.

landscape-api remove-series focal ubuntu

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (master)

Reviewed: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/920767
Committed: https://opendev.org/openstack/charm-rabbitmq-server/commit/edf7791f4032951ccfaa336c7ccd734bdef1873d
Submitter: "Zuul (22348)"
Branch: master

commit edf7791f4032951ccfaa336c7ccd734bdef1873d
Author: Kevin Nasto <email address hidden>
Date: Wed May 29 12:26:51 2024 -0500

    Add consumer-timeout config option

    Exposes the consumer timeout configuration value that was changed in the jammy
    version of rabbitmq. The value used to default to unlimited, but was changed to
    30 minutes. This causes clients to timeout for longer-running jobs.

    This change doesn't alter the payload's default value, it override it only when
    the user provides a value via juju config.

    Closes-Bug: #2020909
    Related-Bug: #2067424

    Change-Id: I552c04c964c98b2a700cd47e98a2ae491f5fd47b

Changed in charm-rabbitmq-server:
status: In Progress → Fix Committed
Revision history for this message
Peter Jose De Sousa (pjds) wrote :

thanks @Kevin - I'm going to need you to synchronise releases, not security. If I recall security is only a few gigabytes - where as releases is a few hundred gigabytes.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (stable/jammy)
Revision history for this message
Felipe Reyes (freyes) wrote :

I'm leaving Focal out of the backport, because that version deploys rabbitmq-server-3.8 where the consumer_timeout is unlimited and we don't see the need to expose this config option for that payload, no users have reported issues with that default, so from a risk management perspective it's better to not do the backport.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (stable/jammy)

Reviewed: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/920793
Committed: https://opendev.org/openstack/charm-rabbitmq-server/commit/81011f9e0cc419c4361434427357bd8ffe88e9a9
Submitter: "Zuul (22348)"
Branch: stable/jammy

commit 81011f9e0cc419c4361434427357bd8ffe88e9a9
Author: Kevin Nasto <email address hidden>
Date: Wed May 29 12:26:51 2024 -0500

    Add consumer-timeout config option

    Exposes the consumer timeout configuration value that was changed in the jammy
    version of rabbitmq. The value used to default to unlimited, but was changed to
    30 minutes. This causes clients to timeout for longer-running jobs.

    This change doesn't alter the payload's default value, it override it only when
    the user provides a value via juju config.

    Closes-Bug: #2020909
    Related-Bug: #2067424

    Change-Id: I552c04c964c98b2a700cd47e98a2ae491f5fd47b
    (cherry picked from commit edf7791f4032951ccfaa336c7ccd734bdef1873d)

Revision history for this message
Kevin Nasto (silverdrake11) wrote :
Changed in landscape-bundles:
status: New → Fix Committed
Changed in landscape-bundles:
assignee: nobody → Kevin Nasto (silverdrake11)
Revision history for this message
Mitch Burton (mitchburton) wrote (last edit ):

I've done more testing:

deployed landscape-scalable on jammy
replaced rabbitmq-server with the new version with config consumer-timeout=19000000000
scaled rabbitmq-server up to 3 units
scaled landscape-server up to 3 units
scaled postgresql up to 3 units
created a repo mirror of jammy release amd64
synced mirror

The sync is still running, but it has been running for almost 3 hours now with no errors. I'll update when it's done.

I also confirmed that the consumer-timeout setting was set on all 3 rabbitmq-server units.

Revision history for this message
Mitch Burton (mitchburton) wrote :

aforementioned sync succeeded after about 5 hours.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.