rabbitmq tcp backlog can be insufficient

Bug #1854704 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Michele Baldessari

Bug Description

From https://bugzilla.redhat.com/show_bug.cgi?id=1778428

We need to tune the default rabbitmq tcp listen backlog. Currently it defaults
to 128, but here's what happens:
Say we have 1500 total rabbitmq client connections spread across a 3 node
cluster, evenly distributed so each node has 500 clients.

Then, we stop rabbitmq on one of the nodes.

Now those 500 client connections all immediately fail over to the other two
node. Assume roughly even split, and each gets 250 connections simultaneously.
Since the tcp listen backlog is only 128, a large number of the failover
connections cannot connect and get ECONNREFUSED because the kernel just drops
them.

Eventually things retry and the backlog clears, but it just makes things noisy
in the logs and makes failover take a little bit longer.

Upstream docs discuss here:
https://www.rabbitmq.com/networking.html#tuning-for-large-number-of-connections-connection-backlog

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/696827
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=9f4832fcc4d939da3d4e7f83e26c4f934bff7dc0
Submitter: Zuul
Branch: master

commit 9f4832fcc4d939da3d4e7f83e26c4f934bff7dc0
Author: Michele Baldessari <email address hidden>
Date: Mon Dec 2 09:03:06 2019 +0100

    Increase rabbitmq tcp backlog

    From https://bugzilla.redhat.com/show_bug.cgi?id=1778428

    We need to tune the default rabbitmq tcp listen backlog. Currently it defaults
    to 128, but here's what happens:
    Say we have 1500 total rabbitmq client connections spread across a 3 node
    cluster, evenly distributed so each node has 500 clients.

    Then, we stop rabbitmq on one of the nodes.

    Now those 500 client connections all immediately fail over to the other two
    node. Assume roughly even split, and each gets 250 connections simultaneously.
    Since the tcp listen backlog is only 128, a large number of the failover
    connections cannot connect and get ECONNREFUSED because the kernel just drops
    them.

    Eventually things retry and the backlog clears, but it just makes things noisy
    in the logs and makes failover take a little bit longer.

    Upstream docs discuss here:
    https://www.rabbitmq.com/networking.html#tuning-for-large-number-of-connections-connection-backlog

    Suggested-By: John Eckersberg <email address hidden>
    Closes-Bug: #1854704

    Change-Id: If6da4aff016db9a72e1cb9dfc9731f06e062f64d

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/699462

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/699573

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/train)

Change abandoned by Emilien Macchi (<email address hidden>) on branch: stable/train
Review: https://review.opendev.org/699462
Reason: Clearing the gate now, see https://bugs.launchpad.net/tripleo/+bug/1856864
Do not restore the patch yet, I'll take care of it when the gate is back online.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/699915

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/699916

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/rocky)

Reviewed: https://review.opendev.org/699915
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=5c8f9c67f1bd3eccaf68490b7432935deb171776
Submitter: Zuul
Branch: stable/rocky

commit 5c8f9c67f1bd3eccaf68490b7432935deb171776
Author: Michele Baldessari <email address hidden>
Date: Thu Dec 19 07:13:41 2019 +0100

    Increase rabbitmq tcp backlog

    From https://bugzilla.redhat.com/show_bug.cgi?id=1778428

    We need to tune the default rabbitmq tcp listen backlog. Currently it defaults
    to 128, but here's what happens:
    Say we have 1500 total rabbitmq client connections spread across a 3 node
    cluster, evenly distributed so each node has 500 clients.

    Then, we stop rabbitmq on one of the nodes.

    Now those 500 client connections all immediately fail over to the other two
    node. Assume roughly even split, and each gets 250 connections simultaneously.
    Since the tcp listen backlog is only 128, a large number of the failover
    connections cannot connect and get ECONNREFUSED because the kernel just drops
    them.

    Eventually things retry and the backlog clears, but it just makes things noisy
    in the logs and makes failover take a little bit longer.

    Upstream docs discuss here:
    https://www.rabbitmq.com/networking.html#tuning-for-large-number-of-connections-connection-backlog

    Suggested-By: John Eckersberg <email address hidden>
    Closes-Bug: #1854704

    Change-Id: If6da4aff016db9a72e1cb9dfc9731f06e062f64d
    (cherry picked from commit 9f4832fcc4d939da3d4e7f83e26c4f934bff7dc0)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/queens)

Reviewed: https://review.opendev.org/699916
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=9e868e15de576ef6ae2d873edc7f6ffed469165d
Submitter: Zuul
Branch: stable/queens

commit 9e868e15de576ef6ae2d873edc7f6ffed469165d
Author: Michele Baldessari <email address hidden>
Date: Thu Dec 19 07:13:41 2019 +0100

    Increase rabbitmq tcp backlog

    From https://bugzilla.redhat.com/show_bug.cgi?id=1778428

    We need to tune the default rabbitmq tcp listen backlog. Currently it defaults
    to 128, but here's what happens:
    Say we have 1500 total rabbitmq client connections spread across a 3 node
    cluster, evenly distributed so each node has 500 clients.

    Then, we stop rabbitmq on one of the nodes.

    Now those 500 client connections all immediately fail over to the other two
    node. Assume roughly even split, and each gets 250 connections simultaneously.
    Since the tcp listen backlog is only 128, a large number of the failover
    connections cannot connect and get ECONNREFUSED because the kernel just drops
    them.

    Eventually things retry and the backlog clears, but it just makes things noisy
    in the logs and makes failover take a little bit longer.

    Upstream docs discuss here:
    https://www.rabbitmq.com/networking.html#tuning-for-large-number-of-connections-connection-backlog

    Suggested-By: John Eckersberg <email address hidden>
    Closes-Bug: #1854704

    Change-Id: If6da4aff016db9a72e1cb9dfc9731f06e062f64d
    (cherry picked from commit 9f4832fcc4d939da3d4e7f83e26c4f934bff7dc0)
    (cherry picked from commit 5c8f9c67f1bd3eccaf68490b7432935deb171776)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/stein)

Reviewed: https://review.opendev.org/699573
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=e7889891987c6108cb66ef06cd0f48868fc3809f
Submitter: Zuul
Branch: stable/stein

commit e7889891987c6108cb66ef06cd0f48868fc3809f
Author: Michele Baldessari <email address hidden>
Date: Mon Dec 2 09:03:06 2019 +0100

    Increase rabbitmq tcp backlog

    From https://bugzilla.redhat.com/show_bug.cgi?id=1778428

    We need to tune the default rabbitmq tcp listen backlog. Currently it defaults
    to 128, but here's what happens:
    Say we have 1500 total rabbitmq client connections spread across a 3 node
    cluster, evenly distributed so each node has 500 clients.

    Then, we stop rabbitmq on one of the nodes.

    Now those 500 client connections all immediately fail over to the other two
    node. Assume roughly even split, and each gets 250 connections simultaneously.
    Since the tcp listen backlog is only 128, a large number of the failover
    connections cannot connect and get ECONNREFUSED because the kernel just drops
    them.

    Eventually things retry and the backlog clears, but it just makes things noisy
    in the logs and makes failover take a little bit longer.

    Upstream docs discuss here:
    https://www.rabbitmq.com/networking.html#tuning-for-large-number-of-connections-connection-backlog

    Suggested-By: John Eckersberg <email address hidden>
    Closes-Bug: #1854704

    Change-Id: If6da4aff016db9a72e1cb9dfc9731f06e062f64d
    (cherry picked from commit 9f4832fcc4d939da3d4e7f83e26c4f934bff7dc0)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/train)

Reviewed: https://review.opendev.org/699462
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=24e723475c409b565307261d1b3a2248edb404b5
Submitter: Zuul
Branch: stable/train

commit 24e723475c409b565307261d1b3a2248edb404b5
Author: Michele Baldessari <email address hidden>
Date: Mon Dec 2 09:03:06 2019 +0100

    Increase rabbitmq tcp backlog

    From https://bugzilla.redhat.com/show_bug.cgi?id=1778428

    We need to tune the default rabbitmq tcp listen backlog. Currently it defaults
    to 128, but here's what happens:
    Say we have 1500 total rabbitmq client connections spread across a 3 node
    cluster, evenly distributed so each node has 500 clients.

    Then, we stop rabbitmq on one of the nodes.

    Now those 500 client connections all immediately fail over to the other two
    node. Assume roughly even split, and each gets 250 connections simultaneously.
    Since the tcp listen backlog is only 128, a large number of the failover
    connections cannot connect and get ECONNREFUSED because the kernel just drops
    them.

    Eventually things retry and the backlog clears, but it just makes things noisy
    in the logs and makes failover take a little bit longer.

    Upstream docs discuss here:
    https://www.rabbitmq.com/networking.html#tuning-for-large-number-of-connections-connection-backlog

    Suggested-By: John Eckersberg <email address hidden>
    Closes-Bug: #1854704

    Change-Id: If6da4aff016db9a72e1cb9dfc9731f06e062f64d
    (cherry picked from commit 9f4832fcc4d939da3d4e7f83e26c4f934bff7dc0)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 11.3.1

This issue was fixed in the openstack/tripleo-heat-templates 11.3.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 12.1.0

This issue was fixed in the openstack/tripleo-heat-templates 12.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates rocky-eol

This issue was fixed in the openstack/tripleo-heat-templates rocky-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates queens-eol

This issue was fixed in the openstack/tripleo-heat-templates queens-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates stein-eol

This issue was fixed in the openstack/tripleo-heat-templates stein-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.