tripleo

rabbitmq tcp backlog can be insufficient

Bug #1854704 reported by Michele Baldessari on 2019-12-02

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	High	Michele Baldessari	tripleo ussuri-1

Bug Description

From https://bugzilla.redhat.com/show_bug.cgi?id=1778428

We need to tune the default rabbitmq tcp listen backlog. Currently it defaults
to 128, but here's what happens:
Say we have 1500 total rabbitmq client connections spread across a 3 node
cluster, evenly distributed so each node has 500 clients.

Then, we stop rabbitmq on one of the nodes.

Now those 500 client connections all immediately fail over to the other two
node. Assume roughly even split, and each gets 250 connections simultaneously.
Since the tcp listen backlog is only 128, a large number of the failover
connections cannot connect and get ECONNREFUSED because the kernel just drops
them.

Eventually things retry and the backlog clears, but it just makes things noisy
in the logs and makes failover take a little bit longer.

Upstream docs discuss here:
https://www.rabbitmq.com/networking.html#tuning-for-large-number-of-connections-connection-backlog

Tags:

OpenStack Infra (hudson-openstack) on 2019-12-02

Changed in tripleo:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-12: Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/696827
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=9f4832fcc4d939da3d4e7f83e26c4f934bff7dc0
Submitter: Zuul
Branch: master

commit 9f4832fcc4d939da3d4e7f83e26c4f934bff7dc0
Author: Michele Baldessari <email address hidden>
Date: Mon Dec 2 09:03:06 2019 +0100

Increase rabbitmq tcp backlog

From https://bugzilla.redhat.com/show_bug.cgi?id=1778428

    We need to tune the default rabbitmq tcp listen backlog. Currently it defaults
    to 128, but here's what happens:
    Say we have 1500 total rabbitmq client connections spread across a 3 node
    cluster, evenly distributed so each node has 500 clients.

Then, we stop rabbitmq on one of the nodes.

    Now those 500 client connections all immediately fail over to the other two
    node. Assume roughly even split, and each gets 250 connections simultaneously.
    Since the tcp listen backlog is only 128, a large number of the failover
    connections cannot connect and get ECONNREFUSED because the kernel just drops
    them.

Eventually things retry and the backlog clears, but it just makes things noisy
in the logs and makes failover take a little bit longer.

Upstream docs discuss here:
https://www.rabbitmq.com/networking.html#tuning-for-large-number-of-connections-connection-backlog

Suggested-By: John Eckersberg <email address hidden>
Closes-Bug: #1854704

Change-Id: If6da4aff016db9a72e1cb9dfc9731f06e062f64d

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-17: Fix proposed to tripleo-heat-templates (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/699462

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-18: Fix proposed to tripleo-heat-templates (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/699573

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-18: Change abandoned on tripleo-heat-templates (stable/train)

Change abandoned by Emilien Macchi (<email address hidden>) on branch: stable/train
Review: https://review.opendev.org/699462
Reason: Clearing the gate now, see https://bugs.launchpad.net/tripleo/+bug/1856864
Do not restore the patch yet, I'll take care of it when the gate is back online.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-19: Fix proposed to tripleo-heat-templates (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/699915

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-19: Fix proposed to tripleo-heat-templates (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/699916

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-19: Fix merged to tripleo-heat-templates (stable/rocky)

Reviewed: https://review.opendev.org/699915
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=5c8f9c67f1bd3eccaf68490b7432935deb171776
Submitter: Zuul
Branch: stable/rocky

commit 5c8f9c67f1bd3eccaf68490b7432935deb171776
Author: Michele Baldessari <email address hidden>
Date: Thu Dec 19 07:13:41 2019 +0100

Increase rabbitmq tcp backlog

From https://bugzilla.redhat.com/show_bug.cgi?id=1778428

    We need to tune the default rabbitmq tcp listen backlog. Currently it defaults
    to 128, but here's what happens:
    Say we have 1500 total rabbitmq client connections spread across a 3 node
    cluster, evenly distributed so each node has 500 clients.

Then, we stop rabbitmq on one of the nodes.

    Now those 500 client connections all immediately fail over to the other two
    node. Assume roughly even split, and each gets 250 connections simultaneously.
    Since the tcp listen backlog is only 128, a large number of the failover
    connections cannot connect and get ECONNREFUSED because the kernel just drops
    them.

Eventually things retry and the backlog clears, but it just makes things noisy
in the logs and makes failover take a little bit longer.

Upstream docs discuss here:
https://www.rabbitmq.com/networking.html#tuning-for-large-number-of-connections-connection-backlog

Suggested-By: John Eckersberg <email address hidden>
Closes-Bug: #1854704

Change-Id: If6da4aff016db9a72e1cb9dfc9731f06e062f64d
(cherry picked from commit 9f4832fcc4d939da3d4e7f83e26c4f934bff7dc0)

tags:

added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-19: Fix merged to tripleo-heat-templates (stable/queens)

Reviewed: https://review.opendev.org/699916
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=9e868e15de576ef6ae2d873edc7f6ffed469165d
Submitter: Zuul
Branch: stable/queens

commit 9e868e15de576ef6ae2d873edc7f6ffed469165d
Author: Michele Baldessari <email address hidden>
Date: Thu Dec 19 07:13:41 2019 +0100

Increase rabbitmq tcp backlog

From https://bugzilla.redhat.com/show_bug.cgi?id=1778428

    We need to tune the default rabbitmq tcp listen backlog. Currently it defaults
    to 128, but here's what happens:
    Say we have 1500 total rabbitmq client connections spread across a 3 node
    cluster, evenly distributed so each node has 500 clients.

Then, we stop rabbitmq on one of the nodes.

    Now those 500 client connections all immediately fail over to the other two
    node. Assume roughly even split, and each gets 250 connections simultaneously.
    Since the tcp listen backlog is only 128, a large number of the failover
    connections cannot connect and get ECONNREFUSED because the kernel just drops
    them.

Eventually things retry and the backlog clears, but it just makes things noisy
in the logs and makes failover take a little bit longer.

Upstream docs discuss here:
https://www.rabbitmq.com/networking.html#tuning-for-large-number-of-connections-connection-backlog

Suggested-By: John Eckersberg <email address hidden>
Closes-Bug: #1854704

    Change-Id: If6da4aff016db9a72e1cb9dfc9731f06e062f64d
    (cherry picked from commit 9f4832fcc4d939da3d4e7f83e26c4f934bff7dc0)
    (cherry picked from commit 5c8f9c67f1bd3eccaf68490b7432935deb171776)

tags:

added: in-stable-queens

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-19: Fix merged to tripleo-heat-templates (stable/stein)

Reviewed: https://review.opendev.org/699573
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=e7889891987c6108cb66ef06cd0f48868fc3809f
Submitter: Zuul
Branch: stable/stein

commit e7889891987c6108cb66ef06cd0f48868fc3809f
Author: Michele Baldessari <email address hidden>
Date: Mon Dec 2 09:03:06 2019 +0100

Increase rabbitmq tcp backlog

From https://bugzilla.redhat.com/show_bug.cgi?id=1778428

    We need to tune the default rabbitmq tcp listen backlog. Currently it defaults
    to 128, but here's what happens:
    Say we have 1500 total rabbitmq client connections spread across a 3 node
    cluster, evenly distributed so each node has 500 clients.

Then, we stop rabbitmq on one of the nodes.

    Now those 500 client connections all immediately fail over to the other two
    node. Assume roughly even split, and each gets 250 connections simultaneously.
    Since the tcp listen backlog is only 128, a large number of the failover
    connections cannot connect and get ECONNREFUSED because the kernel just drops
    them.

Eventually things retry and the backlog clears, but it just makes things noisy
in the logs and makes failover take a little bit longer.

Upstream docs discuss here:
https://www.rabbitmq.com/networking.html#tuning-for-large-number-of-connections-connection-backlog

Suggested-By: John Eckersberg <email address hidden>
Closes-Bug: #1854704

Change-Id: If6da4aff016db9a72e1cb9dfc9731f06e062f64d
(cherry picked from commit 9f4832fcc4d939da3d4e7f83e26c4f934bff7dc0)

tags:

added: in-stable-stein

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-20: Fix merged to tripleo-heat-templates (stable/train)

#10

Reviewed: https://review.opendev.org/699462
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=24e723475c409b565307261d1b3a2248edb404b5
Submitter: Zuul
Branch: stable/train

commit 24e723475c409b565307261d1b3a2248edb404b5
Author: Michele Baldessari <email address hidden>
Date: Mon Dec 2 09:03:06 2019 +0100

Increase rabbitmq tcp backlog

From https://bugzilla.redhat.com/show_bug.cgi?id=1778428

    We need to tune the default rabbitmq tcp listen backlog. Currently it defaults
    to 128, but here's what happens:
    Say we have 1500 total rabbitmq client connections spread across a 3 node
    cluster, evenly distributed so each node has 500 clients.

Then, we stop rabbitmq on one of the nodes.

    Now those 500 client connections all immediately fail over to the other two
    node. Assume roughly even split, and each gets 250 connections simultaneously.
    Since the tcp listen backlog is only 128, a large number of the failover
    connections cannot connect and get ECONNREFUSED because the kernel just drops
    them.

Eventually things retry and the backlog clears, but it just makes things noisy
in the logs and makes failover take a little bit longer.

Upstream docs discuss here:
https://www.rabbitmq.com/networking.html#tuning-for-large-number-of-connections-connection-backlog

Suggested-By: John Eckersberg <email address hidden>
Closes-Bug: #1854704

Change-Id: If6da4aff016db9a72e1cb9dfc9731f06e062f64d
(cherry picked from commit 9f4832fcc4d939da3d4e7f83e26c4f934bff7dc0)

tags:

added: in-stable-train

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-06: Fix included in openstack/tripleo-heat-templates 11.3.1

#11

This issue was fixed in the openstack/tripleo-heat-templates 11.3.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-18: Fix included in openstack/tripleo-heat-templates 12.1.0

#12

This issue was fixed in the openstack/tripleo-heat-templates 12.1.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-03-05: Fix included in openstack/tripleo-heat-templates rocky-eol

#13

This issue was fixed in the openstack/tripleo-heat-templates rocky-eol release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-12: Fix included in openstack/tripleo-heat-templates queens-eol

#14

This issue was fixed in the openstack/tripleo-heat-templates queens-eol release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-23: Fix included in openstack/tripleo-heat-templates stein-eol

#15

This issue was fixed in the openstack/tripleo-heat-templates stein-eol release.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

redhat-bugs #1778428 Edit

Bug watches keep track of this bug in other bug trackers.