Bug #1780348 “default gc_thresh settings for Linux are too small... : Bugs : OpenStack Neutron Open vSwitch Charm

Revision history for this message

James Page (james-page) wrote on 2018-07-16:

#1

This can be set using the sysctl flags configuration option, but I agree that increasing the defaults makes sense as well.

Changed in charm-nova-compute:
status:	New → Triaged
importance:	Undecided → Low

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2019-02-13:

#2

Just to add some more context.

The default ARP-related sysctl settings come from the kernel defaults (described here http://man7.org/linux/man-pages/man7/arp.7.html).

net.ipv4.neigh.default.gc_interval = 30
net.ipv4.neigh.default.gc_stale_time = 60
net.ipv4.neigh.default.gc_thresh1 = 128
net.ipv4.neigh.default.gc_thresh2 = 512
net.ipv4.neigh.default.gc_thresh3 = 1024

As soon as gc_thresh3 is hit MAC learning stops and it is up to gc to clear stale entries if it can (it doesn't delete static entries, for example):
http://kernel.ubuntu.com/git/ubuntu/ubuntu-bionic.git/tree/net/core/neighbour.c#n314
net_info_ratelimited("%s: neighbor table overflow!\n",
tbl->id);

A hash table used for ARP table lookups in the kernel grows with the amount of neighbor table entries:

https://kernel.ubuntu.com/git/ubuntu/ubuntu-bionic.git/tree/net/core/neighbour.c?id=aa07f7dcb959603e1e6d56db7281b1d36bce9928#n395

https://kernel.ubuntu.com/git/ubuntu/ubuntu-bionic.git/tree/net/core/neighbour.c?id=aa07f7dcb959603e1e6d56db7281b1d36bce9928#n532

if (atomic_read(&tbl->entries) > (1 << nht->hash_shift))
nht = neigh_hash_grow(tbl, nht->hash_shift + 1);

* ARP table thresholds are not namespaced and can be modified on a per-system (kernel) basis. While ARP table entries have namespace affinity (`ip neigh` returns only entries relevant to a particular namespace), they share the same storage (the same global kernel neighbor table). So it is important to tune the global thresholds as we have many namespaces with their own contents (fip, qrouter, snat, dhcp);
* ARP table entries for floating IPs are only added to the source hypervisor host's FIP namespace ARP table (not to the destination hypervisor host's FIP namespace ARP table unless you ping a FIP from that namespace specifically);
* ARP table entries for remote DVR ports are added to destination FIP namespaces when ARP responses for FIPs are made (i.e. there may be as many entries as there are hypervisor hosts in the extreme case where VMs on every other hypervisor ping a FIP on one specific hypervisor);
* ARP table size will matter if you have a lot of east-west FIP to FIP communication.

Upstream kernel discussions around per-namespace tables and bumping up default limits:
https://lkml.org/lkml/2018/7/17/550

Example cloudinit-userdata config to enable this via a sysctl drop-in:

juju model-config userdata-sysctl-conf.yaml

$ cat userdata-sysctl-conf.yaml
cloudinit-userdata: |
  write_files:
  - content: |
      net.ipv4.neigh.default.gc_thresh1 = 16384
      net.ipv4.neigh.default.gc_thresh2 = 28672
      net.ipv4.neigh.default.gc_thresh3 = 32768
    owner: "root:root"
    path: /etc/sysctl.d/network-tuning.conf
    permissions: '0644'

Just to add some more context.

The default ARP-related sysctl settings come from the kernel defaults (described here http://man7.org/linux/man-pages/man7/arp.7.html).

net.ipv4.neigh.default.gc_interval = 30
net.ipv4.neigh.default.gc_stale_time = 60
net.ipv4.neigh.default.gc_thresh1 = 128
net.ipv4.neigh.default.gc_thresh2 = 512
net.ipv4.neigh.default.gc_thresh3 = 1024

As soon as gc_thresh3 is hit MAC learning stops and it is up to gc to clear stale entries if it can (it doesn't delete static entries, for example):
http://kernel.ubuntu.com/git/ubuntu/ubuntu-bionic.git/tree/net/core/neighbour.c#n314
			net_info_ratelimited("%s: neighbor table overflow!\n",
					     tbl->id);

A hash table used for ARP table lookups in the kernel grows with the amount of neighbor table entries:

https://kernel.ubuntu.com/git/ubuntu/ubuntu-bionic.git/tree/net/core/neighbour.c?id=aa07f7dcb959603e1e6d56db7281b1d36bce9928#n395

https://kernel.ubuntu.com/git/ubuntu/ubuntu-bionic.git/tree/net/core/neighbour.c?id=aa07f7dcb959603e1e6d56db7281b1d36bce9928#n532

if (atomic_read(&tbl->entries) > (1 << nht->hash_shift))
		nht = neigh_hash_grow(tbl, nht->hash_shift + 1);

* ARP table thresholds are not namespaced and can be modified on a per-system (kernel) basis. While ARP table entries have namespace affinity (`ip neigh` returns only entries relevant to a particular namespace), they share the same storage (the same global kernel neighbor table). So it is important to tune the global thresholds as we have many namespaces with their own contents (fip, qrouter, snat, dhcp);
* ARP table entries for floating IPs are only added to the source hypervisor host's FIP namespace ARP table (not to the destination hypervisor host's FIP namespace ARP table unless you ping a FIP from that namespace specifically);
* ARP table entries for remote DVR ports are added to destination FIP namespaces when ARP responses for FIPs are made (i.e. there may be as many entries as there are hypervisor hosts in the extreme case where VMs on every other hypervisor ping a FIP on one specific hypervisor);
* ARP table size will matter if you have a lot of east-west FIP to FIP communication.

Upstream kernel discussions around per-namespace tables and bumping up default limits:
https://lkml.org/lkml/2018/7/17/550

Example cloudinit-userdata config to enable this via a sysctl drop-in:

juju model-config userdata-sysctl-conf.yaml

$ cat userdata-sysctl-conf.yaml
cloudinit-userdata: |
  write_files:
  - content: |
      net.ipv4.neigh.default.gc_thresh1 = 16384
      net.ipv4.neigh.default.gc_thresh2 = 28672
      net.ipv4.neigh.default.gc_thresh3 = 32768
    owner: "root:root"
    path: /etc/sysctl.d/network-tuning.conf
    permissions: '0644'

Revision history for this message

James Page (james-page) wrote on 2019-02-22:

#3

Are:

net.ipv4.neigh.default.gc_thresh1=1024
net.ipv4.neigh.default.gc_thresh2=2048
net.ipv4.neigh.default.gc_thresh3=4096

or

net.ipv4.neigh.default.gc_thresh1 = 16384
net.ipv4.neigh.default.gc_thresh2 = 28672
net.ipv4.neigh.default.gc_thresh3 = 32768

sane changes to always make? whats a safe default? does this only cover IPv4? if so do we need to cover something similar for IPv6?

Changed in charm-neutron-gateway:
status:	New → Triaged
importance:	Undecided → Medium
Changed in charm-nova-compute:
importance:	Low → Medium

Revision history for this message

Jay Vosburgh (jvosburgh) wrote on 2019-02-22:

#4

James,

First, yes, IPv6 has separate neighbour table parameters:

net.ipv4.neigh.default.gc_thresh1 = 128
net.ipv4.neigh.default.gc_thresh2 = 512
net.ipv4.neigh.default.gc_thresh3 = 1024

net.ipv6.neigh.default.gc_thresh1 = 128
net.ipv6.neigh.default.gc_thresh2 = 512
net.ipv6.neigh.default.gc_thresh3 = 1024

Second, give a choice between your two possible sets, I would suggest the second (higher values) set. Explanation to follow:

gc_thresh3 is the absolute limit for the table size, but gc_thresh2 is also important, as when the table size exceeds gc_thresh2 the kernel will aggressively prune entries (deleting entries older than 5 seconds) which could lead to large volumes of ARP or ndisc traffic in pathological situations.

The correct setting for these is ultimately workload dependent, and changing the baked-in kernel default to very large values makes little sense for the typical desktop install, for example. As described in a prior comment, the values may be tuned via cloud-init, so there seems to be no reason to modify the kernel defaults.

I would recommend setting gc_thresh2 to a value equal to the maximum expected number of neighbour entries for a system, plus some head room. The head room can be quite generous (25-50%), as there is no cost to unused capacity in the table. A single neighbour entry is roughly 0.5 KB, so there is minimal risk of excessive memory consumption if the table limit is too high (i.e., 10,000 bogus or stale entries is 5 MB of memory).

In this case, the gc_thresh3 should be set to a value comfortably above the gc_thresh2 value, e.g., another 25%-50% above gc_thresh2, thinking of it as surge capacity.

If your choice is to bake in either the 1024/2048/4096 or 16384/28672/32768 values, I would probably go with the higher values, with the following caveat, below.

Separately, it may also not be desirable to raise the gc_thresh1 to large values; the function of this setting is that if the table contains fewer than gc_thresh1 entries, the kernel will never remove (time out) stale entries. In a "home network" type of environment, this is a reasonable behavior, but in a cloud environment, if IP -> MAC address mappings change, a high gc_thresh1 may lead to hiccups in IP reachability. On the other hand, if the mappings will never, ever change, a high gc_thresh1 may reduce spurious ARP traffic.

James,

First, yes, IPv6 has separate neighbour table parameters:

net.ipv4.neigh.default.gc_thresh1 = 128
net.ipv4.neigh.default.gc_thresh2 = 512
net.ipv4.neigh.default.gc_thresh3 = 1024

net.ipv6.neigh.default.gc_thresh1 = 128
net.ipv6.neigh.default.gc_thresh2 = 512
net.ipv6.neigh.default.gc_thresh3 = 1024

Second, give a choice between your two possible sets, I would suggest the second (higher values) set.  Explanation to follow:

gc_thresh3 is the absolute limit for the table size, but gc_thresh2 is also important, as when the table size exceeds gc_thresh2 the kernel will aggressively prune entries (deleting entries older than 5 seconds) which could lead to large volumes of ARP or ndisc traffic in pathological situations.

The correct setting for these is ultimately workload dependent, and changing the baked-in kernel default to very large values makes little sense for the typical desktop install, for example.  As described in a prior comment, the values may be tuned via cloud-init, so there seems to be no reason to modify the kernel defaults.

I would recommend setting gc_thresh2 to a value equal to the maximum expected number of neighbour entries for a system, plus some head room.  The head room can be quite generous (25-50%), as there is no cost to unused capacity in the table.  A single neighbour entry is roughly 0.5 KB, so there is minimal risk of excessive memory consumption if the table limit is too high (i.e., 10,000 bogus or stale entries is 5 MB of memory).

In this case, the gc_thresh3 should be set to a value comfortably above the gc_thresh2 value, e.g., another 25%-50% above gc_thresh2, thinking of it as surge capacity.

If your choice is to bake in either the 1024/2048/4096 or 16384/28672/32768 values, I would probably go with the higher values, with the following caveat, below.

Separately, it may also not be desirable to raise the gc_thresh1 to large values; the function of this setting is that if the table contains fewer than gc_thresh1 entries, the kernel will never remove (time out) stale entries.  In a "home network" type of environment, this is a reasonable behavior, but in a cloud environment, if IP -> MAC address mappings change, a high gc_thresh1 may lead to hiccups in IP reachability.  On the other hand, if the mappings will never, ever change, a high gc_thresh1 may reduce spurious ARP traffic.

Revision history for this message

James Troup (elmo) wrote on 2019-02-22: Re: [Bug 1780348] Re: default gc_thresh settings for Linux are too small

#5

James Page <email address hidden> writes:

> Are:
>
> net.ipv4.neigh.default.gc_thresh1=1024
> net.ipv4.neigh.default.gc_thresh2=2048
> net.ipv4.neigh.default.gc_thresh3=4096
>
> or
>
> net.ipv4.neigh.default.gc_thresh1 = 16384
> net.ipv4.neigh.default.gc_thresh2 = 28672
> net.ipv4.neigh.default.gc_thresh3 = 32768
>
> sane changes to always make?

FWIW, we've been running with the first set of values on both internal
and customer clouds for a couple of years now without issue. We're
now running a customer cloud on the 2nd set of values.

I'm also subscribing field-medium to this bug as the current default
values caused a cloud-wide outage today and we really need to get them
bumped up for at least neutron-gateway.

--
James

Ryan Beisner (1chb1n) on 2019-02-24

Changed in charm-nova-compute:
milestone:	none → 19.04
Changed in charm-neutron-gateway:
milestone:	none → 19.04
Changed in charm-nova-compute:
assignee:	nobody → Alex Kavanagh (ajkavanagh)
Changed in charm-neutron-gateway:
assignee:	nobody → Alex Kavanagh (ajkavanagh)

Ryan Beisner (1chb1n) on 2019-02-27

Changed in charm-neutron-gateway:
assignee:	Alex Kavanagh (ajkavanagh) → Pete Vander Giessen (petevg)
Changed in charm-nova-compute:
assignee:	Alex Kavanagh (ajkavanagh) → Pete Vander Giessen (petevg)
Changed in charm-neutron-gateway:
importance:	Medium → High
Changed in charm-nova-compute:
importance:	Medium → High

Pen Gale (pengale) on 2019-02-27

Changed in charm-neutron-openvswitch:
status:	New → In Progress
Changed in charm-nova-compute:
status:	Triaged → In Progress
Changed in charm-neutron-gateway:
status:	Triaged → In Progress
Changed in charm-neutron-openvswitch:
assignee:	nobody → Pete Vander Giessen (petevg)
milestone:	none → 19.04
importance:	Undecided → High

Revision history for this message

Pen Gale (pengale) wrote on 2019-02-27:

#6

Added charm-neutron-openvswitch on advice from @icey.

Also setting net.nf_conntrack_max and net.netfilter.nf_conntrack_max to one million, to address further potential issues.

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2019-02-27:

#7

Track the status of patches @:

https://review.openstack.org/#/q/topic:bug/1780348+(status:open+OR+status:merged)

Revision history for this message

James Troup (elmo) wrote on 2019-02-27:

#8

Pete Vander Giessen <email address hidden> writes:

> Added charm-neutron-openvswitch on advice from @icey.
>
> Also setting net.nf_conntrack_max and net.netfilter.nf_conntrack_max to
> one million, to address further potential issues.

If we're changing nf conntrack_max, we should also check that the
value of net.netfilter.nf_conntrack_buckets still makes sense.

And (as a much lower priority and likely to be much more
controversial) we should also consider reviewing the default
net.netfilter.nf_conntrack_tcp_timeout_established as, last I looked,
it's 5 days and that can negate a lot of the benefit of just raising
nf_conntrack_max.

--
James

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-28: Fix proposed to charm-nova-compute (master)

#9

Fix proposed to branch: master
Review: https://review.openstack.org/639984

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-28: Fix proposed to charm-neutron-gateway (master)

#10

Fix proposed to branch: master
Review: https://review.openstack.org/639985

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-28: Fix merged to charm-neutron-gateway (master)

#11

Reviewed: https://review.openstack.org/639985
Committed: https://git.openstack.org/cgit/openstack/charm-neutron-gateway/commit/?id=53b58388d37a3b2b2674989e09c216ae7ce76c9e
Submitter: Zuul
Branch: master

commit 53b58388d37a3b2b2674989e09c216ae7ce76c9e
Author: Pete Vander Giessen <email address hidden>
Date: Wed Feb 27 16:04:27 2019 +0100

Added gc_threshold overrides to sysctl.conf

    When clouds have a large number of hosts, the default size of the ARP
    cache is too small. The cache can overflow, which means that the
    system has no way to reach some ip addresses.

    Setting the threshold limits higher addresses the situation, in a
    reasonably safe way (the maximum impact is 5MB or so of additional RAM
    used). Docs on ARP at http://man7.org/linux/man-pages/man7/arp.7.html,
    and more discussion of the issue in the bug.

Change-Id: I701141784224f5f870f6da73a24bed8015694409
Closes-Bug: 1780348

Changed in charm-neutron-gateway:
status:	In Progress → Fix Committed
Changed in charm-neutron-openvswitch:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-28: Fix merged to charm-neutron-openvswitch (master)

#12

Reviewed: https://review.openstack.org/639723
Committed: https://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/commit/?id=00ca87fec3b59d24665c7db5886647ea9b2ca114
Submitter: Zuul
Branch: master

commit 00ca87fec3b59d24665c7db5886647ea9b2ca114
Author: Pete Vander Giessen <email address hidden>
Date: Wed Feb 27 17:04:19 2019 +0100

Added gc_threshold overrides to sysctl.conf

    When clouds have a large number of hosts, the default size of the ARP
    cache is too small. The cache can overflow, which means that the
    system has no way to reach some ip addresses.

    Setting the threshold limits higher addresses the situation, in a
    reasonably safe way (the maximum impact is 5MB or so of additional RAM
    used). Docs on ARP at http://man7.org/linux/man-pages/man7/arp.7.html,
    and more discussion of the issue in the bug.

Change-Id: I329ec51eff85a2a99a929c67ff0c68b3b36d7273
Closes-Bug: 1780348

Changed in charm-nova-compute:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-28: Fix merged to charm-nova-compute (master)

#13

Reviewed: https://review.openstack.org/639984
Committed: https://git.openstack.org/cgit/openstack/charm-nova-compute/commit/?id=c9a19c40777ea40d64c22070c1346c14708fabdf
Submitter: Zuul
Branch: master

commit c9a19c40777ea40d64c22070c1346c14708fabdf
Author: Pete Vander Giessen <email address hidden>
Date: Wed Feb 27 15:50:05 2019 +0100

Added gc_threshold overrides to sysctl.conf

    When clouds have a large number of hosts, the default size of the ARP
    cache is too small. The cache can overflow, which means that the
    system has no way to reach some ip addresses.

    Setting the threshold limits higher addresses the situation, in a
    reasonably safe way (the maximum impact is 5MB or so of additional RAM
    used). Docs on ARP at http://man7.org/linux/man-pages/man7/arp.7.html,
    and more discussion of the issue in the bug.

Change-Id: Iaf8382ee0b42e1444cfea589bb05a687cd0c23fa
Closes-Bug: 1780348

Revision history for this message

Pen Gale (pengale) wrote on 2019-02-28:

#14

Quick note on the fix here: we simply added some sensible settings be the default value for sysctl in the charms in question. Operators can further change this values to whatever they see fit.

I also moved a sneaky vm swappiness setting in nova compute out of code an into the default config, where it's more obvious that it's there.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2019-02-28:

#15

> I also moved a sneaky vm swappiness setting in nova compute out of code an into the default config, where it's more obvious that it's there.

Good catch. Ceph OSD charm removed vm.swappiness=1 some time ago after some discussions. As we tend to colocate nova-compute and ceph-osd as converged architecture, we might want to remove the default vm.swappiness from nova-compute as well.
https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=3527bf4ae1723a10f49774fef646aaa5b9fc0c45

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-15: Related fix proposed to charm-nova-compute (master)

#16

Related fix proposed to branch: master
Review: https://review.openstack.org/643626

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2019-03-15:

#17

@petevg ^^

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-16: Related fix merged to charm-nova-compute (master)

#18

Reviewed: https://review.openstack.org/643626
Committed: https://git.openstack.org/cgit/openstack/charm-nova-compute/commit/?id=53efb5d2f4ea2bce6f6c6566ba83dab6652eb28f
Submitter: Zuul
Branch: master

commit 53efb5d2f4ea2bce6f6c6566ba83dab6652eb28f
Author: Edward Hope-Morley <email address hidden>
Date: Fri Mar 15 16:18:20 2019 +0000

Fixup commit c9a19c4

    Remove vm.swappiness setting as per [1] and
    add net.netfilter.nf_conntrack_buckets as
    per [2].

[1] https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=3527bf4ae1723a10f49774fef646aaa5b9fc0c45
[2] https://bugs.launchpad.net/charm-nova-compute/+bug/1780348/comments/8

Change-Id: I44506c94927bb93002b040db09d7cc7c1c99d133
Related-Bug: #1780348

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-18: Related fix proposed to charm-neutron-gateway (master)

#19

Related fix proposed to branch: master
Review: https://review.openstack.org/643893

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-18: Related fix proposed to charm-neutron-openvswitch (master)

#20

Related fix proposed to branch: master
Review: https://review.openstack.org/643898

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-19: Related fix merged to charm-neutron-openvswitch (master)

#21

Reviewed: https://review.openstack.org/643898
Committed: https://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/commit/?id=9b094b8ef8855638696689d02a1b3d4c0822997c
Submitter: Zuul
Branch: master

commit 9b094b8ef8855638696689d02a1b3d4c0822997c
Author: Edward Hope-Morley <email address hidden>
Date: Mon Mar 18 09:22:39 2019 +0000

Fixup commit 00ca87f

Add net.netfilter.nf_conntrack_buckets as per [1].

[1] https://bugs.launchpad.net/charm-neutron-gateway/+bug/1780348/comments/8

Change-Id: I6c143230943668c31378349d2f4f92de537ced64
Related-Bug: #1780348

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-19: Related fix merged to charm-neutron-gateway (master)

#22

Reviewed: https://review.openstack.org/643893
Committed: https://git.openstack.org/cgit/openstack/charm-neutron-gateway/commit/?id=3bd24352677169034911874d225a4b8a6fd2018d
Submitter: Zuul
Branch: master

commit 3bd24352677169034911874d225a4b8a6fd2018d
Author: Edward Hope-Morley <email address hidden>
Date: Mon Mar 18 09:16:44 2019 +0000

Fixup commit 53b5838

Add net.netfilter.nf_conntrack_buckets as per [1].

[1] https://bugs.launchpad.net/charm-neutron-gateway/+bug/1780348/comments/8

Change-Id: I1aa261973dd34bdea519c3195f46a3cc0dfd863a
Related-Bug: #1780348

David Ames (thedac) on 2019-04-17

Changed in charm-nova-compute:
status:	Fix Committed → Fix Released
Changed in charm-neutron-gateway:
status:	Fix Committed → Fix Released
Changed in charm-neutron-openvswitch:
status:	Fix Committed → Fix Released

Revision history for this message

Michael Boniface (mjboniface1) wrote on 2019-04-24:

#23

The bug fix to set sysctl defaults on the Neutron Gateway the has broken Openstack on LXD as the variables are not available in containers, see https://ask.openstack.org/en/question/121359/error-neutron-gateway-in-openstack-on-lxd/. The workaround is to set the value to "" explicitly in the bundle so that they are set by the user rather than the defaults.

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2019-05-15:

#24

Also see follow-on bug #1829047

OpenStack Neutron Open vSwitch Charm

default gc_thresh settings for Linux are too small

Bug Description

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to	Milestone
OpenStack Neutron Gateway Charm	Fix Released	High	Pen Gale	OpenStack Neutron Gateway Charm 19.04
OpenStack Neutron Open vSwitch Charm	Fix Released	High	Pen Gale	OpenStack Neutron Open vSwitch Charm 19.04
OpenStack Nova Compute Charm	Fix Released	High	Pen Gale	OpenStack Nova Compute Charm 19.04