sysctl values applied by autotune are bad

Bug #1798794 reported by Edward Hope-Morley
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
In Progress
High
Brett Milford
OpenStack Charm Guide
Fix Released
Medium
Peter Matulis

Bug Description

This is in some sense an extension to bug 1770171 which dealt with removing non-sane systcl settings from the "sysctl" config option.

The ceph-osd charm has an autotune config option which if set to True will:

"attempt to tune your network card sysctls and hard drive settings.
This changes hard drive read ahead settings and max_sectors_kb.
For the network card this will detect the link speed and make
appropriate sysctl changes. Enabling this option should generally
be safe."

This essentially translates (for networking) to the following sysctls being set [1]:

'net.core.rmem_default': 524287,
'net.core.wmem_default': 524287,
'net.core.rmem_max': 524287,
'net.core.wmem_max': 524287,
'net.core.optmem_max': 524287,
'net.core.netdev_max_backlog': 300000,
'net.ipv4.tcp_rmem': '10000000 10000000 10000000',
'net.ipv4.tcp_wmem': '10000000 10000000 10000000',
'net.ipv4.tcp_mem': '10000000 10000000 10000000'

[1] https://github.com/openstack/charms.ceph/blob/master/ceph/utils.py#L105

While on the surface these settings seem like they might have the effect of ensuring e.g. that packets never get dropped, setting netdev_max_backlog tp 300000 is high enough that things can time out while packets are in the enormous backlog queues this value will permit to grow. If the system can't keep up with the incoming packet rate, this setting just adds a lot of latency to the mix without making things better. This sysctl is really a "surge" capacity, and it is doubtful that surges of 300,000 packets and commonplace in ceph deployments. Also setting 'net.ipv4.tcp_*mem to have the same high value for min/default/max is likely to lead to problems in heavily loaded systems:

comment from @jvosburgh:

"[this] is a limit for TCP as a whole (not a per-socket limit), and is measured in units of pages, not bytes as with tcp_rmem/wmem. Setting all of these to the same value will cause TCP to go from 'all fine here' immediately to 'out of memory' without any attempt to moderate its memory use before simply failing memory requests for socket buffer data. That's likely not what was intended. The value chosen, 10 million pages, which, at 4K per page is 38GB, is probably absurdly too high."

So either we need to completelty remove the autotune feature or we need to find more sane values for the "optimisations" but it feels like it is unlikley that we are going to find a magic one-size-fits-all and maybe we should just let users manipulate their sysctls via the sysctl config.

NOTE: removing these settings from the charm will ensure that they are never set for future deployments but existing deployments will need to manually change these settings to prior values since the charm will no longer have any knowledge of their existence.

James Page (james-page)
Changed in charm-ceph-osd:
milestone: 18.11 → 19.04
James Page (james-page)
Changed in charm-ceph-osd:
status: New → Triaged
David Ames (thedac)
Changed in charm-ceph-osd:
milestone: 19.04 → 19.07
David Ames (thedac)
Changed in charm-ceph-osd:
milestone: 19.07 → 19.10
Ryan Beisner (1chb1n)
Changed in charm-ceph-osd:
importance: Medium → Low
importance: Low → High
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Given the marginal usefulness in operations, the potential for highly undesirable impact, and the idea that it is difficult, if not impossible, to autotune with a one-sized approach, I would be in favor of deprecating this charm config option.

In the meantime, we should adjust (and backport) the config.yaml descriptions to reference this bug and to insight elevated caution in users who might consider turning it on.

David Ames (thedac)
Changed in charm-ceph-osd:
milestone: 19.10 → 20.01
James Page (james-page)
Changed in charm-ceph-osd:
milestone: 20.01 → 20.05
David Ames (thedac)
Changed in charm-ceph-osd:
milestone: 20.05 → 20.08
Revision history for this message
Vern Hart (vern) wrote :

Why does this keep slipping? Can we at least change the charm doc that says "Enabling this option should generally be safe" if we believe it's not safe?

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

It keeps "slipping" because it is just updated to the next Milestone. It is not on a roadmap as it hasn't been given enough priority. It would probably make sense for the OpenStack Charm bugs to not be given a milestone until they are actually targeted to one but that is the general practice.

Changed in charm-ceph-osd:
milestone: 20.08 → none
Changed in charm-ceph-osd:
assignee: nobody → Brett Milford (brettmilford)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-osd (master)

Fix proposed to branch: master
Review: https://review.opendev.org/739402

Changed in charm-ceph-osd:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-osd (master)

Reviewed: https://review.opendev.org/739402
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=08d56bb04064d4252b358f99aaeff7f024f8c6c0
Submitter: Zuul
Branch: master

commit 08d56bb04064d4252b358f99aaeff7f024f8c6c0
Author: Brett Milford <email address hidden>
Date: Mon Jul 6 15:23:37 2020 +1000

    Warning description for autotune config.

    Change-Id: Ieaccc18a39d018d120ae8bd6ee62b97f30d90e41
    Partial-Bug: #1798794

Changed in charm-guide:
assignee: nobody → Peter Matulis (petermatulis)
status: New → In Progress
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-guide (master)

Reviewed: https://review.opendev.org/739629
Committed: https://git.openstack.org/cgit/openstack/charm-guide/commit/?id=06f4cdaa65b38218511f6002c8b7ab02707acb85
Submitter: Zuul
Branch: master

commit 06f4cdaa65b38218511f6002c8b7ab02707acb85
Author: Brett Milford <email address hidden>
Date: Tue Jul 7 11:02:31 2020 +1000

    Add deprecation notice for ceph-osd autotune option

    Change-Id: I28e80de167bd24e4b03d01c8898aed9c709bc069
    Closes-Bug: #1798794

Changed in charm-guide:
status: In Progress → Fix Released
Revision history for this message
Trent Lloyd (lathiat) wrote :

It's worth noting that when this is removed we should add code to remove the /etc/sysctl.d files from the disk, reverting the actual running config values is more tricky as it's not entirely straight forward to reset them back to the 'default' as the sysctl tool won't reset values that no longer exist in /etc/sysctl.d and some of the options are calculated to different values on boot based on RAM size.

Additionally if we did reset them, you should then re-run sysctl to parse /etc/sysctl.d again so that if the same values were overridden in other files, that are correctly set again.

I previously tried to quantify this to help someone set them back to the defaults, here is what I found.

For all values below, the item with the leading comment (#) is the default value and the uncommented value is that set by the charm in an environment which was I think 10GbE (the option would set different settings for 1G).

#net.core.netdev_max_backlog=1000
net.core.netdev_max_backlog=300000

#net.core.wmem_max = 212992
net.core.wmem_max=524287

#net.ipv4.tcp_rmem = 4096 131072 6291456 #(default changes based on RAM, but maxes out at 131072 after about 8G ram)
net.ipv4.tcp_rmem=10000000 10000000 10000000

#net.ipv4.tcp_wmem = 4096 16384 4194304
net.ipv4.tcp_wmem=10000000 10000000 10000000

#net.ipv4.tcp_mem = 22995 30660 45990 (for 2GB)
#net.ipv4.tcp_mem = 1540740 2054321 3081480 (for 128GB)
#net.ipv4.tcp_mem = 3087000 4116000 6174000 (for 256GB)
#net.ipv4.tcp_mem = 6174000 8232000 12348000 (estimated for 512GB)
net.ipv4.tcp_mem=10000000 10000000 10000000

#net.core.wmem_default = 212992
net.core.wmem_default=524287

#net.core.optmem_max = 20480
net.core.optmem_max=524287

#net.core.rmem_default = 212992
net.core.rmem_default=524287

#net.core.rmem_max = 212992
net.core.rmem_max=524287

I don't recall but this was likely for a 4.4 or 4.15 kernel.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.