CephOSD/Compute nodes crash under memory pressure unless custom tuned profile is used

Bug #1800232 reported by John Fulton
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Medium
John Fulton

Bug Description

Under a really heavy I/O load a hyperconverged OpenStack-Ceph cluster has collapsed, multiple times, almost immediately in a chain reaction:

- OOM killer runs
- random nova instances and ceph-osd processes are killed
- Ceph goes into recovery mode
- OSDs consume much more memory than before
- OSDs hit cgroup limit and crash
- remaining OSDs are under even more pressure than before, and so on.
- Ceph reaches a point where some data is inaccessible (not lost) due to too many OSDs down.
- VMs hang and eventually get stack traces stating that I/O timeouts of > 120 seconds are occurring.

The issue can be worked around by lowering the dirty ratio and disabling KSM. Here's a tuned profile:

[main]
summary=ceph-osd Filestore tuned profile
include=throughput-performance
[sysctl]
vm.dirty_ratio = 10
vm.dirty_background_ratio = 3
[sysfs]
/sys/kernel/mm/ksm/run=0

If this file is installed with OpenStack as /usr/lib/tuned/ceph-osd-hci/tuned.conf and you do the command:

# tuned-adm profile ceph-osd-hci

This will persistently make these changes to the kernel configuration.

Revision history for this message
John Fulton (jfulton-org) wrote :

The suggested tuned profile could be encoded in a map like this:

    tuned_custom_profile:
      name: my_custom_profile
      sections:
        - name: main
          params:
            - option: summary
              value: ceph-osd Filestore tuned profile
            - option: include
              value: throughput-performance
        - name: sysctl
          params:
            - option: vm.dirty_ratio
              value: 10
            - option: vm.dirty_background_ratio
              value: 3
        - name: sysfs
          params:
            - option: /sys/kernel/mm/ksm/run
              value: 0

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/613698

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/613699

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/613698
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=3c953e877b74f2c4d93e57d3b093d9171e897e72
Submitter: Zuul
Branch: master

commit 3c953e877b74f2c4d93e57d3b093d9171e897e72
Author: John Fulton <email address hidden>
Date: Fri Oct 26 18:03:52 2018 -0400

    Allow user to define a custom tuned profile

    Add custom_profile parameter to tripleo::profile::base::tuned
    which may a string in INI format describing a tuned profile
    which puppet will then create and apply with the name supplied
    by the profile parameter.

    Change-Id: Iba17d86bbdd710623ba1ba44b1ea5d4c1b99c541
    Related-Bug: #1800232

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/613699
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=bc0246ff8a974f63229ed334b0c867472a3c9dc9
Submitter: Zuul
Branch: master

commit bc0246ff8a974f63229ed334b0c867472a3c9dc9
Author: John Fulton <email address hidden>
Date: Fri Oct 26 18:06:10 2018 -0400

    Add TunedCustomProfile parameter and HCI Ceph filestore environment

    Add TunedCustomProfile parameter which may contain a string in
    INI format describing a custom tuned profile. Also provide a new
    environment file for users of hypercoverged Ceph deployments
    using the Ceph filestore storage backened. The tuned profile is
    based on heavy I/O load testing. The provided environment file
    creates /etc/tuned/ceph-filestore-osd-hci/tuned.conf whose
    content is the following and sets this tuned profile to be active.

    [main]
    summary=ceph-osd Filestore tuned profile
    include=throughput-performance
    [sysctl]
    vm.dirty_ratio = 10
    vm.dirty_background_ratio = 3
    [sysfs]
    /sys/kernel/mm/ksm/run=0

    Depends-On: Iba17d86bbdd710623ba1ba44b1ea5d4c1b99c541
    Change-Id: Iaa1c82cefac5c8f2959fd7aeb57bd6860fd9096a
    Closes-Bug: #1800232

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/625110

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/625111

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/625112

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/625113

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to puppet-tripleo (stable/rocky)

Reviewed: https://review.openstack.org/625111
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=19eb214ca8bd79f2953b465ff6a1b2444dc3c0db
Submitter: Zuul
Branch: stable/rocky

commit 19eb214ca8bd79f2953b465ff6a1b2444dc3c0db
Author: John Fulton <email address hidden>
Date: Fri Oct 26 18:03:52 2018 -0400

    Allow user to define a custom tuned profile

    Add custom_profile parameter to tripleo::profile::base::tuned
    which may a string in INI format describing a tuned profile
    which puppet will then create and apply with the name supplied
    by the profile parameter.

    Change-Id: Iba17d86bbdd710623ba1ba44b1ea5d4c1b99c541
    Related-Bug: #1800232
    (cherry picked from commit 3c953e877b74f2c4d93e57d3b093d9171e897e72)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-docs (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/628261

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-docs (master)

Reviewed: https://review.openstack.org/628261
Committed: https://git.openstack.org/cgit/openstack/tripleo-docs/commit/?id=08b277406813812f013ac4c044606f2df9947e57
Submitter: Zuul
Branch: master

commit 08b277406813812f013ac4c044606f2df9947e57
Author: John Fulton <email address hidden>
Date: Thu Jan 3 13:46:12 2019 -0500

    Add documentation on deploying with custom tuned profiles

    Change-Id: If4298a186933a7662de9548a6d602df7c9eeff22
    Related-Bug: #1800232

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 10.3.0

This issue was fixed in the openstack/tripleo-heat-templates 10.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/rocky)

Reviewed: https://review.openstack.org/625110
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=ff7c6e28598d51bb73624465f260ec90aed7ac4f
Submitter: Zuul
Branch: stable/rocky

commit ff7c6e28598d51bb73624465f260ec90aed7ac4f
Author: John Fulton <email address hidden>
Date: Fri Oct 26 18:06:10 2018 -0400

    Add TunedCustomProfile parameter and HCI Ceph filestore environment

    Add TunedCustomProfile parameter which may contain a string in
    INI format describing a custom tuned profile. Also provide a new
    environment file for users of hypercoverged Ceph deployments
    using the Ceph filestore storage backened. The tuned profile is
    based on heavy I/O load testing. The provided environment file
    creates /etc/tuned/ceph-filestore-osd-hci/tuned.conf whose
    content is the following and sets this tuned profile to be active.

    [main]
    summary=ceph-osd Filestore tuned profile
    include=throughput-performance
    [sysctl]
    vm.dirty_ratio = 10
    vm.dirty_background_ratio = 3
    [sysfs]
    /sys/kernel/mm/ksm/run=0

    Depends-On: Iba17d86bbdd710623ba1ba44b1ea5d4c1b99c541
    Depends-On: https://review.openstack.org/628429
    Change-Id: Iaa1c82cefac5c8f2959fd7aeb57bd6860fd9096a
    Closes-Bug: #1800232
    (cherry picked from commit bc0246ff8a974f63229ed334b0c867472a3c9dc9)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/queens)

Reviewed: https://review.openstack.org/625113
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=8b2912df1c8c8a909c441b1b0c5c675e0fa7420c
Submitter: Zuul
Branch: stable/queens

commit 8b2912df1c8c8a909c441b1b0c5c675e0fa7420c
Author: John Fulton <email address hidden>
Date: Fri Oct 26 18:06:10 2018 -0400

    Add TunedCustomProfile parameter and HCI Ceph filestore environment

    Add TunedCustomProfile parameter which may contain a string in
    INI format describing a custom tuned profile. Also provide a new
    environment file for users of hypercoverged Ceph deployments
    using the Ceph filestore storage backened. The tuned profile is
    based on heavy I/O load testing. The provided environment file
    creates /etc/tuned/ceph-filestore-osd-hci/tuned.conf whose
    content is the following and sets this tuned profile to be active.

    [main]
    summary=ceph-osd Filestore tuned profile
    include=throughput-performance
    [sysctl]
    vm.dirty_ratio = 10
    vm.dirty_background_ratio = 3
    [sysfs]
    /sys/kernel/mm/ksm/run=0

    Depends-On: Iba17d86bbdd710623ba1ba44b1ea5d4c1b99c541
    Change-Id: Iaa1c82cefac5c8f2959fd7aeb57bd6860fd9096a
    Closes-Bug: #1800232
    (cherry picked from commit bc0246ff8a974f63229ed334b0c867472a3c9dc9)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 8.3.0

This issue was fixed in the openstack/tripleo-heat-templates 8.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 9.3.0

This issue was fixed in the openstack/tripleo-heat-templates 9.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.