Expected-osd-count set to 0 results in "too few PGs per OSD"

Bug #1782245 reported by Alexander Litvinov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Won't Fix
Undecided
Unassigned

Bug Description

I'm deploying my cloud with expected-osd-count set to 0, which supposedly should result in a valid crushmap.

bundle:
https://pastebin.canonical.com/p/r8YZM8vHG3/

ubuntu@juju-8d6642-14-lxd-3:~$ sudo ceph status
  cluster:
    id: b3ade16c-8a04-11e8-9946-00163e24e2c7
    health: HEALTH_WARN
            too few PGs per OSD (4 < min 30)

  services:
    mon: 3 daemons, quorum juju-8d6642-16-lxd-2,juju-8d6642-14-lxd-3,juju-8d6642-18-lxd-0
    mgr: juju-8d6642-14-lxd-3(active), standbys: juju-8d6642-16-lxd-2, juju-8d6642-18-lxd-0
    osd: 140 osds: 135 up, 135 in

  data:
    pools: 1 pools, 200 pgs
    objects: 0 objects, 0 bytes
    usage: 272 GB used, 491 TB / 491 TB avail
    pgs: 200 active+clean

----------------------------------------------------------------------

ubuntu@fnos-inf01:~/ALEX2/cpe-deployments$ juju config nvme-ceph-mon
application: nvme-ceph-mon
charm: ceph-mon
settings:
  auth-supported:
    default: cephx
    description: |
      Which authentication flavour to use.
      .
      Valid options are "cephx" and "none". If "none" is specified,
      keys will still be created and deployed so that it can be
      enabled later.
    source: default
    type: string
    value: cephx
  ceph-cluster-network:
    description: |
      The IP address and netmask of the cluster (back-side) network (e.g.,
      192.168.0.0/24)
      .
      If multiple networks are to be used, a space-delimited list of a.b.c.d/x
      can be provided.
    source: unset
    type: string
  ceph-public-network:
    description: |
      The IP address and netmask of the public (front-side) network (e.g.,
      192.168.0.0/24)
      .
      If multiple networks are to be used, a space-delimited list of a.b.c.d/x
      can be provided.
    source: unset
    type: string
  config-flags:
    description: |
      User provided Ceph configuration. Supports a string representation of
      a python dictionary where each top-level key represents a section in
      the ceph.conf template. You may only use sections supported in the
      template.
      .
      WARNING: this is not the recommended way to configure the underlying
      services that this charm installs and is used at the user's own risk.
      This option is mainly provided as a stop-gap for users that either
      want to test the effect of modifying some config or who have found
      a critical bug in the way the charm has configured their services
      and need it fixed immediately. We ask that whenever this is used,
      that the user consider opening a bug on this charm at
      http://bugs.launchpad.net/charms providing an explanation of why the
      config was needed so that we may consider it for inclusion as a
      natively supported config in the the charm.
    source: user
    type: string
    value: '{''global'': {''mon max pg per osd'': 100000}}'
  customize-failure-domain:
    default: false
    description: |
      Setting this to true will tell Ceph to replicate across Juju's
      Availability Zone instead of specifically by host.
    source: user
    type: boolean
    value: true
  default-rbd-features:
    description: |
      Restrict the rbd features used to the specified level. If set, this will
      inform clients that they should set the config value `rbd default
      features`, for example:
      .
        rbd default features = 1
      .
      This needs to be set to 1 when deploying a cloud with the nova-lxd
      hypervisor.
    source: unset
    type: int
  expected-osd-count:
    default: 0
    description: |
      Number of OSDs expected to be deployed in the cluster. This value is used
      for calculating the number of placement groups on pool creation. The
      number of placement groups for new pools are based on the actual number
      of OSDs in the cluster or the expected-osd-count, whichever is greater
      A value of 0 will cause the charm to only consider the actual number of
      OSDs in the cluster.
    source: default
    type: int
    value: 0
  harden:
    description: |
      Apply system hardening. Supports a space-delimited list of modules
      to run. Supported modules currently include os, ssh, apache and mysql.
    source: unset
    type: string
  key:
    description: |
      Key ID to import to the apt keyring to support use with arbitary source
      configuration from outside of Launchpad archives or PPA's.
    source: unset
    type: string
  loglevel:
    default: 1
    description: Mon and OSD debug level. Max is 20.
    source: default
    type: int
    value: 1
  monitor-count:
    default: 3
    description: |
      Number of ceph-mon units to wait for before attempting to bootstrap the
      monitor cluster. For production clusters the default value of 3 ceph-mon
      units is normally a good choice.
      .
      For test and development environments you can enable single-unit
      deployment by setting this to 1.
      .
      NOTE: To establish quorum and enable partition tolerance a odd number of
      ceph-mon units is required.
    source: default
    type: int
    value: 3
  monitor-hosts:
    description: |
      A space-separated list of ceph mon hosts to use. This field is only used
      to migrate an existing cluster to a juju-managed solution and should
      otherwise be left unset.
    source: unset
    type: string
  monitor-secret:
    description: |
      The Ceph secret key used by Ceph monitors. This value will become the
      mon.key. To generate a suitable value use:
      .
        ceph-authtool /dev/stdout --name=mon. --gen-key
      .
      If left empty, a secret key will be generated.
      .
      NOTE: Changing this configuration after deployment is not supported and
      new service units will not be able to join the cluster.
    source: user
    type: string
    value: AQA8UzZbA1MwMhAAq4MFuGYi69tt5339CgK8iQ==
  nagios_context:
    default: juju
    description: |
      Used by the nrpe-external-master subordinate charm.
      A string that will be prepended to instance name to set the hostname
      in nagios. So for instance the hostname would be something like:
      .
          juju-myservice-0
      .
      If you're running multiple environments with the same services in them
      this allows you to differentiate between them.
    source: default
    type: string
    value: juju
  nagios_degraded_thresh:
    default: 1
    description: Threshold for degraded ratio (0.1 = 10%)
    source: default
    type: float
    value: 1
  nagios_ignore_nodeepscub:
    default: false
    description: Whether to ignore the nodeep-scrub flag
    source: default
    type: boolean
    value: false
  nagios_misplaced_thresh:
    default: 10
    description: Threshold for misplaced ratio (0.1 = 10%)
    source: default
    type: float
    value: 10
  nagios_recovery_rate:
    default: "1"
    description: Recovery rate below which we consider recovery to be stalled
    source: default
    type: string
    value: "1"
  nagios_servicegroups:
    default: ""
    description: |
      A comma-separated list of nagios servicegroups. If left empty, the
      nagios_context will be used as the servicegroup.
    source: default
    type: string
    value: ""
  no-bootstrap:
    default: false
    description: |
      Causes the charm to not do any of the initial bootstrapping of the
      Ceph monitor cluster. This is only intended to be used when migrating
      from the ceph all-in-one charm to a ceph-mon / ceph-osd deployment.
      Refer to the Charm Deployment guide at https://docs.openstack.org/charm-deployment-guide/latest/
      for more information.
    source: default
    type: boolean
    value: false
  pgs-per-osd:
    default: 100
    description: |
      The number of placement groups per OSD to target. It is important to
      properly size the number of placement groups per OSD as too many
      or too few placement groups per OSD may cause resource constraints and
      performance degradation. This value comes from the recommendation of
      the Ceph placement group calculator (http://ceph.com/pgcalc/) and
      recommended values are:
      .
      100 - If the cluster OSD count is not expected to increase in the
            foreseeable future.
      200 - If the cluster OSD count is expected to increase (up to 2x) in the
            foreseeable future.
      300 - If the cluster OSD count is expected to increase between 2x and 3x
            in the foreseeable future.
    source: default
    type: int
    value: 100
  prefer-ipv6:
    default: false
    description: |
      If True enables IPv6 support. The charm will expect network interfaces
      to be configured with an IPv6 address. If set to False (default) IPv4
      is expected.
      .
      NOTE: these charms do not currently support IPv6 privacy extension. In
      order for this charm to function correctly, the privacy extension must be
      disabled and a non-temporary address must be configured/available on
      your network interface.
    source: default
    type: boolean
    value: false
  source:
    description: |
      Optional configuration to support use of additional sources such as:
      .
        - ppa:myteam/ppa
        - cloud:xenial-proposed/ocata
        - http://my.archive.com/ubuntu main
      .
      The last option should be used in conjunction with the key configuration
      option.
    source: user
    type: string
    value: cloud:xenial-queens
  sysctl:
    default: '{ kernel.pid_max : 2097152, vm.max_map_count : 524288, kernel.threads-max:
      2097152 }'
    description: "YAML-formatted associative array of sysctl key/value pairs to be
      set\npersistently. By default we set pid_max, max_map_count and \nthreads-max
      to a high value to avoid problems with large numbers (>20)\nof OSDs recovering.
      very large clusters should set those values even\nhigher (e.g. max for kernel.pid_max
      is 4194303).\n"
    source: default
    type: string
    value: '{ kernel.pid_max : 2097152, vm.max_map_count : 524288, kernel.threads-max:
      2097152 }'
  use-direct-io:
    default: true
    description: Configure use of direct IO for OSD journals.
    source: default
    type: boolean
    value: true
  use-syslog:
    default: false
    description: |
      If set to True, supporting services will log to syslog.
    source: default
    type: boolean
    value: false

----------------------------------------------------------------------------------------

ubuntu@fnos-inf01:~/ALEX2/cpe-deployments$ juju config nvme-ceph-osd
application: nvme-ceph-osd
charm: ceph-osd
settings:
  aa-profile-mode:
    default: disable
    description: |
      Enable apparmor profile. Valid settings: 'complain', 'enforce' or
      'disable'.
      .
      NOTE: changing the value of this option is disruptive to a running Ceph
      cluster as all ceph-osd processes must be restarted as part of changing
      the apparmor profile enforcement mode. Always test in pre-production
      before enabling AppArmor on a live cluster.
    source: user
    type: string
    value: complain
  autotune:
    default: false
    description: |
      Enabling this option will attempt to tune your network card sysctls and
      hard drive settings. This changes hard drive read ahead settings and
      max_sectors_kb. For the network card this will detect the link speed
      and make appropriate sysctl changes. Enabling this option should
      generally be safe.
    source: user
    type: boolean
    value: true
  availability_zone:
    description: |
      Custom availability zone to provide to Ceph for the OSD placement
    source: unset
    type: string
  bluestore:
    default: false
    description: |
      Use experimental bluestore storage format for OSD devices; only supported
      in Ceph Jewel (10.2.0) or later.
      .
      Note that despite bluestore being the default for Ceph Luminous, if this
      option is False, OSDs will still use filestore.
    source: user
    type: boolean
    value: true
  bluestore-block-db-size:
    default: 0
    description: |
      Size of a partition or file to use for BlueStore metadata
      or RocksDB SSTs. A default value is not set as it is calculated
      by ceph-disk if not specified.
    source: default
    type: int
    value: 0
  bluestore-block-wal-size:
    default: 0
    description: |
      Size of a partition or file to use for BlueStore WAL (RocksDB WAL)
      A default value is not set as it is calculated by ceph-disk if
      not specified.
    source: default
    type: int
    value: 0
  bluestore-db:
    description: |
      Path to a BlueStore WAL db block device or file
    source: user
    type: string
    value: /dev/nvme10n1
  bluestore-wal:
    description: |
      Path to a BlueStore WAL block device or file.
    source: user
    type: string
    value: /dev/nvme10n1
  ceph-cluster-network:
    description: |
      The IP address and netmask of the cluster (back-side) network (e.g.,
      192.168.0.0/24)
      .
      If multiple networks are to be used, a space-delimited list of a.b.c.d/x
      can be provided.
    source: unset
    type: string
  ceph-public-network:
    description: |
      The IP address and netmask of the public (front-side) network (e.g.,
      192.168.0.0/24)
      .
      If multiple networks are to be used, a space-delimited list of a.b.c.d/x
      can be provided.
    source: unset
    type: string
  config-flags:
    description: |
      User provided Ceph configuration. Supports a string representation of
      a python dictionary where each top-level key represents a section in
      the ceph.conf template. You may only use sections supported in the
      template.
      .
      WARNING: this is not the recommended way to configure the underlying
      services that this charm installs and is used at the user's own risk.
      This option is mainly provided as a stop-gap for users that either
      want to test the effect of modifying some config or who have found
      a critical bug in the way the charm has configured their services
      and need it fixed immediately. We ask that whenever this is used,
      that the user consider opening a bug on this charm at
      http://bugs.launchpad.net/charms providing an explanation of why the
      config was needed so that we may consider it for inclusion as a
      natively supported config in the the charm.
    source: unset
    type: string
  crush-initial-weight:
    description: |
      The initial crush weight for newly added osds into crushmap. Use this
      option only if you wish to set the weight for newly added OSDs in order
      to gradually increase the weight over time. Be very aware that setting
      this overrides the default setting, which can lead to imbalance in the
      cluster, especially if there are OSDs of different sizes in use. By
      default, the initial crush weight for the newly added osd is set to its
      volume size in TB. Leave this option unset to use the default provided
      by Ceph itself. This option only affects NEW OSDs, not existing ones.
    source: unset
    type: float
  customize-failure-domain:
    default: false
    description: |
      Setting this to true will tell Ceph to replicate across Juju's
      Availability Zone instead of specifically by host.
    source: user
    type: boolean
    value: true
  ephemeral-unmount:
    description: |
      Cloud instances provide ephermeral storage which is normally mounted
      on /mnt.
      .
      Setting this option to the path of the ephemeral mountpoint will force
      an unmount of the corresponding device so that it can be used as a OSD
      storage device. This is useful for testing purposes (cloud deployment
      is not a typical use case).
    source: unset
    type: string
  harden:
    description: |
      Apply system hardening. Supports a space-delimited list of modules
      to run. Supported modules currently include os, ssh, apache and mysql.
    source: unset
    type: string
  ignore-device-errors:
    default: false
    description: |
      By default, the charm will raise errors if a whitelisted device is found,
      but for some reason the charm is unable to initialize the device for use
      by Ceph.
      .
      Setting this option to 'True' will result in the charm classifying such
      problems as warnings only and will not result in a hook error.
    source: default
    type: boolean
    value: false
  key:
    description: |
      Key ID to import to the apt keyring to support use with arbitary source
      configuration from outside of Launchpad archives or PPA's.
    source: unset
    type: string
  loglevel:
    default: 1
    description: OSD debug level. Max is 20.
    source: default
    type: int
    value: 1
  max-sectors-kb:
    default: 1.048576e+06
    description: |
      This parameter will adjust every block device in your server to allow
      greater IO operation sizes. If you have a RAID card with cache on it
      consider tuning this much higher than the 1MB default. 1MB is a safe
      default for spinning HDDs that don't have much cache.
    source: default
    type: int
    value: 1.048576e+06
  nagios_context:
    default: juju
    description: |
      Used by the nrpe-external-master subordinate charm.
      A string that will be prepended to instance name to set the hostname
      in nagios. So for instance the hostname would be something like:
      .
          juju-myservice-0
      .
      If you're running multiple environments with the same services in them
      this allows you to differentiate between them.
    source: default
    type: string
    value: juju
  nagios_servicegroups:
    default: ""
    description: |
      A comma-separated list of nagios servicegroups.
      If left empty, the nagios_context will be used as the servicegroup
    source: default
    type: string
    value: ""
  osd-devices:
    default: /dev/vdb
    description: |
      The devices to format and set up as OSD volumes.
      .
      These devices are the range of devices that will be checked for and
      used across all service units, in addition to any volumes attached
      via the --storage flag during deployment.
      .
      For ceph >= 0.56.6 these can also be directories instead of devices - the
      charm assumes anything not starting with /dev is a directory instead.
    source: user
    type: string
    value: /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1
      /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 /dev/nvme9n1
  osd-encrypt:
    default: false
    description: |
      By default, the charm will not encrypt Ceph OSD devices; however, by
      setting osd-encrypt to True, Ceph's dmcrypt support will be used to
      encrypt OSD devices.
      .
      Specifying this option on a running Ceph OSD node will have no effect
      until new disks are added, at which point new disks will be encrypted.
    source: user
    type: boolean
    value: true
  osd-encrypt-keymanager:
    default: ceph
    description: |
      Keymanager to use for storage of dm-crypt keys used for OSD devices;
      by default 'ceph' itself will be used for storage of keys, making use
      of the key/value storage provided by the ceph-mon cluster.
      .
      Alternatively 'vault' may be used for storage of dm-crypt keys. Both
      approaches ensure that keys are never written to the local filesystem.
      This also requires a relation to the vault charm.
    source: user
    type: string
    value: vault
  osd-format:
    default: xfs
    description: |
      Format of filesystem to use for OSD devices; supported formats include:
      .
        xfs (Default >= 0.48.3)
        ext4 (Only option < 0.48.3)
        btrfs (experimental and not recommended)
      .
      Only supported with ceph >= 0.48.3.
    source: default
    type: string
    value: xfs
  osd-journal:
    description: |
      The device to use as a shared journal drive for all OSD's. By default
      a journal partition will be created on each OSD volume device for use by
      that OSD.
      .
      Only supported with ceph >= 0.48.3.
    source: unset
    type: string
  osd-journal-size:
    default: 1024
    description: |
      Ceph OSD journal size. The journal size should be at least twice the
      product of the expected drive speed multiplied by filestore max sync
      interval. However, the most common practice is to partition the journal
      drive (often an SSD), and mount it such that Ceph uses the entire
      partition for the journal.
      .
      Only supported with ceph >= 0.48.3.
    source: default
    type: int
    value: 1024
  osd-max-backfills:
    description: |
      The maximum number of backfills allowed to or from a single OSD.
      .
      Setting this option on a running Ceph OSD node will not affect running
      OSD devices, but will add the setting to ceph.conf for the next restart.
    source: unset
    type: int
  osd-recovery-max-active:
    description: |
      The number of active recovery requests per OSD at one time. More requests
      will accelerate recovery, but the requests places an increased load on the
      cluster.
      .
      Setting this option on a running Ceph OSD node will not affect running
      OSD devices, but will add the setting to ceph.conf for the next restart.
    source: unset
    type: int
  prefer-ipv6:
    default: false
    description: |
      If True enables IPv6 support. The charm will expect network interfaces
      to be configured with an IPv6 address. If set to False (default) IPv4
      is expected.
      .
      NOTE: these charms do not currently support IPv6 privacy extension. In
      order for this charm to function correctly, the privacy extension must be
      disabled and a non-temporary address must be configured/available on
      your network interface.
    source: default
    type: boolean
    value: false
  source:
    description: |
      Optional configuration to support use of additional sources such as:
      .
        - ppa:myteam/ppa
        - cloud:xenial-proposed/ocata
        - http://my.archive.com/ubuntu main
      .
      The last option should be used in conjunction with the key configuration
      option.
    source: user
    type: string
    value: cloud:xenial-queens
  sysctl:
    default: '{ kernel.pid_max : 2097152, vm.max_map_count : 524288, kernel.threads-max:
      2097152, vm.vfs_cache_pressure: 1, vm.swappiness: 1 }'
    description: |
      YAML-formatted associative array of sysctl key/value pairs to be set
      persistently. By default we set pid_max, max_map_count and
      threads-max to a high value to avoid problems with large numbers (>20)
      of OSDs recovering. very large clusters should set those values even
      higher (e.g. max for kernel.pid_max is 4194303).
    source: user
    type: string
    value: |
      net.ipv4.tcp_sack: 1,
      net.ipv4.tcp_low_latency: 1,
      net.ipv4.tcp_adv_win_scale: 1,
      net.core.rmem_max: 268435456,
      net.core.wmem_max: 268435456,
      net.ipv4.tcp_rmem: 4096 87380 134217728,
      net.ipv4.tcp_wmem: 4096 65536 134217728,
      net.ipv4.tcp_no_metrics_save: 1,
      net.ipv4.tcp_mtu_probing: 1,
      net.core.netdev_max_backlog: 250000,
      net.ipv4.tcp_congestion_control: bbr,
      net.core.default_qdisc: fq
  use-direct-io:
    default: true
    description: Configure use of direct IO for OSD journals.
    source: default
    type: boolean
    value: true
  use-syslog:
    default: false
    description: |
      If set to True, supporting services will log to syslog.
    source: default
    type: boolean
    value: false

Tags: cpe-onsite
tags: added: cpe-onsite
description: updated
Revision history for this message
James Page (james-page) wrote :

"A value of 0 will cause the charm to only consider the actual number of OSDs in the cluster."

If you have that many OSD's you really do need to set the expected-osd-count; the PG calc will happen early in the deployment based on the number of in OSD's at that point in time, which will be inaccurate and then the OSD's join, the PG's spread and the number per OSD drops well below the target of 100/200/300.

Changed in charm-ceph-mon:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.