Ceph Monitor Charm

clock skew results in abnormal memory usage

Bug #2055143 reported by Nishant Dash on 2024-02-27

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ceph Monitor Charm	New	Undecided	Unassigned

Bug Description

- Given a ntp misconfiguration or timeout on reaching ntp servers, there can a temporary situation where there is time drift or clock skew across a ceph-mon cluster. While this in itself is really bad, what I see with ceph mon is that the ceph-mon@... service (/usr/bin/ceph-mon) ends up taking gigs of mem (maxed out the node mem usage at 24G) when it should normally be taking a lot less (for the cluster I have about 500-800M.)

- My Setup:
3 vms, microk8s charm deployed to them and ceph mon + ceph osd charms deployed to them
```
ceph-csi v3.9.0,v0,v3... active 3 ceph-csi 1.28/stable 37 no Versions: cephfs=v3.9.0, config=v0, rbd=v3.9.0
ceph-mon 17.2.6 active 3 ceph-mon quincy/stable 201 no Unit is ready and clustered
ceph-osd 17.2.6 active 3 ceph-osd quincy/stable 576 no Unit is ready (1 OSD)
grafana-agent-microk8s active 3 grafana-agent edge 52 no
microk8s 1.28.3 active 3 microk8s 1.28/stable 213 no node is ready
ntp 4.2 active 3 ntp stable 50 no chrony: Ready
```

Let me know if there is anything more I can provide and clarify

See original description

Nishant Dash (dash3) on 2024-02-27

description:

updated

Nishant Dash (dash3) on 2024-02-28

description:

updated

Revision history for this message

Facundo Ciccioli (fandanbango) wrote on 2024-02-28:

I somewhat reproduced this. Deployed this bundle:

## Variables
loglevel: &loglevel 1
source: &source distro
num_mon_units: &num_mon_units 3

series: jammy
applications:
  ceph-mon:
    charm: ch:ceph-mon
    channel: quincy/edge
    series: jammy
    num_units: *num_mon_units
    constraints: mem=2G
    options:
      source: *source
      loglevel: *loglevel
      monitor-count: *num_mon_units
      monitor-secret: ...
      expected-osd-count: 3
  ceph-osd:
    charm: ch:ceph-osd
    channel: quincy/edge
    series: jammy
    num_units: 1
    constraints: mem=1G
    options:
      source: *source
      loglevel: *loglevel
      osd-devices: '' # must be empty string when using juju storage
    storage:
      osd-devices: cinder,10G,1
relations:
  - [ ceph-mon, ceph-osd ]

Set one mon's clock 15 seconds into the future, and got the respective:

clock skew detected on mon.juju-1e1b6a-lp2055143-14

The initial memory condition was:

Five hours later the situation was:

I've re-set clock synchronization via NTP and will keep monitoring to see if the trend continues or flatlines.

I somewhat reproduced this. Deployed this bundle:

## Variables
loglevel:                   &loglevel                  1
source:                     &source                    distro
num_mon_units:              &num_mon_units             3

series: jammy
applications:
  ceph-mon:
    charm: ch:ceph-mon
    channel: quincy/edge
    series: jammy
    num_units: *num_mon_units
    constraints: mem=2G
    options:
      source: *source
      loglevel: *loglevel
      monitor-count: *num_mon_units
      monitor-secret: ...
      expected-osd-count: 3
  ceph-osd:
    charm: ch:ceph-osd
    channel: quincy/edge
    series: jammy
    num_units: 1
    constraints: mem=1G
    options:
      source: *source
      loglevel: *loglevel
      osd-devices: ''  # must be empty string when using juju storage
    storage:
      osd-devices: cinder,10G,1
relations:
  - [ ceph-mon, ceph-osd ]

Set one mon's clock 15 seconds into the future, and got the respective:

clock skew detected on mon.juju-1e1b6a-lp2055143-14

The initial memory condition was:

ubuntu@fandanbango-bastion:~$ juju run -a ceph-mon -- free -h
- Stdout: |2
                   total        used        free      shared  buff/cache   available
    Mem:           1.9Gi       473Mi       123Mi       4.0Mi       1.3Gi       1.3Gi
    Swap:             0B          0B          0B
  UnitId: ceph-mon/6
- Stdout: |2
                   total        used        free      shared  buff/cache   available
    Mem:           1.9Gi       477Mi       130Mi       4.0Mi       1.3Gi       1.3Gi
    Swap:             0B          0B          0B
  UnitId: ceph-mon/7
- Stdout: |2
                   total        used        free      shared  buff/cache   available
    Mem:           1.9Gi       501Mi        95Mi       4.0Mi       1.3Gi       1.3Gi
    Swap:             0B          0B          0B
  UnitId: ceph-mon/8

Five hours later the situation was:

ubuntu@fandanbango-bastion:~/lp2055143$ juju run -a ceph-mon -- free -h
- Stdout: |2
                   total        used        free      shared  buff/cache   available
    Mem:           1.9Gi       554Mi        68Mi       4.0Mi       1.3Gi       1.2Gi
    Swap:             0B          0B          0B
  UnitId: ceph-mon/6
- Stdout: |2
                   total        used        free      shared  buff/cache   available
    Mem:           1.9Gi       558Mi        67Mi       4.0Mi       1.3Gi       1.2Gi
    Swap:             0B          0B          0B
  UnitId: ceph-mon/7
- Stdout: |2
                   total        used        free      shared  buff/cache   available
    Mem:           1.9Gi       859Mi        64Mi       4.0Mi       1.0Gi       928Mi
    Swap:             0B          0B          0B
  UnitId: ceph-mon/8

I've re-set clock synchronization via NTP and will keep monitoring to see if the trend continues or flatlines.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.