clock skew results in abnormal memory usage

Bug #2055143 reported by Nishant Dash
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
New
Undecided
Unassigned

Bug Description

- Given a ntp misconfiguration or timeout on reaching ntp servers, there can a temporary situation where there is time drift or clock skew across a ceph-mon cluster. While this in itself is really bad, what I see with ceph mon is that the ceph-mon@... service (/usr/bin/ceph-mon) ends up taking gigs of mem (maxed out the node mem usage at 24G) when it should normally be taking a lot less (for the cluster I have about 500-800M.)

- My Setup:
3 vms, microk8s charm deployed to them and ceph mon + ceph osd charms deployed to them
```
ceph-csi v3.9.0,v0,v3... active 3 ceph-csi 1.28/stable 37 no Versions: cephfs=v3.9.0, config=v0, rbd=v3.9.0
ceph-mon 17.2.6 active 3 ceph-mon quincy/stable 201 no Unit is ready and clustered
ceph-osd 17.2.6 active 3 ceph-osd quincy/stable 576 no Unit is ready (1 OSD)
grafana-agent-microk8s active 3 grafana-agent edge 52 no
microk8s 1.28.3 active 3 microk8s 1.28/stable 213 no node is ready
ntp 4.2 active 3 ntp stable 50 no chrony: Ready
```

Let me know if there is anything more I can provide and clarify

Nishant Dash (dash3)
description: updated
Nishant Dash (dash3)
description: updated
Revision history for this message
Facundo Ciccioli (fandanbango) wrote :

I somewhat reproduced this. Deployed this bundle:

## Variables
loglevel: &loglevel 1
source: &source distro
num_mon_units: &num_mon_units 3

series: jammy
applications:
  ceph-mon:
    charm: ch:ceph-mon
    channel: quincy/edge
    series: jammy
    num_units: *num_mon_units
    constraints: mem=2G
    options:
      source: *source
      loglevel: *loglevel
      monitor-count: *num_mon_units
      monitor-secret: ...
      expected-osd-count: 3
  ceph-osd:
    charm: ch:ceph-osd
    channel: quincy/edge
    series: jammy
    num_units: 1
    constraints: mem=1G
    options:
      source: *source
      loglevel: *loglevel
      osd-devices: '' # must be empty string when using juju storage
    storage:
      osd-devices: cinder,10G,1
relations:
  - [ ceph-mon, ceph-osd ]

Set one mon's clock 15 seconds into the future, and got the respective:

clock skew detected on mon.juju-1e1b6a-lp2055143-14

The initial memory condition was:

ubuntu@fandanbango-bastion:~$ juju run -a ceph-mon -- free -h
- Stdout: |2
                   total used free shared buff/cache available
    Mem: 1.9Gi 473Mi 123Mi 4.0Mi 1.3Gi 1.3Gi
    Swap: 0B 0B 0B
  UnitId: ceph-mon/6
- Stdout: |2
                   total used free shared buff/cache available
    Mem: 1.9Gi 477Mi 130Mi 4.0Mi 1.3Gi 1.3Gi
    Swap: 0B 0B 0B
  UnitId: ceph-mon/7
- Stdout: |2
                   total used free shared buff/cache available
    Mem: 1.9Gi 501Mi 95Mi 4.0Mi 1.3Gi 1.3Gi
    Swap: 0B 0B 0B
  UnitId: ceph-mon/8

Five hours later the situation was:

ubuntu@fandanbango-bastion:~/lp2055143$ juju run -a ceph-mon -- free -h
- Stdout: |2
                   total used free shared buff/cache available
    Mem: 1.9Gi 554Mi 68Mi 4.0Mi 1.3Gi 1.2Gi
    Swap: 0B 0B 0B
  UnitId: ceph-mon/6
- Stdout: |2
                   total used free shared buff/cache available
    Mem: 1.9Gi 558Mi 67Mi 4.0Mi 1.3Gi 1.2Gi
    Swap: 0B 0B 0B
  UnitId: ceph-mon/7
- Stdout: |2
                   total used free shared buff/cache available
    Mem: 1.9Gi 859Mi 64Mi 4.0Mi 1.0Gi 928Mi
    Swap: 0B 0B 0B
  UnitId: ceph-mon/8

I've re-set clock synchronization via NTP and will keep monitoring to see if the trend continues or flatlines.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.