clock skew results in abnormal memory usage
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceph Monitor Charm |
New
|
Undecided
|
Unassigned |
Bug Description
- Given a ntp misconfiguration or timeout on reaching ntp servers, there can a temporary situation where there is time drift or clock skew across a ceph-mon cluster. While this in itself is really bad, what I see with ceph mon is that the ceph-mon@... service (/usr/bin/ceph-mon) ends up taking gigs of mem (maxed out the node mem usage at 24G) when it should normally be taking a lot less (for the cluster I have about 500-800M.)
- My Setup:
3 vms, microk8s charm deployed to them and ceph mon + ceph osd charms deployed to them
```
ceph-csi v3.9.0,v0,v3... active 3 ceph-csi 1.28/stable 37 no Versions: cephfs=v3.9.0, config=v0, rbd=v3.9.0
ceph-mon 17.2.6 active 3 ceph-mon quincy/stable 201 no Unit is ready and clustered
ceph-osd 17.2.6 active 3 ceph-osd quincy/stable 576 no Unit is ready (1 OSD)
grafana-
microk8s 1.28.3 active 3 microk8s 1.28/stable 213 no node is ready
ntp 4.2 active 3 ntp stable 50 no chrony: Ready
```
Let me know if there is anything more I can provide and clarify
I somewhat reproduced this. Deployed this bundle:
## Variables
loglevel: &loglevel 1
source: &source distro
num_mon_units: &num_mon_units 3
series: jammy count: *num_mon_units secret: ... osd-count: 3
applications:
ceph-mon:
charm: ch:ceph-mon
channel: quincy/edge
series: jammy
num_units: *num_mon_units
constraints: mem=2G
options:
source: *source
loglevel: *loglevel
monitor-
monitor-
expected-
ceph-osd:
charm: ch:ceph-osd
channel: quincy/edge
series: jammy
num_units: 1
constraints: mem=1G
options:
source: *source
loglevel: *loglevel
osd-devices: '' # must be empty string when using juju storage
storage:
osd-devices: cinder,10G,1
relations:
- [ ceph-mon, ceph-osd ]
Set one mon's clock 15 seconds into the future, and got the respective:
clock skew detected on mon.juju- 1e1b6a- lp2055143- 14
The initial memory condition was:
ubuntu@ fandanbango- bastion: ~$ juju run -a ceph-mon -- free -h
total used free shared buff/cache available
total used free shared buff/cache available
total used free shared buff/cache available
- Stdout: |2
Mem: 1.9Gi 473Mi 123Mi 4.0Mi 1.3Gi 1.3Gi
Swap: 0B 0B 0B
UnitId: ceph-mon/6
- Stdout: |2
Mem: 1.9Gi 477Mi 130Mi 4.0Mi 1.3Gi 1.3Gi
Swap: 0B 0B 0B
UnitId: ceph-mon/7
- Stdout: |2
Mem: 1.9Gi 501Mi 95Mi 4.0Mi 1.3Gi 1.3Gi
Swap: 0B 0B 0B
UnitId: ceph-mon/8
Five hours later the situation was:
ubuntu@ fandanbango- bastion: ~/lp2055143$ juju run -a ceph-mon -- free -h
total used free shared buff/cache available
total used free shared buff/cache available
total used free shared buff/cache available
- Stdout: |2
Mem: 1.9Gi 554Mi 68Mi 4.0Mi 1.3Gi 1.2Gi
Swap: 0B 0B 0B
UnitId: ceph-mon/6
- Stdout: |2
Mem: 1.9Gi 558Mi 67Mi 4.0Mi 1.3Gi 1.2Gi
Swap: 0B 0B 0B
UnitId: ceph-mon/7
- Stdout: |2
Mem: 1.9Gi 859Mi 64Mi 4.0Mi 1.0Gi 928Mi
Swap: 0B 0B 0B
UnitId: ceph-mon/8
I've re-set clock synchronization via NTP and will keep monitoring to see if the trend continues or flatlines.