memory overhead of qemu-kvm with ceph rbd and ram-allocation-ratio=0.9 leads to memory starvation

Bug #1674481 reported by Drew Freiberger
48
This bug affects 16 people
Affects Status Importance Assigned to Milestone
OpenStack Nova Cloud Controller Charm
Invalid
Undecided
Unassigned
ceph (Ubuntu)
Incomplete
Medium
Unassigned
qemu (Ubuntu)
Confirmed
Medium
Unassigned

Bug Description

We have observed up to 20% memory overhead on several 18GB nova instances packed per node taking up to 22GB resident memory when fully utilized. Our standard ram-allocation-ratio is 0.9 and we have 5120 MB reserved-host-memory configured on the nova-compute charm.

The nodes have 512 GB ram and were experiencing less than 8GB free ram with 8GB swap utilized. High consumption CICD environment with ceph ephemeral disk.

We have since worked around memory starvation by adding nodes and reducing ram_allocation_ratio to 0.7.

Does this resident memory overhead seem abnormally high? Is there anything in the below qemu process that may be causing this that we need to account for in the ram-allocation-ratio settings?

Specs:
Xenial series
nova-cloud-controller 13.1.2 charm rev. 503
nova-compute 13.1.2 charm rev. 135

Note the -m 18432(MB) argument and 22049 MB RSS of this typical high-use instance's process:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
libvirt+ 2020854 208 4.2 49713524 22579016 ? Sl 19:06 235:28 /usr/bin/qemu-system-x86_64 -name instance-XXXXXXXX -S -machine pc-i440fx-xenial
accel=kvm
usb=off -cpu Haswell-noTSX -m 18432 -realtime mlock=off -smp 8
sockets=8
cores=1
threads=1 -uuid XXXXXXXX -smbios type=1
manufacturer=OpenStack Foundation
product=OpenStack Nova
version=13.1.2
serial=XXXXXXXXXXXX
uuid=XXXXXXXXX
family=Virtual Machine -no-user-config -nodefaults -chardev socket
id=charmonitor
path=/var/lib/libvirt/qemu/domain-instance-000173b7/monitor.sock
server
nowait -mon chardev=charmonitor
id=monitor
mode=control -rtc base=utc
driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci
id=usb
bus=pci.0
addr=0x1.0x2 -drive file=rbd:nova/XXXXXXX-23e3-40c0-9038-3dd837e5b1a3_disk:id=nova-compute:key=XXXXXXXXX==:auth_supported=cephx\;none:mon_host=X.Y.Z.A\:6789\;1X.Y.Z.B\:6789\;X.Y.Z.C\:6789
format=raw
if=none
id=drive-virtio-disk0
cache=none -device virtio-blk-pci
scsi=off
bus=pci.0
addr=0x4
drive=drive-virtio-disk0
id=virtio-disk0
bootindex=1 -netdev tap
fd=26
id=hostnet0
vhost=on
vhostfd=34 -device virtio-net-pci
netdev=hostnet0
id=net0
mac=XX:XX:XX:XX:XX:XX
bus=pci.0
addr=0x3 -chardev file
id=charserial0
path=/var/lib/nova/instances/29ce4bc7-23e3-40c0-9038-3dd837e5b1a3/console.log -device isa-serial
chardev=charserial0
id=serial0 -chardev pty
id=charserial1 -device isa-serial
chardev=charserial1
id=serial1 -device usb-tablet
id=input0 -vnc 0.0.0.0:0 -k en-us -device cirrus-vga
id=video0
bus=pci.0
addr=0x2 -device virtio-balloon-pci
id=balloon0
bus=pci.0
addr=0x5 -msg timestamp=on

Xav Paice (xavpaice)
tags: added: canonical-bootstack
Revision history for this message
youshotwhointhatwhatnow (moloney-brendan) wrote :

I am not using openstack, but have the same problem with QEMU/KVM virtual machines that have Ceph RBD disks attached. There is an issue on the Ceph tracker (http://tracker.ceph.com/issues/20054#change-93573) but it isn't clear if the bug is on their side or if it is in QEMU.

Revision history for this message
James Page (james-page) wrote :

Just to confirm what I think I see in the qemu params - is this instances booted from ceph using the nova-compute support for ephemeral storage on ceph?

Revision history for this message
James Page (james-page) wrote :

Qemu bug report - bug 1701449

Revision history for this message
James Page (james-page) wrote :

Raising bug tasks for distro packages alongside the charm bug.

Changed in charm-nova-cloud-controller:
status: New → Invalid
Revision history for this message
James Page (james-page) wrote :

Marking charm bug as invalid as this is an issue with librbd or qemu, not the charm itself.

Changed in ceph (Ubuntu):
importance: Undecided → Medium
Changed in qemu (Ubuntu):
importance: Undecided → Medium
status: New → Incomplete
Changed in ceph (Ubuntu):
status: New → Incomplete
Revision history for this message
James Page (james-page) wrote :

Marking this as 'Medium' for now - we may want to bump that to 'High'; right now we have a reproducer in the context of qemu/librbd but not directory with librbd which indicates this is a qemu specific issue.

summary: - memory overhead of qemu-kvm and ram-allocation-ratio=0.9 leads to memory
- starvation
+ memory overhead of qemu-kvm with ceph rbd and ram-allocation-ratio=0.9
+ leads to memory starvation
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for qemu (Ubuntu) because there has been no activity for 60 days.]

Changed in qemu (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for ceph (Ubuntu) because there has been no activity for 60 days.]

Changed in ceph (Ubuntu):
status: Incomplete → Expired
Changed in qemu (Ubuntu):
status: Expired → New
Changed in ceph (Ubuntu):
status: Expired → New
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Looks like #1701449 is a duplicate of this one, or the other way around.

Revision history for this message
James Page (james-page) wrote :

Upstream bug: http://tracker.ceph.com/issues/36192

Commits are in place nautilus, mimic and luminous branches:

$ git tag --contains 2b3761d7a247eaba12bbdb8e0fee4bd9cfc89041
v13.2.3
v13.2.4
v13.2.5
v13.2.6

So this should be resolved for >= cosmic (which has 13.2.4 with 13.2.6 in SRU).

$ git tag --contains 5b173a3ef2f9e9f113556f085b1e6d0f17cbe388
v12.2.11
v12.2.12

and for >= bionic or xenial/{pike,queens} (which all have 12.2.11 in updates).

Revision history for this message
James Page (james-page) wrote :

The original bug report lacks a ceph/openstack/qemu version - was this prior to the releases details in #10?

Changed in ceph (Ubuntu):
status: New → Incomplete
Revision history for this message
James Page (james-page) wrote :

This issue may be resolved in later ceph releases present in Ubuntu.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in qemu (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.