high memory usage when using rbd with client caching

Bug #1701449 reported by Nick
72
This bug affects 14 people
Affects Status Importance Assigned to Milestone
QEMU
New
Undecided
Unassigned

Bug Description

Hi,
we are experiencing a quite high memory usage of a single qemu (used with KVM) process when using RBD with client caching as a disk backend. We are testing with 3GB memory qemu virtual machines and 128MB RBD client cache. When running 'fio' in the virtual machine you can see that after some time the machine uses a lot more memory (RSS) on the hypervisor than she should. We have seen values (in real production machines, no artificially fio tests) of 250% memory overhead. I reproduced this with qemu version 2.9 as well.

Here the contents of our ceph.conf on the hypervisor:
"""
[client]
rbd cache writethrough until flush = False
rbd cache max dirty = 100663296
rbd cache size = 134217728
rbd cache target dirty = 50331648
"""

How to reproduce:
* create a virtual machine with a RBD backed disk (100GB or so)
* install a linux distribution on it (we are using Ubuntu)
* install fio (apt-get install fio)
* run fio multiple times with (e.g.) the following test file:
"""
# This job file tries to mimic the Intel IOMeter File Server Access Pattern
[global]
description=Emulation of Intel IOmeter File Server Access Pattern
randrepeat=0
filename=/root/test.dat
# IOMeter defines the server loads as the following:
# iodepth=1 Linear
# iodepth=4 Very Light
# iodepth=8 Light
# iodepth=64 Moderate
# iodepth=256 Heavy
iodepth=8
size=80g
direct=0
ioengine=libaio

[iometer]
stonewall
bs=4M
rw=randrw

[iometer_just_write]
stonewall
bs=4M
rw=write

[iometer_just_read]
stonewall
bs=4M
rw=read
"""

You can measure the virtual machine RSS usage on the hypervisor with:
  virsh dommemstat <machine name> | grep rss
or if you are not using libvirt:
  grep RSS /proc/<PID of qemu process>/status

When switching off the RBD client cache, all is ok again, as the process does not use so much memory anymore.

There is already a ticket on the ceph bug tracker for this ([1]). However I can reproduce that memory behaviour only when using qemu (maybe it is using librbd in a special way?). Running directly 'fio' with the rbd engine does not result in that high memory usage.

[1] http://tracker.ceph.com/issues/20054

Revision history for this message
Markus Schade (lp-markusschade) wrote :

We are seeing pretty much the same issue with even small (1G mem) virtual instances using 2-3GB of RSS after running I/O intensive applications. Live migrating the instance to another machine pushes the memory usage back, but it will grow back again once I/O is back.

Revision history for this message
joconcepts (jonav) wrote :

Any update on this?

Revision history for this message
James Page (james-page) wrote :

Linking back to bug 1674481 which I think is the same issue seen in Ubuntu

Revision history for this message
Nick (n6ck) wrote :

Is there any progress on solving this or does anyone has an idea how to further debug this? I think we are kinda stuck in the ceph bug tracker issue as well [1].

[1] http://tracker.ceph.com/issues/20054

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Any reason we are keeping this bug and #1674481 separate? We are not sure?

Revision history for this message
Jason Dillaman (jdillaman) wrote :

@Nick: if you can recreate the librbd memory growth, any chance you can help test a potential fix [1]?

[1] https://github.com/ceph/ceph/pull/24297

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.