KVM guest hash page table failed to allocate contiguous memory (CMA)

Bug #1781038 reported by bugproxy
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Opinion
Critical
Canonical Kernel Team
linux (Ubuntu)
Invalid
Critical
Canonical Kernel Team

Bug Description

Per an email forwarded within IBM, we wish to use this Launchpad bug to work on the technical discussion with the Canonical development folks and the IBM KVM and kernel team surrounding the analysis made by Daniel Axtens of Canonical for the customer issue raised in Case #00177825.

The only statement at the moment by the KVM team was that there were various issues associated with CMA fragmentation causing issues with KVM guests. However, as mentioned, this bug is to allow the dialog amongst all the developers to see what can be done to help alleviate the situation or understand the root cause further.

Please also note that we should not be attaching customer data to this bug. If that is necessary then we expect Canonical to help provide a controlled environment for reviewing that data so we avoid any privacy issues (e.g. for GDPR compliance).

Here is the email from Daniel:

I have looked at the sosreport you uploaded. Here is my analysis so far.

Virtualisation on powerpc has some special requirements. To start a guest on a powerpc host, you need to allocate a contiguous area of memory to hold the guest's hash page table (HPT, or HTAB, depending on which document you look at). The HPT is required to track and manage guest memory.

Your error reports show qemu asking the kernel to allocate an HTAB, and the kernel reporting that it had insufficient memory to do so. The required memory for the HPT scales with the guest memory size - it should be about 1/128th of guest memory, so for a 16GB guest, that's 128MB. However, the HPT has to be allocated as a single contiguous memory region. (This is in contrast to regular guest memory, which is not required to be contiguous from the host point of view.)

The kernel keeps a special contiguous memory area (CMA) for these purposes, and keeps track of the total amounts in use and still available. These are shown in /proc/meminfo. From the system that ran the sosreport, we see:

CmaTotal: 26853376 kB
CmaFree: 4024448 kB

So there is a total of about 25GB of CMA, of which about 3.8GB remain. This is obviously more than 128MB:

- It's very possible that between the error and the sosreport, more contiguous memory became available. This would match the intermittent nature of the issue.

- It also might be that the failure was due to fragmentation of memory in the CMA pool. That is, there might be more than 128MB, but it might all be in chunks that are smaller than 128MB, or which don't have the required alignment for a HPT.

Given that the system's uptime was 112 days when the sosreport was generated, it would be unsurprising if fragmentation had occurred! (Relatedly - you're running 4.4.0-109, which does not have the Spectre and Meltdown fixes.)

This issue has come up before - both in a public Canonical-IBM synchronised bug report[1], and with Red Hat[2]. It appears that there is some work within IBM to address this, but it seems to have stalled. I will get in touch with the IBM powerpc kernel team on their public mailing list and ask about the status. I will keep you updated.

In the mean time, I have a potential solution/workaround. By default, 5% of memory is reserved for CMA (kernel source: arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can increase this with a boot parameter, so for example to reserve 10%, you could boot with kvm_cma_resv_ratio=10. This can be set in petitboot. This should significantly reduce the incidence of this issue - perhaps eliminating it entirely - at the cost of locking away more of the system's memory. You would need to experiment to determine the optimal value. Perhaps given that you are seeing the problem only intermittently, a ratio of 7% would be sufficient - that would give you ~35GB of CMA.

Please let me know if testing this setting would be an option for you. Please also let me know if you require further information on setting boot parameters with Petitboot.

Regards,
Daniel

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300

Before we go any further, let's get the basic info here. Apparently there was a sosreport somewhere else, and a link would be good, but, here's what we need here -- at least -- to get started:

1. What is the server model and at least basic config info (I/O cards, firmware level)? Use /proc/meminfo, etc. Attach the syslog and the /var/log/libvirt/qemu logs.

2. What is running on the host (at least uname -a). Sounds like from comment above like it's an older fix level, so let's get it updated to the curent level (and ensure the problem still exists) before proceeding: There is zero point in trying to figure out whether fixes that are known to exist in 16.04 are in this *particular* build level.

3. What is running on the guests? The exact same OS level? Please attach XML (from virsh dumpxml) for each guest running on the system when the failure occurs (and make a note of which one is from the failing guest). If we are 100% sure that, excepting unique IDs & filenames, the XMLs are identical, then don't attach duplicates.

4. Anything else you think we should know.

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-169648 severity-critical targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Luciano Chavez (lnx1138) wrote :

Hello Canonical,

I've subscribed Daniel to this bug for his help getting additional information on the situation he was working. I expect he will add more folks from his end if necessary.

Lastly, if you feel the bug should be marked as private for this discussion, please feel free to do that. Thanks.

Changed in ubuntu-power-systems:
importance: Undecided → Critical
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: triage-g
Revision history for this message
Daniel Axtens (daxtens) wrote :

Hi,

This came up in the context of a customer issue. I have asked them if we can share anonymised data here, and I will pass on any response.

From my analysis of the code while working the case, it would seem that you could reproduce this by spinning up and tearing down VMs of varying memory sizes in order to fragment the CMA. It looks like PCI pass-through would exacerbate the issue, although I don't believe this was a factor in this instance.

I wonder if this is fully 'solvable' per se - with memory overcommit it should be easy to simply run out of CMA space - but it should be possible to at least print much more helpful information either from the kernel or from qemu.

Regards,
Daniel

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-07-12 10:00 EDT-------
Clearly, the general problem of running out of CMA allocatable space is not soluble in the current architecture, anyway. However, this is exactly why we need to know the particular situation at hand to understand this particular customer problem and whether there is something that can be done -- or, depending on what kernel level, has already been done.

Manoj Iyer (manjo)
Changed in linux (Ubuntu):
importance: Undecided → Critical
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team)
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: New → Triaged
Revision history for this message
Daniel Axtens (daxtens) wrote :

Based on the most recent information we have available to us (2018-05-09):

1. What is the server model and at least basic config info (I/O cards, firmware level)? Use /proc/meminfo, etc. Attach the syslog and the /var/log/libvirt/qemu logs.

I am struggling a bit to determine the server model, but I'm uploading the relevant logs.

2. What is running on the host (at least uname -a). Sounds like from comment above like it's an older fix level, so let's get it updated to the curent level (and ensure the problem still exists) before proceeding: There is zero point in trying to figure out whether fixes that are known to exist in 16.04 are in this *particular* build level.

Linux apsoscmp-as-a4p 4.4.0-109-generic #132-Ubuntu SMP Tue Jan 9 20:00:40 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

I don't have any answers for (3); the user has been asked.

Revision history for this message
Daniel Axtens (daxtens) wrote :
Revision history for this message
Daniel Axtens (daxtens) wrote :
Revision history for this message
Daniel Axtens (daxtens) wrote :
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-07-18 08:38 EDT-------
Thanks, Daniel. Are you confident that the provided logs, etc. are taken from a machine that is currently actually showing the symptoms? The customer does tend to take the view that all machines are the same at all times. I don't want to try to dig into information taken from another machine or even the same machine after rebooting, etc.

Revision history for this message
Daniel Axtens (daxtens) wrote : Re: [Bug 1781038] Comment bridged from LTC Bugzilla

Hi,

I am told that this is the same machine but not while it was currently
showing symptoms - due to the intermittent nature of the problem it
was taken some time later. This matches what I see in the logs so I
have no reason to doubt it.

Regards,
Daniel

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-07-20 19:03 EDT-------
Looking though the qemu logs, I'm not seeing anything obvious, though I will have a colleague look at them also on Monday. I see that even "some time later" we have a system with 512gb of RAM, the usual default 5% in CMA (25gb) and less than 4gb free in the CMA. I'm not sure how to see what's using up the CMA, so I'll ask about that, also.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-07-23 15:10 EDT-------
I'm still looking. I'm confident that daniel has correctly identified and explained the failure mode (i.e. out of enough CMA to create guests). I also don't *think* we have enough information at the moment to say for sure why this particular memory pool is exhausted; however, I'm still poking around to see what information we *do* need.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-07-23 16:30 EDT-------
I think the key patch (from LP 1632045) was pulled in as of 4.4.0-51.72. As far as I can tell, the CMA debugging is not in ubuntu (actually, I'm not positive it was even pulled in upstream, but that's less immediately important). I will continue to poke, but I don't think there's a good way even to determine, without a kdump, where that memory is in use. I think the only thing to do is try to work around the problem by devoting more memory to the CMA by specifying a boot parameter like cma=50g for example. That would at least greatly alleviate the problem, though not fix it. To fix it, I think we will need to arrange for a kdump from a failing system and then see exactly where the memory is in use.

I will keep asking my colleagues; however, that's where I am for now.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-07-24 00:03 EDT-------
There are a couple of points about the CMA and HPT allocations:

1. The HPT is probably getting sized according to the maximum memory of the guest, not the initial amount of memory. Since we haven't been given any specifics about the configuration of the guests, I can't tell whether they are configured with maxmem greater than current/initial memory. If the guests are configured with a very large maxmem, their HPTs will be much larger than the 128MB mentioned in previous comments, and that will obviously make it much more likely to have problems allocating the HPT.

(With a sufficiently recent host kernel, guest kernel and QEMU, the HPT can be resized while the guest is running, and in that case, QEMU determines the size of the HPT from the current memory rather than the maximum memory. However, HPT resizing went into the kernel later than 4.4, and I don't believe it has been backported into the Ubuntu version of 4.4, hence the "probably" in the previous paragraph.)

2. The memory in the CMA zone is not locked away in the way that previous comments imply. Memory in the CMA zone is still available for movable allocations, which includes page cache and anonymous pages for user processes, as well as memory for KVM guests. It is not available for kernel allocations (including things like network packet buffers).

Thus it is worth while trying a larger kvm_cma_resv_ratio value in situations like this. When fragmentation occurs, the parts of the CMA zone that are too fragmented to use for HPTs can still be used for running user processes and backing KVM guests.

3. Other relevant factors are whether the guest has any real PCI devices passed through to it, and whether the guest is backed with large pages. If the guest has any PCI devices passed through, then when the guest sets up the DDW (dynamic DMA windows) TCE (iommu) table at boot time, that will have the effect of pinning all the guest memory. Some of the guest memory may have been allocated from the CMA zone. Balbir's patch (which is in the Ubuntu 4.4 kernel now) will try to migrate any pages that are in the CMA zone to somewhere else before they get pinned, but if memory is in short supply, it may not succeed in moving the page out of the CMA zone, and also the patch doesn't cope with large (16M) pages, whether THP or explicit large pages.

Thus it would be worth disabling THP in the host.

I'm not sure whether explicit large pages could ever come from the CMA zone. If the guests were backed by large pages, it would be worth trying without large-page backing.

Manoj Iyer (manjo)
Changed in ubuntu-power-systems:
status: Triaged → Opinion
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-07-31 08:56 EDT-------
I believe the customer has asked to close this bug. Canonical, please confirm.

Revision history for this message
Daniel Axtens (daxtens) wrote :

Yes, we have closed the support case on our end at their request. Apparently increasing the reservation ratio has helped.

Paulus - Hi! Thanks for the info and clearing up some of my misunderstandings. Great to hear from you and I hope things are going well at OzLabs :)

Changed in linux (Ubuntu):
status: New → Invalid
Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.