Default settings for virtlogd results in "too many open files" errors

Bug #1720887 reported by Chris Newcomer
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Nova Compute Charm
Invalid
Undecided
Unassigned
libvirt (Ubuntu)
Expired
High
Unassigned

Bug Description

When deploying VM instances on nova-compute nodes, the following error occurs after approximately the 250th:

libvirtd[7907]: Unable to open file: /var/lib/nova/instances/######-####-####-####-#########/console.log: Too many open files

When setting "LimitNOFILE=inifinity" and restarting virtlogd, the problem goes away.

This was reported under Ocata.

Tags: sts
Revision history for this message
Chris Newcomer (cnewcomer) wrote :

The workaround for now is:

mkdir -p /etc/systemd/system/virtlogd.service.d/
cat > /etc/systemd/system/virtlogd.service.d/limits.conf << EOF
[Service]
LimitNOFILE=infinity
EOF

systemctl daemon-reload

Revision history for this message
Felipe Reyes (freyes) wrote :

I wonder if this is something we should fix in the libvirt package.

tags: added: sts
Revision history for this message
James Page (james-page) wrote :

@freyes

Yes definitely!

Changed in charm-nova-compute:
status: New → Invalid
Changed in libvirt (Ubuntu):
status: New → Triaged
importance: Undecided → High
Felipe Reyes (freyes)
Changed in libvirt (Ubuntu):
assignee: nobody → Felipe Reyes (freyes)
assignee: Felipe Reyes (freyes) → nobody
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Some of you knew that I was on PTO a few days, but for everyone sorry for the delay.
Yet OTOH this wasn't rocket science enough for anybody else to pick up in the meantime.

I started on it today, so expect updates soon ...

Thanks Chris already for the direct pointer to the workaround/fix in your case.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

We have at least to think if infinity is the right value before changing it (and if we change we want to push it upstream, so consideration thought are useful in any case).

Obviously it removes it being a blocker for many guests, but OTOH it is meant to avoid overloads of the system.

The current limit (default) is 8192 (read from /proc/<pid>/limits of the process.
It starts with a base of about 15 and adds 1 per guest.
That would make me expect being a limit short of 8k guests which for a single host is high enough that I'd agree to an admin having to opt in changing a conffile.

But you reported that your cases break around 250 which should be working IMHO.

The comments in the upstream .service file already kind of support my theory that the number should be sufficient.
# Need to have at least one file open per guest (eg QEMU
# stdio log), but might be more (eg serial console logs)
# libvirtd.service written to expect 4096 guests, so if we
# guess at 2 log files per guest here (stdio + 1 serial):
LimitNOFILE=8192

But that also means that if you have many serials by default, then you'll exceed it much faster.
Never the less to break around 250 you'd need 32 serials per guest which seems a bit too much right?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

IIRC Openstack adds two in most cases, like:
    <serial type='file'>
      <source path='/var/lib/nova/instances/7c0dcd78-d6b4-4575-a882-ee5d29c64fe0/console.log'/>
      <target port='0'/>
      <alias name='serial0'/>
    </serial>
    <console type='file'>
      <source path='/var/lib/nova/instances/7c0dcd78-d6b4-4575-a882-ee5d29c64fe0/console.log'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>

Twice the same path btw, but also same alias.
I checked and that still means only 2 files per guest.
The default guest log like:
  /var/log/libvirt/qemu/x-test1.log
And the file redirected console like:
  /tmp/consoleIDX2.log

So under these conditions the assumption of ~4096 should work.
In detail I found that under this config there also are 2 pipes along the two logs.
So a single guest looks like:
l-wx------ 1 root root 64 Sep 25 20:48 22 -> /var/log/libvirt/qemu/x-test3.log
lr-x------ 1 root root 64 Sep 25 20:48 23 -> pipe:[5014140]
l-wx------ 1 root root 64 Sep 25 20:48 24 -> /tmp/consoleIDX3.log
lr-x------ 1 root root 64 Sep 25 20:48 25 -> pipe:[5014157]

With 5 guests I had 20 new FDs on the virtlogd service.
That said the limit should still be enough for 1024 guests.
Testing against those numbers now ...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Yeah I can reach the thousands (which already is kind of dense considering that a real case will not drive the nonsense 64MB guests I use for testing).

So what do we need now - TL;DR:
1. eventually we want to get the change accepted upstream
2. to get those we need some good data why 8192 is not sufficient
3. We need to understand how you exceed these numbers with just ~250 guests

@Chris - Please:
1. check the FDs you virtlogd has attached (get its pid and check /proc/<pid>/fd
2. report how many get added per guest
3. report which ones get added per guest
4. share the generated guest XML (to hopefully show us what is the reason for so many logs)

Changed in libvirt (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

At 1022 I exceed the limit of my basic bridge, but that is fine I guess.
At this point I use ~4000 files and things are working.

virsh list | tail; sudo ls -laF /proc/2905/fd | wc -l;
 1035 x-test1014 running
 1036 x-test1015 running
 1037 x-test1016 running
 1038 x-test1017 running
 1039 x-test1018 running
 1040 x-test1019 running
 1041 x-test1020 running
 1042 x-test1021 running
 1043 x-test1022 running

4102

So I'm really looking forward to see the details of your case and why it causes this (in regard to the virtlogd limit).

Revision history for this message
Chris Newcomer (cnewcomer) wrote :

This is the data that led me to believe the default setting was low. I gathered this from a customer:

ubuntu@xxxxxxxxx:~$ virsh list | wc -l
212
ubuntu@xxxxxxxxx:~$ sudo lsof -p $(pgrep virtlogd) | wc -l
893
ubuntu@xxxxxxxxx:~$ sudo grep "open files" /proc/$(pgrep virtlogd)/limits
Max open files 1024 4096 files

This is with 212 VMs and when the lsof output gets over 1024, it throws errors.

Thanks,
Chris

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

That is very interesting Chris.
As it should not be on 1024/4096 at all.

The limit was not set back in libvirt 2.5 and before, so older versions had a higher limit - that much is true.
I checked and it has
    Artful: Max open files 8192 8192 files
Pre-Artful: Max open files 1048576 1048576 files

But both limits are way above yours.

Your case has ~4 files per guest, but with the right limit that should be enough to close to 2k guests.

I'd suggest two things:
1. I'll discuss upstream to raise the 8192 to 16384 as it seems Openstack (as a common setup) has 4 files per guest.
I'll do that and link the Mailing List entry here

2. We need to find why you have a limit applied even lower than what the service file configures.
   @Chris - can you have a look on that system why that could be so low - any extra limits
    applied somewhere?
    Does the virlogd service file have the 8192 by default - if it has 1024/4096 what changed it?
    If it has 8192, can you see on the system where the 1024/4096 might come from?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Submitted upstream to increase the limit in regard to what we found (4 per guest).
Also we had similar discussions for disks on s390x, so doing those in one shot.

Reference: https://www.redhat.com/archives/libvir-list/2017-October/msg00735.html

Further on I'm waiting for some data/insight on why the original system has a lower limit than what the service configures.

Revision history for this message
Chris Newcomer (cnewcomer) wrote :

Christian,

The customer that I was working with has changed their environment, so I cannot go back to them. I will have to deploy openstack in our test lab to verify the default settings. It will take some time for me to get this checked out. I will reply back when I get it running. Thanks.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks Chris,
until then I'll continue bringing slightly increased limits upstream based on what we learned here.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

FYI - my upstream change to raise the limits got accepted.
But that doesn't help in your case were you were lower than the limit set by the service.
So no change to this bug.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for libvirt (Ubuntu) because there has been no activity for 60 days.]

Changed in libvirt (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.