Ubuntu
libvirt package

nl_recv returned with error: No buffer space available

Bug #2025647 reported by Roberto Bartzen Acosta on 2023-07-03

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	libvirt (Ubuntu)	New	Undecided	Unassigned

Bug Description

We are seeing the following error when creating VMs:

Jun 28 19:33:29 SR-0001 libvirtd[5289]: nl_recv returned with error: No buffer space available

This problem occurs in deployments with different OS versions.

Ubuntu 20.04.4 LTS
ii libnl-3-200:amd64 3.4.0-1
ii libvirt-daemon 6.0.0-0ubuntu8.16

Ubuntu 22.04 LTS
ii libnl-3-200:amd64 3.5.0-0.1
ii libvirt-daemon 8.0.0-1ubuntu7.5

The host has around 40 native interfaces (bounds, vlans, tunneling bridges, mgmt bridges, physical NICs, etc), and plus 40 or more tap interfaces (depending on the number of VMs allocated), but we observed errors with more than 40 tap interfaces added to the host.

I saw with strace that libvirt receives messages with more than 128k (one message with approximately 170k), apparently the MSG_PEEK mechanism seems to be working, but it may fail in some cases.

In a "good received flow" of incoming messages I see something like the log below, multiple messages for the same sequence number until an error occurs.

3840517 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=1336, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1688146694, nlmsg_pid=3840517},....

3840517 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 2840

3840517 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=1420, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1688146694, nlmsg_pid=3840517},.....

3840517 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=1772, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1688146694, nlmsg_pid=3840517},....

3840517 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=1716, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1688146694, nlmsg_pid=3840517}, ...

3840517 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=1428, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1688146694, nlmsg_pid=3840517},...

3840517 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1688146694, nlmsg_pid=3840517}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20

3840517 sendmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=48, nlmsg_type=RTM_NEWQDISC, nlmsg_flags=NLM_F_REQUEST|NLM_F_ACK|NLM_F_EXCL|NLM_F_CREATE, nlmsg_seq=1688146695, nlmsg_pid=0}, {tcm_family=AF_UNSPEC, tcm_ifindex=if_nametoindex("tapf28271ff-34"), tcm_handle=0, tcm_parent=4294967295, tcm_info=0}, [{nla_len=12, nla_type=TCA_KIND}, "noqueue"]], iov_len=48}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 48

3840517 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=36, nlmsg_type=NLMSG_ERROR, nlmsg_flags=NLM_F_CAPPED, nlmsg_seq=1688146695, nlmsg_pid=3840517}, {error=0, msg={nlmsg_len=48, nlmsg_type=RTM_NEWQDISC, nlmsg_flags=NLM_F_REQUEST|NLM_F_ACK|NLM_F_EXCL|NLM_F_CREATE, nlmsg_seq=1688146695, nlmsg_pid=0}}], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 36

There are some abandoned discussions as pointed out in [1], [2], and [3] about MSG_PEEK and libnl to dynamically allocate more buffers, libvirt already sets MSG_PEEK at startup and sets the default buffer to 128k, but even so, messages like this are still logged anyway, what would it be the problem? Any idea?

This problem does not occur when we have few interfaces on the hypervisor, would it be a scaling problem with libnl?

Regards,
Roberto

[1] - https://listman.redhat.com/archives/libvir-list/2017-July/msg00566.html
[2] - https://<email address hidden>/msg149554.html
[3] - https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1794997

Revision history for this message

Roberto Bartzen Acosta (rbartzen) wrote on 2023-07-03:

netlink drops are observed when new VMs are created.

cat /proc/net/netlink
sk Eth Pid Groups Rmem Wmem Dump Locks Drops Inode
ff411bb5d82fc000 0 3015708447 00000551 0 0 0 2 47 187536
ff411bb5da735000 15 3749713065 00000001 0 0 0 2 3806 142072

Revision history for this message

Lucas Kanashiro (lucaskanashiro) wrote on 2023-07-04:

Hi Roberto,

Thanks for taking the time to report this bug and trying to make Ubuntu better.

As you mentioned, this issue does not seem to be reproducible locally (at least I can't), which makes it harder to understand what is going on. I went through the links you added, and from this LP bug:

https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1794997

This does not seem a libvirt bug but a libnl one. I'll try to find some guidance within the Canonical Server team and we will get in touch here.

Could you share how exactly are you creating the VMs?

Revision history for this message

Roberto Bartzen Acosta (rbartzen) wrote on 2023-07-05 (last edit on 2023-07-05):

memory_footprint.txt Edit (1.7 MiB, text/plain)

Hi Lucas,

Thank you for looking at this report.

VMs are created via OpenStack. The problem occurs with Yoga and Ussuri OpenStack deployments (with versions corresponding to the ones informed in the BUG description).

Via OpenStack, these VMs are associated with Networks, Subnets, Security Groups, key pairs, etc.

In my tests, after creating a VM, I start tests with iperf from one VM to another - using the Hypervisor processor and memory.

After a few VMs created, and some load on the host I see the buffer error messages.

I believe that it is not a problem with libvirt itself, but with the memory allocation that libnl makes during the reception of messages in the netlink event callback.

I changed a memory reservation parameter for VMs and noticed that the problem no longer occurred (as far as I've tested it), but that doesn't guarantee it can't occur in the future.

sysctl vm.min_free_kbytes
vm.min_free_kbytes = 4194303

I changed this to vm.min_free_kbytes = 8192 KBytes...

Do you have any idea about additional parameters that need to be set/tuned so that libvirt - memory allocation can work correctly?

Aditional memory footprint info in the attached file.

Regards,
Roberto

Revision history for this message

Paride Legovini (paride) wrote on 2023-07-07:

Hello Roberto,

I am not the subject matter expert here, but I'd say that's something that depends on the load openstack / libvirt will have to handle. Maybe that time of tuning is documented in those projects?

I'm marking this report as Incomplete for now, to encourage further discussion, but I don't believe we have a bug in Ubuntu here.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2023-07-10:

Download full text (6.2 KiB)

Hi,
as the others said, thanks for reporting and thanks for also looking at it from the right angle, as IMHO this is correct:

> I believe that it is not a problem with libvirt itself, but with the memory allocation that libnl makes during the reception of messages in the netlink event callback.

Yeah, but the question for me right now is "is it a bug or just too much pressure with no other choice"

Just a summary (for my own brain next time I come by):
- read [1]
- all that is still true, a history trace would be
  - 1. libnl: systems grew, grow initial buffer [2]
  - 2. libnl: peeking by default to [3]
  - 3. libvirt: even more buffer + peeking [4]
- All that is in Ubuntu for quite a while, the versions you use easily have that

> I changed a memory reservation parameter for VMs and
> noticed that the problem no longer occurred (as far
> as I've tested it), but that doesn't guarantee it
> can't occur in the future.
> sysctl vm.min_free_kbytes
> vm.min_free_kbytes = 4194303
> I changed this to vm.min_free_kbytes = 8192 KBytes...

So you changed min from 4M to 8k causing the kernel to keep less (=2!) pages available for critical internal allocations. That sounds like asking for trouble in other cases. And if it really helped then that IMHO means your system might have been running just on the border of dying by an OOM.
This is only the starting value to calculate such for low/min/high watermarks per zone/node, you can see /proc/zoneinfo about the effective outcome on a huge system. I personally was always afraid of large systems to consume memory as fast as they want to then hit a sudden wall (= all those limits being low) and often wished that at least low/high would not be as small to start e.g. dedicating process cpu cycles to scanning and to swap early so it can be discarded later if needed. Anyway, I start to ramble - this isn't your problem .. :-)

This case is bad because there isn't much we could help/do here, let me outline the options I see:
- a) This is a battle of numbers, your System might exceed the defaults?
- b) There is an actual bug in libnl, maybe a new version fix to test
- c) The load too much to handle for the system
Let me go into my thoughts on each of those:

# (a) Battle of big numbers

The need for [4] was triggered by systems with many CPUs and many interfaces. You have many interfaces and what was "many cpus" in 2016 is normal now. You might have way more CPUs. In any case peeking should fix that unless if the initial response by the kernel exceeds the initial buffer (ref [1]).
So unlikely, but maybe the setup and size of your system are now enough to make even the 128k of [4] not be enough for the very first message before reallocation due to peeking.
Pro:
- If that would be our theory we could provide a test build with a bigger default buffer for you to test
Con:
- Then changing vm.min_free_kbytes should not have helped
- With peeking the size shouldn't matter (only slowdown if not big enough) at all, as it would realloc if not fitting

# (b) Bug in libnl

The wording is a bit weak, does setting the buffer + enabling peeking actually peek?
For example [5] could be read as "nl_socket_enable_msg_peek+nl_s...

Hi,
as the others said, thanks for reporting and thanks for also looking at it from the right angle, as IMHO this is correct:

> I believe that it is not a problem with libvirt itself, but with the memory allocation that libnl makes during the reception of messages in the netlink event callback.

Yeah, but the question for me right now is "is it a bug or just too much pressure with no other choice"

Just a summary (for my own brain next time I come by):
- read [1] 
- all that is still true, a history trace would be
  - 1. libnl: systems grew, grow initial buffer [2]
  - 2. libnl: peeking by default to [3]
  - 3. libvirt: even more buffer + peeking [4]
- All that is in Ubuntu for quite a while, the versions you use easily have that

> I changed a memory reservation parameter for VMs and
> noticed that the problem no longer occurred (as far
> as I've tested it), but that doesn't guarantee it
> can't occur in the future.
>   sysctl vm.min_free_kbytes
>   vm.min_free_kbytes = 4194303
> I changed this to vm.min_free_kbytes = 8192 KBytes...

# (a) Battle of big numbers

# (b) Bug in libnl

The wording is a bit weak, does setting the buffer + enabling peeking actually peek?
For example [5] could be read as "nl_socket_enable_msg_peek+nl_socket_set_msg_buf_size might disable peeking". But that isn't what libvirt uses, they use "nl_socket_enable_msg_peek + nl_socket_set_buffer_size".
For a moment /me shakes a fist against easily misreadable names, mind "socket buffer" vs "message buffer".
So peeking really should be enabled and working as far as libnl and libvirt is considered.
The changelogs 3.5->3.6->3.6 didn't indicate any change in that area, nor is there an issue open.

Pro:
- If we'd think a new libnl might fix it we might easily provide a test backport of 3.7 for you to try
Con:
- Neither changelog nor code study really point at a misbehaving library

# (c) Too much load

The most common reason for this to occur in the past was just overloading the system.
You asked:
> libvirt already sets MSG_PEEK at startup and sets the default
> buffer to 128k, but even so, messages like this are still logged
> anyway, what would it be the problem? Any idea?

I'd assume (from the bit I can see and theorize) that the size of the system and guest configuration causes the needed re-allocation to be huge - and then with a system under general memory pressure - ending up being denied.

If such is denied, then this causes just what you see.

Pro:
- Would fit what we see as logs and from former discussions
- Explains why in cases reported so far it went away after changing the load slightly
- Lowering min_free_kbytes might make it just a bit more likely to get the allocation done
- Debuggable by looking at the sizes and re-allocation that libnl will do
Con:
- <none>

So, what to do from here.
I still dislike lowering min_free_kbytes for the danger it brings.
IMHO the best way forward would be to debug not into recmsg as you did but into `nl_recv` of libnl.
For that you'd either need to
1. Config [6] for libvirt (most likely too much noise and slowdown)
2. GDB against the libnl code, setting up auto-continuing breakpoints reporting  
3. Using the userspace tracing of your choice e.g. perf trace (there are too many, pick what you feel comfortable with
4. modifying libnl to report the size it is going to re-alloc

If your case is reproducible, please have a look at any of the above.

I can think of two cases you might find:
- An alloc of a few KB-MB under memory pressure not allowing that (=> reduce load)
- A really insane allocation size (bug somewhere that can be chased further and hopefully fixed)

Sorry, this was a very long "I do not know of a tunable to help you easily" :-/

P.S. I'm not sure yet, but unless we find a solution we might eventually want to duplicate this+/1794997 (maybe copy the most useful helping comments explanations of both to the one that stays the main bug).

[1]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1794997/comments/2
[2]: https://github.com/thom311/libnl/commit/807fddc4cd9ecb12ba64e1b7fa26d86b6c2f19b0
[3]: https://github.com/thom311/libnl/commit/55ea6e6b6cd805f441b410971c9dd7575e783ef4
[4]: https://gitlab.com/libvirt/libvirt/-/commit/8c70d04bab7278c96390a913fa949a17cd3124f9
[5]: https://github.com/thom311/libnl/blob/main/include/netlink/utils.h#L259
[6]: https://www.gnu.org/software/libc/manual/html_node/Tracing-malloc.html

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2023-07-10 (last edit on 2023-07-10):

Something that quickly gave me the libc realloc calls including their size argument but at relatively low overhead was:

$ git clone https://github.com/brendangregg/perf-tools.git
$ cd perf-tools
$ sudo chmod +x /lib/x86_64-linux-gnu/libc.so.6
$ sudo ./bin/uprobe -H 'p:libc:realloc %di %si' | grep libvirtd

# you could do similar with perf trace looking at brk/mmap
$ sudo apt install linux-tools-common linux-tools-generic
$ perf trace

You might need to modify this to your needs, e.g. restrict the PID/Thread to your libvirt instead of grep (have a look with the above first how/if pids fluctuate throughout your load).