LTC Test- Ubuntu18.04: Starting the guest with network filter defined will fail with "cause is unknown".

Bug #1758037 reported by bugproxy on 2018-03-22
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
High
Canonical Server Team
libvirt
Fix Committed
Undecided
libvirt (Ubuntu)
Low
Ubuntu on IBM Power Systems Bug Triage
Xenial
Undecided
Unassigned
Artful
Undecided
Unassigned
Bionic
Undecided
Unassigned

Bug Description

[Impact]

 * nwfilters were not usable if configured to use dhcp based learning

 * Fix by backporting upstream bug

[Test Case]

 * Add the following to the interface section of a guest description in
   libvirt:
     <filterref filter='clean-traffic'>
       <parameter name='CTRL_IP_LEARNING' value='dhcp'/>
     </filterref>
    Then start the guest.

    Bad case:
    error: Failed to start domain VM1
    error: An error occurred, but the cause is unknown

    Fixed:
    Guest starts and works.

[Regression Potential]

 * I thought a while on this. On first sight one might say there is a
   regression risk due to increasing the size of the buffer. This risk
   would arise on hyperscale environments where the memory consumption per
   guest would increase by 2*128Kb*#guest-interfaces (not much, but can
   sum up on MANY guests).
   But then I realized that this is only true for the use case using
   dhcpsnoop which is
   a) clearly not the most common case
   b) failing to work at all before this fix
   So there can't be anyone today with a working setup that then runs OOM,
   due to the setup either not using the feature (=no change) or failing
   missing this fix.
   So I actually think this mem consumption increase is not an issue in
   terms of SRU considerations.
   Due to that the only remaining regression would be users that had a
   self-built libpcap without TPACKET_V3 to drive a workload like the
   above, and even then only the rather small size bump is what changes.

[Other Info]

 * I have added this case and a few deeper checks on the created rules for
   iptables to the regression tests

---

== Comment: #2 - Mallesh N. Koti <email address hidden> - 2018-02-28 05:02:49 ==

Guest Xml

=======
ISSUE
=======
Defining a network filter and Starting a VM with this nwfiter in VM's xml is failing with "cause is unknown".

==================
Recreation Steps
==================

1. Define a network filter as:
  virsh nwfilter-define filter.xml

2. Add nwfilter in guest xml and start guest.
  virsh start VM1

It fails with :
# virsh start VM1
error: Failed to start domain VM1
error: An error occurred, but the cause is unknown

XML used for defining network filter:
```<?xml version='1.0' encoding='UTF-8'?>
<filter chain="root" name="clean-traffic" priority="">
  <uuid>11111111-b071-6127-b4ec-111111111111</uuid>
  <filterref filter="no-mac-spoofing" /><filterref filter="no-ip-spoofing" /><filterref filter="allow-incoming-ipv4" /><filterref filter="no-arp-spoofing" /><filterref filter="qemu-announce-self" /><rule action="accept" direction="out" priority="-650" statematch="None">
    <mac protocolid="ipv4" /></rule><rule action="accept" direction="inout" priority="-500" statematch="None">
    <mac protocolid="arp" /></rule></filter>
```

will be attaching the guest xml

The issue happens with Ubuntu 18.04 host - where not able to start the guest with network defined with value dhcp.

<parameter name='CTRL_IP_LEARNING' value='dhcp'/>
.
Found following commit is not there in 18.04 Ubuntu source. There could be some dependent commit too. we are facing some build issue and hence not able to verify it.
.
https://github.com/libvirt/libvirt/commit/e62cb4a9b78c7f4499a206635fb4f06e6ac627e5
.

Starting with F27, the Fedora-only patch that disabled TPACKET_V3 support was removed with the comment:

   Drop TPACKET_V3 patch as it should be fixed in kernel by now

This is apparently not the case since, on a host with F27 packages, including libpcap-1.8.1-6 and kernel 4.15.3-300, the previously-functional libvirt code that uses libpcap to watch for DHCP traffic now fails when pcap_setfilter() returns EBADF.

Here is the excerpt of libvirt code (from the file src/nwfilter/nwfilter_dhcpsnoop.c):

    [...]
    handle = pcap_create(ifname, pcap_errbuf);

    if (handle == NULL) {
        virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
                       _("pcap_create failed"));
        goto cleanup_nohandle;
    }

    if (pcap_set_snaplen(handle, PCAP_PBUFSIZE) < 0 ||
        pcap_set_buffer_size(handle, PCAP_BUFFERSIZE) < 0 ||
        pcap_activate(handle) < 0) {
        virReportError(VIR_ERR_INTERNAL_ERROR,
                       _("setup of pcap handle failed: %s"),
                       pcap_geterr(handle));
        goto cleanup;
    }

    if (pcap_compile(handle, &fp, ext_filter, 1, PCAP_NETMASK_UNKNOWN) != 0) {
        virReportError(VIR_ERR_INTERNAL_ERROR,
                       _("pcap_compile: %s"), pcap_geterr(handle));
        goto cleanup;
    }

    if (pcap_setfilter(handle, &fp) != 0) { <=== FAILURE HERE
        virReportError(VIR_ERR_INTERNAL_ERROR,
                       _("pcap_setfilter: %s"), pcap_geterr(handle));
        goto cleanup_freecode;
    }
    [...]

If I add the patch titled "pcap-linux: don't use TPACKETV3 for memory mmapped
 capture" back to the F27 build of libpcap (built locally with fedpkg) and install the resulting rpm, the same code magically begins to work.

I haven't checked rawhide, but I assume the behavior is the same.

I think Fedora needs to re-disable TPACKET_V3

TPACKET_V3 was enable because it is a vital feature for network packet analyzers. I don't want to disable it again, because it will trigger other bugs (usually it is about packet drops).

Can you recommend how we should modify the libvirt code that uses pcap to avoid this regression in pcap's behavior? (while maintaining compatibility with other/older distros that haven't enabled TPACKET_V3)

Switching the needinfo to the assignee(myself).

Could you please provide us with some steps to reproduce? Something about how to compile/run the code in order to trigger this bug, so that we can properly examine it.

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-165229 severity-high targetmilestone-inin---
bugproxy (bugproxy) wrote : guest.xml

------- Comment (attachment only) From <email address hidden> 2018-03-22 07:42 EDT-------

Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → libvirt (Ubuntu)
Changed in ubuntu-power-systems:
importance: Undecided → High
assignee: nobody → Canonical Server Team (canonical-server)
tags: added: triage-g

TL;DR:
- test needs no change to filter
- did this work ever before (happend on 3.6 as well for me)?
- if you retest use >= 4.0.0-1ubuntu6
- did you see bad file descriptor messages, does it help you to see what we need?

After checking the default rules, the only thing your case adds tothe default "clean-traffic" rule is:
   <filterref filter="no-other-l2-traffic" />
This statement was in there by default on older versions IIRC.
But I realized this isn't needed - the default works as well to trigger the issue.

And on the guest the interface gets
 <filterref filter='clean-traffic'>
   <parameter name='CTRL_IP_LEARNING' value='dhcp'/>
 </filterref>

That guest change is enough to trigger the reported error.
I can trigger the same on 3.6.0-1ubuntu6.2 btw did this work for you in former releases or is this a new test?
This is interesting as the offending patch your fix is referring to lists 3.9

Did you see that message when you trigger the issue?
libvirtd[3593]: Warning: Kernel filter failed: Bad file descriptor
libvirtd[3593]: 2018-03-22 12:59:59.774+0000: 13506: error : virNWFilterSnoopDHCPOpen:1133 : internal error: pcap_setfilter: can't remove kernel filter: Bad file descriptor

Maybe that helps you to spot the right fix?

I can at least help you with the build issues.
I have realized your test is rather old 2018-02-28
I have packaged quite a lot of stable fixes in the meantime.
This does - among others - container the patch you referred to.

So if you retest with libvirt >= 4.0.0-1ubuntu6 you have the requested change.
Unfortunately in my sniff test this didn't help - but then I never used that in my environment, maybe it is good in your which might be prepared for this functionality.

Incomplete until retested with the version containing the suggested fix.

Changed in libvirt (Ubuntu):
status: New → Incomplete
Changed in ubuntu-power-systems:
status: New → Incomplete

If you have a system with libvirt and virt-manager installed, you can just create a virtual machine, then edit its configuration with "virsh edit" to att the following "<filterref>" section to its <interface>, then attempt to start the virtual machine:

    <interface type='network'>
      <source network='default'/>
      <filterref filter='clean-traffic'>
        <parameter name='CTRL_IP_LEARNING' value='dhcp'/>
        <parameter name='DHCPSERVER' value='192.168.122.1'/>
      </filterref>
    </interface>

If you've installed the debuginfo for libvirt, you can set a breakpoint on the function virNWFilterSNoopDHCPOpen() (which is the source of the bit of code I posted above).

I can provide more detailed step-by-step instructions if you need it, but didn't want to pollute the BZ with a long treatise on installing virt stuff if you already had it on.

If you need any assistance while doing this, you can find me (and several other libvirt developers) on IRC in the #virt channel on irc.oftc.net.

Created attachment 1413663
Libpcap patch that logs debug information and bypasses the problem

Thanks for the reproducer. Just for the record, it is necessary to install libvirt-daemon-config-nwfilter package before this starts to work. Then you can attach to running libvirtd using gdb, set the breakpoint and try to start the virtual machine.

The real name of the function is: virNWFilterSnoopDHCPOpen , it took me a while to notice the capital N in SNoop :)

The bug itself is triggered in the function:
pcap_setfilter_linux ->
pcap_setfilter_linux_common ->
reset_kernel_filter

where setsockopt is called with handle->fd as an argument, which is -1 in this case. I am not sure yet, where the -1 came from in this file descriptor, but surrounding this function with a condition bypasses the problem and my VM starts just fine. Though I don't know if the libvirt filter works, because I don't know what it is supposed to do. I attach a patch for libpcap, that can be used for debugging. It is not meant as a fix, rather for other maintainers to try it.

(In reply to Martin Sehnoutka from comment #7)
> Created attachment 1413663 [details]
> Libpcap patch that logs debug information and bypasses the problem
>
> Thanks for the reproducer. Just for the record, it is necessary to install
> libvirt-daemon-config-nwfilter package before this starts to work.

Interesting. That should have been installed as a dependency. I'll look into why it's missing. Was your test platform F28, or something else?

>
> The real name of the function is: virNWFilterSnoopDHCPOpen , it took me a
> while to notice the capital N in SNoop :)

Oops! Sorry about that - that will teach me to type instead of paste :-)

> I attach a
> patch for libpcap, that can be used for debugging. It is not meant as a fix,
> rather for other maintainers to try it.

For the sake of gathering information, I'll build with this patch applied and try it to see if libvirt's functionality is still there (what's supposed to happen is that libvirt will monitor all DHCP traffic, and grab the IP address of the guest out of the DHCP response, then use that IP to build additional iptables rules).

(In reply to Laine Stump from comment #8)
> Was your test platform F28, or something else?
>

Fedora 27 Workstation

------- Comment From <email address hidden> 2018-04-18 06:49 EDT-------
Hi Seeteena,

Retested in Ubuntu18.04, with libvirt level :
Compiled against library: libvirt 4.0.0
Using library: libvirt 4.0.0
Using API: QEMU 4.0.0
Running hypervisor: QEMU 2.11.1

But still the same issue.

Ok, so it fails for both of us.
I currently have not debugged the reasons yet.

Was that a test you ran before and it worked on older versions?

You initially suggested a fix that I made clear is already in, if you happen to know other fixes that could be needed here please let me know.

I tested this on Xenial, Artful and Bionic.

In Bionic and Artful it breaks as shown before:
  error: Failed to start domain bg
  error: An error occurred, but the cause is unknown
In Xenial as well - it just takes much longer.

So this is at least not a regression.

Note: the feature itself is rather old, since [1] (2012)

Still all I see is:
Apr 20 09:14:21 b libvirtd[4207]: Warning: Kernel filter failed: Bad file descriptor
Apr 20 09:14:21 b libvirtd[4207]: 4379: error : virNWFilterSnoopDHCPOpen:1130 : internal error: pcap_setfilter: can't remove kernel filter: Bad file descriptor

With debug enabled the log does not hold more info.

The code in that area prepares (compile) and sets the filters.
The handle is created with handle = pcap_create(ifname, pcap_errbuf);
And several operations succeed on that handle before the pcap_setfilter fails eventually.

I wondered if this might be due to my KVM-in-LXD setup and tried X/B on Bare metal, to the same effect/error.

For this case one would need to debug libvirtd to find what is going on, I'm not sure yet this works/worked anywere (since even latest release is broken atm).

For now I'm waiting on a confirmation if there is a known fix (or not) before considering debugging it deeper. If there is no known fix an outline of the important/urgency from the reporters POV would be nice. Given that this seems to never worked I'll rate it rather low for now.

[1]: https://www.redhat.com/archives/libvir-list/2012-May/msg01234.html

Changed in libvirt (Ubuntu):
status: Incomplete → Confirmed
importance: Undecided → Low
Changed in ubuntu-power-systems:
status: Incomplete → Confirmed
Download full text (8.0 KiB)

Based on "virNWFilterSnoopDHCPOpen:1130 : internal error: pcap_setfilter: can't remove kernel filter: Bad file descriptor"

Tracking some details in virNWFilterSnoopDHCPOpen with gdb and debug symbols.
pcap_setfilter being the failing call.

The stack to the breaking call looks pretty much as expected:
#0 0x00007fae3c457850 in pcap_setfilter () from /usr/lib/x86_64-linux-gnu/libpcap.so.0.8
#1 0x00007fae3c693e31 in virNWFilterSnoopDHCPOpen (ifname=0x7fae2c01f010 "vnet0", mac=mac@entry=0x7fae2c004fcf, filter=<optimized out>,
    dir=PCAP_D_IN) at ../../../src/nwfilter/nwfilter_dhcpsnoop.c:1128
#2 0x00007fae3c6954ea in virNWFilterDHCPSnoopThread (req0=0x7fae2c004f70) at ../../../src/nwfilter/nwfilter_dhcpsnoop.c:1387
#3 0x00007fae59fe5ad2 in virThreadHelper (data=<optimized out>) at ../../../src/util/virthread.c:206
#4 0x00007fae5968f6db in start_thread (arg=0x7fae0dd02700) at pthread_create.c:463
#5 0x00007fae593b888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The arguments seem reasonable:
virNWFilterSnoopDHCPOpen (ifname=0x7fae2c01f010 "vnet0", mac=mac@entry=0x7fae2c004fcf,
    filter=0x7fae3c6aaaea "dst port 67 and src port 68", dir=PCAP_D_IN)

pcap_compile does not complain and make a bpf program as it seems.
fp => {bf_len = 24, bf_insns = 0x1a042e0b60b57e00}
pcap_compile(handle, &fp, ext_filter, 1, PCAP_NETMASK_UNKNOWN)
fp => {bf_len = 28, bf_insns = 0x7fadf0254760}

ext_filter got generated as "dst port 67 and src port 68 and ether src 52:54:00:e7:04:a2"

The pcap handle looks at least not totally broken either:
p *p
$1 = {read_op = 0x7fae3c44d8c0 <pcap_read_linux>, next_packet_op = 0x0, fd = -1, selectable_fd = -1, bufsize = 576, buffer = 0x0, bp = 0x0,
  cc = 0, break_loop = 0, priv = 0x7fadf4020810, swapped = 0, rfile = 0x0, fddipad = 0, next = 0x0, version_major = 0, version_minor = 0,
  snapshot = 576, linktype = 1, linktype_ext = 0, tzoff = 0, offset = 6, activated = 1, oldstyle = 0, opt = {device = 0x7fadf40905a0 "vnet0",
    timeout = 0, buffer_size = 131072, promisc = 0, rfmon = 0, immediate = 0, tstamp_type = -1, tstamp_precision = 0}, pkt = 0x0,
  direction = PCAP_D_INOUT, bpf_codegen_flags = 1, fcode = {bf_len = 0, bf_insns = 0x0},
  errbuf = "can't mmap rx ring: Invalid argument", '\000' <repeats 220 times>, dlt_count = 0, dlt_list = 0x0, tstamp_type_count = 0,
  tstamp_type_list = 0x0, tstamp_precision_count = 0, tstamp_precision_list = 0x0, pcap_header = {ts = {tv_sec = 0, tv_usec = 0}, caplen = 0,
    len = 0}, activate_op = 0x7fae3c450e80 <pcap_activate_linux>, can_set_rfmon_op = 0x7fae3c44fd20 <pcap_can_set_rfmon_linux>,
  inject_op = 0x7fae3c450b40 <pcap_inject_linux>, setfilter_op = 0x7fae3c44e790 <pcap_setfilter_linux>,
  setdirection_op = 0x7fae3c44f2b0 <pcap_setdirection_linux>, set_datalink_op = 0x7fae3c44d390 <pcap_set_datalink_linux>,
  getnonblock_op = 0x7fae3c457730 <pcap_getnonblock_fd>, setnonblock_op = 0x7fae3c4577a0 <pcap_setnonblock_fd>,
  stats_op = 0x7fae3c44e860 <pcap_stats_linux>, oneshot_callback = 0x7fae3c456430 <pcap_oneshot>,
  cleanup_op = 0x7fae3c44f3c0 <pcap_cleanup_linux>}

pcap_setfilter fixes up the program format and sets can_filter_in_kernel which sho...

Read more...

After discussion with upstream on this we found that this also affects other Distributions.
TL;DR: when libpcap has TPACKET_V3 enabled there is a weird incompatibiity between libpcap the kernel & libvirt

Linking the BZ to this bug.

Hi,
I wanted to let you know that this problem is not only affecting Fedora.
I happened to debug [1] leading the same conclusions on the fd being -1 breaking the latter setsockopt.

I asked on #virt and fortunately danpb was aware of this bug and pointed me here.
Ubuntu and Debian have no disabled TPACKET_V3 (never had) so it seems about all releases I could test are affected by this bug.

Glad I found this bug to participate here and to be able to track your further insights as well.

[1]: https://bugs.launchpad.net/libvirt/+bug/1758037

Changed in libvirt:
importance: Unknown → Undecided
status: Unknown → Confirmed

Debugging pcap_activate and pcap_create with and without
  <parameter name='CTRL_IP_LEARNING' value='dhcp'/>

TL;DR:
- Bad Case: fd stays at -1 on pcap_activate
- Good case: pcap_activate sets a valid fd = 27

They use very different paths.
Bad Case:
1.
pcap_create(ifname, pcap_errbuf);
  ifname = "vnet0"
  pcap_errbuf is a local char* of [PCAP_ERRBUF_SIZE];
2. pcap_set_snaplen(handle, PCAP_PBUFSIZE)
   PCAP_PBUFSIZE = 576 by nwfilter_dhcpsnoop.c:237
3. pcap_set_buffer_size(handle, PCAP_BUFFERSIZE)
   PCAP_BUFFERSIZE (128 * 1024) by nwfilter_dhcpsnoop.c:259
4. pcap_activate(handle)

Good Case:
- doesn't even hit virNWFilterSnoopDHCPOpen.
- It reaches the pacp functions via:

#0 pcap_activate (p=p@entry=0x7f3234000bb0) at ./pcap.c:775
#1 0x00007f3290684b4c in pcap_open_live (device=device@entry=0x7f3280017768 "vnet0", snaplen=snaplen@entry=8192, promisc=promisc@entry=0,
    to_ms=to_ms@entry=500, errbuf=errbuf@entry=0x7f3284b9fb70 "") at ./pcap.c:840
#2 0x00007f32908cb575 in learnIPAddressThread (arg=0x7f3280017760) at ../../../src/nwfilter/nwfilter_learnipaddr.c:413
#3 0x00007f32adaa9ad2 in virThreadHelper (data=<optimized out>) at ../../../src/util/virthread.c:206
#4 0x00007f32ad1536db in start_thread (arg=0x7f3284ba0700) at pthread_create.c:463
#5 0x00007f32ace7c88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Maybe this does something different in between?

It does
1. pcap_create(device, errbuf);
same device string
same size of errbuf
2. pcap_set_snaplen(p, snaplen);
len is BUFSIZ which is stdio.h _IO_BUFSIZ (that is 8192 by
/usr/include/x86_64-linux-gnu/bits/_G_config.h:61:#define _G_BUFSIZ 8192)
3. pcap_set_promisc(p, promisc)
 with promisc being 0
4. pcap_set_timeout(p, to_ms)
 with timeout being 500
5. pcap_activate(p);
Before that p->oldstyle = 1; gets set.

Following comment for that assignment:
        /*
         * Mark this as opened with pcap_open_live(), so that, for
         * example, we show the full list of DLT_ values, rather
         * than just the ones that are compatible with capturing
         * when not in monitor mode. That allows existing applications
         * to work the way they used to work, but allows new applications
         * that know about the new open API to, for example, find out the
         * DLT_ values that they can select without changing whether
         * the adapter is in monitor mode or not.
         */

Could it be that the libvirt usage just is "oldstyle"?

I'm now trying to recreate that against a local build of master to patch the differences one by one hopefully finding the weak spot we could fix.

Ok, repro with local build works.

I tested the buffer sized, the other calls that pcap_open_live does and so on.
Eventually I minimized the changes and found the issue to go away with:

diff --git a/src/nwfilter/nwfilter_dhcpsnoop.c b/src/nwfilter/nwfilter_dhcpsnoop.c
index 6069e70..c12ba9c 100644
--- a/src/nwfilter/nwfilter_dhcpsnoop.c
+++ b/src/nwfilter/nwfilter_dhcpsnoop.c
@@ -1115,7 +1115,6 @@ virNWFilterSnoopDHCPOpen(const char *ifname, virMacAddr *mac,
     }

     if (pcap_set_snaplen(handle, PCAP_PBUFSIZE) < 0 ||
- pcap_set_buffer_size(handle, PCAP_BUFFERSIZE) < 0 ||
         pcap_activate(handle) < 0) {
         virReportError(VIR_ERR_INTERNAL_ERROR,
                        _("setup of pcap handle failed: %s"),

That was defined as:
  /*
   * libpcap 1.5 requires a 128kb buffer
   * 128 kb is bigger than (DHCP_PKT_BURST * PCAP_PBUFSIZE / 2)
   */
  # define PCAP_BUFFERSIZE (128 * 1024)

Maybe with more modern libpcap that is not needed or even wrong now.
I have 1.8.1 atm.

I realized that the path without:
  <parameter name='CTRL_IP_LEARNING' value='dhcp'/>
works but is totally different.
It uses pcap_open_live

So I compared our code in virNWFilterSnoopDHCPOpen with pcap_open_live.
We use different buffer sizes and don't set promisc and timeout.
But the minimal change I found that seems to make it work was:

--- a/src/nwfilter/nwfilter_dhcpsnoop.c
+++ b/src/nwfilter/nwfilter_dhcpsnoop.c
@@ -1115,7 +1115,6 @@ virNWFilterSnoopDHCPOpen(const char *ifname, virMacAddr *mac,
     }

     if (pcap_set_snaplen(handle, PCAP_PBUFSIZE) < 0 ||
- pcap_set_buffer_size(handle, PCAP_BUFFERSIZE) < 0 ||
         pcap_activate(handle) < 0) {
         virReportError(VIR_ERR_INTERNAL_ERROR,
                        _("setup of pcap handle failed: %s"),

The size for this was defined as:
  /*
   * libpcap 1.5 requires a 128kb buffer
   * 128 kb is bigger than (DHCP_PKT_BURST * PCAP_PBUFSIZE / 2)
   */
  # define PCAP_BUFFERSIZE (128 * 1024)

This is from [1], does all that from a libpcap experts POV make sense?
Would it be reasonable to drop this call these days or change the size?

[1]: https://libvirt.org/git/?p=libvirt.git;a=commit;h=49b59a151f60b0a178b023b727bac30f80bd6000

Added this summary to the BZ:
I realized that the path without:
  <parameter name='CTRL_IP_LEARNING' value='dhcp'/>
works but is totally different.
It uses pcap_open_live

So I compared our code in virNWFilterSnoopDHCPOpen with pcap_open_live.
We use different buffer sizes and don't set promisc and timeout.
But the minimal change I found that seems to make it work was:

--- a/src/nwfilter/nwfilter_dhcpsnoop.c
+++ b/src/nwfilter/nwfilter_dhcpsnoop.c
@@ -1115,7 +1115,6 @@ virNWFilterSnoopDHCPOpen(const char *ifname, virMacAddr *mac,
     }

     if (pcap_set_snaplen(handle, PCAP_PBUFSIZE) < 0 ||
- pcap_set_buffer_size(handle, PCAP_BUFFERSIZE) < 0 ||
         pcap_activate(handle) < 0) {
         virReportError(VIR_ERR_INTERNAL_ERROR,
                        _("setup of pcap handle failed: %s"),

The size for this was defined as:
  /*
   * libpcap 1.5 requires a 128kb buffer
   * 128 kb is bigger than (DHCP_PKT_BURST * PCAP_PBUFSIZE / 2)
   */
  # define PCAP_BUFFERSIZE (128 * 1024)

This is from [1], does all that from a libpcap experts POV make sense?
Would it be reasonable to drop this call these days or change the size?

[1]: https://libvirt.org/git/?p=libvirt.git;a=commit;h=49b59a151f60b0a178b023b727bac30f80bd6000

Looking at source the default size if you don't call pcap_set_buffer_size is *massively* larger than the PCAP_BUFFERSIZE constant libvirt has defined:

$ grep -r 'buffer_size =' *
pcap.c: p->opt.buffer_size = 0; /* use the platform's default */
pcap.c: p->opt.buffer_size = buffer_size;
pcap-linux.c: if (handle->opt.buffer_size == 0) {
pcap-linux.c: handle->opt.buffer_size = 2*1024*1024;
pcap-win32.c: if (p->opt.buffer_size == 0)
pcap-win32.c: p->opt.buffer_size = WIN32_DEFAULT_KERNEL_BUFFER_SIZE;

$ grep -r WIN32_DEFAULT_KERNEL_BUFFER_SIZE *
pcap-win32.c:#define WIN32_DEFAULT_KERNEL_BUFFER_SIZE 1000000
pcap-win32.c: * WIN32_DEFAULT_KERNEL_BUFFER_SIZE.
pcap-win32.c: p->opt.buffer_size = WIN32_DEFAULT_KERNEL_BUFFER_SIZE;

So libpcap uses 2 MB on Linux, or 1 MB on Windows.

We need to consider what maximum oldest libpcap we support to support is, to decide whether we can safely drop the pcap_Set_buffer_size call, vs increase its value.

I just tried setting the buffer size to 256k instead of 128k, and everything works. I posted a patch:

 https://www.redhat.com/archives/libvir-list/2018-April/msg02459.html

This should be safer than not setting the buffer size when using older libpcap versions that might have a very small default buffer size, and should place too much of a burden on hosts that use a new libpcap (even 1000 guests with CTRL_IP_LEARNING turned on would only use 128MB more memory than they do now).

Does this seem reasonable?

After discussion on-list, we decided that although we may need to increase the buffer size again sometime in the future, due to memory usage concerns it is still best to just set the buffer size to 256k, and include a lot of verbiage explaining the choice (and pointing out the probable source of failure if it happens again) in the code and in the git log. (See the follow-up comments to the V12 patch listed above).

To that end, I just pushed this patch upstream:

https://www.redhat.com/archives/libvir-list/2018-April/msg02635.html

commit ce5aebeacd10a1c15cb3ee46a59c8b5ff235589e
Author: Laine Stump <email address hidden>
Date: Wed Apr 25 17:12:03 2018 -0400

    nwfilter: increase pcap buffer size to be compatible with TPACKET_V3

I also pushed that to the v3.7-maint branch so that it will automatically go into the next maintenance release of libvirt 3.7 (used by Fedora 27), and would have done the same for v4.1-maint (which will be used by F28) but that branch hasn't been created yet.

(forgot to switch the component to libvirt)

Changed in libvirt:
status: Confirmed → Unknown

Ok, together we found a fix for this which now is integrated upstream.

commit ce5aebeacd10a1c15cb3ee46a59c8b5ff235589e
Author: Laine Stump <email address hidden>
Date: Wed Apr 25 17:12:03 2018 -0400

    nwfilter: increase pcap buffer size to be compatible with TPACKET_V3

Bionic+1 is not yet open, so atm it is a bit complex for a few days to push fixes.
Furthermore since this never worked I'm not quite sure we want/need to SRU it back to all releases - at least it can't be urgent IMHO.

I'd ask for the reporters opinion for it, my plan would be to:
1. wait until 18.10 release opened the Archive
2. fix in 18.10
3. SRU to 18.04, 17.10, 16.04

Would that be ok for you?

Changed in libvirt (Ubuntu):
status: Confirmed → Triaged

Started tests for a Cosmic upload and added SRU Template.

description: updated
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 4.0.0-1ubuntu10

---------------
libvirt (4.0.0-1ubuntu10) cosmic; urgency=medium

  * Fix nwfilters that set CTRL_IP_LEARNING set to dhcp failing with "An error
    occurred, but the cause is unknown" due to a buffer being too small
    for pcap with TPACKET_V3 enabled (LP: #1758037)
    - debian/patches/ubuntu/lp-1758037-nwfilter-increase-pcap-buffer-size.patch

 -- Christian Ehrhardt <email address hidden> Wed, 09 May 2018 17:07:59 +0200

Changed in libvirt (Ubuntu):
status: Triaged → Fix Released
Changed in ubuntu-power-systems:
status: Confirmed → Fix Released

Regression- and Case-Tested once more from a ppa and being good.
Also pushed to ubuntu libvirt-maintainers git as a new cosmic-4.0 branch
Uploaded to Cosmic and completed, now considering SRUs.

Changed in libvirt (Ubuntu Xenial):
status: New → Triaged
Changed in libvirt (Ubuntu Artful):
status: New → Triaged
Changed in libvirt (Ubuntu Bionic):
status: New → Triaged

SRU test builds and tests from PPA worked, pushed for SRU review to X/A/B-unapproved.

Hello bugproxy, or anyone else affected,

Accepted libvirt into artful-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libvirt/3.6.0-1ubuntu6.7 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-artful to verification-done-artful. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-artful. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in libvirt (Ubuntu Artful):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-artful
Changed in libvirt (Ubuntu Xenial):
status: Triaged → Fix Committed
tags: added: verification-needed-xenial
Timo Aaltonen (tjaalton) wrote :

Hello bugproxy, or anyone else affected,

Accepted libvirt into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libvirt/1.3.1-1ubuntu10.23 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in libvirt (Ubuntu Bionic):
status: Triaged → Fix Committed
tags: added: verification-needed-bionic
Timo Aaltonen (tjaalton) wrote :

Hello bugproxy, or anyone else affected,

Accepted libvirt into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libvirt/4.0.0-1ubuntu8.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in ubuntu-power-systems:
status: Fix Released → Fix Committed

Upgraded
- 1.3.1-1ubuntu10.23
- 3.6.0-1ubuntu6.7
- 4.0.0-1ubuntu8.1
without issues.

Confirmed the issue affected the target release before the proposed version:
e.g.:
root@b:~# virsh start b1
error: Failed to start domain b1
error: An error occurred, but the cause is unknown

Bionic and Artful with the fix resolved the error just as expected.

Xenial is a bit more tricky.
There is another issue in Xenial not this bug here but affecting the same use case.
You'll see
libvirtd[18352]: internal error: setup of pcap handle failed: socket: Permission denied
Due to
apparmor="DENIED" operation="create" namespace="root//lxd-x_<var-snap-lxd-common-lxd>" profile="/usr/sbin/libvirtd" pid=2952 comm="libvirtd" family="packet" sock_type="raw" protocol=768 requested_mask="create" denied_mask="create"

That is fixed in latter versions and considered not important for Xenial as it is a uncommon use case and can be changed due to config-files (apparmor rule). But for now that means on Xenial one has to allow those to get the use-case working which is required to check the fix of our change of the bug in discussion here.

With that in place Xenial works as well now, and in regard to the issue fixed in this bug here working.

Per the above setting all three releases to verified.

tags: added: verification-done verification-done-artful verification-done-bionic verification-done-xenial
removed: verification-needed verification-needed-artful verification-needed-bionic verification-needed-xenial

------- Comment From <email address hidden> 2018-05-11 06:22 EDT-------
(In reply to comment #33)
> Ok, together we found a fix for this which now is integrated upstream.
>
> commit ce5aebeacd10a1c15cb3ee46a59c8b5ff235589e
> Author: Laine Stump <email address hidden>
> Date: Wed Apr 25 17:12:03 2018 -0400
>
> nwfilter: increase pcap buffer size to be compatible with TPACKET_V3
>
> Bionic+1 is not yet open, so atm it is a bit complex for a few days to push
> fixes.
> Furthermore since this never worked I'm not quite sure we want/need to SRU
> it back to all releases - at least it can't be urgent IMHO.
>
> I'd ask for the reporters opinion for it, my plan would be to:
> 1. wait until 18.10 release opened the Archive
> 2. fix in 18.10
> 3. SRU to 18.04, 17.10, 16.04
>
> Would that be ok for you?

Yes. Thanks for the quick test results as well.

The verification of the Stable Release Update for libvirt has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 3.6.0-1ubuntu6.7

---------------
libvirt (3.6.0-1ubuntu6.7) artful; urgency=medium

  * Fix nwfilters that set CTRL_IP_LEARNING set to dhcp failing with "An error
    occurred, but the cause is unknown" due to a buffer being too small
    for pcap with TPACKET_V3 enabled (LP: #1758037)
    - debian/patches/ubuntu/lp-1758037-nwfilter-increase-pcap-buffer-size.patch

 -- Christian Ehrhardt <email address hidden> Fri, 11 May 2018 07:35:09 +0200

Changed in libvirt (Ubuntu Artful):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 1.3.1-1ubuntu10.23

---------------
libvirt (1.3.1-1ubuntu10.23) xenial; urgency=medium

  * Fix nwfilters that set CTRL_IP_LEARNING set to dhcp failing with "An error
    occurred, but the cause is unknown" due to a buffer being too small
    for pcap with TPACKET_V3 enabled (LP: #1758037)
    - debian/patches/ubuntu/lp-1758037-nwfilter-increase-pcap-buffer-size.patch

 -- Christian Ehrhardt <email address hidden> Fri, 11 May 2018 07:37:36 +0200

Changed in libvirt (Ubuntu Xenial):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 4.0.0-1ubuntu8.1

---------------
libvirt (4.0.0-1ubuntu8.1) bionic; urgency=medium

  * Fix nwfilters that set CTRL_IP_LEARNING set to dhcp failing with "An error
    occurred, but the cause is unknown" due to a buffer being too small
    for pcap with TPACKET_V3 enabled (LP: #1758037)
    - debian/patches/ubuntu/lp-1758037-nwfilter-increase-pcap-buffer-size.patch

 -- Christian Ehrhardt <email address hidden> Fri, 11 May 2018 07:32:28 +0200

Changed in libvirt (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released

------- Comment From <email address hidden> 2018-06-04 05:13 EDT-------
(In reply to comment #42)
> (In reply to comment #33)
> > Ok, together we found a fix for this which now is integrated upstream.
> >
> > commit ce5aebeacd10a1c15cb3ee46a59c8b5ff235589e
> > Author: Laine Stump <email address hidden>
> > Date: Wed Apr 25 17:12:03 2018 -0400
> >
> > nwfilter: increase pcap buffer size to be compatible with TPACKET_V3
> >
> > Bionic+1 is not yet open, so atm it is a bit complex for a few days to push
> > fixes.
> > Furthermore since this never worked I'm not quite sure we want/need to SRU
> > it back to all releases - at least it can't be urgent IMHO.
> >
> > I'd ask for the reporters opinion for it, my plan would be to:
> > 1. wait until 18.10 release opened the Archive
> > 2. fix in 18.10
> > 3. SRU to 18.04, 17.10, 16.04
> >
> > Would that be ok for you?
>
> Yes. Thanks for the quick test results as well.

Since this has been tested and working fine, closing the bug now.

Thanks everyone!

tags: added: targetmilestone-inin1804
removed: targetmilestone-inin---

Because it isn't clear from Comment 14, I wanted to point out that the patch referenced there is not present upstream until libvirt-4.3.0. I had pushed it to the v4.1-maint branch at the end of April so that it will be included in an F28 build, but the Fedora builds have lately been based on the upstream mainline releases rather than the maintenance branches, so this bug is still present in F28 (and also F27)

libvirt-3.7.0-6.fc27 has been submitted as an update to Fedora 27. https://bodhi.fedoraproject.org/updates/FEDORA-2018-2b053454a4

Changed in libvirt:
status: Unknown → Fix Committed

libvirt-3.7.0-6.fc27 has been pushed to the Fedora 27 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-2b053454a4

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.