ubuntu 4.8 kernel, virtio_net error causes NAT packets to be lost

Bug #1683947 reported by Jay Vosburgh
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Jay Vosburgh
Yakkety
Fix Released
Undecided
Unassigned

Bug Description

SRU Justification:

Impact:

Configuring the 4.8 kernel with iptables MASQUERADE over virtio_net causes packets to be dropped by the hypervisor (host) due to improper flags being set based on the IP checksum state of the packet. The host performing MASQUERADE is affected by the bug.

Issue was introduced by

commit fd2a0437dc33b6425cabf74cc7fc7fdba6d5903b
Author: Mike Rapoport <email address hidden>
Date: Wed Jun 8 16:09:18 2016 +0300

    virtio_net: introduce virtio_net_hdr_{from,to}_skb

which first appears in v4.8-rc1

Fix:

Fixed upstream by

3e9e40e74753 virtio_net: Simplify call sites for virtio_net_hdr_{from, to}_skb().
501db511397f virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on xmit
6391a4481ba0 virtio-net: restore VIRTIO_HDR_F_DATA_VALID on receiving

3e9e40e74753 first appears in v4.9-rc5 (and is a prerequisite only), the others in v4.10-rc4.

Testcase:

Reproduction to date has been on GCE, although in principle it should manifest on any suitable topology using virtio_net. There is a dependency on the forwarded packets having skb->ip_summed == CHECKSUM_UNNECESSARY; not all incoming devices will have this property.

On GCE, the following steps will induce the issue on an affected kernel:

Setup a network:

% gcloud compute networks create nat-network --mode legacy --range 10.240.0.0/16
% gcloud compute firewall-rules create nat-network-allow-ssh --allow tcp:22 --network nat-network
% gcloud compute firewall-rules create nat-network-allow-internal --allow tcp:1-65535,udp:1-65535,icmp --source-ranges 10.240.0.0/16 --network nat-network

Setup an Ubuntu 16.04 NAT VM:

% gcloud compute instances create nat-gateway-16 --zone us-central1-a --network nat-network --can-ip-forward --image-family ubuntu-1604-lts --image-project ubuntu-os-cloud --tags nat --metadata startup-script='sysctl -w net.ipv4.ip_forward=1 ; iptables -t nat -A POSTROUTING -o ens4 -j MASQUERADE'

Setup a route to use the 16.04 NAT:

% gcloud compute routes create no-ip-internet-route --network nat-network --destination-range 0.0.0.0/0 --next-hop-instance nat-gateway-16 --next-hop-instance-zone us-central1-a --tags no-ip --priority 800

Setup a simple test VM without any external network:

% gcloud compute instances create nat-client --zone us-central1-a --network nat-network --no-address --image-family ubuntu-1604-lts --image-project ubuntu-os-cloud --tags no-ip --metadata startup-script='wget --timeout=5 https://github.com/GoogleCloudPlatform/compute-image-packages/archive/20170327.tar.gz'

Wait for it to boot... maybe 30 seconds or so.

Look for serial port output:

% gcloud compute instances get-serial-port-output nat-client --zone us-central1-a | grep startup-script

You will see that the connection to github never succeeds - it just gets stuck on "Resolving github.com (github.com)... 192.30.253.112, 192.30.253.113" and will timeout. (ignore the previous attempt from the successful 14.04 based NAT).

Repeat the test by resettting the test client instance and watch for
serial output:

% gcloud compute instances reset nat-client --zone us-central1-a

Wait a minute or so for new boot, then check the serial-port-output as
above.

Jay Vosburgh (jvosburgh)
Changed in linux (Ubuntu):
assignee: nobody → Jay Vosburgh (jvosburgh)
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1683947

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Jay Vosburgh (jvosburgh)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Jason A. Donenfeld (zx2c4) wrote :

Hey Jay,

I found this same issue here -- https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1685416 -- when debugging WireGuard issues on GCE. I'm curious how you found it and what your debugging was like. Do you work for Google and could debug their virtio implementation? I spent a really long time just rebuilding things and tweaking stuff and following the skb all the way down to the output path. When I had nearly given up, I thought, "you know, maybe I really _should_ take a look at this virtio header stuff." After setting that flag back to zero, and seeing what other successful packets were doing, I had figured it out. At first I thought it was a real kernel bug, and then later saw it was a backporting issue and hence reported it. Anyway, really traumatic debugging blitz that extended through the night. I'm curious about your story...

Jason

Revision history for this message
Jay Vosburgh (jvosburgh) wrote :

Jason,

I work for Canonical; the issue came up with one of our customers.

FWIW, I debugged the issue by first using kprobes and ftrace on the kernel of a running instance to trace the packet path through the kernel. Once it seemed that the affected packets were not being dropped somewhere on the instance and that MASQUERADE appeared to be operating correctly, I did a git bisect of the kernel to isolate the actual commit that resolved the problem (as the 4.11 kernel did not suffer from the issue).

Stefan Bader (smb)
Changed in linux (Ubuntu Yakkety):
status: New → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.8.0-51.54

---------------
linux (4.8.0-51.54) yakkety; urgency=low

  * linux: 4.8.0-51.54 -proposed tracker (LP: #1686070)

  * [Hyper-V][SAUCE] pci-hyperv: Use only 16 bit integer for PCI domain
    (LP: #1684971)
    - SAUCE: pci-hyperv: Use only 16 bit integer for PCI domain

linux (4.8.0-50.53) yakkety; urgency=low

  * linux: 4.8.0-50.53 -proposed tracker (LP: #1685847)

  * ubuntu 4.8 kernel, virtio_net error causes NAT packets to be lost
    (LP: #1683947)
    - virtio_net: Simplify call sites for virtio_net_hdr_{from, to}_skb().
    - virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on xmit
    - virtio-net: restore VIRTIO_HDR_F_DATA_VALID on receiving

 -- Kleber Sacilotto de Souza <email address hidden> Tue, 25 Apr 2017 13:08:56 +0200

Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Fix Released
Po-Hsu Lin (cypressyew)
Changed in linux (Ubuntu):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.