lxc network.mtu setting not set consistently across hosts

Bug #1442257 reported by Edward Hope-Morley
50
This bug affects 6 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Unassigned
1.23
Fix Released
High
Unassigned
1.24
Fix Released
High
Unassigned

Bug Description

We have noticed the following in one of our deployments. Basically the issue is that Juju setting network.mtu for lxc containers to a value that is taken from "the first nic to come up that is not loopback" (from #juju conversation). The motivation for Juju to provide this feature is not clear but it is valuable and necessary for a specific reason IFF it is done properly.

Currently we are noticing UDP fragmentation in LXC containers whereby an application inside the container expects to be able to transmit at 1500 (default mtu) but for an (as yet unknown) reason lxc adds some extra bytes to each packet resulting in fragmentation of packets close to 1500 bytes in size. So, while setting the lxc veth mtu is not the perfect solution it is a solution but only if done properly and that means doing it deterministicly and consistently on all hosts. The second part to this is that if Juju is picking a nic at random to get it's mtu, that nic may not have the correct mtu that we want to use. So solutions I propose are:

1. allow juju to accept an lxc-mtu config option

or

2. Always set the lxc.mtu to something like 1600 which is sufficient to avoid any fragmentation as we have seen it.

---- ---- ---- ---- ----

root@szg-dr-fdc-os-1:/var/lib/lxc# cat juju-trusty-lxc-template/config
# Template used to create this container: /usr/share/lxc/templates/lxc-ubuntu-cloud
# Parameters passed to the template: --debug --userdata /var/lib/juju/containers/juju-trusty-lxc-template/cloud-init --hostid juju-trusty-lxc-template -r trusty -T https://10.10.24.31:17070/environment/747fcd5f-9415-44d9-8d6f-3b57af854abc/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz
# For additional config options, please look at lxc.container.conf(5)

# Common configuration
lxc.include = /usr/share/lxc/config/ubuntu-cloud.common.conf

# Container specific configuration
lxc.rootfs = /var/lib/lxc/juju-trusty-lxc-template/rootfs
lxc.mount = /var/lib/lxc/juju-trusty-lxc-template/fstab
lxc.utsname = juju-trusty-lxc-template
lxc.arch = amd64

# Network configuration
lxc.network.type = veth
lxc.network.hwaddr = 00:16:3e:7c:68:64
lxc.network.flags = up
lxc.network.link = juju-br0
lxc.network.mtu = 9180

root@szg-dr-fdc-os-3:~# cat /var/lib/lxc/juju-trusty-lxc-template/config
# Template used to create this container: /usr/share/lxc/templates/lxc-ubuntu-cloud
# Parameters passed to the template: --debug --userdata /var/lib/juju/containers/juju-trusty-lxc-template/cloud-init --hostid juju-trusty-lxc-template -r trusty -T https://10.10.24.31:17070/environment/747fcd5f-9415-44d9-8d6f-3b57af854abc/images/lxc/trusty/amd64/ubuntu-14.04-server-cloudimg-amd64-root.tar.gz
# For additional config options, please look at lxc.container.conf(5)

# Common configuration
lxc.include = /usr/share/lxc/config/ubuntu-cloud.common.conf

# Container specific configuration
lxc.rootfs = /var/lib/lxc/juju-trusty-lxc-template/rootfs
lxc.mount = /var/lib/lxc/juju-trusty-lxc-template/fstab
lxc.utsname = juju-trusty-lxc-template
lxc.arch = amd64

# Network configuration
lxc.network.type = veth
lxc.network.hwaddr = 00:16:3e:54:a4:76
lxc.network.flags = up
lxc.network.link = juju-br0
lxc.network.mtu = 1500

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.24-alpha1
tags: added: addressability lxc network
Ian Booth (wallyworld)
Changed in juju-core:
importance: High → Critical
Revision history for this message
Stéphane Graber (stgraber) wrote :
Download full text (7.8 KiB)

I'm not sure what you're seeing and what your environment is actually like, but veth devices DO NOT alter the packets in any way shape or form, so your assertion that LXC appends anything to the packet is just wrong.

Here is a proof, showing a simple test with the no-frag flag set (which is the easiest way to test those kind of issues).

Sending a ping with a packet of exactly 1500 from one container to another (no NAT, same subnet, same bridge):
root@trusty01:/# ping -M do 10.0.3.115 -s 1472 -c 5
PING 10.0.3.115 (10.0.3.115) 1472(1500) bytes of data.
1480 bytes from 10.0.3.115: icmp_seq=1 ttl=64 time=0.046 ms
1480 bytes from 10.0.3.115: icmp_seq=2 ttl=64 time=0.056 ms
1480 bytes from 10.0.3.115: icmp_seq=3 ttl=64 time=0.047 ms
1480 bytes from 10.0.3.115: icmp_seq=4 ttl=64 time=0.099 ms
1480 bytes from 10.0.3.115: icmp_seq=5 ttl=64 time=0.049 ms

--- 10.0.3.115 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 3996ms
rtt min/avg/max/mdev = 0.046/0.059/0.099/0.021 ms

Sending a ping with a packet of exactly 1501 from one container to another (no NAT, same subnet, same bridge):
root@trusty01:/# ping -M do 10.0.3.115 -s 1473 -c 5
PING 10.0.3.115 (10.0.3.115) 1473(1501) bytes of data.
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500

--- 10.0.3.115 ping statistics ---
5 packets transmitted, 0 received, +5 errors, 100% packet loss, time 3999ms

Sending a ping with a packet of exactly 1500 from one container to a public service (NAT):
root@trusty01:/# ping -M do 8.8.8.8 -s 1472 -c 5
PING 8.8.8.8 (8.8.8.8) 1472(1500) bytes of data.
1480 bytes from 8.8.8.8: icmp_seq=1 ttl=54 time=11.1 ms
1480 bytes from 8.8.8.8: icmp_seq=2 ttl=54 time=11.5 ms
1480 bytes from 8.8.8.8: icmp_seq=3 ttl=54 time=10.8 ms
1480 bytes from 8.8.8.8: icmp_seq=4 ttl=54 time=10.9 ms
1480 bytes from 8.8.8.8: icmp_seq=5 ttl=54 time=12.7 ms

--- 8.8.8.8 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4004ms
rtt min/avg/max/mdev = 10.813/11.446/12.760/0.714 ms

Sending a ping with a packet of exactly 1501 from one container to a public service (NAT):
root@trusty01:/# ping -M do 8.8.8.8 -s 1473 -c 5
PING 8.8.8.8 (8.8.8.8) 1473(1501) bytes of data.
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500

--- 8.8.8.8 ping statistics ---
5 packets transmitted, 0 received, +5 errors, 100% packet loss, time 3999ms

An ICMP echo request is 28 bytes large, so 1472 bytes of data results in exactly 1500 bytes on the network which is the absolute maximum you can send without fragmentation. The -M do flag forces the packets to go unfragmented, causing the expected failure when reaching MTU+1.

tracepath can also be used to detect PMTU as can be seen here:
root@trusty01:/# tracepath -n vorash.stgraber.org
 1?: [LOCALHOST] pmtu 1500
 1: 10.0.3.1 ...

Read more...

Revision history for this message
Robbie Williamson (robbiew) wrote :

Pasting in an extremely relevant comment received via email from Jay Vosburgh (jvosburgh):
===========================================================================
 I think there's some misunderstanding as to what's going on at
the low level; I've talked with dosaboy and wolsen about this, and am
hoping to look at some packet captures. The issue seems to be related
to the destination container receiving the packet after it has been
messed with by the iptables connection tracking logic (conntrack).
Since adjusting the interface MTU works around the problem, it seems
unlikely to be due to LXC itself.

 Some technical details for interested parties...

 There is not a 20 byte tracking overhead in the packet itself;
what appears to be happening is that corosync sends UDP datagrams that
exceed the MTU of the network (for reasons that aren't clear) by about
20 bytes, and the sending container correctly fragments them into two IP
fragments. The actual cutoff size to induce fragmentation is ususally
1472; UDP datagrams larger than this will be fragmented for the usual
IPv4 case (because the MTU is 1500 and the IP + UDP headers usually
occupy 28 bytes).

 When those fragments reach the host, conntrack reassembles them
into one IP datagram (which is larger than the MTU); it does this
because the individual fragments generally cannot be tracked as they
lack the upper layer protocol information. This makes the datagram look
as if it was sent as a single, unfragmented, datagram when it really
wasn't.

 This "biggie" datagram is processed, and leaves the host, headed
for a destination container; at this time, it is refragmented so that
each piece is less than the egress interface MTU. For reasons currently
unknown, this UDP datagram (composed of two pieces) is lost or dropped.

 The workaround mentioned is to raise the MTU of the interface
leaving the host, destined for the container (host end of the veth
pair), as then the "large" datagram is not re-fragmented, but sent
whole. This works only for veth and only because its transmit
functionality is special and does not honor the receiver veth's MTU
(i.e., it will send packets that exceed the veth receiver's MTU, and
those packets will be successfully processed).

 Mixing MTUs within a network like this is generally a risky
thing to do, so in my opinion this should really be considered as a
workaround and not a permanent fix.

 There are really two issues:

 1) Why does corosync send over-MTU datagrams?

 Corosync has some logic that appears to attempt to avoid doing
this, but appears to not be working.

 2) Why is the re-fragmented datagram either not processed or not
received by the destination container?

 This appears to be the real problem; in principle, a fragmented
UDP datagram should be received and processed by the destination
container regardless of size. Fragmentation can be a performance
concern, but for low volume traffic should function correctly.

Revision history for this message
Hua Zhang (zhhuabj) wrote :

seems corosync has a setting netmtu [1] to change mtu value.

[1], https://pve.proxmox.com/wiki/Multicast_notes

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Joshua, corosync netmtu is currently supported in the hacluster charm but we have yet to prove that it actually resolves the observed corosync issues. Regardless, Juju still needs fixing because applying mtu inconsistently and non-deterministicly across hosts is bad.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

I still can't see what's the solution Juju should implement.
Will just hardcoding all container NIC's MTU to a given value (specified as an environment level setting) suffice?

Revision history for this message
Robbie Williamson (robbiew) wrote :

Adding email comment from Robie Basak (racb):
=========================================
I can't comment on any specifics here, but I would like to point out
some essentials about MTU issues.

1. In general, if you have to tune the MTU of any particular interface
to fix an issue, you're applying a workaround that covers up some other
real issue that will bite you later. Tuning an MTU is often a sign that
pMTU discovery is broken. And if pMTU discovery has been broken somehow,
stuff *will* eventually break. I get the impression that this is what
has happened here.

1b. The only exceptions to "it should never be necessary to tune MTUs"
might be jumbo frames (layer 2 only) and/or tunnels. I think all
interfaces on a link that uses jumbo frames would need to be configured
with the same larger MTU, but I have little experience with this.

2. Layer 3 should always be able to sort itself out regardless of the
MTUs on underlying interfaces (whether that's layer 2 or a tunnel also
at layer 3). If everything is configured correctly, pMTU discovery
should Just Work. If you have to adjust MTUs to work around layer 3
issues, then either you have some layer 2 connectivity problem (check
each link can pass an MTU-sized frame) or you have layer 3 either
dropping or failing to generate packets that should be passed (eg. ICMP
frag needed packets).

3. Tunnels complicate this. I expect tunnels with a smaller MTU (which
is the common case) to either fragment on tunnel entry or honor DF and
return ICMP frag needed. You can use ping with -M and -s in various
combinations to verify this. It is awkward if the tunnel path pMTU
changes, but generally it remains constant on a private LAN so I'd
expect the tunnel interface MTU to be set correctly when the tunnel is
first set up[1]. So I have been under the impression that tunnels should
be able to do the right thing to make my points 1 and 2 hold, but I
could be wrong.

Summary: I'm deeply suspicious of any fix I see anywhere that involves
tuning MTUs. I think a real fix would fix the root cause, pMTU discovery
would Just Work and no MTUs would need tuning. If there's some reason
MTUs do absolutely have to be tuned to fix this kind of issue, I'd love
to see a detailed technical explanation as to why.

HTH,

Robie

[1] This is perhaps one place where I'll grant that maybe the MTU needs
to be set correctly if traffic between tunnel endpoints are known to
have a pMTU lower than the MTUs of the interfaces used between them.

Revision history for this message
Robbie Williamson (robbiew) wrote :

Adding a relvant comment received from Xiang Hui (xianghui):
==================================================
I am curious about why the corosync packets are fragmented for the default mtu 1500 with IPv4 first.

Here just a guess, link [1] may explained fragment and reassembly corosync packets did break the cluster due to the fact corosync is using Totem protocol which implemented by using a consistent order for all the messages sent or received to the members in a single ring to communicate, so if fragmentation happened it will break the goal of Totem rule.

[1] The Totem Single-Ring Ordering and Membership Protocol
http://www.cs.jhu.edu/~yairamir/archive.html

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Retriaging to 1.24 to unblock 1.23.0 release - it seems there's no usable solution w.r.t. juju-core yet. Once we have it, we'll do a point-release (e.g. 1.23.1).

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Correction, retriaging to 1.23, as there's no 1.23.1 milestone yet.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Dimiter, can I suggest that you, at least, revert this broken feature until we figure out a better solution or fix for our main corosync issue. Having Juju set network.mtu on lxc veth inconsistently across hosts is a real problem in itself for which there is a simple solution - remove the feature altogether.

Revision history for this message
James Page (james-page) wrote :

AFAICT this issue appears to be isolated to LXC.

I sprung a 3 node PXC cluster with corosync/pacemaker managing the VIP under KVM containers on our OpenStack QA cloud in an effort to reproduce this problem on one of our automated testing platforms, but I'm not able to reproduce this problem after three days of testing. I've been actively rebooting units, moving resource around, unplugging network interfaces etc... but the corosync/pacemaker cluster always restores back to a clean state with no split brain or wedging on the corosync/pacemaker stack.

All interfaces (and veth's) are configured at the standard 1500 MTU - instances are on different compute nodes, but we run jumbo frames on the hypervisor physical nics to avoid any packet fragmentation issues in the GRE overlay networks that we use on this cloud - heres how things are connected in OpenStack nova/neutron:

KVM <-> [tap] <-> [bridge] <-> [veth-pair] <-> [OVS br-int] <-> [OVS br-tun] <-> {GRE tunnel}

This may or may not be related, but we also see problems restarting pacemaker/corosync under LXC (see bug 1439649) - another issue which I can't yet reproduce under KVM or on real hardware.

Curtis Hovey (sinzui)
no longer affects: juju-core/1.22
Curtis Hovey (sinzui)
Changed in juju-core:
importance: Critical → High
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.24-alpha1 → 1.25.0
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

As discussed with Ed, I'll be reverting the change that makes LXC NICs' MTU inherit the host's primary NIC's MTU value and instead add a "lxc-default-mtu" environment setting as a fallback. If the setting is set (not set by default) will cause *all* containers' NICs' MTU values to be set to that value.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

I'm resuming the work on this one, as described in my previous comment.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Fix proposed for 1.23 with https://github.com/juju/juju/pull/2365, will be forward ported to 1.24 and 1.25 (master).

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

The fix for 1.23 has landed, the one for 1.24 is proposed with https://github.com/juju/juju/pull/2366 (it includes also the overlooked https://github.com/juju/juju/pull/2190 needed for MAAS).

Changed in juju-core:
status: Triaged → In Progress
assignee: nobody → Dimiter Naydenov (dimitern)
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Fix for 1.24 landed. Port of the fix to 1.25 proposed with https://github.com/juju/juju/pull/2392

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

All done! I'd appreciate stakeholder feedback about how the new lxc-default-mtu setting works in real-live deployments.

Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
tags: added: canonical-bootstack
Revision history for this message
Matt Rae (mattrae) wrote :

setting lxc-default-mtu solved the issue I was having in this bug https://bugs.launchpad.net/juju-core/+bug/1441319/comments/35

Changed in juju-core:
assignee: Dimiter Naydenov (dimitern) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.