Comment 1 for bug 1442257

Revision history for this message
Stéphane Graber (stgraber) wrote :

I'm not sure what you're seeing and what your environment is actually like, but veth devices DO NOT alter the packets in any way shape or form, so your assertion that LXC appends anything to the packet is just wrong.

Here is a proof, showing a simple test with the no-frag flag set (which is the easiest way to test those kind of issues).

Sending a ping with a packet of exactly 1500 from one container to another (no NAT, same subnet, same bridge):
root@trusty01:/# ping -M do 10.0.3.115 -s 1472 -c 5
PING 10.0.3.115 (10.0.3.115) 1472(1500) bytes of data.
1480 bytes from 10.0.3.115: icmp_seq=1 ttl=64 time=0.046 ms
1480 bytes from 10.0.3.115: icmp_seq=2 ttl=64 time=0.056 ms
1480 bytes from 10.0.3.115: icmp_seq=3 ttl=64 time=0.047 ms
1480 bytes from 10.0.3.115: icmp_seq=4 ttl=64 time=0.099 ms
1480 bytes from 10.0.3.115: icmp_seq=5 ttl=64 time=0.049 ms

--- 10.0.3.115 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 3996ms
rtt min/avg/max/mdev = 0.046/0.059/0.099/0.021 ms

Sending a ping with a packet of exactly 1501 from one container to another (no NAT, same subnet, same bridge):
root@trusty01:/# ping -M do 10.0.3.115 -s 1473 -c 5
PING 10.0.3.115 (10.0.3.115) 1473(1501) bytes of data.
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500

--- 10.0.3.115 ping statistics ---
5 packets transmitted, 0 received, +5 errors, 100% packet loss, time 3999ms

Sending a ping with a packet of exactly 1500 from one container to a public service (NAT):
root@trusty01:/# ping -M do 8.8.8.8 -s 1472 -c 5
PING 8.8.8.8 (8.8.8.8) 1472(1500) bytes of data.
1480 bytes from 8.8.8.8: icmp_seq=1 ttl=54 time=11.1 ms
1480 bytes from 8.8.8.8: icmp_seq=2 ttl=54 time=11.5 ms
1480 bytes from 8.8.8.8: icmp_seq=3 ttl=54 time=10.8 ms
1480 bytes from 8.8.8.8: icmp_seq=4 ttl=54 time=10.9 ms
1480 bytes from 8.8.8.8: icmp_seq=5 ttl=54 time=12.7 ms

--- 8.8.8.8 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4004ms
rtt min/avg/max/mdev = 10.813/11.446/12.760/0.714 ms

Sending a ping with a packet of exactly 1501 from one container to a public service (NAT):
root@trusty01:/# ping -M do 8.8.8.8 -s 1473 -c 5
PING 8.8.8.8 (8.8.8.8) 1473(1501) bytes of data.
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500

--- 8.8.8.8 ping statistics ---
5 packets transmitted, 0 received, +5 errors, 100% packet loss, time 3999ms

An ICMP echo request is 28 bytes large, so 1472 bytes of data results in exactly 1500 bytes on the network which is the absolute maximum you can send without fragmentation. The -M do flag forces the packets to go unfragmented, causing the expected failure when reaching MTU+1.

tracepath can also be used to detect PMTU as can be seen here:
root@trusty01:/# tracepath -n vorash.stgraber.org
 1?: [LOCALHOST] pmtu 1500
 1: 10.0.3.1 0.130ms
 1: 10.0.3.1 0.081ms
 2: 172.31.1.1 2.175ms
 3: 172.20.48.1 2.482ms
 4: 12.228.154.65 4.385ms
 5: 12.252.42.69 3.511ms asymm 6
 6: 12.122.139.190 12.503ms asymm 10
 7: 12.122.100.117 12.543ms asymm 9
 8: 12.123.16.109 11.312ms
 9: no reply
10: no reply
11: no reply
12: 198.27.73.183 54.201ms asymm 17
13: 198.27.73.1 54.690ms asymm 17
14: 198.27.73.96 56.945ms asymm 17
15: 192.99.34.219 54.308ms reached
     Resume: pmtu 1500 hops 15 back 19

As can be seen on a regular network where the MTU of the whole path is 1500 (as expected).

Or:
root@trusty:~# tracepath vorash.stgraber.org
 1?: [LOCALHOST] pmtu 1500
 1: 10.0.3.1 0.158ms
 1: 10.0.3.1 0.116ms
 2: sateda.lan.mtl.stgraber.net 0.350ms
 3: sateda.lan.mtl.stgraber.net 0.418ms pmtu 1492
 3: lo-100.lns02.tor.packetflow.ca 19.801ms asymm 4
 4: ae0_2110-bdr04-tor.teksavvy.com 16.246ms
 5: no reply
 6: mtl-2-6k.qc.ca 21.662ms
 7: bhs-g1-6k.qc.ca 22.639ms
 8: bhs-3a-a9.qc.ca 22.984ms
 9: vorash.stgraber.org 22.322ms reached
     Resume: pmtu 1492 hops 9 back 9

As can be observed on a connection where one hop (in this case a PPPoE link) reduces the MTU. Note that from this machine, the 1500 bytes ping with no-frag will fail as the maximum point to point MTU is that of the lowest link, which is 1492.

root@trusty:~# ping -M do -s 1472 8.8.8.8 -c 5
PING 8.8.8.8 (8.8.8.8) 1472(1500) bytes of data.
ping: local error: Message too long, mtu=1492
ping: local error: Message too long, mtu=1492
ping: local error: Message too long, mtu=1492
ping: local error: Message too long, mtu=1492
ping: local error: Message too long, mtu=1492

--- 8.8.8.8 ping statistics ---
5 packets transmitted, 0 received, +5 errors, 100% packet loss, time 4000ms

You'll note that the right MTU is properly reported even though the container and the whole network itself have a MTU of 1500 (as they should). So if you're not providing the no-frag flag, the client will fragment as expected.

root@trusty:~# ping -s 1472 8.8.8.8 -c 5
PING 8.8.8.8 (8.8.8.8) 1472(1500) bytes of data.
1480 bytes from 8.8.8.8: icmp_seq=1 ttl=58 time=14.6 ms
1480 bytes from 8.8.8.8: icmp_seq=2 ttl=58 time=14.5 ms
1480 bytes from 8.8.8.8: icmp_seq=3 ttl=58 time=14.5 ms
1480 bytes from 8.8.8.8: icmp_seq=4 ttl=58 time=14.5 ms
1480 bytes from 8.8.8.8: icmp_seq=5 ttl=58 time=14.4 ms

--- 8.8.8.8 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4006ms
rtt min/avg/max/mdev = 14.484/14.543/14.614/0.159 ms

The cases in which MTU problems will hit you and cause packet loss typically are:
 1) Asynchronous MTU configuration on a router (your router thinks its upstream is 1500 but it's really 1492, the upstream is set at 1492) which means you'll received things fine but any packet you send which are > 1492 will be lost.
 2) You have a firewall blocking all ICMP, meaning you don't get the ICMP fragmentation packet so the client can't do path MTU detection and so will send packets which are too large and will get dropped.
 3) You are within a single subnet (no router) with a lowered MTU and are trying to send packets which are larger than the MTU and have the no-frag flag set.

All of those boil down to network configuration errors. The usual rule (so outside of the storage server jumbo frame case) is to always use a 1500 MTU everywhere and only lower it on the specific links where it must be lowered (PPPoE, GRE, tun, ...), make sure that the router that's dealing with that low-MTU link is configured to fragment and that it can send ICMP packets back to the client for the no-frag packets (where the client must be informed of the low MTU for that target).

So all that to say, there is no encapsulation going on with veth and a bridge so there is absolutely no reason to raise or lower the MTU of lxcbr0 and the containers, in fact, doing so would break IPv6 connectivity and just be harmful in general.