lxc network.mtu setting not set consistently across hosts
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
juju-core |
Fix Released
|
High
|
Unassigned | ||
1.23 |
Fix Released
|
High
|
Unassigned | ||
1.24 |
Fix Released
|
High
|
Unassigned |
Bug Description
We have noticed the following in one of our deployments. Basically the issue is that Juju setting network.mtu for lxc containers to a value that is taken from "the first nic to come up that is not loopback" (from #juju conversation). The motivation for Juju to provide this feature is not clear but it is valuable and necessary for a specific reason IFF it is done properly.
Currently we are noticing UDP fragmentation in LXC containers whereby an application inside the container expects to be able to transmit at 1500 (default mtu) but for an (as yet unknown) reason lxc adds some extra bytes to each packet resulting in fragmentation of packets close to 1500 bytes in size. So, while setting the lxc veth mtu is not the perfect solution it is a solution but only if done properly and that means doing it deterministicly and consistently on all hosts. The second part to this is that if Juju is picking a nic at random to get it's mtu, that nic may not have the correct mtu that we want to use. So solutions I propose are:
1. allow juju to accept an lxc-mtu config option
or
2. Always set the lxc.mtu to something like 1600 which is sufficient to avoid any fragmentation as we have seen it.
---- ---- ---- ---- ----
root@szg-
# Template used to create this container: /usr/share/
# Parameters passed to the template: --debug --userdata /var/lib/
# For additional config options, please look at lxc.container.
# Common configuration
lxc.include = /usr/share/
# Container specific configuration
lxc.rootfs = /var/lib/
lxc.mount = /var/lib/
lxc.utsname = juju-trusty-
lxc.arch = amd64
# Network configuration
lxc.network.type = veth
lxc.network.hwaddr = 00:16:3e:7c:68:64
lxc.network.flags = up
lxc.network.link = juju-br0
lxc.network.mtu = 9180
root@szg-
# Template used to create this container: /usr/share/
# Parameters passed to the template: --debug --userdata /var/lib/
# For additional config options, please look at lxc.container.
# Common configuration
lxc.include = /usr/share/
# Container specific configuration
lxc.rootfs = /var/lib/
lxc.mount = /var/lib/
lxc.utsname = juju-trusty-
lxc.arch = amd64
# Network configuration
lxc.network.type = veth
lxc.network.hwaddr = 00:16:3e:54:a4:76
lxc.network.flags = up
lxc.network.link = juju-br0
lxc.network.mtu = 1500
Changed in juju-core: | |
status: | New → Triaged |
importance: | Undecided → High |
milestone: | none → 1.24-alpha1 |
tags: | added: addressability lxc network |
Changed in juju-core: | |
importance: | High → Critical |
no longer affects: | juju-core/1.22 |
Changed in juju-core: | |
importance: | Critical → High |
Changed in juju-core: | |
milestone: | 1.24-alpha1 → 1.25.0 |
Changed in juju-core: | |
status: | Triaged → In Progress |
assignee: | nobody → Dimiter Naydenov (dimitern) |
Changed in juju-core: | |
status: | Fix Committed → Fix Released |
tags: | added: canonical-bootstack |
Changed in juju-core: | |
assignee: | Dimiter Naydenov (dimitern) → nobody |
I'm not sure what you're seeing and what your environment is actually like, but veth devices DO NOT alter the packets in any way shape or form, so your assertion that LXC appends anything to the packet is just wrong.
Here is a proof, showing a simple test with the no-frag flag set (which is the easiest way to test those kind of issues).
Sending a ping with a packet of exactly 1500 from one container to another (no NAT, same subnet, same bridge):
root@trusty01:/# ping -M do 10.0.3.115 -s 1472 -c 5
PING 10.0.3.115 (10.0.3.115) 1472(1500) bytes of data.
1480 bytes from 10.0.3.115: icmp_seq=1 ttl=64 time=0.046 ms
1480 bytes from 10.0.3.115: icmp_seq=2 ttl=64 time=0.056 ms
1480 bytes from 10.0.3.115: icmp_seq=3 ttl=64 time=0.047 ms
1480 bytes from 10.0.3.115: icmp_seq=4 ttl=64 time=0.099 ms
1480 bytes from 10.0.3.115: icmp_seq=5 ttl=64 time=0.049 ms
--- 10.0.3.115 ping statistics --- 059/0.099/ 0.021 ms
5 packets transmitted, 5 received, 0% packet loss, time 3996ms
rtt min/avg/max/mdev = 0.046/0.
Sending a ping with a packet of exactly 1501 from one container to another (no NAT, same subnet, same bridge):
root@trusty01:/# ping -M do 10.0.3.115 -s 1473 -c 5
PING 10.0.3.115 (10.0.3.115) 1473(1501) bytes of data.
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
--- 10.0.3.115 ping statistics ---
5 packets transmitted, 0 received, +5 errors, 100% packet loss, time 3999ms
Sending a ping with a packet of exactly 1500 from one container to a public service (NAT):
root@trusty01:/# ping -M do 8.8.8.8 -s 1472 -c 5
PING 8.8.8.8 (8.8.8.8) 1472(1500) bytes of data.
1480 bytes from 8.8.8.8: icmp_seq=1 ttl=54 time=11.1 ms
1480 bytes from 8.8.8.8: icmp_seq=2 ttl=54 time=11.5 ms
1480 bytes from 8.8.8.8: icmp_seq=3 ttl=54 time=10.8 ms
1480 bytes from 8.8.8.8: icmp_seq=4 ttl=54 time=10.9 ms
1480 bytes from 8.8.8.8: icmp_seq=5 ttl=54 time=12.7 ms
--- 8.8.8.8 ping statistics --- 11.446/ 12.760/ 0.714 ms
5 packets transmitted, 5 received, 0% packet loss, time 4004ms
rtt min/avg/max/mdev = 10.813/
Sending a ping with a packet of exactly 1501 from one container to a public service (NAT):
root@trusty01:/# ping -M do 8.8.8.8 -s 1473 -c 5
PING 8.8.8.8 (8.8.8.8) 1473(1501) bytes of data.
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
--- 8.8.8.8 ping statistics ---
5 packets transmitted, 0 received, +5 errors, 100% packet loss, time 3999ms
An ICMP echo request is 28 bytes large, so 1472 bytes of data results in exactly 1500 bytes on the network which is the absolute maximum you can send without fragmentation. The -M do flag forces the packets to go unfragmented, causing the expected failure when reaching MTU+1.
tracepath can also be used to detect PMTU as can be seen here:
root@trusty01:/# tracepath -n vorash.stgraber.org
1?: [LOCALHOST] pmtu 1500
1: 10.0.3.1 ...