Comment 2 for bug 1442257

Revision history for this message
Robbie Williamson (robbiew) wrote :

Pasting in an extremely relevant comment received via email from Jay Vosburgh (jvosburgh):
===========================================================================
 I think there's some misunderstanding as to what's going on at
the low level; I've talked with dosaboy and wolsen about this, and am
hoping to look at some packet captures. The issue seems to be related
to the destination container receiving the packet after it has been
messed with by the iptables connection tracking logic (conntrack).
Since adjusting the interface MTU works around the problem, it seems
unlikely to be due to LXC itself.

 Some technical details for interested parties...

 There is not a 20 byte tracking overhead in the packet itself;
what appears to be happening is that corosync sends UDP datagrams that
exceed the MTU of the network (for reasons that aren't clear) by about
20 bytes, and the sending container correctly fragments them into two IP
fragments. The actual cutoff size to induce fragmentation is ususally
1472; UDP datagrams larger than this will be fragmented for the usual
IPv4 case (because the MTU is 1500 and the IP + UDP headers usually
occupy 28 bytes).

 When those fragments reach the host, conntrack reassembles them
into one IP datagram (which is larger than the MTU); it does this
because the individual fragments generally cannot be tracked as they
lack the upper layer protocol information. This makes the datagram look
as if it was sent as a single, unfragmented, datagram when it really
wasn't.

 This "biggie" datagram is processed, and leaves the host, headed
for a destination container; at this time, it is refragmented so that
each piece is less than the egress interface MTU. For reasons currently
unknown, this UDP datagram (composed of two pieces) is lost or dropped.

 The workaround mentioned is to raise the MTU of the interface
leaving the host, destined for the container (host end of the veth
pair), as then the "large" datagram is not re-fragmented, but sent
whole. This works only for veth and only because its transmit
functionality is special and does not honor the receiver veth's MTU
(i.e., it will send packets that exceed the veth receiver's MTU, and
those packets will be successfully processed).

 Mixing MTUs within a network like this is generally a risky
thing to do, so in my opinion this should really be considered as a
workaround and not a permanent fix.

 There are really two issues:

 1) Why does corosync send over-MTU datagrams?

 Corosync has some logic that appears to attempt to avoid doing
this, but appears to not be working.

 2) Why is the re-fragmented datagram either not processed or not
received by the destination container?

 This appears to be the real problem; in principle, a fragmented
UDP datagram should be received and processed by the destination
container regardless of size. Fragmentation can be a performance
concern, but for low volume traffic should function correctly.