Ubuntu
linux package

Multicast traffic not propating correctly over linux bridge

Bug #1402763 reported by James Page on 2014-12-15

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	juju-core	Won't Fix	Medium	Unassigned
	linux (Ubuntu)	Confirmed	Undecided	Unassigned

Bug Description

There's a lot of supposition in the title of this bug but its currently my best guess.

In this deployment, I have a number of services running in LXC containers across multiple physical hosts; each service is clustered across three units, all on separate physical hosts, using corosync and pacemaker; when using multicast to support cluster communication, I occasionally see a container drop out of the cluster and use its isolation response (to shutdown all managed services); when using unicast I've not yet see this same problem.

LXC containers are bridged to the main physical network using a linux bridge:

eth0 <-> juju-br0 <-> vethXXX <-> | vethXXX |

All MTU's are standard (1500).

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: lxc 1.0.6-0ubuntu0.1
ProcVersionSignature: User Name 3.13.0-40.69-generic 3.13.11.10
Uname: Linux 3.13.0-40-generic x86_64
ApportVersion: 2.14.1-0ubuntu3.6
Architecture: amd64
Date: Mon Dec 15 17:20:08 2014
ProcEnviron:
TERM=screen-bce
PATH=(custom, no user)
XDG_RUNTIME_DIR=<set>
LANG=en_US.UTF-8
SHELL=/bin/bash
SourcePackage: lxc
UpgradeStatus: No upgrade log present (probably fresh install)
defaults.conf:
lxc.network.type = veth
lxc.network.link = lxcbr0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
---
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Dec 17 10:16 seq
crw-rw---- 1 root audio 116, 33 Dec 17 10:16 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.6
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 14.04
IwConfig: Error: [Errno 2] No such file or directory
MachineType: Dell Inc. PowerEdge R610
Package: lxc 1.0.6-0ubuntu0.1
PackageArchitecture: amd64
PciMultimedia:

ProcCmdline: BOOT_IMAGE=/boot/vmlinuz-3.13.0-43-generic root=UUID=5a86874d-8bbd-4e7a-b73e-17c914de390b ro
ProcEnviron:
TERM=screen-bce
PATH=(custom, no user)
XDG_RUNTIME_DIR=<set>
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-43-generic root=UUID=5a86874d-8bbd-4e7a-b73e-17c914de390b ro
ProcVersionSignature: User Name 3.13.0-43.72-generic 3.13.11.11
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty uec-images trusty uec-images apparmor
Uname: Linux 3.13.0-43-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm audio cdrom dialout dip floppy libvirtd netdev plugdev sudo video
_MarkForUpload: True
defaults.conf:
lxc.network.type = veth
lxc.network.link = lxcbr0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
dmi.bios.date: 08/18/2011
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 6.0.7
dmi.board.name: 0F0XJ6
dmi.board.vendor: Dell Inc.
dmi.board.version: A11
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr6.0.7:bd08/18/2011:svnDellInc.:pnPowerEdgeR610:pvr:rvnDellInc.:rn0F0XJ6:rvrA11:cvnDellInc.:ct23:cvr:
dmi.product.name: PowerEdge R610
dmi.sys.vendor: Dell Inc.

See original description

Tags:

Revision history for this message

James Page (james-page) wrote on 2014-12-15:

Dependencies.txt Edit (6.1 KiB, text/plain; charset="utf-8")
KernLog.txt Edit (7.0 KiB, text/plain; charset="utf-8")
RelatedPackageVersions.txt Edit (208 bytes, text/plain; charset="utf-8")
lxc-net.default.txt Edit (1.3 KiB, text/plain; charset="utf-8")
lxc.default.txt Edit (430 bytes, text/plain; charset="utf-8")
lxcsyslog.txt Edit (2.7 KiB, text/plain; charset="utf-8")

Revision history for this message

Launchpad Janitor (janitor) wrote on 2014-12-16:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in lxc (Ubuntu):
status:	New → Confirmed

Revision history for this message

James Page (james-page) wrote on 2014-12-16:

I've done a bit more testing, and here are my observations:

1) rebooting a container internally/stop-start externally using lxc-*

Works OK - container re-joins the corosync cluster just fine.

2) rebooting the physical host

Fails - containers never re-join the corosync cluster and perform the configured isolation response which is to stop everything.

That said, if I reconfigure the cluster to using a new multicast address, the cluster reforms OK.

So it would appear that cross server multicast is not being restored on a physical server reboot; I guess this could be something todo with switch configuration as well.

tags:

added: smoosh

Revision history for this message

Brad Figg (brad-figg) wrote on 2014-12-16: Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1402763

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

James Page (james-page) wrote on 2014-12-17:

Things appear to be somewhat random; however setting the multicast_querier flag to 1 resulted in all my clustered spring back to life:

for i in `seq 0 10`; do juju ssh $i "echo -n 1 | sudo tee /sys/devices/virtual/net/juju-br0/bridge/multicast_querier"; done

Juju creates a bridge where this is not enabled by default.

Changed in linux (Ubuntu):
status:	Incomplete → Invalid

Revision history for this message

James Page (james-page) wrote on 2014-12-17:

For context:

http://en.wikipedia.org/wiki/IGMP_snooping#IGMP_querier

James Page (james-page) on 2014-12-17

Changed in linux (Ubuntu):
status:	Invalid → Incomplete

Revision history for this message

James Page (james-page) wrote on 2014-12-17:

I enabled the multicast_querier using a udev rule for juju-br0; however physical host reboots are still resulting in impact lxc containers not re-joining the cluster; toggling the querier off/on again resolves the issue so I'm guessing some sort of race.

tags:	added: apparmor apport-collected
description:	updated

Revision history for this message

James Page (james-page) wrote on 2014-12-17: BootDmesg.txt

BootDmesg.txt Edit (59.8 KiB, text/plain)

apport information

Revision history for this message

James Page (james-page) wrote on 2014-12-17: CRDA.txt

CRDA.txt Edit (322 bytes, text/plain)

apport information

Revision history for this message

James Page (james-page) wrote on 2014-12-17: CurrentDmesg.txt

#10

CurrentDmesg.txt Edit (2.5 KiB, text/plain)

apport information

Revision history for this message

James Page (james-page) wrote on 2014-12-17: Dependencies.txt

#11

Dependencies.txt Edit (6.1 KiB, text/plain)

apport information

Revision history for this message

James Page (james-page) wrote on 2014-12-17: KernLog.txt

#12

KernLog.txt Edit (23.9 KiB, text/plain)

apport information

Revision history for this message

James Page (james-page) wrote on 2014-12-17: Lspci.txt

#13

Lspci.txt Edit (23.5 KiB, text/plain)

apport information

Revision history for this message

James Page (james-page) wrote on 2014-12-17: Lsusb.txt

#14

Lsusb.txt Edit (498 bytes, text/plain)

apport information

Revision history for this message

James Page (james-page) wrote on 2014-12-17: ProcCpuinfo.txt

#15

ProcCpuinfo.txt Edit (6.9 KiB, text/plain)

apport information

Revision history for this message

James Page (james-page) wrote on 2014-12-17: ProcInterrupts.txt

#16

ProcInterrupts.txt Edit (4.8 KiB, text/plain)

apport information

Revision history for this message

James Page (james-page) wrote on 2014-12-17: ProcModules.txt

#17

ProcModules.txt Edit (4.1 KiB, text/plain)

apport information

Revision history for this message

James Page (james-page) wrote on 2014-12-17: RelatedPackageVersions.txt

#18

RelatedPackageVersions.txt Edit (208 bytes, text/plain)

apport information

Revision history for this message

James Page (james-page) wrote on 2014-12-17: UdevDb.txt

#19

UdevDb.txt Edit (136.0 KiB, text/plain)

apport information

Revision history for this message

James Page (james-page) wrote on 2014-12-17: UdevLog.txt

#20

UdevLog.txt Edit (288.9 KiB, text/plain)

apport information

Revision history for this message

James Page (james-page) wrote on 2014-12-17: WifiSyslog.txt

#21

WifiSyslog.txt Edit (424.6 KiB, text/plain)

apport information

Revision history for this message

James Page (james-page) wrote on 2014-12-17: lxc-net.default.txt

#22

lxc-net.default.txt Edit (1.3 KiB, text/plain)

apport information

Revision history for this message

James Page (james-page) wrote on 2014-12-17: lxc.default.txt

#23

lxc.default.txt Edit (430 bytes, text/plain)

apport information

Revision history for this message

James Page (james-page) wrote on 2014-12-17: lxcsyslog.txt

#24

lxcsyslog.txt Edit (5.9 KiB, text/plain)

apport information

Revision history for this message

James Page (james-page) wrote on 2014-12-17:

#25

Even post toggling the querier off/no, multicast is still unreliable, with cluster members failing to transmit data successfully to each other.

Changed in linux (Ubuntu):
status:	Incomplete → New

Revision history for this message

James Page (james-page) wrote on 2014-12-17:

#26

Just to fill in a bit more detail, the physical hosts have multiple lxc containers (~3 each), any of which will be participating in a different multicast group; containers in the same group are spread across different physical hosts.

Revision history for this message

Brad Figg (brad-figg) wrote on 2014-12-17: Status changed to Confirmed

#27

This change was made by a bot.

Changed in linux (Ubuntu):
status:	New → Confirmed

Revision history for this message

Stéphane Graber (stgraber) wrote on 2014-12-17:

#28

Anything else that's special on that network, e.g. non-standard MTU?

Curtis Hovey (sinzui) on 2014-12-17

tags:	added: lxc network
Changed in juju-core:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

James Page (james-page) wrote on 2014-12-18:

#29

Stephane

The network itself is pretty stock; the mtu on the switches is set to 9000; other than that its pretty much the standard cisco configuration for the switch model.

I have the full details (but can't put them here :-)).

Revision history for this message

guigui (gfraysse-free) wrote on 2015-02-15:

#30

Have the same issue not juju related at all : created a single LXC container on ubuntu precise :
Here are the value for multicast querier
cat /sys/devices/virtual/net/lxcbr0/bridge/multicast_querier
1

From container : ping 239.255.255.250 (UPnP adress) does not work but works fine on host. Otherwise container network is fine.

Serge Hallyn (serge-hallyn) on 2015-02-26

Changed in lxc (Ubuntu):
importance:	Undecided → Medium

Stéphane Graber (stgraber) on 2015-11-09

no longer affects:

lxc (Ubuntu)

Anastasia (anastasia-macmood) on 2016-10-17

Changed in juju-core:
status:	Triaged → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

Multicast traffic not propating correctly over linux bridge

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package