guest on bridge becomes inaccessible

Bug #1616949 reported by Scott Moser
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Medium
Unassigned

Bug Description

We have a "virtual maas" setup where we have a maas node on a linux bridge (created by libvirt) and some other number of nodes on that same bridge.

the maas node then controls the other nodes turning them on and off and generally working maas-like magic.

In this scenario on 16.04 with when deploying 3 or so nodes at a time, the maas node becomes inaccessible, and as it is the root device for iscsi storage of the installs, the nodes fail also.

This occurred with xenial kernel (4.4).
Reverting the kernel to 3.19 (linux-lts-vivid) kernel fixes the problem.
https://launchpad.net/ubuntu/+source/linux-lts-vivid
Specifically, the fixed kernel is
$ dpkg-query --show | grep linux-.*3.19
linux-image-3.19.0-66-generic 3.19.0-66.74~14.04.1
linux-image-extra-3.19.0-66-generic 3.19.0-66.74~14.04.1

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.4.0-34-generic 4.4.0-34.53
ProcVersionSignature: Ubuntu 3.19.0-66.74~14.04.1-generic 3.19.8-ckt22
Uname: Linux 3.19.0-66-generic ppc64le
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Aug 25 12:10 seq
 crw-rw---- 1 root audio 116, 33 Aug 25 12:10 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.1-0ubuntu2.1
Architecture: ppc64el
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Thu Aug 25 14:37:07 2016
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb:
 Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
PciMultimedia:

ProcEnviron:
 TERM=screen
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: root=/dev/mapper/mpath0-part2 ro console=hvc0
ProcLoadAvg: 0.40 0.28 0.46 1/1337 140794
ProcSwaps:
 Filename Type Size Used Priority
 /swap.img file 8388544 0 -1
ProcVersion: Linux version 3.19.0-66-generic (buildd@bos01-ppc64el-021) (gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3) ) #74~14.04.1-Ubuntu SMP Tue Jul 19 19:54:47 UTC 2016
RelatedPackageVersions:
 linux-restricted-modules-3.19.0-66-generic N/A
 linux-backports-modules-3.19.0-66-generic N/A
 linux-firmware 1.157.3
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
cpu_cores: Number of cores present = 20
cpu_coreson: Number of cores online = 20
cpu_dscr: DSCR is 0
cpu_freq:
 min: 3.667 GHz (cpu 120)
 max: 3.691 GHz (cpu 0)
 avg: 3.689 GHz
cpu_runmode:
 Could not retrieve current diagnostics mode,
 No kernel interface to firmware
cpu_smt: SMT is off

Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

We can perform a kernel bisect to identify the commit that introduced this bug. Can you first test the latest upstream stable 4.4 kernel to see if this bug is already fixed? It can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.19/

It might also be good to test the latest mainline kernel. It can be downloaded from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.8-rc3/

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: performing-bisect
Curtis Hovey (sinzui)
tags: added: jujuqa
Revision history for this message
Scott Moser (smoser) wrote :

Wonder if anything has happened here.
This is still a pain, and basically bridges dont work well with the newer kernel.

Curtis's re-appearance here is that a node was rebooted, and their stack failed after it went into the newer kernel.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Which newer kernel are you referring to? Were you able to test either of the ones mentioned in comment #3?

The latest 4.4 stable kernel is now at 4.4.30:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.30/

The latest mainline kernel is 4.9-rc4:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.9-rc4/

Revision history for this message
Scott Moser (smoser) wrote :

Joseph,
I'm sorry, I'd not seen the request for test. This is unfortunately non-trivial to test.

I agree, the next thing to do is to check and see if the issue is still present in upstream, the hwe for 16.04 and the current GA for 16.04.

I dont expect that I will haev any time to look at this in the near future, hopefully someone else can test. Right now, I know the users of this virtual maas have just pinned to the old kernel as they need the system up.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: kernel-da-key
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.