multipath errors on vivid and wily kernel?

Bug #1462530 reported by Scott Moser
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Unassigned

Bug Description

Oleg and I got changes into curtin to fix bug 1371634.
So that now we can boot into multipath iscsi devices on power.

It seems to go well enough with trusty+hwe-u, but when trying vivid, I am seeing errors.
Boot was failing in one case
---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jun 8 16:05 seq
 crw-rw---- 1 root audio 116, 33 Jun 8 16:05 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.17.3-0ubuntu4
Architecture: ppc64el
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 15.10
Lsusb:
 Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=screen
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: root=UUID=962acf9e-cd1c-49b0-9264-801271cd510f ro console=hvc0
ProcLoadAvg: 0.08 0.12 0.10 1/1126 3818
ProcLocks:
 1: FLOCK ADVISORY WRITE 2951 00:12:14557 0 EOF
 2: POSIX ADVISORY WRITE 2952 00:12:43306 0 EOF
 3: POSIX ADVISORY WRITE 2905 00:12:22701 0 EOF
 4: POSIX ADVISORY WRITE 2965 00:12:18505 0 EOF
ProcSwaps:
 Filename Type Size Used Priority
 /swap.img file 8388544 0 -1
ProcVersion: Linux version 3.19.0-20-generic (buildd@fisher03) (gcc version 4.9.2 (Ubuntu 4.9.2-10ubuntu13) ) #20-Ubuntu SMP Fri May 29 10:03:56 UTC 2015
ProcVersionSignature: Ubuntu 3.19.0-20.20-generic 3.19.8
RelatedPackageVersions:
 linux-restricted-modules-3.19.0-20-generic N/A
 linux-backports-modules-3.19.0-20-generic N/A
 linux-firmware 1.144
RfKill: Error: [Errno 2] No such file or directory
Tags: wily uec-images
UdevLog: Error: [Errno 2] No such file or directory: '/var/log/udev'
Uname: Linux 3.19.0-20-generic ppc64le
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
cpu_cores: Number of cores present = 20
cpu_coreson: Number of cores online = 20
cpu_dscr: DSCR is 0
cpu_freq:
 min: 3.693 GHz (cpu 72)
 max: 3.695 GHz (cpu 87)
 avg: 3.694 GHz
cpu_runmode:
 Could not retrieve current diagnostics mode,
 No firmware implementation of function
cpu_smt: SMT=8

Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1462530

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: vivid
Revision history for this message
Stefan Bader (smb) wrote : Re: multipath errors on vivid kernel?

At least this requires the output of "multipath -ll" and the iscsi setup. Apparently its supporting ALUA, so the targets may be on a real storage server but it would be better you would let us know instead of guessing. Are the errors seen on every boot or just the one time it did not boot?

Revision history for this message
Scott Moser (smoser) wrote :

marking confirmed... smb has some access now.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Scott Moser (smoser) wrote :

Attaching console log of a wily (3.19.0-20-generic) kernel

Scott Moser (smoser)
summary: - multipath errors on vivid kernel?
+ multipath errors on vivid and wily kernel?
Revision history for this message
Scott Moser (smoser) wrote : CRDA.txt

apport information

tags: added: apport-collected uec-images wily
description: updated
Revision history for this message
Scott Moser (smoser) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Scott Moser (smoser) wrote : DeviceTree.tar.gz

apport information

Revision history for this message
Scott Moser (smoser) wrote : IwConfig.txt

apport information

Revision history for this message
Scott Moser (smoser) wrote : JournalErrors.txt

apport information

Revision history for this message
Scott Moser (smoser) wrote : Lspci.txt

apport information

Revision history for this message
Scott Moser (smoser) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Scott Moser (smoser) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Scott Moser (smoser) wrote : ProcMisc.txt

apport information

Revision history for this message
Scott Moser (smoser) wrote : ProcModules.txt

apport information

Revision history for this message
Scott Moser (smoser) wrote : ProcPpc64.tar.gz

apport information

Revision history for this message
Scott Moser (smoser) wrote : UdevDb.txt

apport information

Revision history for this message
Scott Moser (smoser) wrote : WifiSyslog.txt

apport information

Revision history for this message
Scott Moser (smoser) wrote : nvram.gz

apport information

Revision history for this message
Scott Moser (smoser) wrote :

Additional info:
$ sudo multipath -ll
mpath2 (1IBM IPR-0 5EC2A900000000A0) dm-2 IBM,IPR-0 5EC2A900
size=264G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=130 status=active
| `- 0:2:2:0 sdc 8:32 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  `- 1:2:2:0 sdi 8:128 active ready running
mpath1 (1IBM IPR-0 5EC2A90000000060) dm-1 IBM,IPR-0 5EC2A900
size=264G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=130 status=active
| `- 0:2:1:0 sdb 8:16 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  `- 1:2:1:0 sdh 8:112 active ready running
mpath0 (1IBM IPR-0 5EC2A90000000080) dm-0 IBM,IPR-0 5EC2A900
size=264G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=130 status=active
| `- 0:2:0:0 sda 8:0 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  `- 1:2:0:0 sdg 8:96 active ready running
mpath5 (1IBM IPR-0 5EC2A90000000020) dm-5 IBM,IPR-0 5EC2A900
size=264G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=130 status=active
| `- 0:2:5:0 sdf 8:80 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  `- 1:2:5:0 sdl 8:176 active ready running
mpath4 (1IBM IPR-0 5EC2A900000000C0) dm-4 IBM,IPR-0 5EC2A900
size=264G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=130 status=active
| `- 0:2:4:0 sde 8:64 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  `- 1:2:4:0 sdk 8:160 active ready running
mpath3 (1IBM IPR-0 5EC2A90000000040) dm-3 IBM,IPR-0 5EC2A900
size=264G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=130 status=active
| `- 0:2:3:0 sdd 8:48 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  `- 1:2:3:0 sdj 8:144 active ready running

After installation via maas , system hung with the boot log attached as 'console boot from wily (boot hangs)'. I then did and ipmi reset and it came up this time.
system is currently up in this kernel and Stefan's ssh keys should be able to get in as 'ubuntu' user.

tags: added: kernel-da-key
Changed in linux (Ubuntu):
importance: Undecided → High
Revision history for this message
Stefan Bader (smb) wrote :

From the data I can see all 6 multipath disks are set to failover mode. Which in this special case is done because some implicit logic in the multipath-tools detected that the scsi controller supports ALUA queries and those returned that host0 has all disks in active state. So all IO is going through that controller. The scsi disk devices seen through the other controller (host1) are only in standby.

The boot dmesg reports inconsistencies in the ext4 filesystem on dm-6 which due to a (normally harmless) error message before can be translated into mpath1-part2. Not sure, but on the system you gave me access to the only alias with data was mpath0. So that may be a potential problem or just a different config. And what cannot be seen is whether maybe the installation+reboot had maybe left the filesystem in a bad state.

Revision history for this message
Stefan Bader (smb) wrote :

In order to get some more insight about when this fs corruption happens, please provide a console log of a provisioning run which on reboot into the prepared system sees those issues.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Scott Moser (smoser) wrote :

Well, I got what you asked for. full install log (with curtin -vv install).
And the console log of first boot.
Unfortunately, there are no issues in this boot.

I'd really think that I was making things up, in fact, I originally ignored them, but then Mike saw the same thing I saw (log https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1371634/comments/18)

Revision history for this message
Stefan Bader (smb) wrote :

Yeah, I am not sure. Somehow it seems that when I install multipath-tools (+multipath-tools-boot) on an already set up power8 I get some errors on the console because that appear to be creating some device-mapper target already. Odd that those do not seem to be in the log Mike posted. Should be similar as curtin has the root fs still mounted when installing multipath-tools. Sounds a bit bad as well to create the mappings on post-install. Not sure that is deliberate in a hook or a side-effect of starting the daemon.

Revision history for this message
Scott Moser (smoser) wrote :

I'm copying Mike's log from bug 1371634 comment 21.

Note, this is a full installation log as Stefan asked for, and Mike is still seeing issues.

I have tried the latest curtin-common and python-curtin from the wily repositories running on 14.04.2:

#leftyfb@maaster[0]:~$ apt-cache policy curtin-common
curtin-common:
  Installed: 0.1.0~bzr214-0ubuntu1
  Candidate: 0.1.0~bzr214-0ubuntu1
  Version table:
 *** 0.1.0~bzr214-0ubuntu1 0
        500 http://us.archive.ubuntu.com/ubuntu/ wily/main amd64 Packages
        100 /var/lib/dpkg/status
     0.1.0~bzr201-0ubuntu1~14.04.1 0
        500 http://us.archive.ubuntu.com/ubuntu/ trusty-updates/main amd64 Packages
     0.1.0~bzr201-0ubuntu1~14.04.1~ppa0 0
        500 http://ppa.launchpad.net/maas-maintainers/stable/ubuntu/ trusty/main amd64 Packages
     0.1.0~bzr126-0ubuntu1 0
        500 http://us.archive.ubuntu.com/ubuntu/ trusty/main amd64 Packages
#leftyfb@maaster[0]:~$ apt-cache policy python-curtin
python-curtin:
  Installed: 0.1.0~bzr214-0ubuntu1
  Candidate: 0.1.0~bzr214-0ubuntu1
  Version table:
 *** 0.1.0~bzr214-0ubuntu1 0
        500 http://us.archive.ubuntu.com/ubuntu/ wily/main amd64 Packages
        100 /var/lib/dpkg/status
     0.1.0~bzr201-0ubuntu1~14.04.1 0
        500 http://us.archive.ubuntu.com/ubuntu/ trusty-updates/main amd64 Packages
     0.1.0~bzr201-0ubuntu1~14.04.1~ppa0 0
        500 http://ppa.launchpad.net/maas-maintainers/stable/ubuntu/ trusty/main amd64 Packages
     0.1.0~bzr126-0ubuntu1 0
        500 http://us.archive.ubuntu.com/ubuntu/ trusty/main amd64 Packages
#leftyfb@maaster[0]:~$ apt-cache policy maas
maas:
  Installed: 1.7.5+bzr3369-0ubuntu1~trusty1
  Candidate: 1.7.5+bzr3369-0ubuntu1
  Version table:
     1.7.5+bzr3369-0ubuntu1 0
        500 http://us.archive.ubuntu.com/ubuntu/ wily/main amd64 Packages
 *** 1.7.5+bzr3369-0ubuntu1~trusty1 0
        500 http://ppa.launchpad.net/maas-maintainers/stable/ubuntu/ trusty/main amd64 Packages
        100 /var/lib/dpkg/status
     1.5.4+bzr2294-0ubuntu1.3 0
        500 http://us.archive.ubuntu.com/ubuntu/ trusty-updates/main amd64 Packages
     1.5.4+bzr2294-0ubuntu1.2 0
        500 http://security.ubuntu.com/ubuntu/ trusty-security/main amd64 Packages
     1.5+bzr2252-0ubuntu1 0
        500 http://us.archive.ubuntu.com/ubuntu/ trusty/main amd64 Packages

The deployment never fully boots up. It never gets to a console or allows an ssh connection. Attached is a log of the deployment and first boot process to debug.

Revision history for this message
Scott Moser (smoser) wrote :

I collected the log in comment 23 and 24 yesterday, and then walked away.
Looked at the console from that system today, and it shows:
[54550.928492] EXT4-fs error (device dm-9): htree_dirblock_to_tree:914: inode #7078166: block 28319821: comm updatedb.mlocat: bad entry in directory: directory entry across range - offset=0(0), inode=0, rec_len=98572, name_len=74

So it seems that even though I successfully booted, there are some errors, they just weren't found in boot. It is somewhat arbitrary failure, so you might seem OK or might not get through the first boot (as mike has shown).

Scott Moser (smoser)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Scott Moser (smoser) wrote :

So, just to catch this bug up with where Oleg and I are.
a.) curtin installs multipath-tools-boot into the target with /dev, /proc and /sys mounted. However, it has disabled services from starting through ChrootableTarget's allow_services=False [1]. That uses disable_services_in_root [2].
b.) the way /etc/multipath/bindings is normally created is through the multipath daemon running

So, since we've blocked 'b', we dont get the file created and it doesn't get collected into the initramfs.

To solve this, we'll need to run a blocking command with no other side effects in the chroot that creates /etc/multipath/bindings. Oleg found that that can be accomplished with /multipath -r'.
    sudo sh -c 'rm -Rf /etc/multipath && multipath -r >/dev/null && ls -l /etc/multipath/bindings'

Last thing, a minor point, there is a race condition in the normal install path that could lead to initramfs not collecting /etc/multipath/bindings. This is because the daemon is started in the background, and is not guaranteed to create /etc/multipath/bindings before update-initramfs is run by the trigger.

--
[1] http://bazaar.launchpad.net/~curtin-dev/curtin/trunk/view/head:/curtin/util.py#L289
[2] http://bazaar.launchpad.net/~curtin-dev/curtin/trunk/view/head:/curtin/util.py#L257

Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Scott Moser (smoser) wrote :

I filed debian bug https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=788841 about the race condition that I think exists.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.