upgrading linux-image package to 4.4.0-103.126 breaks Ceph network file system connection

Bug #1737033 reported by Benjamin Long on 2017-12-07
48
This bug affects 6 people
Affects Status Importance Assigned to Milestone
ceph (Ubuntu)
Undecided
Unassigned
Xenial
Undecided
Unassigned
linux (Ubuntu)
Medium
Unassigned
Xenial
Medium
Unassigned

Bug Description

After clients have upgraded to 4.4.0-103.126 they can no longer connect to the Ceph network.

Using the Grub menu to boot the previous kernel fixes the issue.

The error in dmesg is:

[ 46.811897] FS-Cache: Loaded
[ 46.843670] Key type ceph registered
[ 46.844177] libceph: loaded (mon/osd proto 15/24)
[ 46.863107] FS-Cache: Netfs 'ceph' registered for caching
[ 46.863116] ceph: loaded (mds proto 32)
[ 46.884392] libceph: client3354099 fsid 2efbeab1-4903-4c4c-8365-6778afecbcbd
[ 46.886856] libceph: mon0 10.10.2.111:6789 session established
[ 46.897487] ceph: problem parsing mds trace -5
[ 46.897491] ceph: mds parse_reply err -5
[ 46.897492] ceph: mdsc_handle_reply got corrupt reply mds0(tid:1)

All clients are running ceph client version:
ii ceph-fs-common 10.2.9-0ubuntu0.16.04.1

Server nodes are running 10.2.6 packages as supplied by Ceph.

All 10.2.* versions are compatible. Using the previous kernel allows the connection to work.
---
ApportVersion: 2.20.1-0ubuntu2.14
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: lightdm 1310 F.... pulseaudio
DistroRelease: Ubuntu 16.04
HibernationDevice: RESUME=UUID=9e3c38ba-10bd-4183-b073-68d9d9a30a9b
IwConfig: Error: [Errno 2] No such file or directory
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 002: ID 80ee:0021 VirtualBox USB Tablet
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: innotek GmbH VirtualBox
Package: linux (not installed)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=en_US
 SHELL=/bin/bash
ProcFB: 0 vboxdrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-103-generic root=UUID=5a37e891-beb9-477f-b934-3c05651acf68 ro quiet splash
ProcVersionSignature: Ubuntu 4.4.0-103.126-generic 4.4.98
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-103-generic N/A
 linux-backports-modules-4.4.0-103-generic N/A
 linux-firmware 1.157.14
RfKill:

Tags: xenial xenial
Uname: Linux 4.4.0-103-generic x86_64
UnreportableReason: The report belongs to a package that is not installed.
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: False
dmi.bios.date: 12/01/2006
dmi.bios.vendor: innotek GmbH
dmi.bios.version: VirtualBox
dmi.board.name: VirtualBox
dmi.board.vendor: Oracle Corporation
dmi.board.version: 1.2
dmi.chassis.type: 1
dmi.chassis.vendor: Oracle Corporation
dmi.modalias: dmi:bvninnotekGmbH:bvrVirtualBox:bd12/01/2006:svninnotekGmbH:pnVirtualBox:pvr1.2:rvnOracleCorporation:rnVirtualBox:rvr1.2:cvnOracleCorporation:ct1:cvr:
dmi.product.name: VirtualBox
dmi.product.version: 1.2
dmi.sys.vendor: innotek GmbH

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1737033

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

apport information

tags: added: apport-collected xenial
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Benjamin Long (benjamin-long) wrote :

Please note that while the apport report is from a virtualbox VM, this is happening across my entire network. All my workstation are having the same issue, and all are 'fixed' by booting with the previous kernel.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Po-Hsu Lin (cypressyew) wrote :

Hello,
for the "previous kernel", do you mean 4.4.0-102 or 4.4.0-101?
Thanks

Benjamin Long (benjamin-long) wrote :

4.4.0-101 is the one they upgraded from. I'm not seeing 4.4.0.102 anywhere. It's not in the available package list of the machine I'm in front of right now.

Po-Hsu Lin (cypressyew) wrote :

Good to know, thanks!
We do have some ceph related change from 4.4.0-101 to 4.4.0-103
https://launchpad.net/ubuntu/+source/linux/4.4.0-103.126

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key needs-bisect
Changed in linux (Ubuntu):
status: Confirmed → Triaged
Changed in linux (Ubuntu Xenial):
status: New → Triaged
importance: Undecided → Medium
Changed in linux (Ubuntu):
status: Triaged → In Progress
Changed in linux (Ubuntu Xenial):
status: Triaged → In Progress
Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Joseph Salisbury (jsalisbury)
David Coronel (davecore) wrote :

On the running system after you rollback to the previous kernel, can you paste your /etc/fstab for the filesystems that failed to mount and the corresponding lines from mount -v? I assume the issues are with nfshome and officeshare:

cat /etc/fstab | grep -i -e nfshome -e officeshare -e office_share
mount -v | grep -i -e nfshome -e officeshare -e office_share

Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with a revert of commit ff467fd. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1737033

Can you test this kernel and see if it resolves this bug?

David Coronel (davecore) wrote :

Hi Joseph,

I'm able to reproduce Benjamin's original issue with kernel 4.4.0-103-generic #126-Ubuntu:

# mount -t ceph <ip>:6789:/ /mnt/mycephfs -o name=admin,secret=<redacted>
mount error 5 = Input/output error

I don't get this problem with 4.4.0-101-generic #124-Ubuntu. I'm working with Jay Vosburgh to test a build of kernel 4.4.0-103.126 without the patches from https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1728739

David Coronel (davecore) wrote :

I forgot to mention that I was still able to reproduce the bug with the kernel you built with a revert of commit ff467fd from http://kernel.ubuntu.com/~jsalisbury/lp1737033

Changed in linux (Ubuntu):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Xenial):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu):
status: In Progress → Triaged
Changed in linux (Ubuntu Xenial):
status: In Progress → Triaged
Benjamin Long (benjamin-long) wrote :

David,

Here is the output you requested:

$ cat /etc/fstab | grep -i -e nfshome -e officeshare -e office_share
Hecate-A:6789,Hecate-B:6789:/nfshome /nfshome ceph name=cephfs,secretfile=/etc/ceph/client.cephfs,noatime,_netdev 0 2
Hecate-A:6789,Hecate-B:6789:/officeshare /mnt/share/Office_Share ceph name=cephfs,secretfile=/etc/ceph/client.cephfs,noatime,_netdev 0 2

$ mount -v | grep -i -e nfshome -e officeshare -e office_share
10.10.2.111:6789,10.10.2.113:6789:/officeshare on /mnt/share/Office_Share type ceph (rw,noatime,name=cephfs,secret=<hidden>,acl)
10.10.2.111:6789,10.10.2.113:6789:/nfshome on /nfshome type ceph (rw,noatime,name=cephfs,secret=<hidden>,acl)

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ceph (Ubuntu Xenial):
status: New → Confirmed
Changed in ceph (Ubuntu):
status: New → Confirmed
tags: removed: kernel-da-key needs-bisect
Changed in linux (Ubuntu Xenial):
status: Triaged → Fix Committed
Benjamin Long (benjamin-long) wrote :

As soon as the package hits proposed, I'll test it. Will this bug be updated when it's ready?

Po-Hsu Lin (cypressyew) wrote :

Hi Benjamin,

Yes we will keep this bug updated, the next SRU cycle is currently underway, the package should be ready in proposed pocket after 17-Dec.

next cycle: 08-Dec through 06-Jan
====================================================================
         08-Dec Last day for kernel commits for this cycle.
11-Dec - 16-Dec Kernel prep week.
17-Dec - 05-Jan Bug verification & Regression testing.
         08-Jan Release to -updates.

Benjamin Long (benjamin-long) wrote :

I noticed that kernel 4.4.0-104 is in proposed while poking around for something else. I've installed it on my virtualbox test mule and it's working. I can mount Ceph without any issues.

I'm going to test a few things, then push this out to my workstations using my in-house repo. I'll let you know if I have any other issues with it.

Benjamin Long (benjamin-long) wrote :

After briefly testing to make sure things looked sane, I pushed the update out via our internal repo. No problems so far, but I'll post here if anything comes up.

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial

Please note that we are re-spinning Xenial kernels to release a fix for this issue faster, so the dates from comment #25 don't apply for this bug.

David Coronel (davecore) wrote :

I also confirm the kernel 4.4.0-104.127 in xenial-proposed fixes the issue. I am able to mount my CephFS filesystem normally. Benjamin also confirmed the kernel works for him in comments #26 and #27. I am changing the tag to verification-done-xenial. Thanks!

tags: added: verification-done-xenial
removed: verification-needed-xenial

Thanks @davecore and @benjamin-long for verifying the fix!

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.4.0-104.127

---------------
linux (4.4.0-104.127) xenial; urgency=low

  * linux: 4.4.0-104.127 -proposed tracker (LP: #1737511)

  * upgrading linux-image package to 4.4.0-103.126 breaks Ceph network file
    system connection (LP: #1737033)
    - Revert "libceph: MOSDOpReply v7 encoding"
    - Revert "libceph: advertise support for TUNABLES5"
    - Revert "crush: decode and initialize chooseleaf_stable"
    - Revert "crush: add chooseleaf_stable tunable"
    - Revert "crush: ensure take bucket value is valid"
    - Revert "crush: ensure bucket id is valid before indexing buckets array"

 -- Kleber Sacilotto de Souza <email address hidden> Mon, 11 Dec 2017 12:20:36 +0100

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
James Page (james-page) on 2018-03-19
Changed in ceph (Ubuntu):
status: Confirmed → Invalid
Changed in ceph (Ubuntu Xenial):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers