upgrading linux-image package to 4.15.0-66.75 breaks Ceph network file system clients

Bug #1849178 reported by Benjamin Long
54
This bug affects 14 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned
Bionic
Fix Released
Undecided
Unassigned

Bug Description

This is occurring with both 18.04, and 16.04 w/ HWE.

After upgrading the linux kernel image to 4.15.0-66, logging in with a user that has a cephfs home directory hangs. Dmesg reports the following:

[ 221.239709] general protection fault: 0000 [#1] SMP PTI
[ 221.239712] Modules linked in: ceph libceph libcrc32c fscache vboxsf(OE) snd_intel8x0 snd_ac97_codec crct10dif_pclmul crc32_pclmul ac97_bus ghash_clmulni_intel snd_pcm pcbc snd_seq_midi snd_seq_midi_event snd_rawmidi aesni_intel aes_x86_64 vboxvideo(OE) joydev snd_seq snd_seq_device snd_timer crypto_simd ttm glue_helper snd drm_kms_helper input_leds cryptd intel_rapl_perf soundcore serio_raw vboxguest(OE) drm fb_sys_fops syscopyarea sysfillrect sysimgblt video mac_hid sch_fq_codel parport_pc ppdev lp parport sunrpc ip_tables x_tables autofs4 hid_generic usbhid hid ahci psmouse libahci i2c_piix4 e1000 pata_acpi
[ 221.239746] CPU: 1 PID: 1795 Comm: kworker/1:4 Tainted: G W OE 4.15.0-66-generic #75-Ubuntu
[ 221.239748] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 221.239759] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[ 221.239767] RIP: 0010:kmem_cache_alloc+0x81/0x1b0
[ 221.239769] RSP: 0018:ffffb208812d7ae0 EFLAGS: 00010202
[ 221.239770] RAX: 0efb8a01a897ba8a RBX: 0efb8a01a897bfea RCX: 000000000000001a
[ 221.239772] RDX: 0000000000000019 RSI: 0000000001400040 RDI: 00004929e00040d0
[ 221.239773] RBP: ffffb208812d7b10 R08: ffffd2087fd040d0 R09: ffff88de94ad6640
[ 221.239774] R10: 0000000000000000 R11: 00000000e3f68484 R12: 0000000001400040
[ 221.239775] R13: ffff88de962eb800 R14: 0efb8a01a897ba8a R15: ffff88de962eb800
[ 221.239777] FS: 0000000000000000(0000) GS:ffff88de9fd00000(0000) knlGS:0000000000000000
[ 221.239778] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 221.239779] CR2: 00005604162106a8 CR3: 00000000c5e0a004 CR4: 00000000000606e0
[ 221.239782] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 221.239783] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 221.239784] Call Trace:
[ 221.239792] ? ceph_alloc_inode+0x1d/0x3b0 [ceph]
[ 221.239798] ceph_alloc_inode+0x1d/0x3b0 [ceph]
[ 221.239801] alloc_inode+0x20/0x90
[ 221.239804] iget5_locked+0xea/0x1f0
[ 221.239809] ? ceph_d_init+0x45/0x60 [ceph]
[ 221.239813] ? ceph_ino_compare+0x30/0x30 [ceph]
[ 221.239817] ? ceph_mount+0x8a0/0x8a0 [ceph]
[ 221.239822] ceph_get_inode+0x36/0xc0 [ceph]
[ 221.239827] ceph_readdir_prepopulate+0x4e9/0xcb0 [ceph]
[ 221.239834] handle_reply+0x954/0xcc0 [ceph]
[ 221.239841] dispatch+0xcf/0xb40 [ceph]
[ 221.239843] ? __switch_to_asm+0x35/0x70
[ 221.239845] ? __switch_to_asm+0x41/0x70
[ 221.239847] ? __switch_to_asm+0x35/0x70
[ 221.239848] ? __switch_to_asm+0x41/0x70
[ 221.239850] ? __switch_to_asm+0x35/0x70
[ 221.239856] try_read+0x64a/0x11a0 [libceph]
[ 221.239858] ? __switch_to_asm+0x41/0x70
[ 221.239860] ? __switch_to_asm+0x35/0x70
[ 221.239861] ? __switch_to_asm+0x41/0x70
[ 221.239863] ? __switch_to_asm+0x35/0x70
[ 221.239864] ? __switch_to_asm+0x41/0x70
[ 221.239866] ? __switch_to_asm+0x35/0x70
[ 221.239871] ceph_con_workfn+0xda/0x610 [libceph]
[ 221.239874] process_one_work+0x1de/0x420
[ 221.239876] worker_thread+0x32/0x410
[ 221.239878] kthread+0x121/0x140
[ 221.239880] ? process_one_work+0x420/0x420
[ 221.239881] ? kthread_create_worker_on_cpu+0x70/0x70
[ 221.239883] ret_from_fork+0x35/0x40
[ 221.239885] Code: f4 5b 74 49 83 78 10 00 4d 8b 30 0f 84 00 01 00 00 4d 85 f6 0f 84 f7 00 00 00 49 63 5f 20 49 8b 3f 48 8d 4a 01 4c 89 f0 4c 01 f3 <48> 33 1b 49 33 9f 40 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 74
[ 221.239913] RIP: kmem_cache_alloc+0x81/0x1b0 RSP: ffffb208812d7ae0
[ 221.239914] ---[ end trace 35d882be2a72b80a ]---

This happens across all workstations on our network that are upgraded to this kernel.

This issue does not exist on 4.15.0-65, and downgrading to this kernel returns all affected workstations to usability.
---
ProblemType: Bug
ApportVersion: 2.20.9-0ubuntu7.7
Architecture: amd64
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/by-path', '/dev/snd/pcmC0D1c', '/dev/snd/pcmC0D0c', '/dev/snd/pcmC0D0p', '/dev/snd/controlC0', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 18.04
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 002: ID 80ee:0021 VirtualBox USB Tablet
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: innotek GmbH VirtualBox
Package: linux-hwe
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 vboxdrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-66-generic root=UUID=f6b704ab-dd37-49ba-ae14-151beb30b0f0 ro quiet splash
ProcVersionSignature: Ubuntu 4.15.0-66.75-generic 4.15.18
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-66-generic N/A
 linux-backports-modules-4.15.0-66-generic N/A
 linux-firmware 1.173.9
RfKill:

Tags: bionic
Uname: Linux 4.15.0-66-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 12/01/2006
dmi.bios.vendor: innotek GmbH
dmi.bios.version: VirtualBox
dmi.board.name: VirtualBox
dmi.board.vendor: Oracle Corporation
dmi.board.version: 1.2
dmi.chassis.type: 1
dmi.chassis.vendor: Oracle Corporation
dmi.modalias: dmi:bvninnotekGmbH:bvrVirtualBox:bd12/01/2006:svninnotekGmbH:pnVirtualBox:pvr1.2:rvnOracleCorporation:rnVirtualBox:rvr1.2:cvnOracleCorporation:ct1:cvr:
dmi.product.family: Virtual Machine
dmi.product.name: VirtualBox
dmi.product.version: 1.2
dmi.sys.vendor: innotek GmbH

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1849178

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic
Revision history for this message
Benjamin Long (benjamin-long) wrote :

Running apport-collect doesn't seem to want to complete. I'm getting dots across the terminal, and the load average on the machine is 4.0. It's on it's 17th row of dots. :|

I'm going to try to run it again after a reboot, before the bug is triggered.

tags: added: apport-collected
description: updated
Revision history for this message
Benjamin Long (benjamin-long) wrote : AlsaInfo.txt

apport information

Revision history for this message
Benjamin Long (benjamin-long) wrote : CRDA.txt

apport information

Revision history for this message
Benjamin Long (benjamin-long) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Benjamin Long (benjamin-long) wrote : Lspci.txt

apport information

Revision history for this message
Benjamin Long (benjamin-long) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Benjamin Long (benjamin-long) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Benjamin Long (benjamin-long) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Benjamin Long (benjamin-long) wrote : ProcModules.txt

apport information

Revision history for this message
Benjamin Long (benjamin-long) wrote : UdevDb.txt

apport information

Revision history for this message
Benjamin Long (benjamin-long) wrote : WifiSyslog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux-hwe (Ubuntu):
status: New → Confirmed
Revision history for this message
Walter (wdoekes) wrote :

Related to https://lkml.org/lkml/2019/10/3/862 ? (That rervert is _not_ in ubuntu linux 4.15.0-66.)

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Hi wdoekes,
thanks for the research, that patch indeed does not exist in 4.15.0-66, it's now in 4.15.0-67

Benjamin,
can you please give the Bionic kernel in -proposed a try (4.15.0-67), it contains these potential fixs mentioned by wdoekes:

https://kernel.ubuntu.com/git/ubuntu/ubuntu-bionic.git/commit/?h=master-next&id=30b79decb8a3a9e5a003669bfe7dea05cd807a53
https://kernel.ubuntu.com/git/ubuntu/ubuntu-bionic.git/commit/?h=master-next&id=198f548fbb7956cb46b687a4b8a70171652efb99

Thanks.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
no longer affects: linux-hwe (Ubuntu)
no longer affects: linux-hwe (Ubuntu Bionic)
Changed in linux (Ubuntu Bionic):
status: New → Incomplete
Revision history for this message
Benjamin Long (benjamin-long) wrote :

Po-Hsu Lin,

The kernel from proposed (4.15.0-67) on bionic seems to be working without this issue. There's no panic in the logs, and I can log in as an end user with a cephfs home directory without a problem.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Thanks for the testing,
I will flip the bug status to fix-committed for now.

The proposed kernel is scheduled to be released on 11-Nov.

Changed in linux (Ubuntu Bionic):
status: Incomplete → Fix Committed
Changed in linux (Ubuntu):
status: Incomplete → Invalid
Po-Hsu Lin (cypressyew)
Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.