Kernel log flood "ceph: Failed to find inode for 1"

Bug #1875884 reported by Michael Robertson
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned
Bionic
Fix Released
Medium
Unassigned
linux-azure-4.15 (Ubuntu)
Bionic
Fix Released
Undecided
Unassigned

Bug Description

OS provided by AKS is currently Ubuntu 16.04.6 LTS, kernel 4.15.0-1077-azure.

Every block written by a k8s pod to a ceph CSI volume generates 2 warning lines in the node's system logs (kern.log, syslog, messages, warn):
"Apr 24 09:37:46 aks-<nodename> kernel: [242123.654538] ceph: Failed to find inode for 1"

Under production load, eventually the node succumbs to DiskPressure as the drive fills up. Also performance is noticeably degraded.

Background here: https://tracker.ceph.com/issues/45283

Luis Hernandez indicates 4 commits relating to this issue, just 2 of which have been backported to Ubuntu 16.
d557c48db730 ("ceph: quota: add counter for snaprealms with quota") <==
e3161f17d926 ("ceph: quota: cache inode pointer in ceph_snap_realm")
0eb6bbe4d9cf ("ceph: fix root quota realm check") <==
2596366907f8 ("ceph: don't check quota for snap inode")

Quoth Luis:
"I've done a quick test and, after compiling the bionic kernel 4.15.0-96.97 (the latest released), I can reproduce the issue. Cherry-picking the 2 missing commits (2596366907f8 and e3161f17d926) fixes it."

In my testing Ubuntu 18 does not exhibit the bug, but Azure support tells me it will be months before they make it GA in AKS.

Can we get those commits backported to Ubuntu 16?

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-azure (not installed)
ProcVersionSignature: Ubuntu 4.15.0-1077.82-azure 4.15.18
Uname: Linux 4.15.0-1077-azure x86_64
ApportVersion: 2.20.1-0ubuntu2.23
Architecture: amd64
Date: Wed Apr 29 12:45:12 2020
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-meta-azure
UpgradeStatus: No upgrade log present (probably fresh install)
---
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access '/dev/snd/': No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.20.1-0ubuntu2.23
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
CRDA: N/A
DistroRelease: Ubuntu 16.04
IwConfig: Error: [Errno 2] No such file or directory
Lsusb: Error: command ['lsusb'] failed with exit code 1:
MachineType: Microsoft Corporation Virtual Machine
Package: linux-azure-4.15
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 hyperv_fb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-1082-azure root=UUID=026eedeb-bd55-4fe3-9e65-0f5f76c13202 ro console=tty1 console=ttyS0 earlyprintk=ttyS0 rootdelay=300
ProcVersionSignature: Ubuntu 4.15.0-1082.92~16.04.1-azure 4.15.18
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-1082-azure N/A
 linux-backports-modules-4.15.0-1082-azure N/A
 linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory
Tags: xenial uec-images xenial uec-images
Uname: Linux 4.15.0-1082-azure x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 06/02/2017
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 090007
dmi.board.name: Virtual Machine
dmi.board.vendor: Microsoft Corporation
dmi.board.version: 7.0
dmi.chassis.asset.tag: 7783-7084-3265-9085-8269-3286-77
dmi.chassis.type: 3
dmi.chassis.vendor: Microsoft Corporation
dmi.chassis.version: 7.0
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr090007:bd06/02/2017:svnMicrosoftCorporation:pnVirtualMachine:pvr7.0:rvnMicrosoftCorporation:rnVirtualMachine:rvr7.0:cvnMicrosoftCorporation:ct3:cvr7.0:
dmi.product.name: Virtual Machine
dmi.product.uuid: 2B0B428F-CF5B-3142-8500-028AFD70E74B
dmi.product.version: 7.0
dmi.sys.vendor: Microsoft Corporation

CVE References

Revision history for this message
Michael Robertson (mpr1972) wrote :
Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Targeting bionic:linux-azure-4.15 since xenial:linux-azure is derived from it.

no longer affects: linux-meta-azure (Ubuntu)
Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Michael,

Can you provide more details on how to reproduce the problem?

Thank you.

Revision history for this message
Michael Robertson (mpr1972) wrote :

Sure. In my case, I create an AKS cluster (which uses the Ubuntu 16 linux-azure image), install Rook v1.3.2, mount a CSI PVC volume, and watch the logs fill up. That is rather heavy. I went back and asked the ceph experts for a simpler reproduction testcase. Here goes:

# mount -t ceph <mon>:<port>:/ /mnt/ceph -o name=admin,secret=<my-secret>
# mkdir /mnt/ceph/quotadir
# setfattr -n ceph.quota.max_files -v 10 /mnt/ceph/quotadir
# umount /mnt/ceph
# mount -t ceph <mon>:<port>:/quotadir /mnt/ceph -o name=admin,secret=<my-secret> # <== Note the 'quotadir' here!!!
# touch /mnt/ceph/newfile

Prerequisite is a ceph cluster running, with a <mon>:<port> available to mount on, and configured with csi secret.

Revision history for this message
Flemming Frandsen (flfr-stibo) wrote :

I'm also seeing this on Ubuntu 18.04.4 LTS with the 4.15.0-74-generic kernel.

Any idea if this problem has been fixed in the generic kernel yet?

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1875884

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Bionic):
status: New → Incomplete
Revision history for this message
Michael Robertson (mpr1972) wrote : AudioDevicesInUse.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Michael Robertson (mpr1972) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Michael Robertson (mpr1972) wrote : Lspci.txt

apport information

Revision history for this message
Michael Robertson (mpr1972) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Michael Robertson (mpr1972) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Michael Robertson (mpr1972) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Michael Robertson (mpr1972) wrote : ProcModules.txt

apport information

Revision history for this message
Michael Robertson (mpr1972) wrote : UdevDb.txt

apport information

Revision history for this message
Michael Robertson (mpr1972) wrote : WifiSyslog.txt

apport information

Revision history for this message
Michael Robertson (mpr1972) wrote :

I obeyed the bot and ran the collector. Dunno if I need to mess with the bug state, as it looks like "Confirmed" expects someone other than the reporter (that's me), and the more targeted sub-bug is still in "New".
Let me know if I need to do anything else.

Revision history for this message
Flemming Frandsen (flfr-stibo) wrote :

I have just reproduced the problem with Ubuntu 18.04.4 LTS using the latest released kernel: 4.15.0-99-generic

Marcelo Cerri (mhcerri)
Changed in linux (Ubuntu Bionic):
status: Incomplete → In Progress
Revision history for this message
Marcelo Cerri (mhcerri) wrote :
no longer affects: linux-azure-4.15 (Ubuntu)
Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Since that's a generic issue I'm targeting the 4.15 bionic linux-generic kernel. That's the base of the 4.15 linux-azure kernels and that will cause them to receive the fix as well.

Stefan Bader (smb)
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Michael Robertson (mpr1972) wrote :

Sorry Bot, I am unable to confirm this one myself. AKS images don't get access to upgrade kernel from bionic-proposed directly, that I can tell. The standard image is still xenial, and its preview bionic images use kernel 5.3.x.x, and do not have this issue.
Flemming, are you able to check this? Any help is appreciated.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (18.8 KiB)

This bug was fixed in the package linux - 4.15.0-106.107

---------------
linux (4.15.0-106.107) bionic; urgency=medium

  * CVE-2020-0543
    - SAUCE: x86/cpu: Add a steppings field to struct x86_cpu_id
    - SAUCE: x86/cpu: Add 'table' argument to cpu_matches()
    - SAUCE: x86/speculation: Add Special Register Buffer Data Sampling (SRBDS)
      mitigation
    - SAUCE: x86/speculation: Add SRBDS vulnerability and mitigation documentation
    - SAUCE: x86/speculation: Add Ivy Bridge to affected list

linux (4.15.0-103.104) bionic; urgency=medium

  * bionic/linux: 4.15.0-103.104 -proposed tracker (LP: #1881272)

  * "BUG: unable to handle kernel paging request" when testing
    ubuntu_kvm_smoke_test.kvm_smoke_test with B-KVM in proposed (LP: #1881072)
    - KVM: VMX: Explicitly reference RCX as the vmx_vcpu pointer in asm blobs
    - KVM: VMX: Mark RCX, RDX and RSI as clobbered in vmx_vcpu_run()'s asm blob

linux (4.15.0-102.103) bionic; urgency=medium

  * bionic/linux: 4.15.0-102.103 -proposed tracker (LP: #1878856)

  * Packaging resync (LP: #1786013)
    - update dkms package versions

  * debian/scripts/file-downloader does not handle positive failures correctly
    (LP: #1878897)
    - [Packaging] file-downloader not handling positive failures correctly

  * Kernel log flood "ceph: Failed to find inode for 1" (LP: #1875884)
    - ceph: don't check quota for snap inode
    - ceph: quota: cache inode pointer in ceph_snap_realm

  * [UBUNTU 18.04] zpcictl --reset - contribution for kernel (LP: #1870320)
    - s390/pci: Recover handle in clp_set_pci_fn()
    - s390/pci: Fix possible deadlock in recover_store()

  * Bionic update: upstream stable patchset 2020-05-12 (LP: #1878256)
    - drm/edid: Fix off-by-one in DispID DTD pixel clock
    - drm/qxl: qxl_release leak in qxl_draw_dirty_fb()
    - drm/qxl: qxl_release leak in qxl_hw_surface_alloc()
    - drm/qxl: qxl_release use after free
    - btrfs: fix block group leak when removing fails
    - btrfs: fix partial loss of prealloc extent past i_size after fsync
    - mmc: sdhci-xenon: fix annoying 1.8V regulator warning
    - mmc: sdhci-pci: Fix eMMC driver strength for BYT-based controllers
    - ALSA: hda/realtek - Two front mics on a Lenovo ThinkCenter
    - ALSA: hda/hdmi: fix without unlocked before return
    - ALSA: pcm: oss: Place the plugin buffer overflow checks correctly
    - PM: ACPI: Output correct message on target power state
    - PM: hibernate: Freeze kernel threads in software_resume()
    - dm verity fec: fix hash block number in verity_fec_decode
    - RDMA/mlx5: Set GRH fields in query QP on RoCE
    - RDMA/mlx4: Initialize ib_spec on the stack
    - vfio: avoid possible overflow in vfio_iommu_type1_pin_pages
    - vfio/type1: Fix VA->PA translation for PFNMAP VMAs in vaddr_get_pfn()
    - iommu/qcom: Fix local_base status check
    - scsi: target/iblock: fix WRITE SAME zeroing
    - iommu/amd: Fix legacy interrupt remapping for x2APIC-enabled system
    - ALSA: opti9xx: shut up gcc-10 range warning
    - nfs: Fix potential posix_acl refcnt leak in nfs3_set_acl
    - dmaengine: dmatest: Fix iteration non-stop logic
    - selinux: properly handle multiple messages in ...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (21.8 KiB)

This bug was fixed in the package linux-azure-4.15 - 4.15.0-1089.99

---------------
linux-azure-4.15 (4.15.0-1089.99) bionic; urgency=medium

  [ Ubuntu: 4.15.0-106.107 ]

  * CVE-2020-0543
    - SAUCE: x86/cpu: Add a steppings field to struct x86_cpu_id
    - SAUCE: x86/cpu: Add 'table' argument to cpu_matches()
    - SAUCE: x86/speculation: Add Special Register Buffer Data Sampling (SRBDS)
      mitigation
    - SAUCE: x86/speculation: Add SRBDS vulnerability and mitigation documentation
    - SAUCE: x86/speculation: Add Ivy Bridge to affected list

  [ Ubuntu: 4.15.0-103.104 ]

  * bionic/linux: 4.15.0-103.104 -proposed tracker (LP: #1881272)
  * "BUG: unable to handle kernel paging request" when testing
    ubuntu_kvm_smoke_test.kvm_smoke_test with B-KVM in proposed (LP: #1881072)
    - KVM: VMX: Explicitly reference RCX as the vmx_vcpu pointer in asm blobs
    - KVM: VMX: Mark RCX, RDX and RSI as clobbered in vmx_vcpu_run()'s asm blob

linux-azure-4.15 (4.15.0-1084.94) bionic; urgency=medium

  * bionic/linux-azure-4.15: 4.15.0-1084.94 -proposed tracker (LP: #1878842)

  * Add support for Ambiq micro AM1805 RTC chip (LP: #1876667)
    - SAUCE: rtc: add am-1805 RTC driver

  * linux-azure: Enable FSGSBASE instructions to support SGX (LP: #1877425)
    - x86/entry: Add some paranoid entry/exit CR3 handling comments
    - x86/entry/64: Further improve paranoid_entry comments
    - x86/fsgsbase/64: Introduce FS/GS base helper functions
    - x86/fsgsbase/64: Make ptrace use the new FS/GS base helpers
    - x86/fsgsbase/64: Factor out FS/GS segment loading from __switch_to()
    - x86/segments/64: Rename the GDT PER_CPU entry to CPU_NUMBER
    - x86/vdso: Introduce helper functions for CPU and node number
    - x86/vdso: Initialize the CPU/node NR segment descriptor earlier
    - x86/segments: Introduce the 'CPUNODE' naming to better document the segment
      limit CPU/node NR trick
    - x86/fsgsbase/64: Clean up various details
    - x86/fsgsbase/64: Fix the base write helper functions
    - selftests/x86/fsgsbase: Test ptracer-induced GSBASE write
    - selftests/x86/fsgsbase: Test RD/WRGSBASE
    - selftests/x86/fsgsbase: Test ptracer-induced GSBASE write with FSGSBASE
    - selftests/x86/fsgsbase: Fix some test case bugs
    - Revert "x86/ptrace: Prevent ptrace from clearing the FS/GS selector" and fix
      the test
    - SAUCE: x86/ptrace: Prevent ptrace from clearing the FS/GS selector
    - SAUCE: selftests/x86/fsgsbase: Test GS selector on ptracer-induced GS base
      write
    - SAUCE: x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE
    - SAUCE: x86/entry/64: Clean up paranoid exit
    - SAUCE: x86/entry/64: Switch CR3 before SWAPGS in paranoid entry
    - SAUCE: x86/entry/64: Introduce the FIND_PERCPU_BASE macro
    - SAUCE: x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit
    - SAUCE: x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions
    - SAUCE: x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions
    - SAUCE: x86/fsgsbase/64: Use FSGSBASE in switch_to() if available
    - SAUCE: x86/fsgsbase/64: Use FSGSBASE instructions on thread copy and ptrace
    - SAUCE: x86/speculation/swapgs: Check ...

Changed in linux-azure-4.15 (Ubuntu Bionic):
status: New → Fix Released
Po-Hsu Lin (cypressyew)
Changed in linux (Ubuntu):
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.