ocfs2-tools is causing kernel panics in Ubuntu Focal (Ubuntu-5.4.0-9.12)

Bug #1852122 reported by Iain Lane on 2019-11-11
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OCFS2 Tools
Fix Released
Unknown
linux (Ubuntu)
Status tracked in Focal
Eoan
Medium
Unassigned
Focal
Medium
Unassigned
ocfs2-tools (Ubuntu)
Status tracked in Focal
Eoan
Medium
Rafael David Tinoco
Focal
Medium
Rafael David Tinoco

Bug Description

[Impact]

 * Umounts to OCFS2 filesystems will cause a kernel crash in any kernel containing commit e581595ea29c (v5.3-rc1) and not containing its fix b73eba2a867e (v5.5-rc5).

[Test Case]

 * ocfs2_reproducer.sh (attached to this bug)

[Regression Potential]

 * it could cause ocfs2 issues, so I guess its a low impact considering amount of users.
 * Its a straightforward fix identified by upstream developer and a clean cherry-pick for us.

[Other Info]

 * Original description:

I noticed the tests for ocfs2-tools/1.8.6-1ubuntu1 were constantly retrying themselves. It's a feature we have so that transient / occasional failures are auto-retried, but it's misfiring here because we're not detecting that it's a consistent failure. That particular bug is fixed, but it means that ocfs2-tools is failing on ppc64el. Here's the important part of the log, full output attached.

[ 85.605738] BUG: Unable to handle kernel data access at 0x01744098
                                                                                                                 [ 85.605850] Faulting instruction address: 0xc000000000e81168
                                                                                                                 [ 85.605901] Oops: Kernel access of bad area, sig: 11 [#1]
                                                                                                                 [ 85.605970] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
                                                                                                                 [ 85.606029] Modules linked in: ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue iptable_mangle xt_TCPMSS xt_tcpudp bpfilter dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua vmx_crypto crct10dif_vpmsum sch_fq_codel ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq libcrc32c crc32c_vpmsum virtio_net virtio_blk net_failover failover
                                                                                                                 [ 85.606291] CPU: 0 PID: 1 Comm: systemd Not tainted 5.3.0-18-generic #19-Ubuntu
                                                                                                                 [ 85.606350] NIP: c000000000e81168 LR: c00000000054f240 CTR: 0000000000000000
                                                                                                                 [ 85.606410] REGS: c00000005a3e3700 TRAP: 0300 Not tainted (5.3.0-18-generic)
                                                                                                                 [ 85.606469] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 28024448 XER: 00000000
                                                                                                                 [ 85.606531] CFAR: 0000701f9806f638 DAR: 0000000001744098 DSISR: 40000000 IRQMASK: 0
                                                                                                                 [ 85.606531] GPR00: 0000000000007374 c00000005a3e3990 c0000000019c9100 c00000004fe462a8
                                                                                                                 [ 85.606531] GPR04: c00000005856d840 000000000000000e 0000000074656772 c00000004fe4a568
                                                                                                                 [ 85.606531] GPR08: 0000000000000000 c000000058568004 0000000001744090 0000000000000000
                                                                                                                 [ 85.606531] GPR12: 00000000e8086002 c000000001d60000 00007fffddd522d0 0000000000000000
                                                                                                                 [ 85.606531] GPR16: 0000000000000000 0000000000000000 0000000000000000 c00000000755e07c
                                                                                                                 [ 85.606531] GPR20: c0000000598caca8 c00000005a3e3a58 0000000000000000 c000000058292f00
                                                                                                                 [ 85.606531] GPR24: c000000000eea710 0000000000000000 c00000005856d840 c00000000755e074
                                                                                                                 [ 85.606531] GPR28: 000000006518907d c00000005a3e3a68 c00000004fe4b160 00000000027c47b6
                                                                                                                 [ 85.607079] NIP [c000000000e81168] rb_insert_color+0x18/0x1c0
                                                                                                                 [ 85.607137] LR [c00000000054f240] ext4_htree_store_dirent+0x140/0x1c0
                                                                                                                 [ 85.607186] Call Trace:
                                                                                                                 [ 85.607208] [c00000005a3e3990] [c00000000054f158] ext4_htree_store_dirent+0x58/0x1c0 (unreliable)
                                                                                                                 [ 85.607279] [c00000005a3e39e0] [c000000000594cd8] htree_dirblock_to_tree+0x1b8/0x380
                                                                                                                 [ 85.607340] [c00000005a3e3b00] [c0000000005962c0] ext4_htree_fill_tree+0xc0/0x3f0
                                                                                                                 [ 85.607401] [c00000005a3e3c00] [c00000000054ebe4] ext4_readdir+0x814/0xce0
                                                                                                                 [ 85.607459] [c00000005a3e3d40] [c000000000472d6c] iterate_dir+0x1fc/0x280
                                                                                                                 [ 85.607511] [c00000005a3e3d90] [c0000000004746f0] ksys_getdents64+0xa0/0x1f0
                                                                                                                 [ 85.607572] [c00000005a3e3e00] [c000000000474868] sys_getdents64+0x28/0x130
                                                                                                                 [ 85.607622] [c00000005a3e3e20] [c00000000000b388] system_call+0x5c/0x70
                                                                                                                 [ 85.607672] Instruction dump:
                                                                                                                 [ 85.607703] 4082ffe8 4e800020 38600000 4e800020 60000000 60000000 e9230000 2c290000
                                                                                                                 [ 85.607764] 4182018c e9490000 71480001 4c820020 <e90a0008> 7c284840 2fa80000 4182006c
                                                                                                                 [ 85.607827] ---[ end trace cfc53af0f8d62cef ]---
                                                                                                                 [ 85.610600]
                                                                                                                 [ 86.611522] BUG: Unable to handle kernel data access at 0xc000030058567eff
                                                                                                                 [ 86.611604] Faulting instruction address: 0xc000000000403aa8
                                                                                                                 [ 86.611656] Oops: Kernel access of bad area, sig: 11 [#2]
                                                                                                                 [ 86.611697] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
                                                                                                                 [ 86.611748] Modules linked in: ocfs2 quota_tr

Andreas Hasenack (ahasenack) wrote :

Reproduced. Left part shows the o2cb test being run, and right side is dmesg. It's focal, not eoan as the screenshot says, it's because I had to start with eoan and dist-upgrade to focal.

Changed in ocfs2-tools (Ubuntu):
status: New → Confirmed
status: Confirmed → Triaged
importance: Undecided → Medium

There was a really old thread with the exact same stack trace coming from google's syscaller fuzzing BOT. It seems that the BOT was right =o).

I have replied with this stack trace, telling them I got a way to reproduce the issue:

https://<email address hidden>/

and will monitor the thread.

Changed in ocfs2-tools (Ubuntu):
status: Triaged → Confirmed
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
summary: - ocfs2-tools autopkgtest is causing kernel panics on ppc64el
+ ocfs2-tools autopkgtest is causing kernel panics on service shutdown
Changed in ocfs2-tools (Ubuntu):
status: Confirmed → In Progress

I have tested latest focal kernel and looks like the kernel issue is still there. There is an on going thread upstream about that (most likely) and I have explained how to reproduce the issue here:

https://github.com/markfasheh/ocfs2-tools/issues/45#issuecomment-572875062

Issue is happening in all arches currently in autopkgtest infrastructure.

summary: - ocfs2-tools autopkgtest is causing kernel panics on service shutdown
+ ocfs2-tools is causing kernel panics in Ubuntu Focal (Ubuntu-5.4.0-9.12)
Changed in linux (Ubuntu):
status: New → In Progress
importance: Undecided → Medium
Changed in linux (Ubuntu Eoan):
status: New → In Progress
Changed in ocfs2-tools (Ubuntu Eoan):
status: New → In Progress
importance: Undecided → Medium
Changed in linux (Ubuntu Eoan):
importance: Undecided → Medium
Changed in ocfs2-tools (Ubuntu Eoan):
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)

Waiting on upstream bug.

Changed in ocfs2-tools:
status: Unknown → New
tags: added: update-excuse
Changed in ocfs2-tools:
status: New → Fix Released

Thanks for checking this Andreas! Looks like upstream got a fix for the issue and Debian has confirmed it worked. With that, I'll check if it fixes the issue indeed and suggest the SRU to the kernel-team so we can unblock ocfs2-tools.

Iain Lane (laney) wrote :

awesome, thanks for following up!

The upstream fix is likely this:

From b73eba2a867e10b9b4477738677341f3307c07bb Mon Sep 17 00:00:00 2001
From: Gang He <email address hidden>
Date: Sat, 4 Jan 2020 13:00:22 -0800
Subject: [PATCH] ocfs2: fix the crash due to call ocfs2_get_dlm_debug once
 less

Because ocfs2_get_dlm_debug() function is called once less here, ocfs2
file system will trigger the system crash, usually after ocfs2 file
system is unmounted.

This system crash is caused by a generic memory corruption, these crash
backtraces are not always the same, for exapmle,

    ocfs2: Unmounting device (253,16) on (node 172167785)
    general protection fault: 0000 [#1] SMP PTI
    CPU: 3 PID: 14107 Comm: fence_legacy Kdump:
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
    RIP: 0010:__kmalloc+0xa5/0x2a0
    Code: 00 00 4d 8b 07 65 4d 8b
    RSP: 0018:ffffaa1fc094bbe8 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: d310a8800d7a3faf RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000dc0 RDI: ffff96e68fc036c0
    RBP: d310a8800d7a3faf R08: ffff96e6ffdb10a0 R09: 00000000752e7079
    R10: 000000000001c513 R11: 0000000004091041 R12: 0000000000000dc0
    R13: 0000000000000039 R14: ffff96e68fc036c0 R15: ffff96e68fc036c0
    FS: 00007f699dfba540(0000) GS:ffff96e6ffd80000(0000) knlGS:00000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055f3a9d9b768 CR3: 000000002cd1c000 CR4: 00000000000006e0
    Call Trace:
     ext4_htree_store_dirent+0x35/0x100 [ext4]
     htree_dirblock_to_tree+0xea/0x290 [ext4]
     ext4_htree_fill_tree+0x1c1/0x2d0 [ext4]
     ext4_readdir+0x67c/0x9d0 [ext4]
     iterate_dir+0x8d/0x1a0
     __x64_sys_getdents+0xab/0x130
     do_syscall_64+0x60/0x1f0
     entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7f699d33a9fb

This regression problem was introduced by commit e581595ea29c ("ocfs: no
need to check return value of debugfs_create functions").

Link: http://<email address hidden>
Fixes: e581595ea29c ("ocfs: no need to check return value of debugfs_create functions")
Signed-off-by: Gang He <email address hidden>
Acked-by: Joseph Qi <email address hidden>
Cc: Mark Fasheh <email address hidden>
Cc: Joel Becker <email address hidden>
Cc: Junxiao Bi <email address hidden>
Cc: Changwei Ge <email address hidden>
Cc: Gang He <email address hidden>
Cc: Jun Piao <email address hidden>
Cc: <email address hidden> [5.3+]
Signed-off-by: Andrew Morton <email address hidden>
Signed-off-by: Linus Torvalds <email address hidden>

as reported in upstream bug. Giving it a try to finally suggest as a SRU to the kernel team.

Changed in ocfs2-tools (Ubuntu Eoan):
status: In Progress → Invalid
Changed in ocfs2-tools (Ubuntu Focal):
status: In Progress → Invalid

Yep, fixes the issue:

[ 25.249831] ocfs2: Registered cluster interface o2cb
[ 25.257661] OCFS2 User DLM kernel interface loaded
[ 25.265196] o2hb: Heartbeat mode set to local
[ 34.231435] o2dlm: Joining domain F67405C564CB4A7CAEBA6F6ACCA2C82F
[ 34.231436] (
[ 34.231437] 0
[ 34.231438] ) 1 nodes
[ 34.231825] JBD2: Ignoring recovery information on journal
[ 34.233058] ocfs2: Mounting device (7,0) on (node 0, slot 0) with ordered data mode.
[ 38.246001] o2dlm: Leaving domain F67405C564CB4A7CAEBA6F6ACCA2C82F
[ 38.247583] ocfs2: Unmounting device (7,0) on (node 0)
[ 50.998435] ocfs2: Unregistered cluster interface o2cb
[ 51.117395] ocfs2: Registered cluster interface o2cb
[ 51.124916] OCFS2 User DLM kernel interface loaded
[ 51.131192] o2hb: Heartbeat mode set to local
[ 59.999672] o2dlm: Joining domain 1D1FFAA94E654FE6B94AA0E44029CE9E
[ 59.999674] (
[ 59.999675] 0
[ 59.999676] ) 1 nodes
[ 60.000252] JBD2: Ignoring recovery information on journal
[ 60.001543] ocfs2: Mounting device (7,0) on (node 0, slot 0) with ordered data mode.
[ 64.043638] o2dlm: Leaving domain 1D1FFAA94E654FE6B94AA0E44029CE9E
[ 64.045484] ocfs2: Unmounting device (7,0) on (node 0)
[ 70.866574] random: crng init done
[ 70.866586] random: 7 urandom warning(s) missed due to ratelimiting

I'll propose patch to kernel team.

description: updated

Eoan patch was acked. Focal patch is already in latest kernel version.

Changed in linux (Ubuntu Focal):
status: In Progress → Fix Released
Marcelo Cerri (mhcerri) on 2020-01-29
Changed in linux (Ubuntu Eoan):
status: In Progress → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-eoan' to 'verification-done-eoan'. If the problem still exists, change the tag 'verification-needed-eoan' to 'verification-failed-eoan'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-eoan

VERIFICATION:

(k)rafaeldtinoco@ocfs2:~$ apt-cache policy linux-image-5.4.0-14-generic
linux-image-5.4.0-14-generic:
  Installed: 5.4.0-14.17
  Candidate: 5.4.0-14.17
  Version table:
 *** 5.4.0-14.17 500
        500 http://br.archive.ubuntu.com/ubuntu focal-proposed/main amd64 Packages
        100 /var/lib/dpkg/status

(k)rafaeldtinoco@ocfs2:~/scripts$ sudo ./ocfs2_reproducer.sh
=== dlmfs ===
ocfs2_dlmfs /dlm ocfs2_dlmfs rw,relatime 0 0
=== lsmod ===
ocfs2_stack_o2cb 16384 0
ocfs2_dlm 192512 1 ocfs2_stack_o2cb
ocfs2_nodemanager 196608 7 ocfs2_stack_o2cb,ocfs2_dlm,ocfs2_dlmfs
ocfs2_stackglue 20480 2 ocfs2_stack_o2cb,ocfs2_dlmfs
=== o2hbmonitor ===
1026 o2hbmonitor
=== o2cluster ===
o2cb,ocfs2,local
=== o2cb_ctl ===
cluster:
        name = ocfs2
        node_count = 1
        status = configured

=== losetup ===
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 0.192249 s, 1.1 GB/s
=== mkfs ===
mkfs.ocfs2 1.8.6
Cluster stack: o2cb
Cluster name: ocfs2
Stack Flags: 0x0
NOTE: Feature extended slot map may be enabled
Label:
Features: sparse extended-slotmap backup-super unwritten inline-data strict-journal-super xattr indexed-dirs refcount discontig-bg append-dio
Block size: 1024 (10 bits)
Cluster size: 4096 (12 bits)
Volume size: 209715200 (51200 clusters) (204800 blocks)
Cluster groups: 7 (tail covers 5120 clusters, rest cover 7680 clusters)
Extent allocator size: 2097152 (1 groups)
Journal size: 4194304
Node slots: 2
Creating bitmaps: done
Initializing superblock: done
Writing system files: done
Writing superblock: done
Writing backup superblock: 0 block(s)
Formatting Journals: done
Growing extent allocator: done
Formatting slot map: done
Formatting quota files: done
Writing lost+found: done
mkfs.ocfs2 successful

=== mount ===
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/loop0 204800 14904 189896 8% /mnt
cleanup

tags: added: verification-done verification-done-eoan
removed: verification-needed-eoan
Download full text (78.1 KiB)

This bug was fixed in the package linux - 5.3.0-40.32

---------------
linux (5.3.0-40.32) eoan; urgency=medium

  * eoan/linux: 5.3.0-40.32 -proposed tracker (LP: #1861214)

  * No sof soundcard for 'ASoC: CODEC DAI intel-hdmi-hifi1 not registered' after
    modprobe sof (LP: #1860248)
    - ASoC: SOF: Intel: fix HDA codec driver probe with multiple controllers

  * ocfs2-tools is causing kernel panics in Ubuntu Focal (Ubuntu-5.4.0-9.12)
    (LP: #1852122)
    - ocfs2: fix the crash due to call ocfs2_get_dlm_debug once less

  * QAT drivers for C3XXX and C62X not included as modules (LP: #1845959)
    - [Config] CRYPTO_DEV_QAT_C3XXX=m, CRYPTO_DEV_QAT_C62X=m and
      CRYPTO_DEV_QAT_DH895xCC=m

  * Eoan update: upstream stable patchset 2020-01-24 (LP: #1860816)
    - scsi: lpfc: Fix discovery failures when target device connectivity bounces
    - scsi: mpt3sas: Fix clear pending bit in ioctl status
    - scsi: lpfc: Fix locking on mailbox command completion
    - Input: atmel_mxt_ts - disable IRQ across suspend
    - f2fs: fix to update time in lazytime mode
    - iommu: rockchip: Free domain on .domain_free
    - iommu/tegra-smmu: Fix page tables in > 4 GiB memory
    - dmaengine: xilinx_dma: Clear desc_pendingcount in xilinx_dma_reset
    - scsi: target: compare full CHAP_A Algorithm strings
    - scsi: lpfc: Fix SLI3 hba in loop mode not discovering devices
    - scsi: csiostor: Don't enable IRQs too early
    - scsi: hisi_sas: Replace in_softirq() check in hisi_sas_task_exec()
    - powerpc/pseries: Mark accumulate_stolen_time() as notrace
    - powerpc/pseries: Don't fail hash page table insert for bolted mapping
    - powerpc/tools: Don't quote $objdump in scripts
    - dma-debug: add a schedule point in debug_dma_dump_mappings()
    - leds: lm3692x: Handle failure to probe the regulator
    - clocksource/drivers/asm9260: Add a check for of_clk_get
    - clocksource/drivers/timer-of: Use unique device name instead of timer
    - powerpc/security/book3s64: Report L1TF status in sysfs
    - powerpc/book3s64/hash: Add cond_resched to avoid soft lockup warning
    - ext4: update direct I/O read lock pattern for IOCB_NOWAIT
    - ext4: iomap that extends beyond EOF should be marked dirty
    - jbd2: Fix statistics for the number of logged blocks
    - scsi: tracing: Fix handling of TRANSFER LENGTH == 0 for READ(6) and WRITE(6)
    - scsi: lpfc: Fix duplicate unreg_rpi error in port offline flow
    - f2fs: fix to update dir's i_pino during cross_rename
    - clk: qcom: Allow constant ratio freq tables for rcg
    - clk: clk-gpio: propagate rate change to parent
    - irqchip/irq-bcm7038-l1: Enable parent IRQ if necessary
    - irqchip: ingenic: Error out if IRQ domain creation failed
    - fs/quota: handle overflows of sysctl fs.quota.* and report as unsigned long
    - scsi: lpfc: fix: Coverity: lpfc_cmpl_els_rsp(): Null pointer dereferences
    - PCI: rpaphp: Fix up pointer to first drc-info entry
    - scsi: ufs: fix potential bug which ends in system hang
    - powerpc/pseries/cmm: Implement release() function for sysfs device
    - PCI: rpaphp: Don't rely on firmware feature to imply drc-info support
    - PCI: rpaphp: Annotate and corr...

Changed in linux (Ubuntu Eoan):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.