kernel BUG at fs/ocfs2/alloc.c:1514

Bug #1818501 reported by Niklas Rother
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

The current bionic kernel (4.15) contains a known bug in the OCFS2 distributed filesystem, which can cause all nodes (!) of a redundant cluster to crash. More information on this bug (including the patch) can be found here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=841144

This fix was included upstream in 4.16, so it is included in the HWE stack, but not in the GA kernel.

In my opinion this is quite severe bug, because it can bring a whole redundant setup down (this happened to us). This patch should be backported to 4.15.

#cat /proc/version_signature
Ubuntu 4.15.0-45.48-generic 4.15.18

# lsb_release -rd
Description: Ubuntu 18.04.2 LTS
Release: 18.04

Tags: bionic
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1818501

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic
Revision history for this message
Niklas Rother (nrother) wrote :
Download full text (4.4 KiB)

Adding log with apport-collect files is not easily possible due to our security setup, but should not be necessary because all information can be found in the linked debian bug report.

Here is our stacktrace of the bug happening:

Mär 02 06:25:59 prometheus-lo kernel: ------------[ cut here ]------------
Mär 02 06:25:59 prometheus-lo kernel: kernel BUG at /build/linux-uQJ2um/linux-4.15.0/fs/ocfs2/alloc.c:1514!
Mär 02 06:25:59 prometheus-lo kernel: invalid opcode: 0000 [#1] SMP PTI
Mär 02 06:25:59 prometheus-lo kernel: Modules linked in: vhost_net vhost tap xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 devlink ebtable_filter ebtables ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
Mär 02 06:25:59 prometheus-lo kernel: xt_addrtype xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack iptable_filter ib_iser rdma_cm iw_cm ib_cm ib_core
Mär 02 06:25:59 prometheus-lo kernel: CPU: 0 PID: 9345 Comm: kworker/0:1 Not tainted 4.15.0-45-generic #48-Ubuntu
Mär 02 06:25:59 prometheus-lo kernel: Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 11/14/2017
Mär 02 06:25:59 prometheus-lo kernel: Workqueue: dio/dm-0 dio_aio_complete_work
Mär 02 06:25:59 prometheus-lo kernel: RIP: 0010:ocfs2_grow_tree+0x5e9/0x7e0 [ocfs2]
Mär 02 06:25:59 prometheus-lo kernel: RSP: 0018:ffffbea20df37a28 EFLAGS: 00010246
Mär 02 06:25:59 prometheus-lo kernel: RAX: 0000000000000000 RBX: ffffbea20df37da0 RCX: ffffbea20df37bb8
Mär 02 06:25:59 prometheus-lo kernel: RDX: ffffbea20df37ac4 RSI: ffffbea20df37da0 RDI: ffff9679f54479f0
Mär 02 06:25:59 prometheus-lo kernel: RBP: ffffbea20df37a98 R08: 0000000000000000 R09: ffffbea20df37c58
Mär 02 06:25:59 prometheus-lo kernel: R10: ffffbea20df37b68 R11: 0000000000000030 R12: ffff9676ba5d95a0
Mär 02 06:25:59 prometheus-lo kernel: R13: ffff9679e321d0c0 R14: ffff9676ba5d95a0 R15: 0000000000000001
Mär 02 06:25:59 prometheus-lo kernel: FS: 0000000000000000(0000) GS:ffff9679ffc00000(0000) knlGS:0000000000000000
Mär 02 06:25:59 prometheus-lo kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mär 02 06:25:59 prometheus-lo kernel: CR2: 00007f1b137ba3cc CR3: 00000013a720a002 CR4: 00000000007626f0
Mär 02 06:25:59 prometheus-lo kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mär 02 06:25:59 prometheus-lo kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mär 02 06:25:59 prometheus-lo kernel: PKRU: 55555554
Mär 02 06:25:59 prometheus-lo kernel: Call Trace:
Mär 02 06:25:59 prometheus-lo kernel: ? ocfs2_set_buffer_uptodate+0x34/0x490 [ocfs2]
Mär 02 06:25:59 prometheus-lo kernel: ocfs2_split_and_insert+0x332/0x4d0 [ocfs2]
Mär 02 06:25:59 prometheus-lo kernel: ? ocfs2_read_blocks+0x304/0x600 [ocfs2]
Mär 02 06:25:59 prometheus-lo kernel: ocfs2_split_extent+0x3cb/0x530 [ocfs2]
Mär 02 06:25:59 prometheus-lo kernel: ? ocfs2_dinode_set_last_eb_blk+0x20/0x20 [ocfs2]
Mär 02 06:25:59 prometheus-lo kernel: ocfs2_change_extent_flag+0x25b/0x3e0 [ocfs2]
Mär 02 06:25:59 prometheus-lo kernel: ocfs2_mark_extent_written+0xad/0x1c0 [ocfs2]
Mär 02 06:25:59 prometheus-lo kernel: ocfs2...

Read more...

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Seems like these are required commits:
3e10b793fc40dfdbe51762e0d084bd6f2c8acaaa ocfs2: budget for extent tree splits when adding refcount flag
06a70305812c3973c66824f26223656283c59b27 ocfs2: prohibit refcounted swapfiles
86544fbd853c49a9eccb3d0f4e7eb9317f3fccf9 ocfs2: add newlines to some error messages
84e40080bd6f363ddbcab75b04cb7bc742efbf12 ocfs2: convert inode refcount test to a helper

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

But these are already in v4.15. Is the commit that fixes the issue identified?

Revision history for this message
Niklas Rother (nrother) wrote :

It should be
63de8bd9328bf2a778fc277503da163ae3defa3c ocfs2: make metadata estimation accurate and clear
71a36944042b7d9dd71f6a5d1c5ea1c2353b5d42 ocfs2: try to reuse extent block in dealloc without meta_alloc

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
Revision history for this message
Niklas Rother (nrother) wrote :

Unfortunately we don't have a dedicated test system here, and I don't really want to test this on the production system... I can confirm that upgrading to the HWE kernel (4.18, which contains the patches) seems to solve the problem.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

It there any reproducer?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.