[Intel Ubuntu 18.04 Bug] Null pointer dereference, when disconnecting RAID rebuild target

Bug #1759279 reported by Pawel Baldysiak on 2018-03-27
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Joseph Salisbury
Bionic
High
Joseph Salisbury

Bug Description

Kernel panic with null pointer dereference, when RAID10 rebuild target is disconnected during rebuild. It's sporadical issue.

Steps to reproduce:
1) Create raid10 with mdadm
2) Wait for resync to end
3) Add spare drive
4) Fail one of the member drive
- Raid becomes degraded, rebuild to spare from step 3 starts.
5) disconnect the drive added in step 3 (rebuild target)

trace:
[ 1022.872118] BUG: unable to handle kernel NULL pointer dereference at 00000000000000f0
[ 1022.881072] IP: raid10d+0xaec/0x1430 [raid10]
[ 1022.886071] PGD 0 P4D 0
[ 1022.889033] Oops: 0002 [#1] SMP PTI
[ 1022.893056] Modules linked in: xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc devlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_realtek snd_hda_codec_generic kvm_intel kvm snd_hda_intel snd_hda_codec irqbypass snd_hda_core ipmi_ssif intel_cstate joydev snd_hwdep intel_rapl_perf input_leds snd_pcm snd_timer ioatdma dca lpc_ich snd soundcore shpchp ipmi_si ipmi_devintf ipmi_msghandler tpm_crb acpi_pad acpi_power_meter mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc
[ 1022.973751] ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear dm_mirror dm_region_hash dm_log nouveau hid_generic mxm_wmi mgag200 video i2c_algo_bit usbhid ttm e1000e i40e crct10dif_pclmul hid ptp crc32_pclmul ghash_clmulni_intel pcbc drm_kms_helper aesni_intel syscopyarea aes_x86_64 raid1 sysfillrect crypto_simd sysimgblt glue_helper uas fb_sys_fops cryptd ahci vmd drm usb_storage pps_core libahci wmi
[ 1023.026580] CPU: 90 PID: 6373 Comm: md126_raid10 Not tainted 4.15.0-10-generic #11-Ubuntu
[ 1023.035831] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYDTRL1.86B.0151.R03.1801050249 01/05/2018
[ 1023.046913] RIP: 0010:raid10d+0xaec/0x1430 [raid10]
[ 1023.052479] RSP: 0018:ffffb5178747bd70 EFLAGS: 00010246
[ 1023.058429] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff99b3bd0d5e20
[ 1023.066517] RDX: ffffffffc025f8c0 RSI: 0000000000000286 RDI: ffff99b7a1ed5c00
[ 1023.074605] RBP: ffffb5178747be90 R08: 0000000000000349 R09: 0000000000000000
[ 1023.082697] R10: ffffb5178747bd70 R11: 0000000000000365 R12: 0000000000000000
[ 1023.090790] R13: ffff99b3d97dbf70 R14: ffff99b3bd0d5e00 R15: ffff99b3bd0d5e00
[ 1023.098883] FS: 0000000000000000(0000) GS:ffff99b7ed880000(0000) knlGS:0000000000000000
[ 1023.108051] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1023.114602] CR2: 00000000000000f0 CR3: 00000001d0c0a004 CR4: 00000000007606e0
[ 1023.122707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1023.130804] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1023.138915] PKRU: 55555554
[ 1023.142086] Call Trace:
[ 1023.144964] ? __clear_rsb+0x15/0x3d
[ 1023.149105] ? __schedule+0x29f/0x8a0
[ 1023.153340] ? __clear_rsb+0x25/0x3d
[ 1023.157478] ? schedule+0x2c/0x80
[ 1023.161326] md_thread+0x129/0x170
[ 1023.165273] ? raid10_start_reshape+0x630/0x630 [raid10]
[ 1023.171349] ? md_thread+0x129/0x170
[ 1023.175484] ? wait_woken+0x80/0x80
[ 1023.179521] kthread+0x121/0x140
[ 1023.183267] ? find_pers+0x70/0x70
[ 1023.187204] ? kthread_create_worker_on_cpu+0x70/0x70
[ 1023.192986] ? do_syscall_64+0x118/0x130
[ 1023.197505] ret_from_fork+0x35/0x40
[ 1023.201631] Code: e4 48 8b 57 48 0f 84 92 08 00 00 49 83 7c 24 48 00 0f 84 86 08 00 00 48 63 d8 48 c1 e3 05 48 85 d2 74 41 49 8b 46 08 48 8b 04 18 <f0> ff 80 f0 00 00 00 49 8b 46 08 48 8b 04 18 48 8b 40 30 48 8b
[ 1023.223245] RIP: raid10d+0xaec/0x1430 [raid10] RSP: ffffb5178747bd70
[ 1023.230490] CR2: 00000000000000f0
[ 1023.234340] ---[ end trace 12e1280fca9f2646 ]---

Additional information:
Following upstream patches solves the issue:

md: document lifetime of internal rdev pointer.
https://marc.info/?l=linux-raid&m=151761002007155&w=2
https://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git/commit/?h=for-next&id=f2785b527cda46314805123ddcbc871655b7c4c4

md: only allow remove_and_add_spares when no sync_thread running.
https://marc.info/?l=linux-raid&m=151761004007159&w=2
https://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git/commit/?h=for-next&id=39772f0a7be3b3dc26c74ea13fe7847fd1522c8b

information type: Public → Private
tags: added: kernel-da-key
Changed in linux (Ubuntu):
status: New → Triaged
importance: Undecided → High
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: bionic
Changed in linux (Ubuntu Bionic):
assignee: Canonical Kernel Team (canonical-kernel-team) → Joseph Salisbury (jsalisbury)
status: Triaged → In Progress
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with commits f2785b527cda and 39772f0a7be. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1759279

Can you test this kernel and see if it resolves this bug?

Note, to test this kernel, you need to install both the linux-image and linux-image-extra .deb packages.

Thanks in advance!

Hello!

I can confirm the issue not occurs on given kernel. (4.15.0-12-generic #13~lp1759279)

Thanks for your work!

information type: Private → Public
Seth Forshee (sforshee) on 2018-03-30
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :
Download full text (40.4 KiB)

This bug was fixed in the package linux - 4.15.0-15.16

---------------
linux (4.15.0-15.16) bionic; urgency=medium

  * linux: 4.15.0-15.16 -proposed tracker (LP: #1761177)

  * FFe: Enable configuring resume offset via sysfs (LP: #1760106)
    - PM / hibernate: Make passing hibernate offsets more friendly

  * /dev/bcache/by-uuid links not created after reboot (LP: #1729145)
    - SAUCE: (no-up) bcache: decouple emitting a cached_dev CHANGE uevent

  * Ubuntu18.04:POWER9:DD2.2 - Unable to start a KVM guest with default machine
    type(pseries-bionic) complaining "KVM implementation does not support
    Transactional Memory, try cap-htm=off" (kvm) (LP: #1752026)
    - powerpc: Use feature bit for RTC presence rather than timebase presence
    - powerpc: Book E: Remove unused CPU_FTR_L2CSR bit
    - powerpc: Free up CPU feature bits on 64-bit machines
    - powerpc: Add CPU feature bits for TM bug workarounds on POWER9 v2.2
    - powerpc/powernv: Provide a way to force a core into SMT4 mode
    - KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
    - KVM: PPC: Book3S HV: Work around XER[SO] bug in fake suspend mode
    - KVM: PPC: Book3S HV: Work around TEXASR bug in fake suspend state

  * Important Kernel fixes to be backported for Power9 (kvm) (LP: #1758910)
    - powerpc/mm: Fixup tlbie vs store ordering issue on POWER9

  * Ubuntu 18.04 - IO Hang on some namespaces when running HTX with 16
    namespaces (Bolt / NVMe) (LP: #1757497)
    - powerpc/64s: Fix lost pending interrupt due to race causing lost update to
      irq_happened

  * fwts-efi-runtime-dkms 18.03.00-0ubuntu1: fwts-efi-runtime-dkms kernel module
    failed to build (LP: #1760876)
    - [Packaging] include the retpoline extractor in the headers

linux (4.15.0-14.15) bionic; urgency=medium

  * linux: 4.15.0-14.15 -proposed tracker (LP: #1760678)

  * [Bionic] mlx4 ETH - mlnx_qos failed when set some TC to vendor
    (LP: #1758662)
    - net/mlx4_en: Change default QoS settings

  * AT_BASE_PLATFORM in AUXV is absent on kernels available on Ubuntu 17.10
    (LP: #1759312)
    - powerpc/64s: Fix NULL AT_BASE_PLATFORM when using DT CPU features

  * Bionic update to 4.15.15 stable release (LP: #1760585)
    - net: dsa: Fix dsa_is_user_port() test inversion
    - openvswitch: meter: fix the incorrect calculation of max delta_t
    - qed: Fix MPA unalign flow in case header is split across two packets.
    - tcp: purge write queue upon aborting the connection
    - qed: Fix non TCP packets should be dropped on iWARP ll2 connection
    - sysfs: symlink: export sysfs_create_link_nowarn()
    - net: phy: relax error checking when creating sysfs link netdev->phydev
    - devlink: Remove redundant free on error path
    - macvlan: filter out unsupported feature flags
    - net: ipv6: keep sk status consistent after datagram connect failure
    - ipv6: old_dport should be a __be16 in __ip6_datagram_connect()
    - ipv6: sr: fix NULL pointer dereference when setting encap source address
    - ipv6: sr: fix scheduling in RCU when creating seg6 lwtunnel state
    - mlxsw: spectrum_buffers: Set a minimum quota for CPU port traffic
    - net: phy: Tell caller result ...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers