System hang with kernel traces while entering reboot process on a Disco ARM64 moonshot node

Bug #1859582 reported by Po-Hsu Lin
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned
Disco
Fix Released
Undecided
Marcelo Cerri

Bug Description

This issue occurs on a Moonshot node, the node was just deployed by MAAS, but it never came back after the reboot command.

Here is the console output:

[ OK ] Started Reboot.
[ OK ] Reached target Reboot.
[ 450.084108] kernel BUG at mm/slub.c:305!
[ 450.131160] Internal error: Oops - BUG: 0 [#1] SMP
[ 450.188640] Modules linked in: dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua gpio_keys_polled input_polldev crct10dif_ce xgene_rng mailbox_xgene_slimpro uio_pdrv_genirq uio sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear mlx4_ib ib_uverbs ib_core mlx4_en mlx4_core gpio_dwapb devlink ahci_xgene gpio_xgene_sb
[ 450.758357] Process shutdown (pid: 1, stack limit = 0x00000000cb9d7b95)
[ 450.837642] CPU: 2 PID: 1 Comm: shutdown Not tainted 5.0.0-38-generic #41-Ubuntu
[ 450.926416] Hardware name: HP ProLiant m400 Server Cartridge (DT)
[ 450.999441] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 451.056823] pc : __slab_free+0x170/0x3d8
[ 451.103865] lr : kfree+0x1b4/0x1c8
[ 451.144549] sp : ffff000017c2b950
[ 451.184190] x29: ffff000017c2b950 x28: ffff00001173c708
[ 451.247825] x27: 0000000000000010 x26: ffff7e003e700220
[ 451.311461] x25: 0000000000000002 x24: ffff800f9c00a000
[ 451.375095] x23: ffff800ff6003800 x22: 0000000000000000
[ 451.438731] x21: 0000000080200008 x20: ffff800f9c00a000
[ 451.502365] x19: ffff7e003e700200 x18: 0000000000000000
[ 451.566001] x17: 0000000000000000 x16: 0000000000000000
[ 451.629637] x15: ffff000010fb7f30 x14: ffff800fab1ef390
[ 451.693271] x13: ffff800ff5a59020 x12: 0000000000000000
[ 451.756906] x11: ffff800ff5a58ff8 x10: 0000000000000040
[ 451.820542] x9 : ffff800fab1ef398 x8 : 0000000000000001
[ 451.884178] x7 : ffff800f9c00a000 x6 : 0000000000000001
[ 451.947812] x5 : 0000000000210d00 x4 : 0000000000000001
[ 452.011448] x3 : ffff800f9c00a000 x2 : 0000000000000000
[ 452.075083] x1 : 0000000040000000 x0 : 0000000000210d00
[ 452.138719] Call trace:
[ 452.167930] __slab_free+0x170/0x3d8
[ 452.210700] kfree+0x1b4/0x1c8
[ 452.247217] cm_remove_one+0x21c/0x2b0 [ib_cm]
[ 452.300540] ib_unregister_device+0x100/0x218 [ib_core]
[ 452.363237] mlx4_ib_remove+0x84/0x1f8 [mlx4_ib]
[ 452.418655] mlx4_remove_device+0xcc/0xf8 [mlx4_core]
[ 452.479272] mlx4_unregister_device+0x84/0x158 [mlx4_core]
[ 452.545104] mlx4_unload_one+0x88/0x2c8 [mlx4_core]
[ 452.603634] mlx4_shutdown+0x70/0x88 [mlx4_core]
[ 452.659025] pci_device_shutdown+0x44/0x88
[ 452.708159] device_shutdown+0x134/0x240
[ 452.755207] kernel_restart_prepare+0x44/0x50
[ 452.807469] kernel_restart+0x20/0x68
[ 452.851284] __se_sys_reboot+0x10c/0x230
[ 452.898228] __arm64_sys_reboot+0x24/0x30
[ 452.946216] el0_svc_common+0xa0/0x168
[ 452.991073] el0_svc_handler+0x38/0x78
[ 453.035930] el0_svc+0x8/0xc
[ 453.070357] Code: 8b020303 eb14031f 54fff921 d503201f (d4210000)
[ 453.143490] ---[ end trace 6177acd8b3b927ab ]---
[ 453.201910] printk: shutdown: 4 output lines suppressed due to ratelimiting
[ 453.285555] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 453.377460] Kernel Offset: disabled
[ 453.419286] CPU features: 0x000,20802000
[ 453.466230] Memory Limit: none
[ 453.502745] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1859582

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: disco
Po-Hsu Lin (cypressyew)
tags: added: arm64
Revision history for this message
Po-Hsu Lin (cypressyew) wrote : Re: Kernel panic while entering reboot process on a Disco ARM64 moonshot node
Download full text (3.2 KiB)

This can easily be reproduced on another moonshot node with the same 5.0.0-38 kernel (clean deploy with Disco by MAAS)

And issue exists in the proposed 5.0.0-39 kernel as well.

[ OK ] Reached target Final Step.
[ OK ] Started Reboot.
[ OK ] Reached target Reboot.
         Stopping LVM2 metadata daemon...
[ 433.924174] kernel BUG at mm/slub.c:305!
[ 433.971224] Internal error: Oops - BUG: 0 [#1] SMP
[ 434.028703] Modules linked in: dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua gpio_keys_polled input_polldev mailbox_xgene_slimpro crct10dif_ce xgene_rng uio_pdrv_genirq uio sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear mlx4_ib ib_uverbs ib_core mlx4_en mlx4_core devlink gpio_dwapb ahci_xgene gpio_xgene_sb
[ 434.598420] Process shutdown (pid: 1, stack limit = 0x00000000024008b6)
[ 434.677705] CPU: 5 PID: 1 Comm: shutdown Not tainted 5.0.0-38-generic #41-Ubuntu
[ 434.766479] Hardware name: HP ProLiant m400 Server Cartridge (DT)
[ 434.839504] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 434.896885] pc : __slab_free+0x170/0x3d8
[ 434.943928] lr : kfree+0x1b4/0x1c8
[ 434.984611] sp : ffff000017c2b950
[ 435.024253] x29: ffff000017c2b950 x28: ffff00001173c708
[ 435.087888] x27: 0000000000000010 x26: ffff7e003ea93020
[ 435.151524] x25: 0000000000000002 x24: ffff800faa4c2c00
[ 435.215159] x23: ffff800ff6003800 x22: 0000000000000000
[ 435.278794] x21: 000000008020000f x20: ffff800faa4c2c00
[ 435.342429] x19: ffff7e003ea93000 x18: 000000000000000c
[ 435.406063] x17: 0000000000000000 x16: 0000000000000000
[ 435.469700] x15: ffff000010fb7f30 x14: ffff800fab1ef390
[ 435.533334] x13: ffff800ff50ec760 x12: 0000000000000000
[ 435.596969] x11: ffff800ff50ec6d8 x10: 0000000000000040
[ 435.660605] x9 : ffff800fab1ef398 x8 : 0000000000000001
[ 435.724240] x7 : ffff800faa4c2c00 x6 : 0000000000000001
[ 435.787875] x5 : 0000000000210d00 x4 : 0000000000000001
[ 435.851510] x3 : ffff800faa4c2c00 x2 : 0000000000000000
[ 435.915146] x1 : 0000000040000000 x0 : 0000000000210d00
[ 435.978781] Call trace:
[ 436.007993] __slab_free+0x170/0x3d8
[ 436.050763] kfree+0x1b4/0x1c8
[ 436.087281] cm_remove_one+0x21c/0x2b0 [ib_cm]
[ 436.140602] ib_unregister_device+0x100/0x218 [ib_core]
[ 436.203300] mlx4_ib_remove+0x84/0x1f8 [mlx4_ib]
[ 436.258703] mlx4_remove_device+0xcc/0xf8 [mlx4_core]
[ 436.319322] mlx4_unregister_device+0x84/0x158 [mlx4_core]
[ 436.385154] mlx4_unload_one+0x88/0x2c8 [mlx4_core]
[ 436.443684] mlx4_shutdown+0x70/0x88 [mlx4_core]
[ 436.499075] pci_device_shutdown+0x44/0x88
[ 436.548208] device_shutdown+0x134/0x240
[ 436.595153] kernel_restart_prepare+0x44/0x50
[ 436.647415] kernel_restart+0x20/0x68
[ 436.691230] __se_sys_reboot+0x10c/0x230
[ 436.738173] __arm64_sys_reboot+0x24/0x30
[ 436.786161] el0_svc_common+0xa0/0x168
[ 436.831018] el0_svc_handler+0x38/0x78
[ 436.875876] el0_svc+0x8/0xc
[ 436.910304] Code: 8b020303 eb14031f 54fff921 d503201f (d4210000)
...

Read more...

summary: - Kernel panic while entering reboot process on a Disco ARM64 moonshot
- node
+ System hand with kernel traces while entering reboot process on a Disco
+ ARM64 moonshot node
Changed in linux (Ubuntu Disco):
status: New → Confirmed
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
summary: - System hand with kernel traces while entering reboot process on a Disco
+ System hang with kernel traces while entering reboot process on a Disco
ARM64 moonshot node
Revision history for this message
Marcelo Cerri (mhcerri) wrote :
Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Test kernel for arm64 available at https://kernel.ubuntu.com/~mhcerri/lp1859582/

tags: added: patch
Marcelo Cerri (mhcerri)
Changed in linux (Ubuntu Disco):
status: Confirmed → Fix Committed
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Tested #4 with the affected moonshot node, it can be rebooted normally (however I can't activate the console this time, but at least it won't hang.)

syslog: http://paste.ubuntu.com/p/RpTsBv9DSs/

Another thing to note is that maas will fail to deploy the node with Disco image (5.0.0-38) today, hang with reboot, I need to turn the node off/on manually from moonshot console to complete the deployment process. Which I don't recall is necessary yesterday.

Changed in linux (Ubuntu):
status: Confirmed → Invalid
Marcelo Cerri (mhcerri)
Changed in linux (Ubuntu Disco):
assignee: nobody → Marcelo Cerri (mhcerri)
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-disco' to 'verification-done-disco'. If the problem still exists, change the tag 'verification-needed-disco' to 'verification-failed-disco'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-disco
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Verified on moonshot node "" with the proposed kernel 5.0.0-40

  OK ] Unmounted Mount unit for core, revision 8271.
[ OK ] Unmounted /boot.
[ OK ] Unmounted /run/snapd/ns/lxd.mnt.
         Unmounting /run/snapd/ns...
[ OK ] Unmounted /run/snapd/ns.
[ OK ] Stopped target Local File Systems (Pre).
[ OK ] Stopped Create Static Device Nodes in /dev.
[ OK ] Stopped Create System Users.
         Stopping Monitoring of LVM…meventd or progress polling...
[ OK ] Stopped target Swap.
         Deactivating swap /swap.img...
[ OK ] Stopped Monitoring of LVM2… dmeventd or progress polling.
[ OK ] Deactivated swap /swap.img.
[ OK ] Reached target Unmount All Filesystems.
[ OK ] Stopped Remount Root and Kernel File Systems.
[ OK ] Reached target Shutdown.
[ OK ] Reached target Final Step.
[ OK ] Started Reboot.
[ OK ] Reached target Reboot.
         Stopping LVM2 metadata daemon...
[ 195.186614] reboot: Restarting system

Reboot process is good, thanks!

tags: added: verification-done-disco
removed: verification-needed-disco
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

moonshot node "ms10-35-mcdivittB0"

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (22.6 KiB)

This bug was fixed in the package linux - 5.0.0-40.44

---------------
linux (5.0.0-40.44) disco; urgency=medium

  * disco/linux: 5.0.0-40.44 -proposed tracker (LP: #1859724)

  * use-after-free in i915_ppgtt_close (LP: #1859522) // CVE-2020-7053
    - SAUCE: drm/i915: Fix use-after-free when destroying GEM context

  * CVE-2019-14615
    - drm/i915/gen9: Clear residual context state on context switch

  * System hang with kernel traces while entering reboot process on a Disco
    ARM64 moonshot node (LP: #1859582)
    - Revert "RDMA/cm: Fix memory leak in cm_add/remove_one"

linux (5.0.0-39.43) disco; urgency=medium

  * disco/linux: 5.0.0-39.43 -proposed tracker (LP: #1858547)

  * [Regression] usb usb2-port2: Cannot enable. Maybe the USB cable is bad?
    (LP: #1856608)
    - SAUCE: Revert "usb: handle warm-reset port requests on hub resume"

  * PAN is broken for execute-only user mappings on ARMv8 (LP: #1858815)
    - arm64: Revert support for execute-only user mappings

  * Fix unusable USB hub on Dell TB16 after S3 (LP: #1855312)
    - SAUCE: USB: core: Make port power cycle a seperate helper function
    - SAUCE: USB: core: Attempt power cycle port when it's in eSS.Disabled state

  * [sas-1126]scsi: hisi_sas: Fix out of bound at debug_I_T_nexus_reset()
    (LP: #1853992)
    - scsi: hisi_sas: Fix out of bound at debug_I_T_nexus_reset()

  * [sas-1126]scsi: hisi_sas: Assign NCQ tag for all NCQ commands (LP: #1853995)
    - scsi: hisi_sas: Assign NCQ tag for all NCQ commands

  * [sas-1126]scsi: hisi_sas: Fix the conflict between device gone and host
    reset (LP: #1853997)
    - scsi: hisi_sas: Fix the conflict between device gone and host reset

  * scsi: hisi_sas: Check sas_port before using it (LP: #1855952)
    - scsi: hisi_sas: Check sas_port before using it

  * CVE-2019-18885
    - btrfs: refactor btrfs_find_device() take fs_devices as argument
    - btrfs: merge btrfs_find_device and find_device

  * Integrate Intel SGX driver into linux-azure (LP: #1844245)
    - [Packaging] Add systemd service to load intel_sgx

  * [SRU][B/OEM-B/OEM-OSP1/D/E/F] Add LG I2C touchscreen multitouch support
    (LP: #1857541)
    - SAUCE: HID: multitouch: Add LG MELF0410 I2C touchscreen support

  * cifs: DFS Caching feature causing problems traversing multi-tier DFS setups
    (LP: #1854887)
    - cifs: Fix retrieval of DFS referrals in cifs_mount()

  * qede driver causes 100% CPU load (LP: #1855409)
    - qede: Handle infinite driver spinning for Tx timestamp.

  * [roce-1126]RDMA/hns: bugfix for slab-out-of-bounds when loading hip08 driver
    (LP: #1853989)
    - RDMA/hns: Bugfix for slab-out-of-bounds when unloading hip08 driver
    - RDMA/hns: bugfix for slab-out-of-bounds when loading hip08 driver

  * [roce-1126]RDMA/hns: Fixs hw access invalid dma memory error (LP: #1853990)
    - RDMA/hns: Fixs hw access invalid dma memory error

  * [hns-1126]net: hns3: revert to old channel when setting new channel num fail
    (LP: #1853983)
    - net: hns3: revert to old channel when setting new channel num fail

  * [hns-1126]net: hns3: fix port setting handle for fibre port
    (LP: #1853984)
    - net: hns3: fix port setting handle for fibre...

Changed in linux (Ubuntu Disco):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.