boot failing - unable to handle kernel paging request - followed by oops

Bug #1863978 reported by Michiel Valee
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

After update to Ubuntu 5.0.0-1032.34-azure 5.0.21, VM is not able to boot anymore.

Keeps stuck in logging "soft lockup - CPU#xxx stuck for xxs" with random process (different on each startup).

Full logs below.

Relevant boot logs:
(...)
[ OK ] Started /var/lib/waagent/Microsoft.…0.0.7820/scripts/enableHandler.py.
[ 14.836987] PKCS#7 signature not signed with a trusted key
[ 14.856290] PKCS#7 signature not signed with a trusted key
[ 15.123087] BUG: unable to handle kernel paging request at 000000003fffffa9
[ 15.123905] #PF error: [normal kernel read fault]
[ 15.123905] PGD 800000084ab2c067 P4D 800000084ab2c067 PUD 8586cb067 PMD 0
[ 15.135982] Oops: 0000 [#1] SMP PTI
[ 15.135982] CPU: 4 PID: 3561 Comm: microsoft-depen Tainted: P OE 5.0.0-1032-azure #34-Ubuntu
[ 15.135982] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
[ 15.135982] RIP: 0010:iterate_extants+0x155/0x380 [microsoft_dependency_agent]
[ 15.135982] Code: ab c0 4c 89 e7 e8 bb 9b 0c e8 49 8b 44 24 08 48 85 c0 74 71 4d 8b 7c 24 08 49 83 ef 68 75 0b eb 64 49 83 ea 68 4d 89 d7 74 5b <41> 0f b7 47 10 83 e0 f7 66 83 f8 02 75 44 41 80 bf 01 02 00 00 06
[ 15.171900] RSP: 0018:ffffbd46489cbb10 EFLAGS: 00010216
[ 15.171900] RAX: 0000000040000001 RBX: ffffbd46489cbb30 RCX: 0000000000000de9
[ 15.171900] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffa9ae5280
[ 15.171900] RBP: ffffbd46489cbb80 R08: 0000000000000001 R09: 000000000000024f
[ 15.171900] R10: ffffbd46489cbb98 R11: 0000000000000000 R12: ffffffffa9ae5280
[ 15.171900] R13: ffffffffc0ab59d0 R14: 0000000000000000 R15: 000000003fffff99
[ 15.171900] FS: 00007f365b7e1440(0000) GS:ffff98541dd00000(0000) knlGS:0000000000000000
[ 15.171900] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 15.171900] CR2: 000000003fffffa9 CR3: 000000084c1c6002 CR4: 00000000003606e0
[ 15.171900] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 15.171900] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 15.171900] Call Trace:
[ 15.171900] ? bs_release+0x1b0/0x1b0 [microsoft_dependency_agent]
[ 15.171900] bs_open+0x2fd/0x580 [microsoft_dependency_agent]
[ 15.171900] ? bs_open+0x2fd/0x580 [microsoft_dependency_agent]
[ 15.171900] ? bs_write+0x90/0x90 [microsoft_dependency_agent]

(...)
[ 45.331952] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [nginx:4876]
[ 45.336115] Modules linked in: microsoft_dependency_agent(POE) bluechannel(POE) veth ipt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter bridge stp llc xt_owner iptable_security aufs overlay xt_conntrack xt_tcpudp iptable_filter xt_nat iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter nls_iso8859_1 kvm_intel kvm irqbypass serio_raw hv_balloon joydev sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel hid_generic aes_x86_64 hid_hyperv crypto_simd cryptd hyperv_fb glue_helper cfbfillrect pata_acpi hyperv_keyboard hid hv_netvsc cfbimgblt cfbcopyarea hv_utils
[ 45.395981] CPU: 2 PID: 4876 Comm: nginx Tainted: P D OE 5.0.0-1032-azure #34-Ubuntu
[ 45.400558] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
[ 45.408757] RIP: 0010:hv_qlock_wait+0x25/0x60
[ 45.411928] Code: 38 be 00 5d c3 0f 1f 44 00 00 55 48 89 e5 53 65 8b 05 2f 75 df 57 a9 00 00 10 00 75 0d 9c 5b fa 0f b6 07 40 38 c6 74 05 53 9d <5b> 5d c3 b9 f0 00 00 40 0f 32 0f 1f 44 00 00 eb ed 48 c1 e2 20 bf
[ 45.424104] RSP: 0018:ffffbd46493a3d60 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff12
[ 45.431986] RAX: 0000000000000000 RBX: 0000000000000246 RCX: 00000000400000f0
[ 45.437191] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffffa9ae5280
[ 45.440905] RBP: ffffbd46493a3d68 R08: 0000000000000009 R09: 00000000000000d8
[ 45.447992] R10: 0000000000000000 R11: 0000000000000000 R12: ffff98541dca2000
[ 45.452055] R13: 0000000000000001 R14: 0000000000000100 R15: 0000000000000000
[ 45.457386] FS: 00007faee231fd48(0000) GS:ffff98541dc80000(0000) knlGS:0000000000000000
[ 45.459920] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 45.469462] CR2: 00007faee132b000 CR3: 000000082287a004 CR4: 00000000003606e0
[ 45.474543] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 45.479975] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 45.484781]
(...)

Last successful boot log:
https://pastebin.com/uVgcQCUC

One of the failing boot logs after updating:
https://pastebin.com/D0jnbHDb

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-azure (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Price (pricechild) wrote :

Seeing something very similar with Ubuntu 16.04 machines upgraded to linux-image-4.15.0-1071-azure

Revision history for this message
Joseph Price (pricechild) wrote :

@michielvalee my issues were caused by a bad kernel module installed by an Azure Extension agent. Given the timing, I'd recommend you check your subscription admin's email inbox for details if you were also affected.

Revision history for this message
Augustin (agjini) wrote :

We are also affected on this problem on 3 Ubuntu 18.04 vms since the kernel 5.0.0-1032-azure has been installed (automatically installed after a reboot of the server).

The Linux vms with older kernel (that have not been restarted) are not affected.

Revision history for this message
Augustin (agjini) wrote :

I've upgraded my system to linux-image-5.3.0-40-generic and it fix the problem!
Here is my procedure :
- restore a backup from before the 5.0.0-1032-azure kernel update
- restart the vm
- sudo apt update
- sudo apt upgrade
- sudo apt install --install-recommends linux-generic-hwe-18.04
- restart the vm

after the restart the applied kernel is linux-image-5.3.0-40-generic.

The server can reboot and the application can restart normally without any CPU lock issue.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.