divide error: 0000 with Ubuntu 18.04

Bug #1814628 reported by Tony
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

Got a host running lots of virtual machines crashing today with the following terminal output (if that's to any help):

[3430458.298979] divide error: 0000 [#1] SMP PTI
[3430458.304484] Modules linked in: cpuid vhost_net vhost tap nf_conntrack_ipv6 nf_defrag_ipv6 xt_physdev br_netfilter ip_set nfnetlink kvm_intel kvm irqbypass rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace sunrpc cachefiles fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge ebtable_filter ebtables ip6table_filter ip6_tables devlink iptable_filter 8021q garp mrp stp llc bonding ipmi_ssif intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp dcdbas intel_cstate intel_rapl_perf ipmi_si mei_me ipmi_devintf lpc_ich mei shpchp ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi
[3430458.389747] scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear crct10dif_pclmul mgag200 crc32_pclmul i2c_algo_bit ghash_clmulni_intel ttm pcbc mxm_wmi drm_kms_helper bnx2x aesni_intel syscopyarea sysfillrect aes_x86_64 sysimgblt fb_sys_fops crypto_simd glue_helper ptp cryptd drm pps_core ahci mdio libahci libcrc32c wmi [last unloaded: irqbypass]
[3430458.440756] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-43-generic #46-Ubuntu
[3430458.450411] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.7.1 01/26/2018
[3430458.460008] RIP: 0010:update_group_capacity+0x19d/0x1f0
[3430458.467098] RSP: 0018:ffff91043ec03bd8 EFLAGS: 00010247
[3430458.474207] RAX: 00000000006314d6 RBX: 0000000000000001 RCX: 000000000000024d
[3430458.483490] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000022880
[3430458.492769] RBP: ffff91043ec03c00 R08: ffff9104376dc940 R09: 000000001dcd6500
[3430458.502071] R10: ffff91043ec00000 R11: 0000000000000001 R12: ffff910437625000
[3430458.511391] R13: ffff9104376dc940 R14: 0000000000000000 R15: ffff91043ec03c30
[3430458.520724] FS: 0000000000000000(0000) GS:ffff91043ec00000(0000) knlGS:0000000000000000
[3430458.531161] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3430458.538994] CR2: 000000010960b008 CR3: 0000003f8300a002 CR4: 00000000003626e0
[3430458.548416] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[3430458.557858] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[3430458.567298] Call Trace:
[3430458.571507] <IRQ>
[3430458.575253] find_busiest_group+0xff/0x9c0
[3430458.581327] load_balance+0x164/0x9c0
[3430458.586928] ? sched_clock+0x9/0x10
[3430458.592347] ? sched_clock_cpu+0x11/0xb0
[3430458.598249] rebalance_domains+0x245/0x2d0
[3430458.604372] run_rebalance_domains+0x139/0x1f0
[3430458.610901] __do_softirq+0xe4/0x2bb
[3430458.616459] irq_exit+0xb8/0xc0
[3430458.621538] scheduler_ipi+0xea/0x130
[3430458.627201] smp_reschedule_interrupt+0x39/0xe0
[3430458.633860] reschedule_interrupt+0x84/0x90
[3430458.640138] </IRQ>
[3430458.644096] RIP: 0010:cpuidle_enter_state+0xa7/0x2f0
[3430458.651289] RSP: 0018:ffffa90100107e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff02
[3430458.661431] RAX: ffff91043ec22880 RBX: 000c2ffbd05252a7 RCX: 000000000000001f
[3430458.671114] RDX: 000c2ffbd05252a7 RSI: ffffff1e5f04e459 RDI: 0000000000000000
[3430458.680819] RBP: ffffa90100107ea8 R08: 00000000ffffffff R09: 0000000000000d17
[3430458.690557] R10: ffffa90100107e38 R11: 0000000000000eca R12: ffffc900ff401f58
[3430458.700303] R13: 0000000000000004 R14: ffffffff9bb71d18 R15: 0000000000000000
[3430458.710075] ? cpuidle_enter_state+0x97/0x2f0
[3430458.716723] cpuidle_enter+0x17/0x20
[3430458.722455] call_cpuidle+0x23/0x40
[3430458.728041] do_idle+0x18c/0x1f0
[3430458.733312] cpu_startup_entry+0x73/0x80
[3430458.739293] start_secondary+0x1ab/0x200
[3430458.745229] secondary_startup_64+0xa5/0xb0
[3430458.751415] Code: 34 10 48 8b 96 38 0a 00 00 48 8b 86 30 0a 00 00 48 8b b6 a0 09 00 00 49 d1 e9 48 29 d6 ba 00 00 00 00 48 0f 48 f2 31 d2 44 01 ce <48> f7 f6 48 3d ff 03 00 00 77 0c ba 00 04 00 00 48 29 c2 48 0f
[3430458.775397] RIP: update_group_capacity+0x19d/0x1f0 RSP: ffff91043ec03bd8
[3430458.784442] ---[ end trace c1ae6368b9740f8a ]---
[3430458.866540] Kernel panic - not syncing: Fatal exception in interrupt
[3430458.866591] Kernel Offset: 0x19600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[3430458.954898] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1814628

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic
Revision history for this message
Tony (tony-sarajarvi) wrote :

The host was inaccessible as it crashed.

Brad Figg (brad-figg)
tags: added: bjf-tracking
Revision history for this message
Terry Rudd (terrykrudd) wrote :

Tony, Thank you for reporting this issue.
1) Can you please provide any information related to the kernel that you are using and conditions prior to this crash?
   a) what kernel are you running? What has changed in the last week?
   b) do you have updated Dell System firmware?
   c) do you have any hardware errors logged?
2) Have you rebooted the system?
   a) does it cleanly pass selftest?
   b) does it boot single user mode?
   c) when you let it boot without interruption, which init stage do you see failure (if repeatable) occur?
   d) if it passes init stage 4 and is the virtualized environment comes up is it stable?

Please note: It is difficult to successfully determine root cause without logs, dmesg. etc. or knowing the conditions of this system prior to this crash.

Revision history for this message
Tony (tony-sarajarvi) wrote :

We are running MAAS here and that server is one that was deployed a month ago approximately.

1a)
[ 0.000000] Linux version 4.15.0-45-generic (buildd@lgw01-amd64-031) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 (Ubuntu 4.15.0-45.48-generic 4.15.18)
[ 0.000000] Command line: BOOT_IMAGE=ubuntu/amd64/generic/bionic/daily/boot-kernel nomodeset ro root=squash:http://10.223.0.5:5248/images/ubuntu/amd64/generic/bionic/daily/squashfs ip=::::finer-panda:BOOTIF ip6=off overlayroot=tmpfs overlayroot_cfgdisk=disabled cc:{'datasource_list': ['MAAS']}end_cc cloud-config-url=http://10.223.0.5:5240/MAAS/metadata/latest/by-id/yx6d6w/?op=get_preseed apparmor=0 log_host=10.223.0.5 log_port=514 --- elevator=noop console=tty1 console=ttyS0,115200 initrd=ubuntu/amd64/generic/bionic/daily/boot-initrd BOOTIF=01-f4-e9-d4-b6-5d-70

1b) Last check for them was a couple of months ago
1c) iDRAC says no.
2a) Say what? I tried starting it up and I got the shell appearing in the console. Then it just shut itself down again before I had time to react to log in. Could be a MAAS thing shutting it down?
2b) And what's that?
2c) Don't see any failures. First it boots up:
Ubuntu 18.04.1 LTS finer-panda ttyS0

finer-panda login: [ 50.049021] cloud-init[2188]: Cloud-init v. 18.4-0ubuntu1~18.04.1 running 'modules:config' at Fri, 08 Feb 2019 07:21:19 +0000. Up 48.36 seconds.
[ 51.015952] cloud-init[2270]: Powering node off.
ci-info: no authorized ssh keys fingerprints found for user ubuntu.

Then promptly after that:
[ 51.169438] cloud-init[2270]: Cloud-init v. 18.4-0ubuntu1~18.04.1 running 'modules:final' at Fri, 08 Feb 2019 07:21:22 +0000. Up 50.67 seconds.
[ 51.169580] cloud-init[2270]: ci-info: no authorized ssh keys fingerprints found for user ubuntu.
[ 51.169694] cloud-init[2270]: Cloud-init v. 18.4-0ubuntu1~18.04.1 finished at Fri, 08 Feb 2019 07:21:23 +0000. Datasource DataSourceMAAS [http://10.223.0.5:5240/MAs
         Stopping Authorization Manager...
[ OK ] Stopped target Timers.
[ OK ] Stopped Daily apt upgrade and clean activ Stopping Accounts Service...
[ OK ] Stopped target Host and Network Name Lookups.
         Stopping Read required files in advance...
         Stopping OpenBSD Secure Shell server...
         Stopping LVM2 PV scan on device 8:2...
[ OK ] Unmounted /var/lib/lxcfs.
etc..

2d) This is the host running the dozens of virtual machines popping up and shut down daily. Until this day and all the 32 exactly identical hosts are still up and running...and have been for a month.

I'll try to dig out the info what causes it to shut down instead of staying up.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.