4.12.0-11-generic - crashing in infrastructure on i386 openvswitch tests

Bug #1712831 reported by Christian Ehrhardt 
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
High
Unassigned
Artful
Triaged
High
Unassigned

Bug Description

Hi,
this seems to me to be a kernel crash of some sorts.
Somewhat in the spirit of older bugs:
- bug 1630940
- bug 1630578

Xnox asked me to look into a hang on openvswitch dep8 tests.
What I found initially was in the log just
  "ERROR: Removing temporary files on testbed timed out"
That message brought me to the two bugs above.

But in there I read that this was the infra running dep8 crashing.
So for a better bug report I tried to reproduce locally and that actually seems to work very reliable.

To reproduce do:
$ autopkgtest-buildvm-ubuntu-cloud -a i386 -r artful -s 10G
$ pull-lp-source openvswitch
$ autopkgtest --apt-upgrade --shell --no-built-binaries openvswitch_2.8.0~git20170809.7aa47a19d-0ubuntu1.dsc -- qemu ~/work/autopkgtest-artful-i386.img
# This guest currently will crash after a while of testing

But with that running you can attach to the console and monitor of that guest.
For example:
$ sudo nc -U /tmp/autopkgtest-virt-qemu.*/ttyS0

That gave me a crash on the hang which kind of matches the older bugs, here the console:
[ 54.256253] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 54.257156] IP: add_grec+0x28/0x440
[ 54.257553] *pdpt = 000000001a869001 *pde = 0000000000000000
[ 54.257555]
[ 54.258338] Oops: 0000 [#1] SMP
[ 54.258638] Modules linked in: veth openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c 9p fscache ppdev kvm_intel kvm irqbypass joydev input_leds serio_raw 9pnet_virtio parport_pc 9pnet parport qemu_fw_cfg i2c_piix4 mac_hid ip_tables x_tables autofs4 btrfs xor raid6_pq psmouse virtio_blk virtio_net floppy pata_acpi
[ 54.261891] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.12.0-11-generic #12-Ubuntu
[ 54.262715] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1~cloud0 04/01/2014
[ 54.263610] task: c7b622c0 task.stack: c7b5a000
[ 54.264039] EIP: add_grec+0x28/0x440
[ 54.264378] EFLAGS: 00010202 CPU: 0
[ 54.264711] EAX: 00000000 EBX: dd062540 ECX: 00000006 EDX: dd062540
[ 54.265308] ESI: dd1e6e00 EDI: dd1e6e00 EBP: dbcc5f30 ESP: dbcc5ef0
[ 54.265793] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 54.266297] CR0: 80050033 CR2: 00000000 CR3: 1e930d80 CR4: 000006f0
[ 54.266885] Call Trace:
[ 54.267120] <SOFTIRQ>
[ 54.267349] mld_ifc_timer_expire+0xfe/0x250
[ 54.267754] ? mld_dad_timer_expire+0x50/0x50
[ 54.268173] call_timer_fn+0x30/0x120
[ 54.268524] ? mld_dad_timer_expire+0x50/0x50
[ 54.268942] ? mld_dad_timer_expire+0x50/0x50
[ 54.269364] run_timer_softirq+0x3c5/0x420
[ 54.269760] ? __softirqentry_text_start+0x8/0x8
[ 54.270198] __do_softirq+0xa9/0x245
[ 54.270539] ? __softirqentry_text_start+0x8/0x8
[ 54.270976] do_softirq_own_stack+0x24/0x30
[ 54.271373] </SOFTIRQ>
[ 54.271611] irq_exit+0xad/0xb0
[ 54.271913] smp_apic_timer_interrupt+0x38/0x50
[ 54.272344] apic_timer_interrupt+0x39/0x40
[ 54.272745] EIP: native_safe_halt+0x5/0x10
[ 54.273139] EFLAGS: 00000246 CPU: 0
[ 54.273624] EAX: 00000000 EBX: 00000000 ECX: 00000001 EDX: 00000000
[ 54.274000] ESI: 00000000 EDI: c7b622c0 EBP: c7b5bf10 ESP: c7b5bf10
[ 54.274361] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 54.274687] default_idle+0x1c/0xf0
[ 54.274896] arch_cpu_idle+0xe/0x10
[ 54.275286] default_idle_call+0x1e/0x30
[ 54.275787] do_idle+0x145/0x1c0
[ 54.276092] cpu_startup_entry+0x65/0x70
[ 54.276441] rest_init+0x62/0x70
[ 54.276718] start_kernel+0x3be/0x3d7
[ 54.276974] i386_start_kernel+0x9d/0xa1
[ 54.277260] startup_32_smp+0x16b/0x16d
[ 54.277509] Code: 00 00 00 3e 8d 74 26 00 55 89 e5 57 56 53 89 c6 83 ec 34 89 4d e8 65 a1 14 00 00 00 89 45 f0 31 c0 8b 42 10 f6 42 48 08 89 45 cc <8b> 00 c7 45 ec 00 00 00 00 89 45 c8 89 f0 0f 85 b4 02 00 00 8b
[ 54.279314] EIP: add_grec+0x28/0x440 SS:ESP: 0068:dbcc5ef0
[ 54.279829] CR2: 0000000000000000
[ 54.280143] ---[ end trace 3164b1c0dd7745bc ]---
[ 54.280550] Kernel panic - not syncing: Fatal exception in interrupt
[ 54.281078] Kernel Offset: 0x6000000 from 0xc1000000 (relocation range: 0xc0000000-0xdfbfdfff)
[ 54.281797] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

But since this is lovely qemu the machine isn't as dead as real HW now, so via the monitor
$ sudo nc -U /tmp/autopkgtest-virt-qemu.*/monitor
I took a dump.

Fetching the i386 debug kernel shows me I can load that in crash.

But I'd leave the authoritative look what happened to the kernel Team. So I shared via fileshare.
Please load with debug-kernel of linux-image-4.12.0-11-generic-dbgsym.
I have issues properly loading that in crash as this is a i386 artful on a 64bit KVM, so the format of the kdump generated by qemu is x86_64 and something seems to disagree.
But since I have simple repro steps above I hope you can crash and analyze it the way you want/need.

In case you still want my dump fetch it from https://private-fileshare.canonical.com/~paelzer/ovs-dep8-i386.vmcore.kdump.zlib
I also have raw (non kdump style formats if needed).

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

The last two known good runs on LP infra also were on the same kernel 4.12.0.11.12, no idea yet what has changed.
Maybe the openvswitch upload itself a while ago - but that worked as well on LP infra ... ?

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1712831

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: artful
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Bot,
this issue does not need the usual logs IMHO, reproducible as instructed "on demand" - setting back to new.

Changed in linux (Ubuntu):
status: Incomplete → New
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1712831

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

ok, every bot is different, this one wants confirmed to be silent.
Setting that ...

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.13 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13-rc6

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key
tags: added: kernel-key
removed: kernel-da-key
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Installed requested mainline in my autopkgtest environment created as described above.
Booting correctly by default and verified to boot into this by default.

Now kicking the autopkgtest that was hanging before, but still running into the crash.
On the serial console (connect as described above) it pups up very similar than before:

[ 36.480026] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 36.480873] IP: add_grec+0x28/0x440
[ 36.481301] *pdpt = 000000001d25b001 *pde = 0000000000000000
[ 36.481302]
[ 36.482198] Oops: 0000 [#1] SMP
[ 36.482590] Modules linked in: veth openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c 9p fscache ppdev kvm_intel kvm irqbypass input_leds 9pnet_virtio 9pnet joydev serio_raw i2c_piix4 parport_pc mac_hid parport qemu_fw_cfg ip_tables x_tables autofs4 btrfs xor raid6_pq psmouse floppy virtio_blk virtio_net pata_acpi
[ 36.486878] CPU: 0 PID: 2173 Comm: systemd-udevd Tainted: G W 4.13.0-041300rc6-generic #201708201831
[ 36.488220] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1~cloud0 04/01/2014
[ 36.489414] task: db04eb40 task.stack: dd274000
[ 36.489988] EIP: add_grec+0x28/0x440
[ 36.490470] EFLAGS: 00010202 CPU: 0
[ 36.490912] EAX: 00000000 EBX: dd1bbd20 ECX: 00000006 EDX: dd1bbd20
[ 36.491682] ESI: dd1ef500 EDI: dd1ef500 EBP: dbcc5f30 ESP: dbcc5ef0
[ 36.492450] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 36.493135] CR0: 80050033 CR2: 00000000 CR3: 1e123640 CR4: 000006f0
[ 36.493917] Call Trace:
[ 36.494224] <SOFTIRQ>
[ 36.494526] mld_ifc_timer_expire+0xfe/0x250
[ 36.495066] ? mld_dad_timer_expire+0x50/0x50
[ 36.495616] call_timer_fn+0x30/0x120
[ 36.496070] ? mld_dad_timer_expire+0x50/0x50
[ 36.496612] ? mld_dad_timer_expire+0x50/0x50
[ 36.497144] run_timer_softirq+0x3c5/0x420
[ 36.497685] ? __softirqentry_text_start+0x8/0x8
[ 36.498259] __do_softirq+0xa9/0x245
[ 36.498717] ? __softirqentry_text_start+0x8/0x8
[ 36.499286] do_softirq_own_stack+0x24/0x30
[ 36.499797] </SOFTIRQ>
[ 36.500110] irq_exit+0xad/0xb0
[ 36.500516] smp_apic_timer_interrupt+0x38/0x50
[ 36.501101] apic_timer_interrupt+0x39/0x40
[ 36.501623] EIP: 0x469ba7
[ 36.501948] EFLAGS: 00000292 CPU: 0
[ 36.502385] EAX: bfa2e000 EBX: 004a8bf4 ECX: 00000030 EDX: 0047f836
[ 36.503157] ESI: bfa2e09c EDI: bfa2e000 EBP: bfa2f0b8 ESP: bfa2dff4
[ 36.503923] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[ 36.504606] Code: 00 00 00 3e 8d 74 26 00 55 89 e5 57 56 53 89 c6 83 ec 34 89 4d e8 65 a1 14 00 00 00 89 45 f0 31 c0 8b 42 10 f6 42 48 08 89 45 cc <8b> 00 c7 45 ec 00 00 00 00 89 45 c8 89 f0 0f 85 b4 02 00 00 8b
[ 36.506481] EIP: add_grec+0x28/0x440 SS:ESP: 0068:dbcc5ef0
[ 36.507009] CR2: 0000000000000000
[ 36.507330] ---[ end trace f0cddc0c09ae15d4 ]---
[ 36.507773] Kernel panic - not syncing: Fatal exception in interrupt
[ 36.508476] Kernel Offset: 0x16000000 from 0xc1000000 (relocation range: 0xc0000000-0xdfbfdfff)
[ 36.509286] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

tags: added: kernel-da-key
removed: kernel-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Do you know if this is a regression? Was there a prior kernel version that did not exhibit this bug?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

There could be two potential changes that caused this regression, looking at [1]:

It either is the newer OVS 2.8 or the Kernel 4.12.
There are two good tests with OVS-2.8 in a row.
2017-08-10 16:37:12 UTC
2017-08-14 11:20:32 UTC

Given the current rate of success (5/24) it is rather unlikely that this case had the issue present.
Then on "Published on 2017-08-16" we got kernel 4.12 [2] and since then the issues occur.

So no hard guarantee, but all data points to 4.12 to be the regressing change.

[1]: http://autopkgtest.ubuntu.com/packages/openvswitch/artful/i386
[2]: https://launchpad.net/ubuntu/+source/linux/4.12.0-11.12

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

You can add/remove kernels to the guest by mounting them like via qemu-nbd.
To properly install boot it directly in kvm and then run dpkg.

That way I tested the mainline kernel above and the same way I switched back to [1].
With that I got correctly the older kernel:
  autopkgtest [13:48:11]: testbed running kernel: Linux 4.11.0-13-generic #19-Ubuntu SMP Thu Aug 3 15:13:04 UTC 2017

With that 3/3 runs with OVS 2.8 are good, so I think we can conclude this is a regression due to the kernel 4.12 update.

[1]: https://launchpad.net/ubuntu/+source/linux/4.11.0-13.19/+build/13193104

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you see if this bug exists in 4.12 final:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12/

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I feel a bit like a retry bot - no offense - but you do realized that I posted steps to reproduce on the initial post already right?

Maybe it helps to document the steps to exchange a kernel.
1. you have an image created like
  autopkgtest-buildvm-ubuntu-cloud -a i386 -r artful -s 10G
2. you mount that to deploy a kernel
  sudo qemu-nbd --connect=/dev/nbd0 ~/work/autopkgtest-artful-i386.img
  sudo mount /dev/nbd0p1 /mnt/test/
  # do your stuff, like wget a kernel to there
  sudo umount /mnt/test
3. you need to install the kernel in a virtual env to get the devices right
  $ kvm -m 512 -hda /home/paelzer/work/autopkgtest-artful-i386.img
  use ubuntu:ubuntu to log in and install via sudo
  then shut the guest down
4. run the test as usual
 $ autopkgtest --apt-upgrade --shell --no-built-binaries openvswitch_2.8.0~git20170809.7aa47a19d-0ubuntu1.dsc -- qemu ~/work/autopkgtest-artful-i386.img

That done show that 4.12 final is just as broken as the others since 4.12

Changed in linux (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (3.8 KiB)

Rechecked today on artful:
OVS: 2.8.0-0ubuntu2
Kernel: 4.13.0-16-generic
Based on the cloud image of today, which is post release.

Still crashing with the simple steps to reproduce I listed above:
[ 37.370757] IP: add_grec+0x28/0x440
[ 37.371002] *pdpt = 000000001dacc001 *pde = 0000000000000000
[ 37.371004]
[ 37.371499] Oops: 0000 [#1] SMP
[ 37.371757] Modules linked in: veth openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c 9p fscache kvm_intel ppdev kvm irqbypass 9pnet_virtio joydev input_leds serio_raw 9pnet parport_pc parport mac_hid i2c_piix4 qemu_fw_cfg ip_tables x_tables autofs4 btrfs xor raid6_pq psmouse virtio_blk virtio_net pata_acpi floppy
[ 37.375047] CPU: 0 PID: 2148 Comm: iperf Tainted: G W 4.13.0-16-generic #19-Ubuntu
[ 37.375872] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1~cloud0 04/01/2014
[ 37.376775] task: dd9c3a80 task.stack: ddb1e000
[ 37.377206] EIP: add_grec+0x28/0x440
[ 37.377548] EFLAGS: 00010202 CPU: 0
[ 37.377880] EAX: 00000000 EBX: dd9e8540 ECX: 00000006 EDX: dd9e8540
[ 37.378470] ESI: ddacee00 EDI: ddacee00 EBP: db8c5f30 ESP: db8c5ef0
[ 37.379063] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 37.379572] CR0: 80050033 CR2: 00000000 CR3: 1efb1080 CR4: 000006f0
[ 37.380167] Call Trace:
[ 37.380404] <SOFTIRQ>
[ 37.380632] mld_ifc_timer_expire+0xfe/0x250
[ 37.381037] ? mld_dad_timer_expire+0x50/0x50
[ 37.381451] call_timer_fn+0x30/0x120
[ 37.381800] ? mld_dad_timer_expire+0x50/0x50
[ 37.382214] ? mld_dad_timer_expire+0x50/0x50
[ 37.382627] run_timer_softirq+0x3c5/0x420
[ 37.383015] ? __softirqentry_text_start+0x8/0x8
[ 37.383452] __do_softirq+0xa9/0x245
[ 37.383822] ? __softirqentry_text_start+0x8/0x8
[ 37.384260] do_softirq_own_stack+0x24/0x30
[ 37.384655] </SOFTIRQ>
[ 37.384892] irq_exit+0xad/0xb0
[ 37.385191] smp_apic_timer_interrupt+0x38/0x50
[ 37.385620] apic_timer_interrupt+0x39/0x40
[ 37.386017] EIP: __copy_user_ll+0x3e/0xf0
[ 37.386395] EFLAGS: 00010246 CPU: 0
[ 37.386726] EAX: b6700b50 EBX: 00000000 ECX: 000013d0 EDX: dd1d0c00
[ 37.387315] ESI: dd1d30c0 EDI: b6703010 EBP: ddb1fd34 ESP: ddb1fd28
[ 37.387908] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 37.388417] ? tcp_rcv_established+0x3b/0x6e0
[ 37.388831] copyout+0x21/0x30
[ 37.389123] copy_page_to_iter+0x1c6/0x3a0
[ 37.389512] skb_copy_datagram_iter+0x133/0x250
[ 37.389940] tcp_recvmsg+0x219/0xb40
[ 37.390277] inet_recvmsg+0x4a/0xc0
[ 37.390606] sock_recvmsg+0x36/0x40
[ 37.390934] SYSC_recvfrom+0xd0/0x150
[ 37.391280] ? set_next_entity+0xb7/0x2c0
[ 37.391654] ? pick_next_task_fair+0x523/0x5d0
[ 37.392075] SyS_socketcall+0x2c6/0x670
[ 37.392436] ? _copy_to_user+0x21/0x30
[ 37.392789] ? put_timespec64+0x34/0x60
[ 37.393147] ? SyS_clock_gettime+0x5b/0xa0
[ 37.393530] do_fast_syscall_32+0x71/0x150
[ 37.393915] entry_SYSENTER_32+0x4e/0x7c
[ 37.394280] EIP: 0xb7fb0cf9
[ 37.394543] EFLAGS: 00000293 CPU: 0
[ 37.394870] EAX: ffffffda EBX: 0000000a ECX: b5efe25c EDX: 00000000
[ 37.395448] ESI: b7...

Read more...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

ping - still broken as of today in Bionic when updating openvswitch.
Asking for a bump of the force-badtest for now ...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.