Boot crash in xen_send_IPI_one

Bug #1649821 reported by Ross Lagerwall on 2016-12-14
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Xenial
Undecided
Ross Lagerwall

Bug Description

Every hundred boots or so when booting Ubuntu 16.04 under Xen, it crashes early on boot with the following stack trace:

  kernel BUG at /build/linux-Ay7j_C/linux-4.4.0/drivers/xen/events/events_base.c:1210!
  invalid opcode: 0000 [#1] SMP
  ...
  RIP: 0010:[<ffffffff814c97c9>] [<ffffffff814c97c9>] xen_send_IPI_one+0x59/0x60
  ...
  Call Trace:
   [<ffffffff8102be9e>] xen_qlock_kick+0xe/0x10
   [<ffffffff810cabc2>] __pv_queued_spin_unlock+0xb2/0xf0
   [<ffffffff810ca6d1>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
   [<ffffffff81052936>] ? check_tsc_warp+0x76/0x150
   [<ffffffff81052aa6>] check_tsc_sync_source+0x96/0x160
   [<ffffffff81051e28>] native_cpu_up+0x3d8/0x9f0
   [<ffffffff8102b315>] xen_hvm_cpu_up+0x35/0x80
   [<ffffffff8108198c>] _cpu_up+0x13c/0x180
   [<ffffffff81081a4a>] cpu_up+0x7a/0xa0
   [<ffffffff81f80dfc>] smp_init+0x7f/0x81
   [<ffffffff81f5a121>] kernel_init_freeable+0xef/0x212
   [<ffffffff81817f30>] ? rest_init+0x80/0x80
   [<ffffffff81817f3e>] kernel_init+0xe/0xe0
   [<ffffffff8182488f>] ret_from_fork+0x3f/0x70
   [<ffffffff81817f30>] ? rest_init+0x80/0x80

This is fixed by the following commit:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/arch/x86/xen?id=707e59ba494372a90d245f18b0c78982caa88e48

Unfortunately this wasn't backported to Linux 4.4. Can you please include this in the next Ubuntu 16.04 kernel release?

Thanks
Ross
---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Dec 14 11:48 seq
 crw-rw---- 1 root audio 116, 33 Dec 14 11:48 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.20.1-0ubuntu2.2
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 16.04
HibernationDevice: RESUME=UUID=637c02d7-6135-4ecc-8273-b014d21cf217
IwConfig: Error: [Errno 2] No such file or directory
Lsusb:
 Bus 001 Device 002: ID 0627:0001 Adomax Technology Co., Ltd
 Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: Xen HVM domU
Package: linux (not installed)
PciMultimedia:

ProcFB: 0 EFI VGA
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-53-generic root=UUID=6748a82f-b075-401e-ac74-0ebe865003f2 ro console=hvc0 console=tty0
ProcVersionSignature: Ubuntu 4.4.0-53.74-generic 4.4.30
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-53-generic N/A
 linux-backports-modules-4.4.0-53-generic N/A
 linux-firmware 1.157.5
RfKill: Error: [Errno 2] No such file or directory
Tags: xenial
Uname: Linux 4.4.0-53-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 12/07/2016
dmi.bios.vendor: Xen
dmi.bios.version: 4.7.1-xs131890-d
dmi.chassis.type: 1
dmi.chassis.vendor: Xen
dmi.modalias: dmi:bvnXen:bvr4.7.1-xs131890-d:bd12/07/2016:svnXen:pnHVMdomU:pvr4.7.1-xs131890-d:cvnXen:ct1:cvr:
dmi.product.name: HVM domU
dmi.product.version: 4.7.1-xs131890-d
dmi.sys.vendor: Xen

CVE References

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1649821

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Ross Lagerwall (rosslagerwall) wrote :
Download full text (4.3 KiB)

FWIW, here is the full boot log (it crashes very early):
[ 0.014281] Freeing SMP alternatives memory: 28K (ffffffff820b2000 - ffffffff820b9000)
[ 0.021560] ftrace: allocating 31878 entries in 125 pages
[ 0.248098] smpboot: Max logical packages: 15
[ 0.248123] smpboot: APIC(0) Converting physical 0 to logical package 0
[ 0.248140] smpboot: APIC(2) Converting physical 2 to logical package 1
[ 0.248671] x2apic: IRQ remapping doesn't support X2APIC mode
[ 0.249178] Switched APIC routing to physical flat.
[ 0.251027] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=0 pin2=0
[ 0.296235] clocksource: xen: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 0.296440] installing Xen timer for CPU 0
[ 0.296824] smpboot: CPU0: Intel(R) Xeon(R) CPU X3430 @ 2.40GHz (family: 0x6, model: 0x1e, stepping: 0x5)
[ 0.296897] cpu 0 spinlock event irq 53
[ 0.296916] Performance Events: unsupported p6 CPU model 30 no PMU driver, software events only.
[ 0.297819] NMI watchdog: disabled (cpu0): hardware events not enabled
[ 0.297841] NMI watchdog: Shutting down hard lockup detector on all cpus
[ 0.297919] installing Xen timer for CPU 1
[ 0.297997] x86: Booting SMP configuration:
[ 0.298010] .... node #0, CPUs: #1
[ 0.008000] calibrate_delay_direct() dropping max bogoMips estimate 3 = 11283848
[ 0.384000] ------------[ cut here ]------------
[ 0.384000] kernel BUG at /build/linux-Ay7j_C/linux-4.4.0/drivers/xen/events/events_base.c:1210!
[ 0.384000] invalid opcode: 0000 [#1] SMP
[ 0.384000] Modules linked in:
[ 0.384000] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.4.0-21-generic #37-Ubuntu
[ 0.384000] Hardware name: Xen HVM domU, BIOS 4.6.1-xs124820 04/20/2016
[ 0.384000] task: ffff88003d758000 ti: ffff88003d760000 task.ti: ffff88003d760000
[ 0.384000] RIP: 0010:[<ffffffff814c97c9>] [<ffffffff814c97c9>] xen_send_IPI_one+0x59/0x60
[ 0.384000] RSP: 0000:ffff88003d763d30 EFLAGS: 00010086
[ 0.384000] RAX: ffff88003da522fc RBX: 00000005528d31c0 RCX: 0000000000000001
[ 0.384000] RDX: ffff88003da57840 RSI: 0000000000000003 RDI: 00000000ffffffff
[ 0.384000] RBP: ffff88003d763d30 R08: 0000000000000100 R09: ffff88003f7c7900
[ 0.384000] R10: ffff88003da4a080 R11: ffff88003da4a060 R12: 000000000000a90d
[ 0.384000] R13: 000000055285a88a R14: 0000000000000001 R15: ffffffff820d3db0
[ 0.384000] FS: 0000000000000000(0000) GS:ffff88003da00000(0000) knlGS:0000000000000000
[ 0.384000] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 0.384000] CR2: ffff880002200000 CR3: 0000000001e0a000 CR4: 00000000000006f0
[ 0.384000] Stack:
[ 0.384000] ffff88003d763d40 ffffffff8102be9e ffff88003d763de0 ffffffff810cabc2
[ 0.384000] ffffffff810ca6d1 ffff88003da4a060 ffff88003da4a080 0000000000000001
[ 0.384000] ffff88003da4a080 ffffffff820d3db0 0000000000080000 0000000500000000
[ 0.384000] Call Trace:
[ 0.384000] [<ffffffff8102be9e>] xen_qlock_kick+0xe/0x10
[ 0.384000] [<ffffffff810cabc2>] __pv_queued_spin_unlock+0xb2/0xf0
[ 0.384000] [<ffffffff810ca6d1>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[ 0.38400...

Read more...

apport information

tags: added: apport-collected xenial
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Ross Lagerwall (rosslagerwall) wrote :

I've run the required command and I've set the status to Confirmed. The logs won't have much useful in it because the crash is really early on boot.

Anyway, the following commit should fix the issue:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/arch/x86/xen?id=707e59ba494372a90d245f18b0c78982caa88e48

Tim Gardner (timg-tpi) wrote :
Changed in linux (Ubuntu Xenial):
assignee: nobody → Ross Lagerwall (rosslagerwall)
status: New → In Progress
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Ross Lagerwall (rosslagerwall) wrote :

> It seems like an obvious fix. Both call traces look identical. I can
> certainly produce a test kernel and get the reporter to produce some
> results.

Just to be clear, both call traces look identical because I wrote that patch to fix this particular issue, but unfortunately didn't get it backported to the stable releases.

I'm not sure why we haven't seen it on trusty. Perhaps some initialization ordering or kernel config options changed which exposed the race.

Luis Henriques (henrix) on 2016-12-15
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Ross Lagerwall (rosslagerwall) wrote :

I'm not able to test this for a couple of weeks due to the holiday period.

Po-Hsu Lin (cypressyew) wrote :

Hello Ross,
can you help us to verify this fix while you're back?
Thanks!

Ross Lagerwall (rosslagerwall) wrote :

Yeah I will get it done now.

Launchpad Janitor (janitor) wrote :
Download full text (5.9 KiB)

This bug was fixed in the package linux - 4.4.0-59.80

---------------
linux (4.4.0-59.80) xenial; urgency=low

  [ John Donnelly ]

  * Release Tracking Bug
    - LP: #1654282

  * [2.1.1] MAAS has nvme0n1 set as boot disk, curtin fails (LP: #1651602)
    - (fix) nvme: only require 1 interrupt vector, not 2+

linux (4.4.0-58.79) xenial; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1651402

  * Support ACPI probe for IIO sensor drivers from ST Micro (LP: #1650123)
    - SAUCE: iio: st_sensors: match sensors using ACPI handle
    - SAUCE: iio: st_accel: Support sensor i2c probe using acpi
    - SAUCE: iio: st_pressure: Support i2c probe using acpi
    - [Config] CONFIG_HTS221=m, CONFIG_HTS221_I2C=m, CONFIG_HTS221_SPI=m

  * Fix channel data parsing in ST Micro sensor IIO drivers (LP: #1650189)
    - SAUCE: iio: common: st_sensors: fix channel data parsing

  * ST Micro lng2dm 3-axis "femto" accelerometer support (LP: #1650112)
    - SAUCE: iio: st-accel: add support for lis2dh12
    - SAUCE: iio: st_sensors: support active-low interrupts
    - SAUCE: iio: accel: Add support for the h3lis331dl accelerometer
    - SAUCE: iio: st_sensors: verify interrupt event to status
    - SAUCE: iio: st_sensors: support open drain mode
    - SAUCE: iio:st_sensors: fix power regulator usage
    - SAUCE: iio: st_sensors: switch to a threaded interrupt
    - SAUCE: iio: accel: st_accel: Add lis3l02dq support
    - SAUCE: iio: st_sensors: fix scale configuration for h3lis331dl
    - SAUCE: iio: accel: st_accel: add support to lng2dm
    - SAUCE: iio: accel: st_accel: inline per-sensor data
    - SAUCE: Documentation: dt: iio: accel: add lng2dm sensor device binding

  * ST Micro hts221 relative humidity sensor support (LP: #1650116)
    - SAUCE: iio: humidity: add support to hts221 rh/temp combo device
    - SAUCE: Documentation: dt: iio: humidity: add hts221 sensor device binding
    - SAUCE: iio: humidity: remove
    - SAUCE: iio: humidity: Support acpi probe for hts211

  * crypto : tolerate new crypto hardware for z Systems (LP: #1644557)
    - s390/zcrypt: Introduce CEX6 toleration

  * Acer, Inc ID 5986:055a is useless after 14.04.2 installed. (LP: #1433906)
    - uvcvideo: uvc_scan_fallback() for webcams with broken chain

  * vmxnet3 driver could causes kernel panic with v4.4 if LRO enabled.
    (LP: #1650635)
    - vmxnet3: segCnt can be 1 for LRO packets

  * system freeze when swapping to encrypted swap partition (LP: #1647400)
    - mm, oom: rework oom detection
    - mm: throttle on IO only when there are too many dirty and writeback pages

  * Kernel Fixes to get TCMU File Backed Optical to work (LP: #1646204)
    - target/user: Use sense_reason_t in tcmu_queue_cmd_ring
    - target/user: Return an error if cmd data size is too large
    - target/user: Fix comments to not refer to data ring
    - SAUCE: (no-up) target/user: Fix use-after-free of tcmu_cmds if they are
      expired

  * CVE-2016-9756
    - KVM: x86: drop error recovery in em_jmp_far and em_ret_far

  * Dell Precision 5520 & 3520 freezes at login screent (LP: #1650054)
    - ACPI / blacklist: add _REV quirks for Dell Precision 5520 and 3520

  * CVE-2016-979...

Read more...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Ross Lagerwall (rosslagerwall) wrote :

I have tested the proposed kernel and it works for me. Thanks!

tags: added: verification-done-xenial
removed: verification-needed-xenial
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers