Hard lockup after 4 hours uptime

Bug #1668356 reported by Olivier Louvignes
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
High
Unassigned
Xenial
Triaged
High
Unassigned

Bug Description

We have recently deployed intel NUC6i5 devices in store for POS display and are encountering strange inexplicable freezes on several devices.

What is very strange is that all devices are freezing exactly 4h after boot.

We have had this exact same issue on more than 20 devices (over 100), with parts from different batches, and all did freeze exactly 4h after boot (but it's not reproducible, it won't freeze every day). Some devices are playing MPV videos, while other run chromium. They are running non-stop, but are all rebooting daily at 05h05.

Looks like it might be related to https://lkml.org/lkml/2015/6/11/787

That seem to have been fixed and backported already https://lkml.org/lkml/2015/10/17/259

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.4.0-64-generic 4.4.0-64.85
ProcVersionSignature: Ubuntu 4.4.0-64.85-generic 4.4.44
Uname: Linux 4.4.0-64-generic x86_64
ApportVersion: 2.20.1-0ubuntu2.4
Architecture: amd64
Date: Mon Feb 27 21:06:44 2017
InstallationDate: Installed on 2016-06-08 (264 days ago)
InstallationMedia: Ubuntu 16.04 LTS "Xenial Xerus" - Release amd64 (20160420.1)
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 003: ID 8087:0a2b Intel Corp.
 Bus 001 Device 002: ID 067b:2303 Prolific Technology, Inc. PL2303 Serial Port
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-64-generic.efi.signed root=UUID=1f60f9c9-bbf1-45df-bdc3-9b4da883839e ro quiet splash net.ifnames=0 vt.handoff=7
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.board.name: NUC6i5SYB

Revision history for this message
Olivier Louvignes (olouvignes) wrote :
Revision history for this message
Olivier Louvignes (olouvignes) wrote :
Download full text (8.6 KiB)

DHCPREQUEST of 10.34.242.77 on eth0 to 10.32.65.65 port 67 (xid=0x30fdf923)
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff810fa7e3>] timecounter_read+0x13/0x60
PGD 0
Oops: 0000 [#1] SMP
Modules linked in: rfcomm ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_multiport xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 br_netfilter bridge stp llc aufs pl2303 usbserial bnep arc4 snd_hda_codec_hdmi snd_soc_skl snd_soc_skl_ipc snd_hda_ext_core snd_soc_sst_ipc snd_hda_codec_realtek snd_soc_sst_dsp snd_hda_codec_generic nls_iso8859_1 snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine dw_dmac_core snd_hda_intel iwlmvm snd_hda_codec snd_hda_core intel_rapl 8250_dw snd_hwdep mac80211 x86_pkg_temp_thermal intel_powerclamp coretemp snd_pcm kvm_intel kvm snd_seq_midi snd_seq_midi_event irqbypass crct10dif_pclmul crc32_pclmul iwlwifi ghash_clmulni_intel snd_rawmidi aesni_intel snd_seq aes_x86_64 lrw gf128mul glue_helper cfg80211 snd_seq_device ablk_helper cryptd snd_timer snd soundcore idma64
 virt_dma shpchp ir_lirc_codec ir_xmp_decoder lirc_dev ir_mce_kbd_decoder ir_sharp_decoder intel_lpss_pci ir_sanyo_decoder btusb ir_sony_decoder hci_uart btrtl ir_jvc_decoder ir_rc6_decoder btbcm btqca ir_rc5_decoder btintel ir_nec_decoder bluetooth mei_me rc_rc6_mce ite_cir rc_core intel_lpss_acpi intel_lpss mei acpi_pad mac_hid acpi_als kfifo_buf industrialio ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables parport_pc ppdev sunrpc lp parport autofs4 i915_bpo intel_ips
 i2c_algo_bit drm_kms_helper syscopyarea e1000e sysfillrect sysimgblt fb_sys_fops ptp sdhci_pci ahci drm pps_core sdhci libahci video pinctrl_sunrisepoint i2c_hid pinctrl_intel hid fjes
CPU: 3 PID: 15471 Comm: kworker/3:0 Not tainted 4.4.0-64-generic #85-Ubuntu
Hardware name: /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
Workqueue: events e1000e_systim_overflow_work [e1000e]
task: ffff880031f32d00 ti: ffff8800350e8000 task.ti: ffff8800350e8000
RIP: 0010:[<ffffffff810fa7e3>] [<ffffffff810fa7e3>] timecounter_read+0x13/0x60
RSP: 0018:ffff8800350ebdb0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff8800353ab7a0 RCX: 0000000000000001
RDX: 0000000000000001 RSI: ffff8800350ebdf8 RDI: 0000000000000000
RBP: ffff8800350ebdb8 R08: ffff88016ed965c0 R09: 0000000000000000
R10: 000000010035ffff R11: 0000000000000001 R12: ffff8800353ab780
R13: ffff8800350ebdf8 R14: 0000000000000246 R15: ffff8800353ab6d0
FS: 0000000000000000(0000) GS:ffff88016ed80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000002e0a000 CR4: 00000000003406e0
Stack:
 ffff8800353ab7d0 ffff8800350ebde8 ffffffffc014d36e ffff8800353ab6d0
 ffff88016ed965c0 ffff88016ed9af00 00000000000000c0 ffff8800350ebe18
 ffffffffc014d521 ffffffff81837e26 ffff88016ed9af00 00000000a91221c0
Call Trace:
 [<ffffffffc01...

Read more...

Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → Medium
importance: Medium → High
Changed in linux (Ubuntu Xenial):
importance: Undecided → High
status: New → Triaged
Changed in linux (Ubuntu):
status: Confirmed → Triaged
tags: added: kernel-da-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The lkml tread you referenced in the bug description was for commit 37b12910dd11d9ab969f2c310dc9160b7f3e3405. That commit landed upstream in v4.3.rc1, so it is already in the 4.4 based Xenial kernel.

Did this issue start happening after a recent upgrade, or after applying updates?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.10 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10

Revision history for this message
Olivier Louvignes (olouvignes) wrote :

Hard to tell regarding updates, I'd say it started early january (but we did not really pay attention at first). Our players have unattended security upgrades so I'd say some kernel upgrade landing in january might have introduced a regression. As far as I know we never encountered this issue in 2016.

Do you think using 4.10 would be considered safe in production? I'm a bit afraid to (further) break production machine. Thanks!

Revision history for this message
Jay (jayanth-k) wrote :

Hi,

We can confirm that this issue still persists as of 4.4.0-103-generic #126-Ubuntu SMP Mon Dec 4 16:23:28 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux this update.

It only happens on a soft reboot, i.e. 'sudo reboot'.

Not sure if there is a work around for it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.