ThunderX: soft lockup in cursor_timer_handler()

Bug #1574814 reported by dann frazier on 2016-04-25
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned
Xenial
Undecided
Unassigned

Bug Description

I booted a Cavium ThunderX crb1s 2.0 system using the netboot mini iso via virtual media:
  http://ports.ubuntu.com/ubuntu-ports/dists/xenial/main/installer-arm64/20101020ubuntu451/images/netboot/mini.iso

During boot I observed the following lockup on the serial console:

[ 28.128327] usb 1-1.1: reset high-speed USB device number 3 using xhci_hcd
[ 84.912299] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 23s! [swapper/14:0]
[ 84.922718] Modules linked in: hid_generic(E) usbhid(E) hid(E) usb_storage(E) mdio_thunder(E) nicvf(E) ast(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ttm(E) drm(E) nicpf(E) thunder_bgx(E) mdio_cavium(E)
[ 84.922749]
[ 84.922754] CPU: 14 PID: 0 Comm: swapper/14 Tainted: G E 4.4.0-21-generic #37-Ubuntu
[ 84.922757] Hardware name: Cavium ThunderX CN88XX board (DT)
[ 84.922761] task: ffff801f6c9d4100 ti: ffff801f6c9e8000 task.ti: ffff801f6c9e8000
[ 84.922771] PC is at cursor_timer_handler+0x30/0x58
[ 84.922775] LR is at cursor_timer_handler+0x30/0x58
[ 84.922778] pc : [<ffff8000004ec4f0>] lr : [<ffff8000004ec4f0>] pstate: 00400145
[ 84.922781] sp : ffff801f6c9ebc20
[ 84.922784] x29: ffff801f6c9ebc20 x28: ffff8000f94398d8
[ 84.922789] x27: ffff801f6c9ebd00 x26: ffff801f7b3bebb8
[ 84.922793] x25: ffff801f6c9e8000 x24: ffff800000e5ec00
[ 84.922798] x23: ffff801f667d9800 x22: ffff8000004ec4c0
[ 84.922802] x21: 0000000000000100 x20: ffff8000f94398d8
[ 84.922807] x19: ffff8000f9439800 x18: 0000ffffc76a5358
[ 84.922811] x17: 0000ffff97bbd2a8 x16: ffff8000002a5040
[ 84.922816] x15: 000000003e4cf1e0 x14: 0000000000000008
[ 84.922820] x13: 0000000000000000 x12: 003d090000000000
[ 84.922824] x11: 00000000003d0900 x10: ffff80000090f200
[ 84.922829] x9 : 00003d0900000000 x8 : 000000000000000e
[ 84.922833] x7 : ffff801f7b3c5008 x6 : 00000000ffffffff
[ 84.922837] x5 : 0000000000000000 x4 : 0000000000000001
[ 84.922842] x3 : 0000000000000000 x2 : ffff801f6c899e05
[ 84.922846] x1 : ffff801f667d99e0 x0 : 0000000000000000
[ 84.922850]
[ 101.008387] usb 1-1.1: reset high-speed USB device number 3 using xhci_hcd
[ 101.180375] usb 1-1.1: reset high-speed USB device number 3 using xhci_hcd
[ 101.342677] random: nonblocking pool is initialized

dann frazier (dannf) wrote :
dann frazier (dannf) wrote :
dann frazier (dannf) wrote :

The traceback isn't always the same - here it was stuck in mod_timer():

[ 180.912269] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [swapper/9:0]
[ 180.912293] Modules linked in: hid_generic(E) usbhid(E) hid(E) usb_storage(E) mdio_thunder(E) nicvf(E) ast(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ttm(E) drm(E) nicpf(E) thunder_bgx(E) mdio_cavium(E)
[ 180.912294]
[ 180.912298] CPU: 9 PID: 0 Comm: swapper/9 Tainted: G E 4.4.0-21-generic #37-Ubuntu
[ 180.912299] Hardware name: Cavium ThunderX CN88XX board (DT)
[ 180.912302] task: ffff801f6c9d0000 ti: ffff801f6c9cc000 task.ti: ffff801f6c9cc000
[ 180.912308] PC is at _raw_spin_unlock_irqrestore+0x2c/0x38
[ 180.912312] LR is at mod_timer+0x12c/0x270
[ 180.912314] pc : [<ffff8000008f5dc4>] lr : [<ffff800000131d2c>] pstate: 60400145
[ 180.912315] sp : ffff801f6c9cfba0
[ 180.912318] x29: ffff801f6c9cfba0 x28: ffff801f662982d8
[ 180.912320] x27: ffff801f6c9cfd00 x26: ffff801f7b369bb8
[ 180.912323] x25: ffff800000d69b80 x24: 0000000000000000
[ 180.912325] x23: ffff801f7b369b80 x22: ffff800000d8a000
[ 180.912328] x21: ffff801f7b369b80 x20: ffff801f7b369b80
[ 180.912330] x19: 0000000000000140 x18: 0000000000000069
[ 180.912332] x17: 0000000000468078 x16: ffff8000000c5b98
[ 180.912335] x15: 0000ffff7b779000 x14: 0000000000000008
[ 180.912337] x13: 0000000000000000 x12: 003d090000000000
[ 180.912340] x11: 00000000003d0900 x10: ffff80000090f200
[ 180.912342] x9 : 00003d0900000000 x8 : 000000000000000d
[ 180.912344] x7 : ffff801f7b3b4008 x6 : 0000000000000000
[ 180.912347] x5 : 0000000000000000 x4 : 0000000000000001
[ 180.912349] x3 : ffff801f6c9d0000 x2 : 00000000ffff8afb
[ 180.912351] x1 : 0000000000000140 x0 : 000000000000927c
[ 180.912352]

dann frazier (dannf) wrote :

Note that, while I haven't been able to reproduce this by booting into Ubuntu from disk, I can easily reproduce it by booting d-i from disk - specifically, the linux and initrd.gz here:

http://ports.ubuntu.com/ubuntu-ports/dists/xenial/main/installer-arm64/20101020ubuntu451/images/netboot/ubuntu-installer/arm64/

dann frazier (dannf) wrote :

I have been able to reproduce a soft lockup using a pure upstream kernel (post-4.5 git, after the thunderx pcie drivers were merged) and the d-i initrd mentioned above. I've attached a log of one of those boots.

I used the Ubuntu config with the following exceptions:

CONFIG_DRM=y and CONFIG_DRM_AST=y - because, so far, I've only seen this with vga enabled.
CONFIG_KVM=n - because some commits fail to boot due to a panic in armpmu code, and KVM won't build w/ ARMPMU diabled.
(and just pressed enter during make oldconfig for everything else).

I could not reproduce with 4.6.0-rc1, so I thought this would be a good bisect candidate. Unfortunately, that went off into the weeds for a couple of reasons - vga doesn't work with some changesets, so I had to skip those, and I also suspect that this issue does not appear in every boot of a broken kernel, leading to false positive results.

It can be triggered 100% by running 'tcpdump -I ethX'.

dann frazier (dannf) wrote :

On Fri, Apr 29, 2016 at 2:06 AM, Ming Lei <email address hidden> wrote:
> It can be triggered 100% by running 'tcpdump -I ethX'.

Thanks Ming. I let that run for a few hours, but was unable to
reproduce. Are you seeing the same traceback along w the softlockup
msg?

   -dann

Ming Lei (tom-leiming) wrote :
Download full text (3.1 KiB)

On Tue, May 3, 2016 at 10:35 AM, dann frazier
<email address hidden> wrote:
> On Fri, Apr 29, 2016 at 2:06 AM, Ming Lei <email address hidden> wrote:
>> It can be triggered 100% by running 'tcpdump -I ethX'.
>
> Thanks Ming. I let that run for a few hours, but was unable to
> reproduce. Are you seeing the same traceback along w the softlockup
> msg?

Yes, it can be always triggered by 'tcpdump -I eth7' on cvm13.

>
> -dann
>
> --
> You received this bug notification because you are subscribed to linux
> in Ubuntu.
> https://bugs.launchpad.net/bugs/1574814
>
> Title:
> ThunderX: soft lockup in cursor_timer_handler() Edit
>
> Status in linux package in Ubuntu:
> Confirmed
>
> Bug description:
> I booted a Cavium ThunderX crb1s 2.0 system using the netboot mini iso via virtual media:
> http://ports.ubuntu.com/ubuntu-ports/dists/xenial/main/installer-arm64/20101020ubuntu451/images/netboot/mini.iso
>
> During boot I observed the following lockup on the serial console:
>
> [ 28.128327] usb 1-1.1: reset high-speed USB device number 3 using xhci_hcd
> [ 84.912299] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 23s! [swapper/14:0]
> [ 84.922718] Modules linked in: hid_generic(E) usbhid(E) hid(E) usb_storage(E) mdio_thunder(E) nicvf(E) ast(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ttm(E) drm(E) nicpf(E) thunder_bgx(E) mdio_cavium(E)
> [ 84.922749]
> [ 84.922754] CPU: 14 PID: 0 Comm: swapper/14 Tainted: G E 4.4.0-21-generic #37-Ubuntu
> [ 84.922757] Hardware name: Cavium ThunderX CN88XX board (DT)
> [ 84.922761] task: ffff801f6c9d4100 ti: ffff801f6c9e8000 task.ti: ffff801f6c9e8000
> [ 84.922771] PC is at cursor_timer_handler+0x30/0x58
> [ 84.922775] LR is at cursor_timer_handler+0x30/0x58
> [ 84.922778] pc : [<ffff8000004ec4f0>] lr : [<ffff8000004ec4f0>] pstate: 00400145
> [ 84.922781] sp : ffff801f6c9ebc20
> [ 84.922784] x29: ffff801f6c9ebc20 x28: ffff8000f94398d8
> [ 84.922789] x27: ffff801f6c9ebd00 x26: ffff801f7b3bebb8
> [ 84.922793] x25: ffff801f6c9e8000 x24: ffff800000e5ec00
> [ 84.922798] x23: ffff801f667d9800 x22: ffff8000004ec4c0
> [ 84.922802] x21: 0000000000000100 x20: ffff8000f94398d8
> [ 84.922807] x19: ffff8000f9439800 x18: 0000ffffc76a5358
> [ 84.922811] x17: 0000ffff97bbd2a8 x16: ffff8000002a5040
> [ 84.922816] x15: 000000003e4cf1e0 x14: 0000000000000008
> [ 84.922820] x13: 0000000000000000 x12: 003d090000000000
> [ 84.922824] x11: 00000000003d0900 x10: ffff80000090f200
> [ 84.922829] x9 : 00003d0900000000 x8 : 000000000000000e
> [ 84.922833] x7 : ffff801f7b3c5008 x6 : 00000000ffffffff
> [ 84.922837] x5 : 0000000000000000 x4 : 0000000000000001
> [ 84.922842] x3 : 0000000000000000 x2 : ffff801f6c899e05
> [ 84.922846] x1 : ffff801f667d99e0 x0 : 0000000000000000
> [ 84.922850]
> [ 101.008387] usb 1-1.1: reset high-speed USB device number 3 using xhci_hcd
> [ 101.180375] usb 1-1.1: reset high-speed USB device number 3 using xhci_hcd
> [ 101.342677] random: nonblocking pool is initialized
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1574814/+s...

Read more...

Ming,
The "-I" option of tcpdump is monitoring mode typically applicable only to wifi interfaces. So even if you run it on Thunder's NIC interfaces it will return saying that this is not supported.

And BTW, what is eth7 ?
From 16.04 release all interfaces are coming up as "enP2xxxxx"

Download full text (3.5 KiB)

On Tue, May 3, 2016 at 1:14 PM, Radha Mohan Chintakuntla
<email address hidden> wrote:
> Ming,
> The "-I" option of tcpdump is monitoring mode typically applicable only to wifi interfaces. So even if you run it on Thunder's NIC interfaces it will return saying that this is not supported.
>

Even without the '-I', the issue still can be triggered.

> And BTW, what is eth7 ?
> >From 16.04 release all interfaces are coming up as "enP2xxxxx"

Yeah, maybe this box isn't shipped with 16.04, and its kernel is

ubuntu@arm64:~$ uname -a
Linux arm64 4.2.0 #2 SMP Mon Dec 14 04:01:19 CST 2015 aarch64 aarch64
aarch64 GNU/Linux

but the log is very similar with Dann's report:

[337056.617650] PC is at _raw_spin_unlock_irqrestore+0x30/0x40
[337056.617657] LR is at mod_timer+0x110/0x238

>
> --
> You received this bug notification because you are subscribed to linux
> in Ubuntu.
> https://bugs.launchpad.net/bugs/1574814
>
> Title:
> ThunderX: soft lockup in cursor_timer_handler() Edit
>
> Status in linux package in Ubuntu:
> Confirmed
>
> Bug description:
> I booted a Cavium ThunderX crb1s 2.0 system using the netboot mini iso via virtual media:
> http://ports.ubuntu.com/ubuntu-ports/dists/xenial/main/installer-arm64/20101020ubuntu451/images/netboot/mini.iso
>
> During boot I observed the following lockup on the serial console:
>
> [ 28.128327] usb 1-1.1: reset high-speed USB device number 3 using xhci_hcd
> [ 84.912299] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 23s! [swapper/14:0]
> [ 84.922718] Modules linked in: hid_generic(E) usbhid(E) hid(E) usb_storage(E) mdio_thunder(E) nicvf(E) ast(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ttm(E) drm(E) nicpf(E) thunder_bgx(E) mdio_cavium(E)
> [ 84.922749]
> [ 84.922754] CPU: 14 PID: 0 Comm: swapper/14 Tainted: G E 4.4.0-21-generic #37-Ubuntu
> [ 84.922757] Hardware name: Cavium ThunderX CN88XX board (DT)
> [ 84.922761] task: ffff801f6c9d4100 ti: ffff801f6c9e8000 task.ti: ffff801f6c9e8000
> [ 84.922771] PC is at cursor_timer_handler+0x30/0x58
> [ 84.922775] LR is at cursor_timer_handler+0x30/0x58
> [ 84.922778] pc : [<ffff8000004ec4f0>] lr : [<ffff8000004ec4f0>] pstate: 00400145
> [ 84.922781] sp : ffff801f6c9ebc20
> [ 84.922784] x29: ffff801f6c9ebc20 x28: ffff8000f94398d8
> [ 84.922789] x27: ffff801f6c9ebd00 x26: ffff801f7b3bebb8
> [ 84.922793] x25: ffff801f6c9e8000 x24: ffff800000e5ec00
> [ 84.922798] x23: ffff801f667d9800 x22: ffff8000004ec4c0
> [ 84.922802] x21: 0000000000000100 x20: ffff8000f94398d8
> [ 84.922807] x19: ffff8000f9439800 x18: 0000ffffc76a5358
> [ 84.922811] x17: 0000ffff97bbd2a8 x16: ffff8000002a5040
> [ 84.922816] x15: 000000003e4cf1e0 x14: 0000000000000008
> [ 84.922820] x13: 0000000000000000 x12: 003d090000000000
> [ 84.922824] x11: 00000000003d0900 x10: ffff80000090f200
> [ 84.922829] x9 : 00003d0900000000 x8 : 000000000000000e
> [ 84.922833] x7 : ffff801f7b3c5008 x6 : 00000000ffffffff
> [ 84.922837] x5 : 0000000000000000 x4 : 0000000000000001
> [ 84.922842] x3 : 0000000000000000 x2 : ffff801f6c899e05
> [ 84.922846] x1 : ffff801f667d99e0 x0 : 0000000000000000
> ...

Read more...

I tried PXE based install (just booting till the installer does network init) multiple times with Cavium UEFI and did not see this issue, but as soon as I tried AMI UEFI first time the "NMI softlockup ..." appeared after sometime while still in opening screen of "debian-installer".

dann frazier (dannf) wrote :

fyi, I've found that removing the ast kernel module from the di initrd seems to avoid the issue.

I think AMI UEFI has the aspeed vga driver but Cavium UEFI doesn't. so this might be causing issues ?

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key
dann frazier (dannf) wrote :

I created a perf timechart (attached) to see if that would illuminate the problem. This is taken from the installer environment for the first 30s.

In this boot, the soft lockup occurred on CPU 11:
  [ 32.815351] NMI watchdog: BUG: soft lockup - CPU#11 stuck for 22s! [swapper/11:0]

This blocked period appears to be visible from the chart, but it isn't clear to me what it was executing at the time.

dann frazier (dannf) wrote :

I captured a perf data file, starting just before the NMI kicked in - in this case, it again occurred on CPU 11. I found that CPU 11 had spent a lot of its time in cursor_timer_handler():

- 16.92% swapper [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore ▒
   - _raw_spin_unlock_irqrestore ▒
      + 16.87% mod_timer ▒
      + 0.05% cursor_timer_handler ▒
- 12.15% swapper [kernel.kallsyms] [k] queue_work_on ▒
   - queue_work_on ▒
      + 12.00% cursor_timer_handler ▒
      + 0.15% call_timer_fn ▒
+ 10.98% swapper [kernel.kallsyms] [k] run_timer_softirq ▒
- 2.23% swapper [kernel.kallsyms] [k] mod_timer ▒
   - mod_timer ▒
      + 1.97% cursor_timer_handler ▒
      + 0.26% call_timer_fn

Looking at the profile of the other CPUs, I found one that was interesting - CPU #12 - which appears to be the one actually running the cursor update code CPU 11 is scheduling:

- 42.18% kworker/u96:2 [kernel.kallsyms] [k] ww_mutex_unlock ▒
   - ww_mutex_unlock ▒
      - 40.70% ast_dirty_update ▒
           ast_imageblit ▒
           soft_cursor ▒
           bit_cursor ▒
           fb_flashcursor ▒
           process_one_work ▒
           worker_thread ▒
           kthread ▒
           ret_from_fork ▒
      + 1.48% ast_imageblit ▒
- 40.15% kworker/u96:2 [kernel.kallsyms] [k] __memcpy_toio ▒
   - __memcpy_toio ▒
      + 31.54% ast_dirty_update ▒
      + 8.61% ast_imageblit

I wonder if this path is blocking, preventing the timer handler on CPU #11 from rescheduling. Indeed, when I time how long the handler takes from start to finish, I am seeing occasional times around ~.1s.

dann frazier (dannf) wrote :

I used ftrace to do some duration measuring of the timer function fb_flashcursor(). I noticed several places where this timer takes around 98 ms to complete. This time seems to be due to multiple calls to __memcpy_toio() in ast_dirty_update():

 for (i = y; i <= y2; i++) {
  /* assume equal stride for now */
  src_offset = dst_offset = i * afbdev->afb.base.pitches[0] + (x * bpp);
  memcpy_toio(bo->kmap.virtual + src_offset, afbdev->sysram + src_offset, (x2 - x + 1) * bpp);

My theory is that this is causing mod_timer() to block on the other CPU, resulting in the soft lockup.

Also - I built a custom d-i using pristine 4.6-rc7, and I am able to easily reproduce this. I think the next step here is to report this to upstream.

Download full text (4.0 KiB)

On Fri, May 13, 2016 at 7:22 AM, dann frazier
<email address hidden> wrote:
> I used ftrace to do some duration measuring of the timer function
> fb_flashcursor(). I noticed several places where this timer takes around
> 98 ms to complete. This time seems to be due to multiple calls to
> __memcpy_toio() in ast_dirty_update():
>
> for (i = y; i <= y2; i++) {
> /* assume equal stride for now */
> src_offset = dst_offset = i * afbdev->afb.base.pitches[0] + (x * bpp);
> memcpy_toio(bo->kmap.virtual + src_offset, afbdev->sysram + src_offset, (x2 - x + 1) * bpp);
>
>
> My theory is that this is causing mod_timer() to block on the other CPU, resulting in the soft lockup.
>
> Also - I built a custom d-i using pristine 4.6-rc7, and I am able to
> easily reproduce this. I think the next step here is to report this to
> upstream.

Hi Dann,

Andrew asked me to take a look at the issue, and from my tracing,
most of times, the cpu 'hangs' in the following line of code:

__mod_timer():
....
out_unlock:
     spin_unlock_irqrestore(&base->lock, flags);

If I added two trace points around the above line, most of times only
the trace point before the line is dumped, and the one after the line
can't dumped.

Thanks,

>
> --
> You received this bug notification because you are subscribed to linux
> in Ubuntu.
> https://bugs.launchpad.net/bugs/1574814
>
> Title:
> ThunderX: soft lockup in cursor_timer_handler() Edit
>
> Status in linux package in Ubuntu:
> Confirmed
>
> Bug description:
> I booted a Cavium ThunderX crb1s 2.0 system using the netboot mini iso via virtual media:
> http://ports.ubuntu.com/ubuntu-ports/dists/xenial/main/installer-arm64/20101020ubuntu451/images/netboot/mini.iso
>
> During boot I observed the following lockup on the serial console:
>
> [ 28.128327] usb 1-1.1: reset high-speed USB device number 3 using xhci_hcd
> [ 84.912299] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 23s! [swapper/14:0]
> [ 84.922718] Modules linked in: hid_generic(E) usbhid(E) hid(E) usb_storage(E) mdio_thunder(E) nicvf(E) ast(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ttm(E) drm(E) nicpf(E) thunder_bgx(E) mdio_cavium(E)
> [ 84.922749]
> [ 84.922754] CPU: 14 PID: 0 Comm: swapper/14 Tainted: G E 4.4.0-21-generic #37-Ubuntu
> [ 84.922757] Hardware name: Cavium ThunderX CN88XX board (DT)
> [ 84.922761] task: ffff801f6c9d4100 ti: ffff801f6c9e8000 task.ti: ffff801f6c9e8000
> [ 84.922771] PC is at cursor_timer_handler+0x30/0x58
> [ 84.922775] LR is at cursor_timer_handler+0x30/0x58
> [ 84.922778] pc : [<ffff8000004ec4f0>] lr : [<ffff8000004ec4f0>] pstate: 00400145
> [ 84.922781] sp : ffff801f6c9ebc20
> [ 84.922784] x29: ffff801f6c9ebc20 x28: ffff8000f94398d8
> [ 84.922789] x27: ffff801f6c9ebd00 x26: ffff801f7b3bebb8
> [ 84.922793] x25: ffff801f6c9e8000 x24: ffff800000e5ec00
> [ 84.922798] x23: ffff801f667d9800 x22: ffff8000004ec4c0
> [ 84.922802] x21: 0000000000000100 x20: ffff8000f94398d8
> [ 84.922807] x19: ffff8000f9439800 x18: 0000ffffc76a5358
> [ 84.922811] x17: 0000ffff97bbd2a8 x16: ffff8000002a5040
> ...

Read more...

dann frazier (dannf) on 2016-05-16
summary: - ThunderX: soft lockup in cursor_timer_handler() Edit
+ ThunderX: soft lockup in cursor_timer_handler()
Ming Lei (tom-leiming) wrote :
Download full text (4.9 KiB)

On Mon, May 16, 2016 at 5:25 PM, Ming Lei <email address hidden> wrote:
> On Fri, May 13, 2016 at 7:22 AM, dann frazier
> <email address hidden> wrote:
>> I used ftrace to do some duration measuring of the timer function
>> fb_flashcursor(). I noticed several places where this timer takes around
>> 98 ms to complete. This time seems to be due to multiple calls to
>> __memcpy_toio() in ast_dirty_update():
>>
>> for (i = y; i <= y2; i++) {
>> /* assume equal stride for now */
>> src_offset = dst_offset = i * afbdev->afb.base.pitches[0] + (x * bpp);
>> memcpy_toio(bo->kmap.virtual + src_offset, afbdev->sysram + src_offset, (x2 - x + 1) * bpp);
>>
>>
>> My theory is that this is causing mod_timer() to block on the other CPU, resulting in the soft lockup.
>>
>> Also - I built a custom d-i using pristine 4.6-rc7, and I am able to
>> easily reproduce this. I think the next step here is to report this to
>> upstream.
>
> Hi Dann,
>
> Andrew asked me to take a look at the issue, and from my tracing,
> most of times, the cpu 'hangs' in the following line of code:
>
> __mod_timer():
> ....
> out_unlock:
> spin_unlock_irqrestore(&base->lock, flags);
>
> If I added two trace points around the above line, most of times only
> the trace point before the line is dumped, and the one after the line
> can't dumped.

Looks the above issue is caused by passing 'jiffies' to mod_timer,
and 'ops->cur_blink_jiffies' is observed as zero in cursor_timer_handler()
when the issue happened.

The following patch(workaround) can make the issue disappeared:

diff --git a/drivers/video/console/fbcon.c b/drivers/video/console/fbcon.c
index 6e92917..5e880ee 100644
--- a/drivers/video/console/fbcon.c
+++ b/drivers/video/console/fbcon.c
@@ -1095,6 +1095,8 @@ static void fbcon_init(struct vc_data *vc, int init)
                con_copy_unimap(vc, svc);

        ops = info->fbcon_par;
+ if (vc->vc_cur_blink_ms)
+ vc->vc_cur_blink_ms = 125;
        ops->cur_blink_jiffies = msecs_to_jiffies(vc->vc_cur_blink_ms);
        p->con_rotate = initial_rotation;
        set_blitting_type(vc, info);

Thanks,

>
> Thanks,
>
>>
>> --
>> You received this bug notification because you are subscribed to linux
>> in Ubuntu.
>> https://bugs.launchpad.net/bugs/1574814
>>
>> Title:
>> ThunderX: soft lockup in cursor_timer_handler() Edit
>>
>> Status in linux package in Ubuntu:
>> Confirmed
>>
>> Bug description:
>> I booted a Cavium ThunderX crb1s 2.0 system using the netboot mini iso via virtual media:
>> http://ports.ubuntu.com/ubuntu-ports/dists/xenial/main/installer-arm64/20101020ubuntu451/images/netboot/mini.iso
>>
>> During boot I observed the following lockup on the serial console:
>>
>> [ 28.128327] usb 1-1.1: reset high-speed USB device number 3 using xhci_hcd
>> [ 84.912299] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 23s! [swapper/14:0]
>> [ 84.922718] Modules linked in: hid_generic(E) usbhid(E) hid(E) usb_storage(E) mdio_thunder(E) nicvf(E) ast(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ttm(E) drm(E) nicpf(E) thunder_bgx(E) mdio...

Read more...

Ming Lei (tom-leiming) wrote :
Download full text (5.1 KiB)

On Tue, May 17, 2016 at 12:12 PM, Ming Lei <email address hidden> wrote:
> On Mon, May 16, 2016 at 5:25 PM, Ming Lei <email address hidden> wrote:
>> On Fri, May 13, 2016 at 7:22 AM, dann frazier
>> <email address hidden> wrote:
>>> I used ftrace to do some duration measuring of the timer function
>>> fb_flashcursor(). I noticed several places where this timer takes around
>>> 98 ms to complete. This time seems to be due to multiple calls to
>>> __memcpy_toio() in ast_dirty_update():
>>>
>>> for (i = y; i <= y2; i++) {
>>> /* assume equal stride for now */
>>> src_offset = dst_offset = i * afbdev->afb.base.pitches[0] + (x * bpp);
>>> memcpy_toio(bo->kmap.virtual + src_offset, afbdev->sysram + src_offset, (x2 - x + 1) * bpp);
>>>
>>>
>>> My theory is that this is causing mod_timer() to block on the other CPU, resulting in the soft lockup.
>>>
>>> Also - I built a custom d-i using pristine 4.6-rc7, and I am able to
>>> easily reproduce this. I think the next step here is to report this to
>>> upstream.
>>
>> Hi Dann,
>>
>> Andrew asked me to take a look at the issue, and from my tracing,
>> most of times, the cpu 'hangs' in the following line of code:
>>
>> __mod_timer():
>> ....
>> out_unlock:
>> spin_unlock_irqrestore(&base->lock, flags);
>>
>> If I added two trace points around the above line, most of times only
>> the trace point before the line is dumped, and the one after the line
>> can't dumped.
>
> Looks the above issue is caused by passing 'jiffies' to mod_timer,
> and 'ops->cur_blink_jiffies' is observed as zero in cursor_timer_handler()
> when the issue happened.
>
> The following patch(workaround) can make the issue disappeared:
>
> diff --git a/drivers/video/console/fbcon.c b/drivers/video/console/fbcon.c
> index 6e92917..5e880ee 100644
> --- a/drivers/video/console/fbcon.c
> +++ b/drivers/video/console/fbcon.c
> @@ -1095,6 +1095,8 @@ static void fbcon_init(struct vc_data *vc, int init)
> con_copy_unimap(vc, svc);
>
> ops = info->fbcon_par;
> + if (vc->vc_cur_blink_ms)

oops, it should be 'if (!vc->vc_cur_blink_ms)'

> + vc->vc_cur_blink_ms = 125;
> ops->cur_blink_jiffies = msecs_to_jiffies(vc->vc_cur_blink_ms);
> p->con_rotate = initial_rotation;
> set_blitting_type(vc, info);
>
> Thanks,
>
>>
>> Thanks,
>>
>>>
>>> --
>>> You received this bug notification because you are subscribed to linux
>>> in Ubuntu.
>>> https://bugs.launchpad.net/bugs/1574814
>>>
>>> Title:
>>> ThunderX: soft lockup in cursor_timer_handler() Edit
>>>
>>> Status in linux package in Ubuntu:
>>> Confirmed
>>>
>>> Bug description:
>>> I booted a Cavium ThunderX crb1s 2.0 system using the netboot mini iso via virtual media:
>>> http://ports.ubuntu.com/ubuntu-ports/dists/xenial/main/installer-arm64/20101020ubuntu451/images/netboot/mini.iso
>>>
>>> During boot I observed the following lockup on the serial console:
>>>
>>> [ 28.128327] usb 1-1.1: reset high-speed USB device number 3 using xhci_hcd
>>> [ 84.912299] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 23s! [swapper/14:0]
>>> [ 84.922718] Modules li...

Read more...

Changed in linux (Ubuntu Xenial):
status: New → Fix Committed
Kamal Mostafa (kamalmostafa) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
dann frazier (dannf) wrote :

I rebuilt d-i against the kernel in proposed and booted it a couple of times. I have not reproduced this issue, so marking it verified.

~ # uname -a
Linux cvm3 4.4.0-25-generic #44-Ubuntu SMP Fri Jun 10 18:15:04 UTC 2016 aarch64 GNU/Linux
~ # dmesg | grep -i soft
[ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-di root=UUID=bc611283-1f38-4993-a2b9-883922c7ed1f ro earlycon=pl011,0x87e024000000 hardlockup_all_cpu_backtrace=1 softlockup_all_cpu_backtrace=1 earlycon=pl011,0x87e024000000 apt-setup/proposed=true
[ 0.000000] software IO TLB [mem 0xfbfed000-0xfffed000] (64MB) mapped at [ffff8000fb1ed000-ffff8000ff1ecfff]
[ 0.229859] CPU features: detected feature: Software prefetching using PRFM
[ 174.913338] xor: measuring software checksum speed

tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (24.1 KiB)

This bug was fixed in the package linux - 4.4.0-25.44

---------------
linux (4.4.0-25.44) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1591289

  * Xenial update to v4.4.13 stable release (LP: #1590455)
    - MIPS64: R6: R2 emulation bugfix
    - MIPS: math-emu: Fix jalr emulation when rd == $0
    - MIPS: MSA: Fix a link error on `_init_msa_upper' with older GCC
    - MIPS: Don't unwind to user mode with EVA
    - MIPS: Avoid using unwind_stack() with usermode
    - MIPS: Fix siginfo.h to use strict posix types
    - MIPS: Fix uapi include in exported asm/siginfo.h
    - MIPS: Fix watchpoint restoration
    - MIPS: Flush highmem pages in __flush_dcache_page
    - MIPS: Handle highmem pages in __update_cache
    - MIPS: Sync icache & dcache in set_pte_at
    - MIPS: ath79: make bootconsole wait for both THRE and TEMT
    - MIPS: Reserve nosave data for hibernation
    - MIPS: Loongson-3: Reserve 32MB for RS780E integrated GPU
    - MIPS: Use copy_s.fmt rather than copy_u.fmt
    - MIPS: Fix MSA ld_*/st_* asm macros to use PTR_ADDU
    - MIPS: Prevent "restoration" of MSA context in non-MSA kernels
    - MIPS: Disable preemption during prctl(PR_SET_FP_MODE, ...)
    - MIPS: ptrace: Fix FP context restoration FCSR regression
    - MIPS: ptrace: Prevent writes to read-only FCSR bits
    - MIPS: Fix sigreturn via VDSO on microMIPS kernel
    - MIPS: Build microMIPS VDSO for microMIPS kernels
    - MIPS: lib: Mark intrinsics notrace
    - MIPS: VDSO: Build with `-fno-strict-aliasing'
    - affs: fix remount failure when there are no options changed
    - ASoC: ak4642: Enable cache usage to fix crashes on resume
    - Input: uinput - handle compat ioctl for UI_SET_PHYS
    - ARM: mvebu: fix GPIO config on the Linksys boards
    - ARM: dts: at91: fix typo in sama5d2 PIN_PD24 description
    - ARM: dts: exynos: Add interrupt line to MAX8997 PMIC on exynos4210-trats
    - ARM: dts: imx35: restore existing used clock enumeration
    - ath9k: Add a module parameter to invert LED polarity.
    - ath9k: Fix LED polarity for some Mini PCI AR9220 MB92 cards.
    - ath10k: fix debugfs pktlog_filter write
    - ath10k: fix firmware assert in monitor mode
    - ath10k: fix rx_channel during hw reconfigure
    - ath10k: fix kernel panic, move arvifs list head init before htt init
    - ath5k: Change led pin configuration for compaq c700 laptop
    - hwrng: exynos - Fix unbalanced PM runtime put on timeout error path
    - rtlwifi: rtl8723be: Add antenna select module parameter
    - rtlwifi: btcoexist: Implement antenna selection
    - rtlwifi: Fix logic error in enter/exit power-save mode
    - rtlwifi: pci: use dev_kfree_skb_irq instead of kfree_skb in
      rtl_pci_reset_trx_ring
    - aacraid: Relinquish CPU during timeout wait
    - aacraid: Fix for aac_command_thread hang
    - aacraid: Fix for KDUMP driver hang
    - hwmon: (ads7828) Enable internal reference
    - mfd: intel-lpss: Save register context on suspend
    - mfd: intel_soc_pmic_core: Terminate panel control GPIO lookup table
      correctly
    - PM / Runtime: Fix error path in pm_runtime_force_resume()
    - cpuidle: Indicate when a device has been unregiste...

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (26.1 KiB)

This bug was fixed in the package linux - 4.4.0-28.47

---------------
linux (4.4.0-28.47) xenial; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1595874

  * Linux netfilter local privilege escalation issues (LP: #1595350)
    - netfilter: x_tables: don't move to non-existent next rule
    - netfilter: x_tables: validate targets of jumps
    - netfilter: x_tables: add and use xt_check_entry_offsets
    - netfilter: x_tables: kill check_entry helper
    - netfilter: x_tables: assert minimum target size
    - netfilter: x_tables: add compat version of xt_check_entry_offsets
    - netfilter: x_tables: check standard target size too
    - netfilter: x_tables: check for bogus target offset
    - netfilter: x_tables: validate all offsets and sizes in a rule
    - netfilter: x_tables: don't reject valid target size on some architectures
    - netfilter: arp_tables: simplify translate_compat_table args
    - netfilter: ip_tables: simplify translate_compat_table args
    - netfilter: ip6_tables: simplify translate_compat_table args
    - netfilter: x_tables: xt_compat_match_from_user doesn't need a retval
    - netfilter: x_tables: do compat validation via translate_table
    - netfilter: x_tables: introduce and use xt_copy_counters_from_user

  * Linux netfilter IPT_SO_SET_REPLACE memory corruption (LP: #1555338)
    - netfilter: x_tables: validate e->target_offset early
    - netfilter: x_tables: make sure e->next_offset covers remaining blob size
    - netfilter: x_tables: fix unconditional helper

linux (4.4.0-27.46) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1594906

  * Support Edge Gateway's Bluetooth LED (LP: #1512999)
    - Revert "UBUNTU: SAUCE: Bluetooth: Support for LED on Marvell modules"

linux (4.4.0-26.45) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1594442

  * linux: Implement secure boot state variables (LP: #1593075)
    - SAUCE: UEFI: Add secure boot and MOK SB State disabled sysctl

  * failures building userspace packages that include ethtool.h (LP: #1592930)
    - ethtool.h: define INT_MAX for userland

linux (4.4.0-25.44) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1591289

  * Xenial update to v4.4.13 stable release (LP: #1590455)
    - MIPS64: R6: R2 emulation bugfix
    - MIPS: math-emu: Fix jalr emulation when rd == $0
    - MIPS: MSA: Fix a link error on `_init_msa_upper' with older GCC
    - MIPS: Don't unwind to user mode with EVA
    - MIPS: Avoid using unwind_stack() with usermode
    - MIPS: Fix siginfo.h to use strict posix types
    - MIPS: Fix uapi include in exported asm/siginfo.h
    - MIPS: Fix watchpoint restoration
    - MIPS: Flush highmem pages in __flush_dcache_page
    - MIPS: Handle highmem pages in __update_cache
    - MIPS: Sync icache & dcache in set_pte_at
    - MIPS: ath79: make bootconsole wait for both THRE and TEMT
    - MIPS: Reserve nosave data for hibernation
    - MIPS: Loongson-3: Reserve 32MB for RS780E integrated GPU
    - MIPS: Use copy_s.fmt rather than copy_u.fmt
    - MIPS: Fix MSA ld_*/st_* asm macros to use PTR_ADDU
    - MIPS: Prevent "restoration" of MSA c...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released

What is the exact commit that went into to fix this?
I am looking at drivers/video/console/fbcon.c at "https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/xenial/" and I don't see the fix here.

dann frazier (dannf) wrote :

@Radha If you are still seeing this bug, can you attach a console log?

Here's the commit that was added in xenial to address this:

commit 51b7d7350e526bde77f458fb23481e7c5d312d91
Author: David Daney <email address hidden>
Date: Tue May 24 12:55:54 2016 +0800

    UBUNTU: SAUCE: tty: vt: Fix soft lockup in fbcon cursor blink timer.

    BugLink: http://bugs.launchpad.net/bugs/1574814

    We are getting somewhat random soft lockups with this signature:

    [ 86.992215] [<fffffc00080935e0>] el1_irq+0xa0/0x10c
    [ 86.997082] [<fffffc000841822c>] cursor_timer_handler+0x30/0x54
    [ 87.002991] [<fffffc000810ec44>] call_timer_fn+0x54/0x1a8
    [ 87.008378] [<fffffc000810ef88>] run_timer_softirq+0x1c4/0x2bc
    [ 87.014200] [<fffffc000809077c>] __do_softirq+0x114/0x344
    [ 87.019590] [<fffffc00080af45c>] irq_exit+0x74/0x98
    [ 87.024458] [<fffffc00080fac20>] __handle_domain_irq+0x98/0xfc
    [ 87.030278] [<fffffc000809056c>] gic_handle_irq+0x94/0x190

    This is caused by the vt visual_init() function calling into
    fbcon_init() with a vc_cur_blink_ms value of zero. This is a
    transient condition, as it is later set to a non-zero value. But, if
    the timer happens to expire while the blink rate is zero, it goes into
    an endless loop, and we get soft lockup.

    The fix is to initialize vc_cur_blink_ms before calling the con_init()
    function.

    Tested-by: Ming Lei <email address hidden>
    Acked-by: Pavel Machek <email address hidden>
    Acked-by: Scot Doyle <email address hidden>
    Signed-off-by: David Daney <email address hidden>
    Cc: <email address hidden>
    Reference: https://lkml.org/lkml/2016/5/17/455
    Cc: Ming Lei <email address hidden>
    Acked-by: Tim Gardner <email address hidden>
    Signed-off-by: Kamal Mostafa <email address hidden>

@Dann, my bad. the installer kernel is from older 16.04 release, that was before this commit. I don't see this issue with latest installer kernel.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers