KPTI support for arm64 systems

Bug #1749040 reported by dann frazier
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Critical
Unassigned
Artful
Fix Released
Critical
Unassigned
Bionic
Fix Released
Critical
Unassigned

Bug Description

While regression testing the current linux-hwe proposed kernel (4.13.0-33.36~16.04.1), I found that it fails to boot on a Cavium ThunderX CRB. I've rebooted twice since upgrading from the current -updates kernel, and it's failed to boot both times, with different failure modes.

Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1749040

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
dann frazier (dannf) wrote : Re: fails to boot on Cavium ThunderX CRB

I did test w/ the latest artful upload (35.39), and the problem is not reproducing. I don't see anything obvious in the git log that explains it. I'll try rebuilding it in a xenial environment on the off-chance it is a toolchain issue.

Revision history for this message
dann frazier (dannf) wrote :

4.13.0-33.36 from artful boots fine while 4.13.0-33.36~16.04.1 from xenial does not - so possibly a toolchain-related issue.

dann frazier (dannf)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
importance: Undecided → Critical
Revision history for this message
Paolo Pisati (p-pisati) wrote :

In the mean time linux-generic-hwe-16.04 has moved to 4.13.0.35.54

$ rmadison -s xenial-proposed linux-generic-hwe-16.04
 linux-generic-hwe-16.04 | 4.13.0.35.54 | xenial-proposed | amd64, arm64, armhf, i386, ppc64el, s390x

The board you tested and lundmark are apparently the same:

$ grep "Board Model" console.log lundmark.log
console.log:Board Model: crb-1s
lundmark.log:Board Model: crb-1s
$ grep "SKU" console.log lundmark.log
console.log:SKU: CN8890-2000BG2601-AAP-Y-G
lundmark.log:SKU: CN8890-2000BG2601-AAP-Y-G
$ grep "Machine model" console.log lundmark.log
console.log:[ 0.000000] Machine model: cavium,thunder-88xx
lundmark.log:[ 0.000000] Machine model: cavium,thunder-88xx

and everything works fine on lundmark:

ubuntu@lundmark:~$ uname -a
Linux lundmark 4.13.0-35-generic #39~16.04.1-Ubuntu SMP Mon Feb 12 15:03:44 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux

FWIW, i tried the same kernel in an Artful installation (thus compied with the artful toolchain), and everything works fine there too.

Revision history for this message
Paolo Pisati (p-pisati) wrote :

We are using slightly different toolchains:

$ grep gcc console.log lundmark.log
console.log:[ 0.000000] Linux version 4.13.0-33-generic (buildd@bos02-arm64-023) (gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.8)) #36~16.04.1-Ubuntu SMP Wed Feb 7 23:37:06 UTC 2018 (Ubuntu 4.13.0-33.36~16.04.1-generic 4.13.13)
lundmark.log:[ 0.000000] Linux version 4.13.0-35-generic (buildd@bos02-arm64-029) (gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.9)) #39~16.04.1-Ubuntu SMP Mon Feb 12 15:03:44 UTC 2018 (Ubuntu 4.13.0-35.39~16.04.1-generic 4.13.13)

 5.4.0-6ubuntu1~16.04.8 your toolchain, 5.4.0-6ubuntu1~16.04.9 mine

Although the difference shouldn't impact arm64:

gcc-5 (5.4.0-6ubuntu1~16.04.9) xenial-security; urgency=medium

  * Revert retpoline changes of ppc64el as per the recommendation from
    Bill Schmidt of IBM.
    - ppc-add-mspeculate-indirect-jumps: drop.

we should try to build that kernel with that toolchain and see what happens.

Revision history for this message
dann frazier (dannf) wrote :

You are correct that seuss and lundmark should be identical. I rebuilt the 4.13.0-35.39 kernel in xenial to see if the issue follows the toolchain (see table below), but I failed noticed the slight differene in versions. Here's the current testing summary - next I'll rebuild 4.13.0-33.36~16.04.1 with the 5.4.0-6ubuntu1~16.04.*9* toolchain as you suggested.

4.13.0-33.36~16.04.1 | 5.4.0-6ubuntu1~16.04.8 | Fails
4.13.0-33.36 | 7.2.0-8ubuntu3.1 | OK
4.13.0-35.39 | 7.2.0-8ubuntu3.2 | OK
4.13.0-35.39 | 5.4.0-6ubuntu1~16.04.9 | OK

Revision history for this message
dann frazier (dannf) wrote :

A couple more tests - rebuilding the failing kernel w/ the updated GCC also shows a failure. Rebuilding the newer artful-proposed kernel (35.39) w/ the same toolchain does not show a failure.
Here's the current status:

| 4.13.0-33.36~16.04.1 | 5.4.0-6ubuntu1~16.04.8 | Fails |
| 4.13.0-33.36~16.04.1 | 5.4.0-6ubuntu1~16.04.9 | Fails |
| 4.13.0-33.36 | 7.2.0-8ubuntu3.1 | OK |
| 4.13.0-35.39 | 7.2.0-8ubuntu3.2 | OK |
| 4.13.0-35.39 | 5.4.0-6ubuntu1~16.04.9 | OK |
| 4.13.0-35.39~16.04.1 | 5.4.0-6ubuntu1~16.04.9 | OK |

dann frazier (dannf)
Changed in linux (Ubuntu Artful):
status: New → In Progress
importance: Undecided → Critical
Changed in linux (Ubuntu):
status: Confirmed → New
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1749040

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote : Re: fails to boot on Cavium ThunderX CRB

Hi,

We have reverted some arm64 patches that were causing boot issues so the system should be able to boot now. Could you please verify if you can boot the system the latest Artful kernel on -proposed?

Thank you.

Revision history for this message
dann frazier (dannf) wrote : Re: [Bug 1749040] Re: fails to boot on Cavium ThunderX CRB

On Tue, Feb 20, 2018 at 4:05 AM, Kleber Sacilotto de Souza
<email address hidden> wrote:
> Hi,
>
> We have reverted some arm64 patches that were causing boot issues so the
> system should be able to boot now. Could you please verify if you can
> boot the system the latest Artful kernel on -proposed?

LGTM:
ubuntu@grotrian:~$ cat /proc/version
Linux version 4.13.0-36-generic (buildd@bos02-arm64-021) (gcc version
5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.9))
#40~16.04.1-Ubuntu SMP Fri Feb 16 23:26:28 UTC 2018

Revision history for this message
dann frazier (dannf) wrote : Re: KPTI-enabled kernel fails to boot on Cavium ThunderX CRB

Paolo has respun a new KPTI backport:
  https://git.launchpad.net/~p-pisati/ubuntu/+source/linux/log/?h=artful-master-next-arm64-kpti-414-backport

A linux/artful test build is available at:
  ppa:p-pisati/arm64-kpti-backport

And a linux-hwe/xenial test build is available at:
  ppa:dannf/kpti

We're now in the process of regression testing across platforms, using the Ubuntu cert tests.

summary: - fails to boot on Cavium ThunderX CRB
+ KPTI-enabled kernel fails to boot on Cavium ThunderX CRB
Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :

1 unexpected failure on the HP m400 - disk/disk_stress_ng_sda. Needs investigation.

Revision history for this message
dann frazier (dannf) wrote :

The failure mentioned in Comment #15 is a test suite bug - see LP: #1751167. I applied a hot-fix, re-ran the test, and it passed.

Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
Manoj Iyer (manjo) wrote :

On QTI QDF2400 I notice errors from xhci-hcd like:

awrep6 login: [18851.092328] xhci-hcd QCOM8041:02: ERROR: unexpected setup address command completion code 0x11.
[18851.308402] xhci-hcd QCOM8041:02: ERROR: unexpected setup address command completion code 0x11.
[18851.530820] usb 5-1.4: device not accepting address 6, error -22
[18852.128633] xhci-hcd QCOM8041:02: ERROR: unexpected setup address command completion code 0x11.
[18852.343093] xhci-hcd QCOM8041:02: ERROR: unexpected setup address command completion code 0x11.

and OOM kills like:

[20783.778840] Out of memory: Kill process 28248 (stress-ng) score 1075 or sacrifice child
[20783.785910] Killed process 28248 (stress-ng) total-vm:8354000kB, anon-rss:6247552kB, file-rss:320kB, shmem-rss:64kB
[20947.665955] Out of memory: Kill process 28310 (stress-ng) score 1066 or sacrifice child
[20947.673020] Killed process 28310 (stress-ng) total-vm:7347264kB, anon-rss:3564612kB, file-rss:356kB, shmem-rss:64kB

I plan to reboot the system and re-run and report back here.

Revision history for this message
dann frazier (dannf) wrote : Re: [Bug 1749040] Re: KPTI-enabled kernel fails to boot on Cavium ThunderX CRB

On Fri, Feb 23, 2018 at 9:24 AM, Manoj Iyer <email address hidden> wrote:
> On QTI QDF2400 I notice errors from xhci-hcd like:
>
> awrep6 login: [18851.092328] xhci-hcd QCOM8041:02: ERROR: unexpected setup address command completion code 0x11.
> [18851.308402] xhci-hcd QCOM8041:02: ERROR: unexpected setup address command completion code 0x11.
> [18851.530820] usb 5-1.4: device not accepting address 6, error -22
> [18852.128633] xhci-hcd QCOM8041:02: ERROR: unexpected setup address command completion code 0x11.
> [18852.343093] xhci-hcd QCOM8041:02: ERROR: unexpected setup address command completion code 0x11.

Is that also seen when running the current linux-hwe kernel?

> and OOM kills like:
>
> [20783.778840] Out of memory: Kill process 28248 (stress-ng) score 1075 or sacrifice child
> [20783.785910] Killed process 28248 (stress-ng) total-vm:8354000kB, anon-rss:6247552kB, file-rss:320kB, shmem-rss:64kB
> [20947.665955] Out of memory: Kill process 28310 (stress-ng) score 1066 or sacrifice child
> [20947.673020] Killed process 28310 (stress-ng) total-vm:7347264kB, anon-rss:3564612kB, file-rss:356kB, shmem-rss:64kB

I think that's normal, as long as the test passes.

 -dann

Revision history for this message
Manoj Iyer (manjo) wrote : Re: KPTI-enabled kernel fails to boot on Cavium ThunderX CRB

Test results from D05 Hisilicon board.

Revision history for this message
dann frazier (dannf) wrote :

The virt test for D05 failing is unexpected. It looks like the updated kernel fails to boot as a guest on HiSilicon D05 systems, and the crash is in code that the KPTI patches introduced.

Revision history for this message
dann frazier (dannf) wrote :

I'm able to reproduce the failure with an upstream kernel (4.16-rc3+ @ 6f70eb2b00eb4), running on both the host and guest.

Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :
summary: - KPTI-enabled kernel fails to boot on Cavium ThunderX CRB
+ KPTI support for arm64 systems
Revision history for this message
Paolo Pisati (p-pisati) wrote :

No need to install 4.16-rc3 in both host & guest: i can reproduce it
on d05-6 using 4.13.0-36-generic #40~16.04.1 on host (no KPTI
patchset) and 4.16.0-rc3+ on guest.

Unfortunately defconfig boots fine, so there's something in that .config that trips it.
If trying to reproduce it, remove the CONFIG_DEBUG_INFO, to avoid building kmod with debug info and ending up with +1Gb of kmods.

Revision history for this message
dann frazier (dannf) wrote :

Right - I should've mentioned that - the issue follows the guest kernel for me as well. I tested w/ latest upstream for both just in case the guest failure is a somehow a side-effect of a host kernel bug.

Also, I should have mentioned that this is an intermittent failure for me. I sometimes have to reboot the guest several times to observe the failure. The test also seems sensitive to changes in the test setup - for example, bumping up the guest cpu count to ~50 made the issue go into hiding.

Revision history for this message
Paolo Pisati (p-pisati) wrote :

Yep, if it boots fine, i usually retry 2/3 times, and it has happened that it failed only the second or the third time i tried.

Anyhow, now i'm down to this delta .config (the initial one had ~2k entries), so the bug appears to be ACPI related:

-------------------------------------------------------------------------------
CONFIG_ACPI=y
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_GHES=y
CONFIG_ACPI_APEI_MEMORY_FAILURE=y
CONFIG_ACPI_APEI_PCIEAER=y
CONFIG_ACPI_APEI_SEA=y
CONFIG_ACPI_BGRT=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_CCA_REQUIRED=y
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_CPPC_LIB=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_FAN=y
CONFIG_ACPI_GENERIC_GSI=y
CONFIG_ACPI_GTDT=y
CONFIG_ACPI_HED=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_I2C_OPREGION=y
CONFIG_ACPI_IORT=y
CONFIG_ACPI_MCFG=y
CONFIG_ACPI_NUMA=y
CONFIG_ACPI_PCI_SLOT=y
-------------------------------------------------------------------------------

How to reproduce it from d05-6 - 192.168.122.14 is my kvm instance:

$ make defconfig
$ cat $abovedelta >> .config
$ make olddefconfig
$ make -j65
$ scp arch/arm64/boot/Image ubuntu@192.168.122.14:.
$ ssh ubuntu@192.168.122.14

and inside the kvm instance:

$ sudo cp Image /boot/
and when rebooting, presc Esc at the grub menu, select the "Image" entry previously created[*]

*: in /boot/grub/grub.cfg create a copy of the default Ubuntu instance, but use /boot/Image instead of /boot/vmlinuz-x.y.z, remove "quiet splash" from the default argument and rename it 'Image'

Revision history for this message
Paolo Pisati (p-pisati) wrote :

Found && fixed.

https://git.launchpad.net/~p-pisati/ubuntu/+source/linux/log/?h=artful-master-next-arm64-kpti-414-backport

I pushed two fixes on top of it:

1) 'syscalls: Use CHECK_DATA_CORRUPTION for addr_limit_user_check' fixes a FTBFS for armhf
2) 'arm64: mm: fix thinko in non-global page table attribute check' fixes this kvm boot race

We probably need to test it again on all arm64 boards now...

Revision history for this message
dann frazier (dannf) wrote : Re: [Bug 1749040] Re: KPTI support for arm64 systems

On Wed, Feb 28, 2018 at 10:12 AM, Paolo Pisati
<email address hidden> wrote:
> Found && fixed.
>
> https://git.launchpad.net/~p-pisati/ubuntu/+source/linux/log/?h=artful-
> master-next-arm64-kpti-414-backport
>
> I pushed two fixes on top of it:
>
> 1) 'syscalls: Use CHECK_DATA_CORRUPTION for addr_limit_user_check' fixes a FTBFS for armhf
> 2) 'arm64: mm: fix thinko in non-global page table attribute check' fixes this kvm boot race
>
> We probably need to test it again on all arm64 boards now...

OK - I'll get a PPA kernel built and we'll begin another test cycle.

Revision history for this message
dann frazier (dannf) wrote :

Refreshed kernel now in ppa:dannf/kpti

Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
Paolo Pisati (p-pisati) wrote :

I see 5 failures marked as blocker there: 4 couldn't complete due to the test environment, and 1[*], apparently, is a CONFIG issue (so something that predates this patchset).

Let's see what the other boards report.

1: https://launchpadlibrarian.net/359043692/submission_2018-03-01T05.58.16.640032.html#7-10-log

Revision history for this message
Manoj Iyer (manjo) wrote :

Hisilicon D05-0 testing.

Revision history for this message
Manoj Iyer (manjo) wrote :

cert testing on QDF2400

Revision history for this message
dann frazier (dannf) wrote :

On Thu, Mar 1, 2018 at 8:50 AM, Paolo Pisati <email address hidden> wrote:
> I see 5 failures marked as blocker there: 4 couldn't complete due to the
> test environment, and 1[*], apparently, is a CONFIG issue (so something
> that predates this patchset).

Sorry - I didn't have time to analyze it last night when I posted. All
5 are expected:
  - disk/disk_stress_ng_sda is a test bug (LP: #1751167)
  - ethernet/multi_iperf3_* requires a client/server setup we don't
have in place
  - miscellanea/bmc_info - this system doesn't have a conventional BMC
  - miscellanea/efi_boot_mode - this system boots w/ u-boot
  - miscellanea/ipmi_test - this system doesn't have a conventional BMC

Revision history for this message
dann frazier (dannf) wrote :

D05 & QDF2400 errors are all expected, thanks Manoj!

Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :

I've seen a couple crashes on a ThunderX CRB1S system - though at this point I don't have reason to believe it is related to KPTI. This system happened to be installed with LVM + crypted home, and we haven't ran the cert tests on such a config before. After the remaining tests complete, I plan to re-run with the pre-kpti kernel on this config, as well as running cert on the "usual" config w/ a kpti kernel. I was only able to capture the console of one of the crashes, I'll attach it here.

I also have a cert run still in-progress from a Gigabyte R120. This system is very similar to the CRB1S, and has not seen any problems yet.

Revision history for this message
Manoj Iyer (manjo) wrote :

Merlin system cert results.

Revision history for this message
Paolo Pisati (p-pisati) wrote :

If the bug in #39 is not related to kpti, can you spawn it into a separate LP bug and add a reproducer?

Thanks.

Revision history for this message
dann frazier (dannf) wrote :

On Fri, Mar 2, 2018 at 4:48 AM, Paolo Pisati <email address hidden> wrote:
> If the bug in #39 is not related to kpti, can you spawn it into a
> separate LP bug and add a reproducer?

If it is shown to not be related to KPTI after I run those further
tests, that is the plan.

  -dann

Revision history for this message
dann frazier (dannf) wrote :

crb1s results - all expected failures except disk/disk_stress_ng_dm-1 which is what was running when the panic mentioned in comment #39 occurred.

Revision history for this message
dann frazier (dannf) wrote :

All failures expected.

Revision history for this message
Paolo Pisati (p-pisati) wrote :

Is there a way for me to manually run disk/disk_stress_ng_dm-1?

Revision history for this message
dann frazier (dannf) wrote :

On Mon, Mar 5, 2018 at 2:34 AM, Paolo Pisati <email address hidden> wrote:
> Is there a way for me to manually run disk/disk_stress_ng_dm-1?

I've just started a run on the server/config that failed the test, but
with a pre-kpti kernel, to see if it follows kpti or the
system/config. The command it generated was:

timeout -s 14 1200 stress-ng --aggressive --verify --timeout 240
--temp-path /tmp/disk_stress_ng_7a89ec14-42ad-4a5c-ae7d-41b3293cd7ee
--aio 0 --hdd-opts dsync --readahead-bytes 16M -k

Revision history for this message
dann frazier (dannf) wrote :

I was able to reproduce a crash on a crypted LVM system w/ the pre-KPTI kernel. LP: #1753489.

The backtrace isn't identical, but it does show that crypted LVM was fragile even before the KPTI patches.

Revision history for this message
Paolo Pisati (p-pisati) wrote :

Good, any other pending tests?

If not, i'll move forward and send a the above v2 arm64 kpti patchset.

Revision history for this message
dann frazier (dannf) wrote :

On Mon, Mar 5, 2018 at 7:47 AM, Paolo Pisati <email address hidden> wrote:
> Good, any other pending tests?

Yes, just one more. I'm going to run the same test on the kpti kernel
on a different system of the same config, but *without* lvm crypt.
Starting that now.

Revision history for this message
dann frazier (dannf) wrote :
  • kern.log Edit (1.7 MiB, text/x-log; charset="US-ASCII"; name="kern.log")

On Mon, Mar 5, 2018 at 9:06 AM, dann frazier <email address hidden> wrote:
> On Mon, Mar 5, 2018 at 7:47 AM, Paolo Pisati <email address hidden> wrote:
>> Good, any other pending tests?
>
> Yes, just one more. I'm going to run the same test on the kpti kernel
> on a different system of the same config, but *without* lvm crypt.
> Starting that now.

Unfortunately, that locked up. Not in the disk test, but while running:
stress-ng -k --aggressive --verify --timeout 1560 --brk 0
(I kicked off a full cert re-run)

Those stress-ng process are stuck blocked for I/O, and commands like
'ps -ef' hang.
The system is otherwise idle.

Revision history for this message
dann frazier (dannf) wrote :

I rebooted the machine (lundmark) and restarted the test mentioned in the previous comment, this time the system crashed w/ a Synchronous External Abort:

[23243.094384] Synchronous External Abort: synchronous parity or ECC error (0x86000018) at 0x0000ffffb0f74f68

This suggests a non-software issue. I'll seek another CRB machine and retest.

Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :

I've reproduced the CRB1S panic with the current (non-kpti) kernel and without crypt lvm (standard MAAS install) and reported bug 1754053 to track it. We can rule that out as being caused by kpti.

dann frazier (dannf)
Changed in linux (Ubuntu Artful):
status: In Progress → Fix Released
Changed in linux (Ubuntu Bionic):
status: Incomplete → Fix Committed
status: Fix Committed → Fix Released
To post a comment you must log in.