Bug #1749040 “KPTI support for arm64 systems” : Artful (17.10) : Bugs : linux package : Ubuntu

Revision history for this message

dann frazier (dannf) wrote on 2018-02-13:

#1

console.log Edit (84.9 KiB, text/plain)

Revision history for this message

dann frazier (dannf) wrote on 2018-02-13:

#2

console.log.2 Edit (99.2 KiB, text/plain)

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2018-02-13: Missing required logs.

#3

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1749040

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

dann frazier (dannf) wrote on 2018-02-13: Re: fails to boot on Cavium ThunderX CRB

#4

I did test w/ the latest artful upload (35.39), and the problem is not reproducing. I don't see anything obvious in the git log that explains it. I'll try rebuilding it in a xenial environment on the off-chance it is a toolchain issue.

Revision history for this message

dann frazier (dannf) wrote on 2018-02-13:

#5

4.13.0-33.36 from artful boots fine while 4.13.0-33.36~16.04.1 from xenial does not - so possibly a toolchain-related issue.

dann frazier (dannf) on 2018-02-13

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed
importance:	Undecided → Critical

Revision history for this message

Paolo Pisati (p-pisati) wrote on 2018-02-13:

#6

In the mean time linux-generic-hwe-16.04 has moved to 4.13.0.35.54

$ rmadison -s xenial-proposed linux-generic-hwe-16.04
linux-generic-hwe-16.04 | 4.13.0.35.54 | xenial-proposed | amd64, arm64, armhf, i386, ppc64el, s390x

The board you tested and lundmark are apparently the same:

$ grep "Board Model" console.log lundmark.log
console.log:Board Model: crb-1s
lundmark.log:Board Model: crb-1s
$ grep "SKU" console.log lundmark.log
console.log:SKU: CN8890-2000BG2601-AAP-Y-G
lundmark.log:SKU: CN8890-2000BG2601-AAP-Y-G
$ grep "Machine model" console.log lundmark.log
console.log:[ 0.000000] Machine model: cavium,thunder-88xx
lundmark.log:[ 0.000000] Machine model: cavium,thunder-88xx

and everything works fine on lundmark:

ubuntu@lundmark:~$ uname -a
Linux lundmark 4.13.0-35-generic #39~16.04.1-Ubuntu SMP Mon Feb 12 15:03:44 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux

FWIW, i tried the same kernel in an Artful installation (thus compied with the artful toolchain), and everything works fine there too.

Revision history for this message

Paolo Pisati (p-pisati) wrote on 2018-02-13:

#7

We are using slightly different toolchains:

$ grep gcc console.log lundmark.log
console.log:[ 0.000000] Linux version 4.13.0-33-generic (buildd@bos02-arm64-023) (gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.8)) #36~16.04.1-Ubuntu SMP Wed Feb 7 23:37:06 UTC 2018 (Ubuntu 4.13.0-33.36~16.04.1-generic 4.13.13)
lundmark.log:[ 0.000000] Linux version 4.13.0-35-generic (buildd@bos02-arm64-029) (gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.9)) #39~16.04.1-Ubuntu SMP Mon Feb 12 15:03:44 UTC 2018 (Ubuntu 4.13.0-35.39~16.04.1-generic 4.13.13)

5.4.0-6ubuntu1~16.04.8 your toolchain, 5.4.0-6ubuntu1~16.04.9 mine

Although the difference shouldn't impact arm64:

gcc-5 (5.4.0-6ubuntu1~16.04.9) xenial-security; urgency=medium

  * Revert retpoline changes of ppc64el as per the recommendation from
    Bill Schmidt of IBM.
    - ppc-add-mspeculate-indirect-jumps: drop.

we should try to build that kernel with that toolchain and see what happens.

Revision history for this message

dann frazier (dannf) wrote on 2018-02-13:

#8

You are correct that seuss and lundmark should be identical. I rebuilt the 4.13.0-35.39 kernel in xenial to see if the issue follows the toolchain (see table below), but I failed noticed the slight differene in versions. Here's the current testing summary - next I'll rebuild 4.13.0-33.36~16.04.1 with the 5.4.0-6ubuntu1~16.04.*9* toolchain as you suggested.

4.13.0-33.36~16.04.1 | 5.4.0-6ubuntu1~16.04.8 | Fails
4.13.0-33.36 | 7.2.0-8ubuntu3.1 | OK
4.13.0-35.39 | 7.2.0-8ubuntu3.2 | OK
4.13.0-35.39 | 5.4.0-6ubuntu1~16.04.9 | OK

Revision history for this message

dann frazier (dannf) wrote on 2018-02-13:

#9

4.13.0-33.36~16.04.1 built w/ gcc 5.4.0-6ubuntu1~16.04.9 console log Edit (84.7 KiB, text/plain)

A couple more tests - rebuilding the failing kernel w/ the updated GCC also shows a failure. Rebuilding the newer artful-proposed kernel (35.39) w/ the same toolchain does not show a failure.
Here's the current status:

| 4.13.0-33.36~16.04.1 | 5.4.0-6ubuntu1~16.04.8 | Fails |
| 4.13.0-33.36~16.04.1 | 5.4.0-6ubuntu1~16.04.9 | Fails |
| 4.13.0-33.36 | 7.2.0-8ubuntu3.1 | OK |
| 4.13.0-35.39 | 7.2.0-8ubuntu3.2 | OK |
| 4.13.0-35.39 | 5.4.0-6ubuntu1~16.04.9 | OK |
| 4.13.0-35.39~16.04.1 | 5.4.0-6ubuntu1~16.04.9 | OK |

dann frazier (dannf) on 2018-02-15

Changed in linux (Ubuntu Artful):
status:	New → In Progress
importance:	Undecided → Critical
Changed in linux (Ubuntu):
status:	Confirmed → New

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2018-02-15: Missing required logs.

#10

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1749040

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

Kleber Sacilotto de Souza (kleber-souza) wrote on 2018-02-20: Re: fails to boot on Cavium ThunderX CRB

#11

Hi,

We have reverted some arm64 patches that were causing boot issues so the system should be able to boot now. Could you please verify if you can boot the system the latest Artful kernel on -proposed?

Thank you.

Revision history for this message

dann frazier (dannf) wrote on 2018-02-20: Re: [Bug 1749040] Re: fails to boot on Cavium ThunderX CRB

#12

On Tue, Feb 20, 2018 at 4:05 AM, Kleber Sacilotto de Souza
<email address hidden> wrote:
> Hi,
>
> We have reverted some arm64 patches that were causing boot issues so the
> system should be able to boot now. Could you please verify if you can
> boot the system the latest Artful kernel on -proposed?

LGTM:
ubuntu@grotrian:~$ cat /proc/version
Linux version 4.13.0-36-generic (buildd@bos02-arm64-021) (gcc version
5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.9))
#40~16.04.1-Ubuntu SMP Fri Feb 16 23:26:28 UTC 2018

Revision history for this message

dann frazier (dannf) wrote on 2018-02-22: Re: KPTI-enabled kernel fails to boot on Cavium ThunderX CRB

#13

Paolo has respun a new KPTI backport:
https://git.launchpad.net/~p-pisati/ubuntu/+source/linux/log/?h=artful-master-next-arm64-kpti-414-backport

A linux/artful test build is available at:
ppa:p-pisati/arm64-kpti-backport

And a linux-hwe/xenial test build is available at:
ppa:dannf/kpti

We're now in the process of regression testing across platforms, using the Ubuntu cert tests.

summary:

- fails to boot on Cavium ThunderX CRB
+ KPTI-enabled kernel fails to boot on Cavium ThunderX CRB

Revision history for this message

dann frazier (dannf) wrote on 2018-02-22:

#14

Test results from a Cavium Sabre ThunderX2 system. All failures are expected. Edit (2.3 MiB, text/html)

Revision history for this message

dann frazier (dannf) wrote on 2018-02-22:

#15

Test results from an HP m400 (X-Gene) system Edit (1.1 MiB, text/html)

1 unexpected failure on the HP m400 - disk/disk_stress_ng_sda. Needs investigation.

Revision history for this message

dann frazier (dannf) wrote on 2018-02-23:

#16

The failure mentioned in Comment #15 is a test suite bug - see LP: #1751167. I applied a hot-fix, re-ran the test, and it passed.

Revision history for this message

dann frazier (dannf) wrote on 2018-02-23:

#17

Cavium CRB (ThunderX) results Edit (1.7 MiB, text/html)

Revision history for this message

dann frazier (dannf) wrote on 2018-02-23:

#18

Gigabyte R120 (ACPI mode ThunderX) Edit (1.8 MiB, text/html)

Revision history for this message

Manoj Iyer (manjo) wrote on 2018-02-23:

#19

submission_2018-02-23T16.15.23.403621.html Edit (1.7 MiB, text/html)

On QTI QDF2400 I notice errors from xhci-hcd like:

awrep6 login: [18851.092328] xhci-hcd QCOM8041:02: ERROR: unexpected setup address command completion code 0x11.
[18851.308402] xhci-hcd QCOM8041:02: ERROR: unexpected setup address command completion code 0x11.
[18851.530820] usb 5-1.4: device not accepting address 6, error -22
[18852.128633] xhci-hcd QCOM8041:02: ERROR: unexpected setup address command completion code 0x11.
[18852.343093] xhci-hcd QCOM8041:02: ERROR: unexpected setup address command completion code 0x11.

and OOM kills like:

[20783.778840] Out of memory: Kill process 28248 (stress-ng) score 1075 or sacrifice child
[20783.785910] Killed process 28248 (stress-ng) total-vm:8354000kB, anon-rss:6247552kB, file-rss:320kB, shmem-rss:64kB
[20947.665955] Out of memory: Kill process 28310 (stress-ng) score 1066 or sacrifice child
[20947.673020] Killed process 28310 (stress-ng) total-vm:7347264kB, anon-rss:3564612kB, file-rss:356kB, shmem-rss:64kB

I plan to reboot the system and re-run and report back here.

Revision history for this message

dann frazier (dannf) wrote on 2018-02-23: Re: [Bug 1749040] Re: KPTI-enabled kernel fails to boot on Cavium ThunderX CRB

#20

On Fri, Feb 23, 2018 at 9:24 AM, Manoj Iyer <email address hidden> wrote:
> On QTI QDF2400 I notice errors from xhci-hcd like:
>
> awrep6 login: [18851.092328] xhci-hcd QCOM8041:02: ERROR: unexpected setup address command completion code 0x11.
> [18851.308402] xhci-hcd QCOM8041:02: ERROR: unexpected setup address command completion code 0x11.
> [18851.530820] usb 5-1.4: device not accepting address 6, error -22
> [18852.128633] xhci-hcd QCOM8041:02: ERROR: unexpected setup address command completion code 0x11.
> [18852.343093] xhci-hcd QCOM8041:02: ERROR: unexpected setup address command completion code 0x11.

Is that also seen when running the current linux-hwe kernel?

> and OOM kills like:
>
> [20783.778840] Out of memory: Kill process 28248 (stress-ng) score 1075 or sacrifice child
> [20783.785910] Killed process 28248 (stress-ng) total-vm:8354000kB, anon-rss:6247552kB, file-rss:320kB, shmem-rss:64kB
> [20947.665955] Out of memory: Kill process 28310 (stress-ng) score 1066 or sacrifice child
> [20947.673020] Killed process 28310 (stress-ng) total-vm:7347264kB, anon-rss:3564612kB, file-rss:356kB, shmem-rss:64kB

I think that's normal, as long as the test passes.

-dann

Revision history for this message

Manoj Iyer (manjo) wrote on 2018-02-26: Re: KPTI-enabled kernel fails to boot on Cavium ThunderX CRB

#21

D05 HiSilicon Edit (1.5 MiB, text/html)

Test results from D05 Hisilicon board.

Revision history for this message

dann frazier (dannf) wrote on 2018-02-26:

#22

The virt test for D05 failing is unexpected. It looks like the updated kernel fails to boot as a guest on HiSilicon D05 systems, and the crash is in code that the KPTI patches introduced.

Revision history for this message

dann frazier (dannf) wrote on 2018-02-27:

#23

I'm able to reproduce the failure with an upstream kernel (4.16-rc3+ @ 6f70eb2b00eb4), running on both the host and guest.

Revision history for this message

dann frazier (dannf) wrote on 2018-02-27:

#24

4.16-rc3+ guest boot log Edit (17.5 KiB, text/plain)

Revision history for this message

dann frazier (dannf) wrote on 2018-02-27:

#25

config used to build 4.16-rc3 kernel Edit (213.0 KiB, text/plain)

summary:

- KPTI-enabled kernel fails to boot on Cavium ThunderX CRB
+ KPTI support for arm64 systems

Revision history for this message

Paolo Pisati (p-pisati) wrote on 2018-02-27:

#26

No need to install 4.16-rc3 in both host & guest: i can reproduce it
on d05-6 using 4.13.0-36-generic #40~16.04.1 on host (no KPTI
patchset) and 4.16.0-rc3+ on guest.

Unfortunately defconfig boots fine, so there's something in that .config that trips it.
If trying to reproduce it, remove the CONFIG_DEBUG_INFO, to avoid building kmod with debug info and ending up with +1Gb of kmods.

Revision history for this message

dann frazier (dannf) wrote on 2018-02-27:

#27

Right - I should've mentioned that - the issue follows the guest kernel for me as well. I tested w/ latest upstream for both just in case the guest failure is a somehow a side-effect of a host kernel bug.

Also, I should have mentioned that this is an intermittent failure for me. I sometimes have to reboot the guest several times to observe the failure. The test also seems sensitive to changes in the test setup - for example, bumping up the guest cpu count to ~50 made the issue go into hiding.

Revision history for this message

Paolo Pisati (p-pisati) wrote on 2018-02-27:

#28

Yep, if it boots fine, i usually retry 2/3 times, and it has happened that it failed only the second or the third time i tried.

Anyhow, now i'm down to this delta .config (the initial one had ~2k entries), so the bug appears to be ACPI related:

-------------------------------------------------------------------------------
CONFIG_ACPI=y
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_GHES=y
CONFIG_ACPI_APEI_MEMORY_FAILURE=y
CONFIG_ACPI_APEI_PCIEAER=y
CONFIG_ACPI_APEI_SEA=y
CONFIG_ACPI_BGRT=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_CCA_REQUIRED=y
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_CPPC_LIB=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_FAN=y
CONFIG_ACPI_GENERIC_GSI=y
CONFIG_ACPI_GTDT=y
CONFIG_ACPI_HED=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_I2C_OPREGION=y
CONFIG_ACPI_IORT=y
CONFIG_ACPI_MCFG=y
CONFIG_ACPI_NUMA=y
CONFIG_ACPI_PCI_SLOT=y
-------------------------------------------------------------------------------

How to reproduce it from d05-6 - 192.168.122.14 is my kvm instance:

$ make defconfig
$ cat $abovedelta >> .config
$ make olddefconfig
$ make -j65
$ scp arch/arm64/boot/Image ubuntu@192.168.122.14:.
$ ssh ubuntu@192.168.122.14

and inside the kvm instance:

$ sudo cp Image /boot/
and when rebooting, presc Esc at the grub menu, select the "Image" entry previously created[*]

*: in /boot/grub/grub.cfg create a copy of the default Ubuntu instance, but use /boot/Image instead of /boot/vmlinuz-x.y.z, remove "quiet splash" from the default argument and rename it 'Image'

Revision history for this message

Paolo Pisati (p-pisati) wrote on 2018-02-28:

#29

Found && fixed.

https://git.launchpad.net/~p-pisati/ubuntu/+source/linux/log/?h=artful-master-next-arm64-kpti-414-backport

I pushed two fixes on top of it:

1) 'syscalls: Use CHECK_DATA_CORRUPTION for addr_limit_user_check' fixes a FTBFS for armhf
2) 'arm64: mm: fix thinko in non-global page table attribute check' fixes this kvm boot race

We probably need to test it again on all arm64 boards now...

Revision history for this message

dann frazier (dannf) wrote on 2018-02-28: Re: [Bug 1749040] Re: KPTI support for arm64 systems

#30

On Wed, Feb 28, 2018 at 10:12 AM, Paolo Pisati
<email address hidden> wrote:
> Found && fixed.
>
> https://git.launchpad.net/~p-pisati/ubuntu/+source/linux/log/?h=artful-
> master-next-arm64-kpti-414-backport
>
> I pushed two fixes on top of it:
>
> 1) 'syscalls: Use CHECK_DATA_CORRUPTION for addr_limit_user_check' fixes a FTBFS for armhf
> 2) 'arm64: mm: fix thinko in non-global page table attribute check' fixes this kvm boot race
>
> We probably need to test it again on all arm64 boards now...

OK - I'll get a PPA kernel built and we'll begin another test cycle.

Revision history for this message

dann frazier (dannf) wrote on 2018-03-01:

#31

Refreshed kernel now in ppa:dannf/kpti

Revision history for this message

dann frazier (dannf) wrote on 2018-03-01:

#32

Updated test results from an HP m400 (X-Gene) system Edit (1.1 MiB, text/html)

Revision history for this message

Paolo Pisati (p-pisati) wrote on 2018-03-01:

#33

I see 5 failures marked as blocker there: 4 couldn't complete due to the test environment, and 1[*], apparently, is a CONFIG issue (so something that predates this patchset).

Let's see what the other boards report.

1: https://launchpadlibrarian.net/359043692/submission_2018-03-01T05.58.16.640032.html#7-10-log

Revision history for this message

Manoj Iyer (manjo) wrote on 2018-03-01:

#34

cert testing on Hisilicon Edit (1.4 MiB, text/html)

Hisilicon D05-0 testing.

Revision history for this message

Manoj Iyer (manjo) wrote on 2018-03-01:

#35

cert testing on QDF2400 Edit (1.5 MiB, text/html)

cert testing on QDF2400

Revision history for this message

dann frazier (dannf) wrote on 2018-03-01:

#36

On Thu, Mar 1, 2018 at 8:50 AM, Paolo Pisati <email address hidden> wrote:
> I see 5 failures marked as blocker there: 4 couldn't complete due to the
> test environment, and 1[*], apparently, is a CONFIG issue (so something
> that predates this patchset).

Sorry - I didn't have time to analyze it last night when I posted. All
5 are expected:
  - disk/disk_stress_ng_sda is a test bug (LP: #1751167)
  - ethernet/multi_iperf3_* requires a client/server setup we don't
have in place
  - miscellanea/bmc_info - this system doesn't have a conventional BMC
  - miscellanea/efi_boot_mode - this system boots w/ u-boot
  - miscellanea/ipmi_test - this system doesn't have a conventional BMC

Revision history for this message

dann frazier (dannf) wrote on 2018-03-01:

#37

D05 & QDF2400 errors are all expected, thanks Manoj!

Revision history for this message

dann frazier (dannf) wrote on 2018-03-02:

#38

sabre test results - all errors expected Edit (2.1 MiB, text/html)

Revision history for this message

dann frazier (dannf) wrote on 2018-03-02:

#39

thunderx panic console log Edit (90.8 KiB, text/plain)

I've seen a couple crashes on a ThunderX CRB1S system - though at this point I don't have reason to believe it is related to KPTI. This system happened to be installed with LVM + crypted home, and we haven't ran the cert tests on such a config before. After the remaining tests complete, I plan to re-run with the pre-kpti kernel on this config, as well as running cert on the "usual" config w/ a kpti kernel. I was only able to capture the console of one of the crashes, I'll attach it here.

I also have a cert run still in-progress from a Gigabyte R120. This system is very similar to the CRB1S, and has not seen any problems yet.

Revision history for this message

Manoj Iyer (manjo) wrote on 2018-03-02:

#40

results of cert on merlin Edit (1.3 MiB, text/html)

Merlin system cert results.

Revision history for this message

Paolo Pisati (p-pisati) wrote on 2018-03-02:

#41

If the bug in #39 is not related to kpti, can you spawn it into a separate LP bug and add a reproducer?

Thanks.

Revision history for this message

dann frazier (dannf) wrote on 2018-03-02:

#42

On Fri, Mar 2, 2018 at 4:48 AM, Paolo Pisati <email address hidden> wrote:
> If the bug in #39 is not related to kpti, can you spawn it into a
> separate LP bug and add a reproducer?

If it is shown to not be related to KPTI after I run those further
tests, that is the plan.

-dann

Revision history for this message

dann frazier (dannf) wrote on 2018-03-03:

#43

results from thunderx crb1s Edit (1.4 MiB, text/html)

crb1s results - all expected failures except disk/disk_stress_ng_dm-1 which is what was running when the panic mentioned in comment #39 occurred.

Revision history for this message

dann frazier (dannf) wrote on 2018-03-03:

#44

Gigabyte R120 (ACPI mode ThunderX) results Edit (1.6 MiB, text/html)

All failures expected.

Revision history for this message

Paolo Pisati (p-pisati) wrote on 2018-03-05:

#45

Is there a way for me to manually run disk/disk_stress_ng_dm-1?

Revision history for this message

dann frazier (dannf) wrote on 2018-03-05:

#46

On Mon, Mar 5, 2018 at 2:34 AM, Paolo Pisati <email address hidden> wrote:
> Is there a way for me to manually run disk/disk_stress_ng_dm-1?

I've just started a run on the server/config that failed the test, but
with a pre-kpti kernel, to see if it follows kpti or the
system/config. The command it generated was:

timeout -s 14 1200 stress-ng --aggressive --verify --timeout 240
--temp-path /tmp/disk_stress_ng_7a89ec14-42ad-4a5c-ae7d-41b3293cd7ee
--aio 0 --hdd-opts dsync --readahead-bytes 16M -k

Revision history for this message

dann frazier (dannf) wrote on 2018-03-05:

#47

I was able to reproduce a crash on a crypted LVM system w/ the pre-KPTI kernel. LP: #1753489.

The backtrace isn't identical, but it does show that crypted LVM was fragile even before the KPTI patches.

Revision history for this message

Paolo Pisati (p-pisati) wrote on 2018-03-05:

#48

Good, any other pending tests?

If not, i'll move forward and send a the above v2 arm64 kpti patchset.

Revision history for this message

dann frazier (dannf) wrote on 2018-03-05:

#49

On Mon, Mar 5, 2018 at 7:47 AM, Paolo Pisati <email address hidden> wrote:
> Good, any other pending tests?

Yes, just one more. I'm going to run the same test on the kpti kernel
on a different system of the same config, but *without* lvm crypt.
Starting that now.

Revision history for this message

dann frazier (dannf) wrote on 2018-03-06:

#50

kern.log Edit (1.7 MiB, text/x-log; charset="US-ASCII"; name="kern.log")

On Mon, Mar 5, 2018 at 9:06 AM, dann frazier <email address hidden> wrote:
> On Mon, Mar 5, 2018 at 7:47 AM, Paolo Pisati <email address hidden> wrote:
>> Good, any other pending tests?
>
> Yes, just one more. I'm going to run the same test on the kpti kernel
> on a different system of the same config, but *without* lvm crypt.
> Starting that now.

Unfortunately, that locked up. Not in the disk test, but while running:
stress-ng -k --aggressive --verify --timeout 1560 --brk 0
(I kicked off a full cert re-run)

Those stress-ng process are stuck blocked for I/O, and commands like
'ps -ef' hang.
The system is otherwise idle.

Revision history for this message

dann frazier (dannf) wrote on 2018-03-07:

#51

I rebooted the machine (lundmark) and restarted the test mentioned in the previous comment, this time the system crashed w/ a Synchronous External Abort:

[23243.094384] Synchronous External Abort: synchronous parity or ECC error (0x86000018) at 0x0000ffffb0f74f68

This suggests a non-software issue. I'll seek another CRB machine and retest.

Revision history for this message

dann frazier (dannf) wrote on 2018-03-07:

#52

combined kern.log/serial log of lundmark SEA crash Edit (1.5 MiB, text/plain)

Revision history for this message

dann frazier (dannf) wrote on 2018-03-07:

#53

I've reproduced the CRB1S panic with the current (non-kpti) kernel and without crypt lvm (standard MAAS install) and reported bug 1754053 to track it. We can rule that out as being caused by kpti.

dann frazier (dannf) on 2018-03-23

Changed in linux (Ubuntu Artful):
status:	In Progress → Fix Released
Changed in linux (Ubuntu Bionic):
status:	Incomplete → Fix Committed
status:	Fix Committed → Fix Released

Ubuntu
linux package

KPTI support for arm64 systems

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
linux (Ubuntu)	Fix Released	Critical	Unassigned
Artful	Fix Released	Critical	Unassigned
Bionic	Fix Released	Critical	Unassigned

Ubuntulinux package

KPTI support for arm64 systems

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package