[Hyper-V] 16.04 kexec-tools doesn't match linux-azure

Bug #1712867 reported by Joshua R. Poulson on 2017-08-24
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
crash (Ubuntu)
High
Marcelo Cerri
Xenial
High
Marcelo Cerri
kexec-tools (Ubuntu)
High
Marcelo Cerri
Xenial
High
Marcelo Cerri
linux-azure (Ubuntu)
High
Marcelo Cerri
Xenial
High
Marcelo Cerri

Bug Description

[Impact]

Currently it's not possible to use the kdump functionality in xenial when running the linux-azure kernel. The problem is actually caused by several factors:

1. kexec fails to parse /proc/kcore and thus fails to load the crash kernel. That's similar to bug #1713940 and it's related to 4.10+ kernels.

2. When the crash kernel boots, a bug in KASLR causes it to crash in a very early stage. For the user, it seems the system just rebooted after the crash.

3. Currently in azure, crashkernel=128G is not enough to boot and run the dump procedure with 4.11+ kernels. That value needs to be increased in order to kdump to succeed.

4. After the vmcore is dumped, the current version of crash in xenial is not able to parse it. All the necessary fixes are already upstream and can be backported.

[Test Case]

1. Install the linux-azure kernel in an azure instance (although it's possible to run linux-azure in bare metal or kvm, the KASLR issue only is triggered in azure).

2. Follow the instructions in https://help.ubuntu.com/lts/serverguide/kernel-crash-dump.html to setup kdump and manually trigger a crash using /proc/sysrq-trigger.

The vmcore must be generated and it should be possible to inspect it using crash.

3. Perform these same tests for the linux-generic kernel, on each supported architecture.

[Regression Potential]

Since both kexec-tools and crash are being changed to support 4.10+ kernels, it's very important that they continue to handle 4.4 kernels properly.

The same steps above can be used to test linux-generic for regressions.

[Other Info]

Original description:

--8<--
Because the linux-azure kernel is based on 4.11, kexec on 16.04 gives the following error:
kdump-tools[1436]: ELF core (kcore) parse failed

Perhaps the artful kexec-tools should be backported?
--8<--

Joshua R. Poulson (jrp) on 2017-08-24
Changed in linux-azure (Ubuntu):
status: New → Confirmed
Marcelo Cerri (mhcerri) on 2017-08-24
Changed in linux-azure (Ubuntu):
importance: Undecided → High
assignee: nobody → Marcelo Cerri (mhcerri)
Marcelo Cerri (mhcerri) on 2017-08-30
Changed in linux-azure (Ubuntu):
status: Confirmed → In Progress
Marcelo Cerri (mhcerri) on 2017-08-31
Changed in linux-azure (Ubuntu Xenial):
assignee: nobody → Marcelo Cerri (mhcerri)
status: New → In Progress
importance: Undecided → High
Marcelo Cerri (mhcerri) wrote :

Porting the artful kexec-tools from artful to xenial fixes the kcore parse failure and doesn't cause any regressions when used with the regular xenial kernel.

However, although "kdump-config load" doesn't fail anymore, the dump is not generated when using linux-azure. The artful kexec-tools also requires kdump-tools to be ported from artful. Because of that, I backported the necessary fixes into the xenial kexec-tools in order to keep the same kdump-tools version. The result was the same as using artful kexec-tools/kdump-tools and the dump wasn't generated when using linux-azure.

Further investigation is still necessary.

Joshua R. Poulson (jrp) wrote :

Any news? Anything I should help test?

Marcelo Cerri (mhcerri) on 2017-09-08
Changed in kexec-tools (Ubuntu):
status: New → Confirmed
Changed in kexec-tools (Ubuntu Xenial):
status: New → Confirmed
Changed in makedumpfile (Ubuntu Xenial):
status: New → Confirmed
Changed in makedumpfile (Ubuntu):
status: New → Confirmed
Marcelo Cerri (mhcerri) on 2017-09-08
no longer affects: makedumpfile (Ubuntu)
no longer affects: makedumpfile (Ubuntu Xenial)
Marcelo Cerri (mhcerri) wrote :

I was able to properly dump a vmcore with the following steps:

1. kexec-tools patches with:
   - ed15ba1b9977 build_mem_phdrs(): check if p_paddr is invalid
   - 9f62cbddddfc kexec/arch/i386: Add support for KASLR memory randomization
   - dbb99d938810 kexec-tools/x86: get_kernel_vaddr_and_size off-by-one fix

2. crashkernel increased to 192M

3. linux-azure patched with:
   - da63b6b20077 x86/KASLR: Fix kexec kernel boot crash when KASLR randomization fails

4. Forced storvsc instead of ata_piix.

With those changes, I was able to consistently dump vmcore images without any issues using several types of azure instances.

Besides that the utility crash also needs to be updated. Currently the xenial version of crash is not able to parse a linux-azure vmcore but the artful version is.

I'm running some tests with the artful crash utility and the linux-generic kernel to check if it's viable to bring that version to xenial.

Changed in kexec-tools (Ubuntu):
status: Confirmed → In Progress
Changed in kexec-tools (Ubuntu Xenial):
status: Confirmed → In Progress
Changed in kexec-tools (Ubuntu):
importance: Undecided → High
Changed in kexec-tools (Ubuntu Xenial):
importance: Undecided → High
Changed in kexec-tools (Ubuntu):
assignee: nobody → Marcelo Cerri (mhcerri)
Marcelo Cerri (mhcerri) on 2017-09-08
Changed in kexec-tools (Ubuntu Xenial):
assignee: nobody → Marcelo Cerri (mhcerri)
Changed in crash (Ubuntu):
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Marcelo Cerri (mhcerri)
Joshua R. Poulson (jrp) wrote :

Looks good! Glad we're making progress.

Marcelo Cerri (mhcerri) on 2017-09-08
Changed in crash (Ubuntu):
status: Confirmed → In Progress
Marcelo Cerri (mhcerri) wrote :
Download full text (4.7 KiB)

I was able to make the xenial version of crash to work with both the linux-generic and linux-azure kernels applying the following upstream changes:

Commit: 7e0cb8b516788c7ba1ef9f32556df347ba0da187
Fix for Linux commit 0100301bfdf56a2a370c7157b5ab0fbf9313e1cd, which rewrote the X86_64 switch_to() code by embedding the __switch_to() call inside a new __switch_to_asm() assembly code ENTRY() function. Without the patch, the message "crash: cannot determine thread return address" gets displayed during initialization, and the "bt" command shows frame #0 starting at "schedule" instead of "__schedule". (<email address hidden>)
Commit: 63f7707d2b534bab2a18c52db41daae7e9c5e505
Fix for the "ps -t" option in 3.17 and later kernels that contain commit ccbf62d8a284cf181ac28c8e8407dd077d90dd4b, which changed the task_struct.start_time member from a struct timespec to a u64. Without the patch, the "RUN TIME" value is nonsensical. (<email address hidden>)

Commit: c1eb2b99e2d9201583aac5a664126d83039bddff
Fix for the "irq -s" option for Linux 4.2 and later kernels. Without the patch, the irq_chip.name string (e.g. "IO-APIC", "PCI-MSI", etc.) is missing from the display. (<email address hidden>)

Commit: 76a71fed90c6304110dbce61d6c833543f2f1ac8
Improvement of the accuracy of the allocated objects count for each kmem_cache shown by "kmem -s" in kernels configured with CONFIG_SLUB. Without the patch, the values under the ALLOCATED column may be too large because cached per-cpu objects are counted as allocated. (<email address hidden>)

Commit: 569002249b1d57162a1e94f529d295828d4e0253
When reading a task's task_struct.flags field, check for its size, which was changed from an unsigned long to an unsigned int. (<email address hidden>)

Commit: 10192898cf59b7b4bb102ef39c72ab65bd401471
Fix for Linux 4.8-rc1 commit 500462a9de657f86edaa102f8ab6bff7f7e43fc2, in which Thomas Gleixner redesigned the kernel timer mechanism to switch to a non-cascading wheel. Without the patch, the "timer" command fails with the message "timer: zero-size memory allocation! (called from <address>)" (<email address hidden>)

Commit: df08978f31ba39e94b3096804f4e0776373c8b53
Improvement of the "dev -d" option to display I/O statics for disks whose device driver uses the blk-mq interface. Currently "dev -d" always displays 0 in all fields for the blk-mq disk because blk-mq does not increment/decrement request_list.count[2] on I/O creation and I/O completion. The following values are used in blk-mq in such situations: - I/O creation: blk_mq_ctx.rq_dispatched[2] - I/O completion: blk_mq_ctx.rq_completed[2] So, we can get the counter of in-progress I/Os as follows: in progress I/Os == rq_dispatched - rq_completed This patch displays the result of above calculation for the disk. It determines whether the device driver uses blk-mq if the request_queue.mq_ops is not NULL. The "DRV" field is displayed as "N/A(MQ)" if the value for in-flight in the device driver does not exist for blk-mq. (<email address hidden>)

Commit: db552975315fec06a957c937803935d8fbddfd2d
Introduction of a new "bt -v" option that checks the kernel stack of all tasks for evidence of stack overflows. It does so by verifyi...

Read more...

Marcelo Cerri (mhcerri) wrote :

The following packages/versions are available in the PPA https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/azure/ for testing:

- crash - 7.1.4-1ubuntu4.2~rc1
- kexec-tools - 1:2.0.10-1ubuntu2.4~rc1
- linux-azure - 4.11.0-1010.10~rc2
- linux-meta-azure - 4.11.0.1010.10~rc1

Joshua R. Poulson (jrp) wrote :

We are testing and will report soon.

Marcelo Cerri (mhcerri) on 2017-09-14
description: updated

Hello Joshua, or anyone else affected,

Accepted kexec-tools into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/kexec-tools/1:2.0.10-1ubuntu2.4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

description: updated
description: updated
Changed in kexec-tools (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-xenial
Stefan Bader (smb) wrote :

Verified that kexec-tools 1:2.0.10-1ubuntu2.4 on s390x still works (with 4.4 kernel).

Steve Langasek (vorlon) wrote :

Hello Joshua, or anyone else affected,

Accepted crash into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/crash/7.1.4-1ubuntu4.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in crash (Ubuntu Xenial):
status: New → Fix Committed
Stefan Bader (smb) wrote :

Verified on Azure with 4.11 kernel that the current version (updates) fails to load the kdump kernel and the new one from proposed succeeds to load.

Marcelo Cerri (mhcerri) wrote :

I also verified that the current version fails to load the kdump kernel with linux-azure (4.11.0-1009.9) and that the version in -proposed fixes the problem. The version in -proposed also works as expected with the 4.4 xenial kernel running in azure and in kvm.

Stefan Bader (smb) wrote :

Additionally took the new crash version and installed it on s390x. I can still load dumps produced before and also dumps done after updating crash (just in case crash is in any way used to produce the dumps, which I do not think is true).

Marcelo Cerri (mhcerri) on 2017-09-15
Changed in kexec-tools (Ubuntu):
status: In Progress → Fix Committed
Changed in linux-azure (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux-azure (Ubuntu):
status: In Progress → Fix Committed
Marcelo Cerri (mhcerri) wrote :

I have verified kexec-tools and crash with the 4.4 kernel in a ppc64el machine and both packages are working as expected.

Steve Langasek (vorlon) on 2017-09-16
Changed in crash (Ubuntu):
status: In Progress → Fix Released
Changed in kexec-tools (Ubuntu):
status: Fix Committed → Fix Released
Marcelo Cerri (mhcerri) on 2017-09-16
tags: added: verification-done verification-done-xenial
removed: verification-needed verification-needed-xenial
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package kexec-tools - 1:2.0.10-1ubuntu2.4

---------------
kexec-tools (1:2.0.10-1ubuntu2.4) xenial; urgency=medium

  [Marcelo Henrique Cerri]
  * [Hyper-V] 16.04 kexec-tools doesn't match linux-azure (LP: #1712867)
    - [PATCH] build_mem_phdrs(): check if p_paddr is invalid
    - [PATCH] kexec-tools/x86: get_kernel_vaddr_and_size off-by-one fix
    - Increase crashkernel size to 256M for machines with 2G or more of memory.

 -- Stefan Bader <email address hidden> Tue, 12 Sep 2017 09:12:32 +0200

Changed in kexec-tools (Ubuntu Xenial):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for kexec-tools has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package crash - 7.1.4-1ubuntu4.2

---------------
crash (7.1.4-1ubuntu4.2) xenial; urgency=medium

  [ Marcelo Henrique Cerri ]
  * [Hyper-V] 16.04 kexec-tools doesn't match linux-azure (LP: #1712867)
    d/p/0011-Fix-for-Linux-commit-0100301bfdf56a2a370c7157b5ab0fb.patch
    d/p/0012-Fix-for-the-ps-t-option-in-3.17-and-later-kernels-th.patch
    d/p/0013-Fix-for-the-irq-s-option-for-Linux-4.2-and-later-ker.patch
    d/p/0014-Improvement-of-the-accuracy-of-the-allocated-objects.patch
    d/p/0015-When-reading-a-task-s-task_struct.flags-field-check-.patch
    d/p/0016-Fix-for-Linux-4.8-rc1-commit-500462a9de657f86edaa102.patch
    d/p/0017-Improvement-of-the-dev-d-option-to-display-I-O-stati.patch
    d/p/0018-Introduction-of-a-new-bt-v-option-that-checks-the-ke.patch
    d/p/0019-Fix-for-Linux-4.9-rc1-commits-15f4eae70d365bba26854c.patch
    d/p/0020-Fix-for-Linux-4.10-commit-7fd8329ba502ef76dd91db561c.patch
    d/p/0021-Prepare-for-the-kernel-s-taint_flag.true-and-taint_f.patch
    d/p/0022-Prevent-the-livepatch-taint-flag-check-during-the-sy.patch

 -- Stefan Bader <email address hidden> Tue, 12 Sep 2017 09:15:25 +0200

Changed in crash (Ubuntu Xenial):
status: Fix Committed → Fix Released
Marcelo Cerri (mhcerri) on 2017-09-19
Changed in crash (Ubuntu):
status: Fix Released → Fix Committed
Changed in crash (Ubuntu Xenial):
importance: Undecided → High
assignee: nobody → Marcelo Cerri (mhcerri)
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-azure - 4.11.0-1011.11

---------------
linux-azure (4.11.0-1011.11) xenial; urgency=low

  * linux-azure: 4.11.0-1011.11 -proposed tracker (LP: #1718265)

  * KVP scripts location for linux-azure image (LP: #1718264)
    - SAUCE: azure: hv_kvp_daemon: search for HV scripts in /usr/sbin/

  * [linux-azure] RTC options not present in kernel config (LP: #1718262)
    - [Config] azure: Enable RTC

linux-azure (4.11.0-1010.10) xenial; urgency=low

  * linux-azure: 4.11.0-1010.10 -proposed tracker (LP: #1717616)

  * linux-azure: persistent memory is not working (LP: #1715755)
    - ext4: fix fault handling when mounted with -o dax,ro
    - [Config] azure: CONFIG_ND_BLK=y
    - [Config] azure: CONFIG_ACPI_NFIT=y

  * [Hyper-V] 16.04 kexec-tools doesn't match linux-azure (LP: #1712867)
    - x86/KASLR: Fix kexec kernel boot crash when KASLR randomization fails

 -- Marcelo Henrique Cerri <email address hidden> Tue, 19 Sep 2017 15:47:22 -0300

Changed in linux-azure (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers