kernel BUG at /build/buildd/linux-3.13.0/mm/page_alloc.c:968

Bug #1497428 reported by Dan Streetman on 2015-09-18
30
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned
Trusty
High
Unassigned

Bug Description

The kernel triggers a BUG when it finds it is in move_freepages() but the start and end pfns for the move are in different zones.

Dan Streetman (ddstreet) on 2015-09-18
Changed in linux (Ubuntu):
assignee: nobody → Dan Streetman (ddstreet)
Dan Streetman (ddstreet) on 2015-09-18
tags: added: trusty

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1497428

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
assignee: Dan Streetman (ddstreet) → nobody
importance: Undecided → Low
Dan Streetman (ddstreet) on 2015-09-21
Changed in linux (Ubuntu):
status: Incomplete → In Progress
assignee: nobody → Dan Streetman (ddstreet)
tags: added: needs-apport-collect
removed: sts
Dan Streetman (ddstreet) wrote :

Chris, this bug is for a Canonical STS issue I'm debugging. I'll add more details as I get them.

tags: added: sts
removed: needs-apport-collect

Dan Steetman, ah, never heard of STS so my bad on zapping the tag. Would it be possible to perform an apport-collect on a reference computer this is reproducible with?

Otherwise, nobody can really contribute to this given the current level of detail provided.

Dan Streetman (ddstreet) wrote :

No, I can't add any details just yet, I don't have direct access to the failing system, but I'm working with the reporter to debug it. This bug is currently just a placeholder so I can provide a debug ppa, pad.lv/ppa/ddstreet/lp1497428. It's okay that nobody else can help debug yet, because I'm debugging it :-)

When I have more details I can share, I will add them to the bug. It's quite possible this only requires a backport to trusty from vivid, but I just don't know yet.

no longer affects: linux-lts-trusty (Ubuntu)
Dan Streetman (ddstreet) on 2015-09-22
Changed in linux (Ubuntu Trusty):
assignee: nobody → Dan Streetman (ddstreet)
status: New → In Progress
dave.muysson (dave-muysson) wrote :

Dan, I have run into this issue 4 times over the past few months, on two separate servers running 3.13. I captured the kernel trace output of each occurrence and can post them here if it would help. I have attached the latest one, but there are 3 others I can provide as well.

Environment:
AWS EC2 Virtual Instance: r3.large
Ubuntu lts-trusty 3.13.0-53-generic (and) 3.13.0-45-generic.

Dan Streetman (ddstreet) wrote :

Hi Dave,

are you able to reproduce the bug? The trace by itself isn't terribly helpful, all it really says is the pageblock spans zones, which means move_freepages_block() logic for detecting that failed for some reason. I have a debug kernel ppa here:
pad.lv/ppa/ddstreet/lp1497428

that includes additional debug if the problem happens (it also should prevent the BUG()). If you can use that kernel to trigger this and send the resulting debug output it would help very much :-)

when the problem reproduces, in the system log you should see:
page_zone(start_page) !=page_zone(end_page)

and more debug following that. It should not trigger BUG() though, so you may need to check the logs periodically.

Thanks!

Dan,

  I haven’t tried to directly reproduce the bug, but I have a few ideas. If I can free up some time I’ll see if I can reproduce it.

Dave Muysson | Cloud Architect
dave.muysson@360pi.com <mailto:dave.muysson@360pi.com> |​ (613) 562-2525 x 510 <tel:6135622525,510> |​ 360pi.com <http://360pi.com/>

> On Oct 13, 2015, at 9:20 AM, Dan Streetman <email address hidden> wrote:
>
> Hi Dave,
>
> are you able to reproduce the bug? The trace by itself isn't terribly helpful, all it really says is the pageblock spans zones, which means move_freepages_block() logic for detecting that failed for some reason. I have a debug kernel ppa here:
> pad.lv/ppa/ddstreet/lp1497428
>
> that includes additional debug if the problem happens (it also should
> prevent the BUG()). If you can use that kernel to trigger this and send
> the resulting debug output it would help very much :-)
>
> when the problem reproduces, in the system log you should see:
> page_zone(start_page) !=page_zone(end_page)
>
> and more debug following that. It should not trigger BUG() though, so
> you may need to check the logs periodically.
>
> Thanks!
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1497428
>
> Title:
> kernel BUG at /build/buildd/linux-3.13.0/mm/page_alloc.c:968
>
> Status in linux package in Ubuntu:
> In Progress
> Status in linux source package in Trusty:
> In Progress
>
> Bug description:
> The kernel triggers a BUG when it finds it is in move_freepages() but
> the start and end pfns for the move are in different zones.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1497428/+subscriptions

Dan,

  Just checking in with a status update. Our main system experiencing the issue is a production system, so loading the custom kernel wasn’t an option at the time. I have since created a clone of our production server and am trying to reproduce the issue now.

  I will let you know once the issue reoccurs on our cloned environment.

Dave Muysson | Cloud Architect
dave.muysson@360pi.com <mailto:dave.muysson@360pi.com> |​ (613) 562-2525 x 510 <tel:6135622525,510> |​ 360pi.com <http://360pi.com/>

> On Oct 13, 2015, at 9:37 AM, Dave Muysson <dave.muysson@360pi.com> wrote:
>
> Dan,
>
> I haven’t tried to directly reproduce the bug, but I have a few ideas. If I can free up some time I’ll see if I can reproduce it.
>
>
> Dave Muysson | Cloud Architect
> dave.muysson@360pi.com <mailto:dave.muysson@360pi.com> |​ (613) 562-2525 x 510 <tel:6135622525,510> |​ 360pi.com <http://360pi.com/>
>
>
>
>> On Oct 13, 2015, at 9:20 AM, Dan Streetman <<email address hidden> <mailto:<email address hidden>>> wrote:
>>
>> Hi Dave,
>>
>> are you able to reproduce the bug? The trace by itself isn't terribly helpful, all it really says is the pageblock spans zones, which means move_freepages_block() logic for detecting that failed for some reason. I have a debug kernel ppa here:
>> pad.lv/ppa/ddstreet/lp1497428
>>
>> that includes additional debug if the problem happens (it also should
>> prevent the BUG()). If you can use that kernel to trigger this and send
>> the resulting debug output it would help very much :-)
>>
>> when the problem reproduces, in the system log you should see:
>> page_zone(start_page) !=page_zone(end_page)
>>
>> and more debug following that. It should not trigger BUG() though, so
>> you may need to check the logs periodically.
>>
>> Thanks!
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1497428 <https://bugs.launchpad.net/bugs/1497428>
>>
>> Title:
>> kernel BUG at /build/buildd/linux-3.13.0/mm/page_alloc.c:968
>>
>> Status in linux package in Ubuntu:
>> In Progress
>> Status in linux source package in Trusty:
>> In Progress
>>
>> Bug description:
>> The kernel triggers a BUG when it finds it is in move_freepages() but
>> the start and end pfns for the move are in different zones.
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1497428/+subscriptions
>

Nelson Elhage (nelhage) wrote :

Hey,

We're also seeing this issue on a production system, and have been around 1/week for a while now. We may be able to boot that test kernel for experimentation purposes – would that still be useful?

Dan Streetman (ddstreet) wrote :

Yep it would definitely be useful to see a repro with the test/debug kernel, thanks!

Diego Andres (drabaioli) wrote :

Hi, I recently had the same issue in a AWS EC2 r3.large instance.
In attackment you can find the system log. Hope that helps!

Diego Andres (drabaioli) wrote :

Also, here's some system config that might have any influence on the crash (in particular Transparent Huge Page):
(cannot attach more than one file):

/etc/rc.local:

echo never > /sys/kernel/mm/transparent_hugepage/enabled
exit 0

Dan Streetman (ddstreet) wrote :

Diego, thanks, although the log doesn't provide any new info, and it's doubtful this is related to hugepages.

Dan Streetman (ddstreet) wrote :

For reference, here's a pasted sample of the Oops (taken from Diego's log above):

[415478.493013] ------------[ cut here ]------------
[415478.496056] kernel BUG at /build/buildd/linux-3.13.0/mm/page_alloc.c:968!
[415478.496056] invalid opcode: 0000 [#1] SMP
[415478.496056] Modules linked in: dm_crypt syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmul crc32_pclmul serio_raw ghash_clmulni_intel isofs aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd floppy psmouse ixgbevf
[415478.496056] CPU: 1 PID: 11213 Comm: htop Not tainted 3.13.0-48-generic #80-Ubuntu
[415478.496056] Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/06/2015
[415478.496056] task: ffff880037758000 ti: ffff8803c9dbe000 task.ti: ffff8803c9dbe000
[415478.496056] RIP: 0010:[<ffffffff81154714>] [<ffffffff81154714>] move_freepages+0x104/0x110
[415478.496056] RSP: 0018:ffff8803c9dbfbd0 EFLAGS: 00010006
[415478.496056] RAX: ffff8803e08fb000 RBX: 0000000000000000 RCX: 0000000000000001
[415478.496056] RDX: ffffea000f827fc0 RSI: ffffea000f820000 RDI: ffff8803e08fbf00
[415478.496056] RBP: ffff8803c9dbfbd8 R08: ffff8803e08fbf00 R09: 0000000000000000
[415478.496056] R10: 0000000000000000 R11: ffffea000f820920 R12: ffffea000f820900
[415478.496056] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000014
[415478.496056] FS: 00007f18e59b2740(0000) GS:ffff8803e0420000(0000) knlGS:0000000000000000
[415478.496056] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[415478.496056] CR2: 00007f18e59bb000 CR3: 00000001456aa000 CR4: 00000000001406e0
[415478.496056] Stack:
[415478.496056] ffffffff81154793 ffff8803c9dbfc50 ffffffff8115620b ffff8803a9b7f400
[415478.496056] ffff8801576ba128 ffff8803e08fbff0 0000000000000001 ffffea000f820920
[415478.496056] ffff8803e08fbf00 0000000200000001 0000000000000000 0000000000000011
[415478.496056] Call Trace:
[415478.496056] [<ffffffff81154793>] ? move_freepages_block+0x73/0x80

Dan Streetman (ddstreet) wrote :

To clarify the bug, a bit of background is needed first (specific numbers apply only to this situation).

The kernel refers to all pages under a single PMD (midlevel page table) as a "pageblock". It's the same size as a hugepage, 2M. In the function triggering the BUG(), it's expecting that the start and end pages are inside the same zone, but that isn't the case so the BUG() is triggered. One function up, move_freepages_block(), is where the start and end PFNs are set; the function takes one page and calculates the start and end PFNs (which are aligned) that contain the provided page. It then verifies that both PFNs are inside the original page's zone, and passes the start/end pages to move_freepages().

The problem is that the zone's PFN range is wrong. In this particular case, the zone's memory ends in the middle of a pageblock, which is unusual. So when move_freepages_block() checks if the end PFN of the pageblock is inside the zone (i.e. < zone end PFN), it *should* fail, and cause the function to return. However, it doesn't fail, meaning the zone's end PFN is wrong, and when move_freepages() checks the page_zone() of the start and end pages, they don't match - because the end page isn't valid - and the BUG() is triggered.

In my testing, if I manually limit memory to end in the middle of a pageblock, the zone's end PFN is correctly set, so it seems that something is changing the zone PFN range (specifically the zone's spanned_pages value) at runtime - or, the particular environment for this bug is different that my test setup and getting the zone end PFN wrong somehow. I'm going to create a debug module that will jprobe these functions to check for this condition, and then print debug output and avoid the BUG().

As a workaround for this, if the amount of memory is set so that it ends at a multiple of the pageblock size (512 4k pages == 2M), this bug should not happen. On x86, the boot mem= param sets the maximum address, which should allow changing the zone's end pfn to be aligned with pageblock; e.g. if the dmesg e820 output lists the last line of the memory ranges as:

[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x00000003e08fffff] usable

then the last valid PFN is 0x3e08fffff, so the zone end pfn (1 more than last valid pfn) is 0x3e0900000, which isn't a multiple of the pageblock size (2M):

$ echo $[ 0x3e0900000 % (2 * 1024 * 1024) ]
1048576

In this example case, restricting the last 1M of memory by setting mem=0x3e0800000 should work around this bug - although since I can't reproduce it yet, I've no way to verify the workaround; and it may simply cause the bug to appear at a different location.

Nelson Elhage (nelhage) wrote :

Hi @ddstreet, thanks for the update.

We unfortunately weren't able to reproduce this on your test kernel, and have since moved to a newer kernel version for other reasons.

However, I can confirm that on the affected machine types, and only the affected machine type, we see a memory range in `/proc/iomem` that ends off of a multiple of 2M. Again, we do not see this on any other machines.

I've attached `/proc/iomem` and `/proc/zoneinfo` from an affected machine (currently running an LTS backport kernel:
Linux [redacted] 3.19.0-33-generic #38~14.04.1-Ubuntu SMP Fri Nov 6 18:17:28 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
)

I'm pretty sure they show the kind of mismatch you're talking about; iomem shows

100000000-f0fffefff : System RAM
f0ffff000-f0fffffff : RAM buffer
f10000000-f17ffffff : System RAM

But the “Node 0 Normal” zone, judging from (start_pfn, start_pfn+spanned) spans 100000000-f18000000. The machine in question is a c4.8xlarge in EC2.

Nelson Elhage (nelhage) wrote :
Dan Streetman (ddstreet) wrote :

The newer kernel may have some change/fix that prevents this bug, as I haven't seen any reports of this (from google, at least) on any other kernel. Plus, the unusual requirement of the memory having to end at not a multiple of 2M.

> But the “Node 0 Normal” zone, judging from (start_pfn, start_pfn+spanned) spans 100000000-f18000000.
> The machine in question is a c4.8xlarge in EC2.

Awesome, I'll set up a vm with that flavor to see if I can reproduce this, or at least reproduce the problematic zone setup. Thanks!

Chris J Arges (arges) on 2015-12-07
tags: added: kernel-key
Changed in linux (Ubuntu):
importance: Low → High
Changed in linux (Ubuntu Trusty):
importance: Undecided → High
Dan Streetman (ddstreet) wrote :

i booted a c4.8xlarge flavor AWS instance and got the same memory/numa layout as comment 16. To clarify though, the /proc/iomem output isn't representative of the actual memory layout; specifically it is:

[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000efffffff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x0000000f0fffefff] usable

and the SRAT divides it into 2 nodes as:

[929310.710905] SRAT: Node 0 PXM 0 [mem 0x00000000-0xefffffff]
[929310.710906] SRAT: Node 0 PXM 0 [mem 0x100000000-0x778efffff]
[929310.710907] SRAT: Node 1 PXM 1 [mem 0x778f00000-0xf0fffffff]

so the node ranges are set up as:

[929310.854161] On node 0 totalpages: 7769757
[929310.854162] DMA zone: 64 pages used for memmap
[929310.854162] DMA zone: 21 pages reserved
[929310.854163] DMA zone: 3997 pages, LIFO batch:0
[929310.854196] mminit::memmap_init Initialising map node 0 zone 0 pfns 1 -> 4096
[929310.854265] DMA32 zone: 15296 pages used for memmap
[929310.854266] DMA32 zone: 978944 pages, LIFO batch:31
[929310.854299] mminit::memmap_init Initialising map node 0 zone 1 pfns 4096 -> 1048576
[929310.869608] Normal zone: 106044 pages used for memmap
[929310.869611] Normal zone: 6786816 pages, LIFO batch:31
[929310.869647] mminit::memmap_init Initialising map node 0 zone 2 pfns 1048576 -> 7835392
[929310.975013] On node 1 totalpages: 7958783
[929310.975018] Normal zone: 124356 pages used for memmap
[929310.975019] Normal zone: 7958783 pages, LIFO batch:31
[929310.975055] mminit::memmap_init Initialising map node 1 zone 2 pfns 7835392 -> 15794175

node 0 DMA and DMA32 ranges are normal, ending at 0x1000 and 0x100000, respectively. The Normal zone for node 0 ends at 0x778f00, and the Normal zone for node 1 ends at 0xf0ffff. Since PAGE_SHIFT is 12 and pageblock_order (with this system config) is (21 - 12 = 9):

node 0 Normal zone ends on a pageblock boundary, while node 1 Normal zone ends 1 page short of a pageblock boundary.

Preliminary note: the SRAT table seems to be incorrect; it spans node 1 all the way to 0xf0fffffff, but e820 memory, and the node 1 Normal zone, only reach 0xf0fffefff.

dave.muysson (dave-muysson) wrote :

Dan,

  Not sure if this will help or not, but of the 8+ servers we have using the r3.large instance type, the only two that have encountered the issue were running MongoDB on them, launched using the numactl tool with the --interleave=all option set.

Here's the exact launch command used:

exec start-stop-daemon --start --quiet --chuid mongodb --make-pidfile --pidfile /var/run/mongodb.pid --exec /usr/bin/numactl -- --interleave=all /usr/bin/mongod --config /etc/mongodb.conf

  I won't pretend to know how numactl interleaves the memory across the nodes, but I can't help but think high memory usage on these nodes combined with forced interleaving might be why we hit this issue?

  After weeks of stress testing with your custom kernel, I have yet to hit this issue again. The synthetic environment I'm using probably isn't enough to hit this bug. Hopefully your testing with the c4.8xLarge is more helpful.

Dan Streetman (ddstreet) wrote :

> I won't pretend to know how numactl interleaves the memory across the nodes,
> but I can't help but think high memory usage on these nodes combined with
> forced interleaving might be why we hit this issue?

The numactl interleaving just causes memory to be allocated from all nodes on a round-robin basis, I don't think that would cause this, other than mongod simply using a whole lot of memory.

> After weeks of stress testing with your custom kernel, I have yet to hit this issue again

so, the custom kernel actually bypasses the BUG() call, and logs debug instead - have you checked your logs to see if there are any relevant messages? You would see output like:

page_zone(start_page) !=page_zone(end_page)

and more debug following it; you can search/grep the logs for "move_freepages".

Dan Streetman (ddstreet) wrote :

kernel module to add debug for this mm BUG(). This module is for kernel 3.13.0-71-generic only.

Dan Streetman (ddstreet) wrote :

Can anyone seeing this problem, if you're on the 3.13.0-71-generic kernel, please load the above attached module? It will initially check the node/zone start/end locations for validity, and also will check every time move_freepages is called, and if it detects the BUG() will be hit it prints out debug about the current node/zone values - but it doesn't prevent the BUG() so you'll know when the problem reproduces.

Dan Streetman (ddstreet) wrote :

BTW, I've only seen this situation - with a node end pfn not on a pageblock boundary - happen with the AWS flavors "c4.8xlarge" and "m4.10xlarge". If anyone else sees this bug anywhere besides those Amazon AWS instances, please let me know.

Matt Wilson (msw-amazon) wrote :

Dan,

This BUG_ON has been demoted to only trigger when DEBUG_VM is set in upstream:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=97ee4ba7cbd30f1858f0d16911e042737c53f2ef

I'm looking into why there's a one page difference between the E820 tables and SRAT. You're right that there seems to be an off-by-one in one or the other.

Dan Streetman (ddstreet) wrote :

Matt,

I think it's fine that upstream has demoted the BUG_ON, as I haven't heard anyone report this with a kernel later than 3.13; I assume whatever is causing it is fixed in later kernels.

At this point there's not much more I can do, as I can't reproduce it and don't have much debug info on exactly what/why the zone's pfn range becomes incorrect. If you or anyone has any info on how to reproduce this, or you have a system where it's reproducable, please let me know.

tags: added: kernel-da-key
removed: kernel-key
Matt W (wise) wrote :
Download full text (3.6 KiB)

I can't be sure that we ran into the exact same bug, but Amazon seems to think we may have. I can't find the beginning of the console log, but here's a mid-point that shows the hang:

Host Type: Amazon EC2 r3.8xlarge
OS: Ubuntu 14.04.5
Kernel: 3.13.0-93-generic
Networking: Intel Enhanced Neworking driver 2.16.4 (ixgbevf)

Workload: Postgres running with most of the systems memory, but Apache Flume was going a bit haywire at the time taking ~20-30% of the available CPU (using Oracle Java 7).

[27484.664087] Code: cc cc cc b8 1c 00 00 00 0f 01 c1 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc b8 1d 00 00 00 0f 01 c1 <c3> cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
[27504.324077] BUG: soft lockup - CPU#2 stuck for 22s! [java:62266]
[27504.324077] Modules linked in: ipt_REJECT xt_multiport nf_conntrack_ipv4 nf_defrag_ipv4 xt_comment xt_conntrack nf_conntrack ip6table_filter ip6_tables iptable_filter ip_tables x_tables bcache dm_crypt syscopyarea[27504.344088] BUG: soft lockup - CPU#3 stuck for 22s! [java:62269]
[27504.344088] Modules linked in: ipt_REJECT xt_multiport nf_conntrack_ipv4 nf_defrag_ipv4 xt_comment xt_conntrack nf_conntrack ip6table_filter ip6_tables iptable_filter ip_tables x_tables bcache dm_crypt syscopyarea sysfillrect sysimgblt fb_sys_fops serio_raw isofs raid10 raid456 async_memcpy async_raid6_recov async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse floppy ixgbevf(OX)
[27504.344088] CPU: 3 PID: 62269 Comm: java Tainted: G D OX 3.13.0-93-generic #140-Ubuntu
[27504.344088] Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/12/2016
[27504.344088] task: ffff883c70a31800 ti: ffff8837955e0000 task.ti: ffff8837955e0000
[27504.344088] RIP: 0010:[<ffffffff810013a8>] [<ffffffff810013a8>] xen_hypercall_sched_op+0x8/0x20
[27504.344088] RSP: 0000:ffff8837955e1c60 EFLAGS: 00000202
[27504.344088] RAX: 0000000000000000 RBX: ffff8837955e1c40 RCX: 00000000fffffffa
[27504.344088] RDX: 0000000000000000 RSI: ffff8837955e1c70 RDI: 0000000000000003
[27504.344088] RBP: ffff8837955e1c90 R08: ffff881e1980f800 R09: ffff881e19400470
[27504.344088] R10: 0000000000000019 R11: 8000001161833966 R12: 0000000000001000
[27504.344088] R13: ffff883c70778340 R14: 0000000000000000 R15: 0000000000000000
[27504.344088] FS: 00007f0b9c7d7700(0000) GS:ffff881e19c60000(0000) knlGS:0000000000000000
[27504.344088] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[27504.344088] CR2: 000000070809b000 CR3: 00000013bdca0000 CR4: 00000000001406e0
[27504.344088] Stack:
[27504.344088] ffffffff81438b2e 0000003b810c1ec7 ffff8837955e1c6c ffffffff00000001
[27504.344088] 0000000000000000 ffff881e19c6afe0 ffff8837955e1ca0 ffffffff8143aab0
[27504.344088] ffff8837955e1ce8 ffffffff81011fa3 0000000000000213 00003eba955e1d40
[27504.344088] Call Trace:
[27504.344088] [<ffffffff81438b2e>] ? xen_poll_irq_timeout+0x3e/0x50
[27504.344088] [<ffffffff8143aab0>] xen_poll_irq+0x10/0x20
[27504.344088] [<ffffffff81011fa3>] xen_lock_spinning+0xa3/0x100
[27504.344088] [<ffffffff81011e61>] __raw_callee_save_xen_lock_spinning+0...

Read more...

Dan Streetman (ddstreet) wrote :

That is very definitely not the same bug as this bug.

Craig Watcham (craig-watcham) wrote :
Download full text (4.0 KiB)

Looks like this has been corrected, from a recent c4.8xl launch:
--
[ 0.000000] Linux version 3.13.0-71-generic (buildd@lgw01-09) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #114-Ubuntu SMP Tue Dec 1 02:34:22 UTC 2015 (Ubuntu 3.13.0-71.114-generic 3.13.11-ckt29)
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.13.0-71-generic root=UUID=0fbd65e4-a082-4f5a-8392-49add7329657 ro console=tty1 console=ttyS0
[ 0.000000] KERNEL supported cpus:
[ 0.000000] Intel GenuineIntel
[ 0.000000] AMD AuthenticAMD
[ 0.000000] Centaur CentaurHauls
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000efffffff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x0000000f0fffffff] usable
--

Debug module output (for reference);
--
[6409565.542433] lp1497428: module verification failed: signature and/or required key missing - tainting kernel
[6409565.565990] pageblock_nr_pages 0x200
[6409565.568026] node 0 zone 0 info:
[6409565.569761] node 0 zone 0 provided page pfn 0xfff valid 1 present 1
[6409565.572417] node 0 zone 0 start pfn 0xe00 valid 1 present 1
[6409565.574936] node 0 zone 0 end pfn 0xfff valid 1 present 1
[6409565.577453] node 0 zone 0 page ffffea000003ffc0 start_page ffffea0000038000 end_page ffffea000003ffc0
[6409565.581422] node 0 zone 0 spans: provided pfn 1 start pfn 1 end pfn 1
[6409565.584305] node 0 zone 0 start pfn 0x1 spanned pages 0xfff end pfn 0x1000
[6409565.587379] node 0 zone 0 present pages 0xf9d managed pages 0xf88
[6409565.590125] node 0 start pfn 0x1 end pfn 0xf18000
[6409565.592238] node 0 normal pageblock multiple
[6409565.594218] node 0 zone 1 info:
[6409565.595756] node 0 zone 1 provided page pfn 0xfffff valid 0 present 0
[6409565.598642] node 0 zone 1 start pfn 0xffe00 valid 0 present 0
[6409565.601218] node 0 zone 1 end pfn 0xfffff valid 0 present 0
[6409565.603587] node 0 zone 1 page ffffea0003ffffc0 start_page ffffea0003ff8000 end_page ffffea0003ffffc0
[6409565.607428] node 0 zone 1 spans: provided pfn 1 start pfn 1 end pfn 1
[6409565.610296] node 0 zone 1 start pfn 0x1000 spanned pages 0xff000 end pfn 0x100000
[6409565.613730] node 0 zone 1 present pages 0xef000 managed pages 0xea2d3
[6409565.616697] node 0 start pfn 0x1 end pfn 0xf18000
[6409565.619146] node 0 normal pageblock multiple
[6409565.621321] node 0 zone 2 info:
[6409565.623000] node 0 zone 2 provided page pfn 0xf17fff valid 1 present 1
[6409565.625955] node 0 zone 2 start pfn 0xf17e00 valid 1 present 1
[6409565.628808] node 0 zone 2 end pfn 0xf17fff valid 1 present 1
[6409565.631467] node 0 zone 2 page ffffea003c5fffc0 start_page ffffea003c5f8000 end_page ffffea003c5fffc0
[6409565.635675] node 0 zone 2 spans: provided pfn 1 start pfn 1 end pfn 1
[6409565.638791] node 0 zone 2 start pfn 0x100000 spanned pages 0xe18000 end pfn 0xf18000
[6409565.642571] node 0 zone ...

Read more...

Craig Watcham (craig-watcham) wrote :

m4.10xl also looks good:

[7399124.202570] lp1497428: module verification failed: signature and/or required key missing - tainting kernel
[7399124.223762] pageblock_nr_pages 0x200
[7399124.226007] node 0 zone 0 info:
[7399124.227849] node 0 zone 0 provided page pfn 0xfff valid 1 present 1
[7399124.231173] node 0 zone 0 start pfn 0xe00 valid 1 present 1
[7399124.234120] node 0 zone 0 end pfn 0xfff valid 1 present 1
[7399124.237107] node 0 zone 0 page ffffea000003ffc0 start_page ffffea0000038000 end_page ffffea000003ffc0
[7399124.241777] node 0 zone 0 spans: provided pfn 1 start pfn 1 end pfn 1
[7399124.245160] node 0 zone 0 start pfn 0x1 spanned pages 0xfff end pfn 0x1000
[7399124.248725] node 0 zone 0 present pages 0xf9d managed pages 0xf88
[7399124.251852] node 0 start pfn 0x1 end pfn 0x2818000
[7399124.254459] node 0 normal pageblock multiple
[7399124.256792] node 0 zone 1 info:
[7399124.258611] node 0 zone 1 provided page pfn 0xfffff valid 0 present 0
[7399124.261926] node 0 zone 1 start pfn 0xffe00 valid 0 present 0
[7399124.264980] node 0 zone 1 end pfn 0xfffff valid 0 present 0
[7399124.267859] node 0 zone 1 page ffffea0003ffffc0 start_page ffffea0003ff8000 end_page ffffea0003ffffc0
[7399124.272375] node 0 zone 1 spans: provided pfn 1 start pfn 1 end pfn 1
[7399124.275627] node 0 zone 1 start pfn 0x1000 spanned pages 0xff000 end pfn 0x100000
[7399124.279624] node 0 zone 1 present pages 0xef000 managed pages 0xea2d3
[7399124.283191] node 0 start pfn 0x1 end pfn 0x2818000
[7399124.286066] node 0 normal pageblock multiple
[7399124.288556] node 0 zone 2 info:
[7399124.290734] node 0 zone 2 provided page pfn 0x2817fff valid 1 present 1
[7399124.294354] node 0 zone 2 start pfn 0x2817e00 valid 1 present 1
[7399124.297599] node 0 zone 2 end pfn 0x2817fff valid 1 present 1
[7399124.300895] node 0 zone 2 page ffffea00a05fffc0 start_page ffffea00a05f8000 end_page ffffea00a05fffc0
[7399124.305733] node 0 zone 2 spans: provided pfn 1 start pfn 1 end pfn 1
[7399124.309124] node 0 zone 2 start pfn 0x100000 spanned pages 0x2718000 end pfn 0x2818000
[7399124.313552] node 0 zone 2 present pages 0x1310000 managed pages 0x12bf8e5
[7399124.317134] node 0 start pfn 0x1 end pfn 0x2818000
[7399124.319881] node 0 normal pageblock multiple
[7399124.322452] node 1 zone 2 info:
[7399124.324461] node 1 zone 2 provided page pfn 0x280ffff valid 1 present 1
[7399124.328042] node 1 zone 2 start pfn 0x280fe00 valid 1 present 1
[7399124.331418] node 1 zone 2 end pfn 0x280ffff valid 1 present 1
[7399124.334599] node 1 zone 2 page ffffea00a03fffc0 start_page ffffea00a03f8000 end_page ffffea00a03fffc0
[7399124.339371] node 1 zone 2 spans: provided pfn 1 start pfn 1 end pfn 1
[7399124.342829] node 1 zone 2 start pfn 0x1410000 spanned pages 0x1400000 end pfn 0x2810000
[7399124.347266] node 1 zone 2 present pages 0x1400000 managed pages 0x13af8ab
[7399124.350781] node 1 start pfn 0x1410000 end pfn 0x2810000
[7399124.353662] node 1 normal pageblock multiple

Dan Streetman (ddstreet) on 2017-10-13
Changed in linux (Ubuntu Trusty):
assignee: Dan Streetman (ddstreet) → nobody
Changed in linux (Ubuntu):
assignee: Dan Streetman (ddstreet) → nobody
Changed in linux (Ubuntu Trusty):
status: In Progress → Triaged
Changed in linux (Ubuntu):
status: In Progress → Triaged
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers