KVM and CFS bandwidth control causes kernel crashes (oops)

Bug #1458045 reported by Mohammed Naser on 2015-05-22
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned

Bug Description

We've seen this crash at least 3 times when we start setting CPU limits using `cgroups`. It makes using CPU limits impossible, causing instabilities in the operating system. Finally, after installing linux-crashdump, we got a full copy of the crash message.

========================================================
[146055.357476] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
[146055.359620] IP: [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
[146055.361890] PGD 0
[146055.364131] Oops: 0000 [#1] SMP
[146055.366475] Modules linked in: vhost_net vhost macvtap macvlan act_police cls_u32 sch_ingress ipmi_si xt_multiport nf_conntrack_ipv6 nf_defrag_ipv6 xt_mac xt_physdev xt_set iptable_raw ip_set_hash_ip ip_set nfnetlink mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_tcpudp dell_rbu bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables nbd openvswitch gre vxlan libcrc32c ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dcdbas kvm crct10dif_pclmul crc32_pclmul
[146055.388889] ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul dm_multipath glue_helper ablk_helper scsi_dh cryptd mei_me mei lpc_ich ipmi_msghandler shpchp wmi acpi_power_meter mac_hid lp parport nls_iso8859_1 igb ixgbe i2c_algo_bit dca ptp ahci pps_core megaraid_sas libahci mdio [last unloaded: ipmi_si]
[146055.404208] CPU: 31 PID: 67922 Comm: qemu-system-x86 Not tainted 3.16.0-37-generic #51~14.04.1-Ubuntu
[146055.409906] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 1.0.4 08/28/2014
[146055.415754] task: ffff883fcab69e90 ti: ffff883a1c168000 task.ti: ffff883a1c168000
[146055.421817] RIP: 0010:[<ffffffff810a7d31>] [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
[146055.428079] RSP: 0018:ffff883a1c16bce8 EFLAGS: 00010092
[146055.434377] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000044aa200
[146055.440913] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff883ffedf3140
[146055.447474] RBP: ffff883a1c16bd00 R08: 0000000000000000 R09: 0000000000000001
[146055.454181] R10: 0000000000000004 R11: 0000000000000206 R12: ffff883ffedf3140
[146055.460968] R13: 000000000000001f R14: 0000000000000001 R15: ffff883ffedf30c0
[146055.467722] FS: 00007f404919d700(0000) GS:ffff883ffede0000(0000) knlGS:ffff880002380000
[146055.474756] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[146055.481830] CR2: 0000000000000038 CR3: 0000003a1c45b000 CR4: 00000000001427e0
[146055.489134] Stack:
[146055.496412] 0000000000000000 ffff883ffedf3140 000000000000001f ffff883a1c16bd68
[146055.504053] ffffffff810af2f8 ffff883ffedf3140 00000000000130c0 ffff883fcab69e90
[146055.511786] ffffffff8101c3b9 ffff883a1c16bd50 ffffffff810a4895 ffff883fcab6a3c8
[146055.519551] Call Trace:
[146055.527330] [<ffffffff810af2f8>] pick_next_task_fair+0x78/0x880
[146055.535292] [<ffffffff8101c3b9>] ? sched_clock+0x9/0x10
[146055.543379] [<ffffffff810a4895>] ? sched_clock_cpu+0x85/0xc0
[146055.551519] [<ffffffff81768afb>] __schedule+0x11b/0x7a0
[146055.559722] [<ffffffff81769579>] _cond_resched+0x29/0x40
[146055.568020] [<ffffffffc0338289>] kvm_arch_vcpu_ioctl_run+0x3e9/0x460 [kvm]
[146055.576509] [<ffffffffc0321ce2>] kvm_vcpu_ioctl+0x2a2/0x5e0 [kvm]
[146055.585045] [<ffffffff81156952>] ? perf_event_context_sched_in+0xa2/0xc0
[146055.593771] [<ffffffff811e7250>] do_vfs_ioctl+0x2e0/0x4c0
[146055.602531] [<ffffffff8109dec8>] ? finish_task_switch+0x108/0x180
[146055.611413] [<ffffffffc032bcd4>] ? kvm_on_user_return+0x74/0x80 [kvm]
[146055.620339] [<ffffffff811e74b1>] SyS_ioctl+0x81/0xa0
[146055.629396] [<ffffffff8176d20d>] system_call_fastpath+0x1a/0x1f
[146055.638500] Code: 83 c4 10 4c 89 f2 4c 89 ee ff d0 49 8b 04 24 48 85 c0 75 e6 eb 99 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 49 89 fc 53 <8b> 46 38 48 89 f3 85 c0 75 5d 49 8b 84 24 b0 00 00 00 48 8b 80
[146055.657833] RIP [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
[146055.667524] RSP <ffff883a1c16bce8>
[146055.677082] CR2: 0000000000000038
========================================================

I've found the following "potential" fix that doesn't seem to have every made it through: https://lkml.org/lkml/2015/4/7/611

In addition, I have a 12GB dump file generated by linux-crashdump, please let me know if there's anything I can do with it which can help troubleshoot this issue.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1458045

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: utopic
Mohammed Naser (mnaser) on 2015-05-22
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Mohammed Naser (mnaser) wrote :
Download full text (4.5 KiB)

Just wanted to add this as it might be useful (I'm trying to do troubleshooting):

crash> dis -rl ffffffff810af2f8
/build/buildd/linux-lts-utopic-3.16.0/kernel/sched/fair.c: 4745
0xffffffff810af280 <pick_next_task_fair>: nopl 0x0(%rax,%rax,1)
0xffffffff810af285 <pick_next_task_fair+5>: push %rbp
/build/buildd/linux-lts-utopic-3.16.0/kernel/sched/fair.c: 4746
0xffffffff810af286 <pick_next_task_fair+6>: lea 0x80(%rdi),%rax
/build/buildd/linux-lts-utopic-3.16.0/kernel/sched/fair.c: 4745
0xffffffff810af28d <pick_next_task_fair+13>: mov %rsp,%rbp
0xffffffff810af290 <pick_next_task_fair+16>: push %r15
0xffffffff810af292 <pick_next_task_fair+18>: mov %rdi,%r15
0xffffffff810af295 <pick_next_task_fair+21>: push %r14
0xffffffff810af297 <pick_next_task_fair+23>: push %r13
0xffffffff810af299 <pick_next_task_fair+25>: push %r12
0xffffffff810af29b <pick_next_task_fair+27>: push %rbx
0xffffffff810af29c <pick_next_task_fair+28>: sub $0x30,%rsp
/build/buildd/linux-lts-utopic-3.16.0/kernel/sched/fair.c: 4746
0xffffffff810af2a0 <pick_next_task_fair+32>: mov %rax,-0x58(%rbp)
/build/buildd/linux-lts-utopic-3.16.0/kernel/sched/fair.c: 4745
0xffffffff810af2a4 <pick_next_task_fair+36>: mov %rsi,-0x48(%rbp)
/build/buildd/linux-lts-utopic-3.16.0/kernel/sched/fair.c: 6787
0xffffffff810af2a8 <pick_next_task_fair+40>: mov $0x130c0,%rax
0xffffffff810af2af <pick_next_task_fair+47>: mov %rax,-0x50(%rbp)
/build/buildd/linux-lts-utopic-3.16.0/kernel/sched/fair.c: 4753
0xffffffff810af2b3 <pick_next_task_fair+51>: mov 0x90(%r15),%edi
0xffffffff810af2ba <pick_next_task_fair+58>: test %edi,%edi
0xffffffff810af2bc <pick_next_task_fair+60>: je 0xffffffff810af318 <pick_next_task_fair+152>
/build/buildd/linux-lts-utopic-3.16.0/kernel/sched/fair.c: 4756
0xffffffff810af2be <pick_next_task_fair+62>: mov -0x48(%rbp),%rax
0xffffffff810af2c2 <pick_next_task_fair+66>: mov 0x60(%rax),%rax
0xffffffff810af2c6 <pick_next_task_fair+70>: cmp $0xffffffff818147e0,%rax
0xffffffff810af2cc <pick_next_task_fair+76>: je 0xffffffff810af790 <pick_next_task_fair+1296>
/build/buildd/linux-lts-utopic-3.16.0/kernel/sched/sched.h: 1160
0xffffffff810af2d2 <pick_next_task_fair+82>: mov -0x48(%rbp),%rsi
0xffffffff810af2d6 <pick_next_task_fair+86>: mov %r15,%rdi
0xffffffff810af2d9 <pick_next_task_fair+89>: callq *0x38(%rax)
0xffffffff810af2dc <pick_next_task_fair+92>: mov -0x58(%rbp),%r12
/build/buildd/linux-lts-utopic-3.16.0/kernel/sched/fair.c: 4835
0xffffffff810af2e0 <pick_next_task_fair+96>: xor %esi,%esi
0xffffffff810af2e2 <pick_next_task_fair+98>: mov %r12,%rdi
0xffffffff810af2e5 <pick_next_task_fair+101>: callq 0xffffffff810a66e0 <pick_next_entity>
/build/buildd/linux-lts-utopic-3.16.0/kernel/sched/fair.c: 4836
0xffffffff810af2ea <pick_next_task_fair+106>: mov %r12,%rdi
/build/buildd/linux-lts-utopic-3.16.0/kernel/sched/fair.c: 4835
0xffffffff810af2ed <pick_next_task_fair+109>: mov %rax,%rbx
/build/buildd/linux-lts-utopic-3.16.0/kernel/sched/fair.c: 4836
0xffffffff810af2f0 <pick_next_task_fair+112>: mov %rax,%rsi
0xffffffff810a...

Read more...

Mohammed Naser (mnaser) wrote :

I suspect the following has occured:

https://github.com/torvalds/linux/blob/v3.16/kernel/sched/fair.c#L4829
 if (!cfs_rq->nr_running)
  goto idle;

 put_prev_task(rq, prev);

 do {
  se = pick_next_entity(cfs_rq, NULL);
  set_next_entity(cfs_rq, se);
  cfs_rq = group_cfs_rq(se);
 } while (cfs_rq);

We can see that `nr_running` is actually set to 0 when checking the dumps

crash> cfs_rq.nr_running ffff883ffedf3140
  nr_running = 0

When following the call order of `put_prev_task`, it actually calls the class-specific `put_prev_task_fair`. This involves calling `put_prev_entity`. As part of that function, it does the following:

 /* throttle cfs_rqs exceeding runtime */
 check_cfs_rq_runtime(cfs_rq);

If we start tracing on what `check_cfs_rq_runtime` does, we can see that it calls `throttle_cfs_rq` which actually manipulates the `nr_running`. If `nr_running` was not 0 but was set to 0 during `put_prev_task`, the system crashes as it attempts to pick an entity (but cannot find any).

This is my first time debugging Kernel issues so this is just what I think so far, but comments are welcome..

Mohammed Naser (mnaser) wrote :

There is currently a pending patch for this issue, I'll update the bug when it lands

http://lkml.iu.edu/hypermail/linux/kernel/1505.3/01029.html
http://lkml.iu.edu/hypermail/linux/kernel/1505.3/01030.html

Thank you

tags: added: bios-outdated-1.2.10
Chris J Arges (arges) wrote :

@mnaser,
I haven't had a ton of time to look into this yet.
Have you tried the following to see if you still get an issue?
1) Running a 4.0+ kernel
2) Patching with https://lkml.org/lkml/2015/4/7/611

Also it may be helpful to know how you are configuring cgroups for your qemu processes in order to run into this issue.
Thanks,

Mohammed Naser (mnaser) wrote :

Chris,

There's been a few other proposed patches, but my biggest issue now is trying to replicate it over here before trying to look at patches, or else there's no way for me to say "this works".

I'm working on something to try and replicate it today and I'll keep this bug updated.. I hope it goes through well.

Thanks
Mohammed

Changed in linux (Ubuntu):
importance: Undecided → High

Mohammed Naser, did you personally test the patch in #7 and confirm it fixes your issue?

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers