cpu frequency scaling causes multiple vcpu guests to panic

Bug #246175 reported by Steven Wagner
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kvm (Ubuntu)
Incomplete
Medium
Unassigned

Bug Description

Binary package hint: kvm

As soon as I turned on cpu frequency scaling, one of my guest domains that was using multiple vcpus did a kernel panic.

The dmesg on the host server had repeated lines of this:
[1041427.963898] vcpu not ready for apic_round_robin
[1041427.963900] vcpu not ready for apic_round_robin
[1041427.963902] vcpu not ready for apic_round_robin

The Kernel panic on the guest said:
smp_apic_timer_interrupt...
apic_timer_interrupt...
EIP: insert_work
Kernel panic - not syncing: Fatal exception in interrupt.

I put all the guest domains to only use a single vcpu, and left the cpu frequency scaling on at the host server. This appeared to avoid the issue for the time being.

Changed in kvm:
importance: Undecided → Medium
Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Hi,

Can you post the output of:
 $ cat /proc/cpuinfo
 $ lshw -class cpu

We think that your processor is an AMD rev F, which don't have a constant tfc frequency, and guests can't see that the host processor frequency has changed. This would be a hardware problem, not a software problem.

:-Dustin

Changed in kvm:
status: New → Incomplete
Revision history for this message
Parag Warudkar (parag-warudkar) wrote :
Download full text (7.4 KiB)

This happens to me with Jaunty on an Intel Xeon CPU - have never seen it before with Intrepid kernel.
/proc/cpuinfo :

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5410 @ 2.33GHz
stepping : 6
cpu MHz : 1998.000
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority
bogomips : 4667.02
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power m...

Read more...

Revision history for this message
Parag Warudkar (parag-warudkar) wrote :

Mainline kernel 2.6.29 does not seem to have this problem - same VM runs ok on 2.6.29 while it gets stuck with the "vcpu not ready for apic_round_robin" error.

BTW, I am not using multiple VCPUs - single cpu Intrepid x86 vm.

Revision history for this message
Dustin Kirkland  (kirkland) wrote : Re: [Bug 246175] Re: cpu frequency scaling causes multiple vcpu guests to panic

Interesting, thanks for the report...

I'm going to keep my finger on the pulse of this.

I'm seeing some strange behavior on my Jaunty systems, when
cpu-freq-scaling is enabled. I'm still trying to nail it down ...

:-Dustin

Revision history for this message
agent 8131 (agent-8131) wrote :
Download full text (3.2 KiB)

I believe I have just run into this bug. I figured now that 9.04 has entered RC status I would upgrade a test server to figure out what problems I would run into if and when upgrading. All of my domains only had 1 VPCU so I don't think that is an issue. The problem is frequency scaling. When frequency scaling was enabled my domains were crashing randomly. Once disabled everything has been stable (at least so far). This is clearly a regression from 8.10 and it would be good to either fix it or make sure that people considering upgrading are aware of the problem. I know those running servers will probably be more cautious in upgrading but I suspect desktop and laptop users might be bitten by this and complain. It would be good to try and isolate this issue and to what degree hardware, kvm, and linux kernel contribute or interact to cause it. It would also be good to determine if there are any other workarounds such as using different kernel timers or options that would enable users to continue to use cpu frequency scaling.

kvm 1:84+dfsg-0ubuntu10
linux-image-2.6.28-11-server 2.6.28-11.41

Here's my cpu data:

# lshw -class cpu
  *-cpu
       description: CPU
       product: AMD Athlon(tm) 64 X2 Dual Core Processor 4600+
       vendor: Advanced Micro Devices [AMD]
       physical id: 4
       bus info: cpu@0
       version: AMD Athlon(tm) 64 X2 Dual Core Processor 4600+
       slot: Socket AM2
       size: 2400MHz
       capacity: 3800MHz
       width: 64 bits
       clock: 200MHz
       capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp x86-64 3dnowext 3dnow rep_good pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy cpufreq

# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 75
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4600+
stepping : 2
cpu MHz : 2400.000
cache size : 512 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips : 4799.56
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 75
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4600+
stepping : 2
cpu MHz : 2400.000
cache size : 512 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips : 4799.56
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 6...

Read more...

Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Describe "crashing randomly"...

Is the guest kernel oopsing or panicing?

Is kvm on the host simply seg faulting?

I need a bit more information here to go on.

:-Dustin

Revision history for this message
agent 8131 (agent-8131) wrote :

Crashing randomly = kernel panics. My apologies for not being more specific. I could turn frequency scaling back on to get the specific messages if you think that would help.

Revision history for this message
agent 8131 (agent-8131) wrote :

This is a kernel panic I've seen at startup. I assume it's related to this issue but it could be another bug. I've seen lots of errors using libvirt and kvm under jaunty that make me think there might be multiple bugs that need addressing before it can be considered ready for use.

Revision history for this message
Dustin Kirkland  (kirkland) wrote :

parag & agent 8131-

Could you perhaps try the kvm-source built modules, if you aren't already?

If this is fixed in the upstream kernel, there's a good chance it's fixed in the kvm-source built module, as it's quite a bit more modern than the source in the Ubuntu jaunty kernel.

:-Dustin

Revision history for this message
agent 8131 (agent-8131) wrote :

Thanks Dustin, I'd say that's looking positive. I used kvm-source, reloaded the modules, and switched the scaling governor to ondemand (none of which required a reboot so it should be easy for others to test as well). Before it was extremely rare for a virtual machine to make it through boot. So far I've been able to boot multiple virtual machines without problem. It's still too early to tell for sure but for the virtual machines to actually start without problem is a definite improvement. I'm also seeing fewer errors in the logs. A few days of continuous operation should give a good indication of the effectiveness of this solution.

Revision history for this message
Dustin Kirkland  (kirkland) wrote :

agent-

Excellent! I'm going to leave the bug 'incomplete' for now, awaiting
feedback from in say, a week?

:-Dustin

Revision history for this message
agent 8131 (agent-8131) wrote :

That sounds like a good plan. If this is a solution to this bug I suspect it's worth considering whether kvm should depend on kvm-source instead of just suggesting it.

Revision history for this message
Alvin (alvind) wrote :

Just adding a different CPU to this bug:

$ sudo lshw -class cpu
  *-cpu:0
       description: CPU
       product: Intel(R) Xeon(R) CPU E5420 @ 2.50GHz
       vendor: Intel Corp.
       physical id: 4
       bus info: cpu@0
       version: Intel(R) Xeon(R) CPU E5420 @ 2.50GHz
       serial: To Be Filled By O.E.M.
       slot: CPU 0
       size: 2500MHz
       capacity: 3400MHz
       width: 64 bits
       clock: 1333MHz
       capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx x86-64 constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority
  *-cpu:1
      [...identical...]

Random craching virtual machine has 2 vcpus.

Revision history for this message
Alvin (alvind) wrote :

I no longer think the CPU model is relevant to this bug. I just copied the virtual machine to another host. (Intel(R) Xeon(R) CPU E5335 @ 2.00GHz). It didn't take long to panic.

Revision history for this message
Alvin (alvind) wrote :

Tried with 1 vcpu instead of 2. Same error (vcpu not ready for apic_round_robin)

Is there someting else I can try without installing other packages?

Revision history for this message
Alvin (alvind) wrote :

Can someone give me some pointers?

- I would like to test different scaling governors, but due to bug 351159 this is impossible. Is this important?
- How can I see the kvm version? (What's the version of kvm in kvm-source, and if they are both installed, what version is in use?) It looks like they are both kvm-84.

A bit of background information:
The host and guest OS (both Jaunty) were running fine for a few weeks, until we started testing production on the guest. (parsing of lots of XML files that reside on an NFS4 file server.)
While parsing, the guest will freeze. (after 10 minutes, or a few hours) At that moment, the host will log a lot of 'vcpu not ready for apic_round_robin' messages.

Revision history for this message
Anthony Liguori (anthony-codemonkey) wrote :

Parag,

Can you open a separate bug report for the error you saw on the Intel system. An Intel processor would not suffer from this particular bug.

Revision history for this message
Steven Wagner (stevenwagner) wrote :

The error "vcpu not ready for apic_round_robin" can also occur if a guest is timing out for other reasons...like not enough RAM allocated. I have examples of this occuring on a hardy 8.04 LTS system with KVM 84.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.