Comment 20 for bug 1011792

Revision history for this message
Stefan Bader (smb) wrote :

Thanks, Mike for the details. Just to make sure, you collected the info from the same instance that locked up (either before or after a reboot)? That would make sure that whatever information about the host is really belonging to the host where the problem happened.

As for more details, not right now as we need first to understand more about the problem. But for a general feeling:
- looking at the same instance type, does it happen on all of them sooner or later or are there exceptions?
- did the same kind of workload run without issues in a previous Ubuntu release or were those new projects
  starting with Onerirc/Precise
- Probably more to Matt, are there issues with other Linux distros running comparable kernel versions?
- It might be worth trying a kernel from mainline (http://kernel.ubuntu.com/~kernel-ppa/mainline/).
  Right now I probably would go for a generic 64bit 3.5.2 and maybe 3.6-rc2 kernel. Not sure whether
  update-grub in Precise already picks up generic kernel, so one might need to fiddle with /boot/grub/menu.cfg
  manually after installing the packages.

As Matt wrote above, when looking at the traces a bit more in detail, there are some cpus stuck in entering the hypervisor call to wait for a spinlock and others seem to have come out of that and trying to wake up some waiters.
@Matt, when you produce those cpu stacktraces, how do you do that? Is that from a dump or somehow tapping into the still running instance?
Right now it is hard to say whether this may be a real deadlock (probably the types of locks can be obtained by checking the backtrace for every cpu, but could be hard when it comes to locks of individual structures/devices). Or it is some problem of delivery of the spinlock event (be it the wrong cpu was notified or for some reason the event never happened). Also not easy to get hold of.

The best chances we would have, if it would be possible to re-create this on an isolated test system. And for that I would need some relative simple to follow steps that allow me to create that workload that is causing the issue. Still, I only got an 8-core which I would have to overcommit to 16 and if that is giving the same results...