a bad AMI can hang an entire compute node

Bug #960276 reported by Nick Moffitt
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
nova (Ubuntu)
Fix Released
Critical
Unassigned

Bug Description

Using the attached image (and others) causes the entire compute node to hang between the booting of the image and the configuration of networking. The running image has a console ring buffer output file (however problematic--often it looks like it never got a proper root filesystem somehow--lots of "NO PTY" errors), but is unpingable.

The only way to terminate these instances is to restart nova-compute so that it will collect amqp messages again, and then send the terminate request. This seems suspiciously like the compute code is blocking in a libvirt call of some sort.

The cluster used booted an older Oneiric image with no problems whatsoever.

This effectively can DoS an entire openstack installation through nothing more than running instances.

Attached is the amd64 image from http://cloud-images.ubuntu.com/precise/20120319/ which exhibited this problem in our rc1 cloud.

Tags: canonistack
Revision history for this message
Nick Moffitt (nick-moffitt) wrote :
James Troup (elmo)
tags: added: canonistack
Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

This does not cause libvirtd to hang, by the way. "sudo virsh list" does fine, and I'm able to kill instances manually with virsh destroy.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu):
status: New → Confirmed
Changed in nova (Ubuntu):
importance: Undecided → Critical
Revision history for this message
Adam Gandelman (gandelman-a) wrote :

We've been carrying a nova patch to resolve a possible DoS in Bug #832507 ( libvirt-use-console-pipe.patch ) I've confirmed that this patch introduces a deadlock somewhere when the serial console gets spammed. 'dd if=/dev/urandom of=/dev/ttyS0 bs=1024 count=1500' from within the instance is enough to basically lock nova-compute until the KVM process is killed or nova-compute restarted.

We either need to fix this patch ASAP or back it out in favor of a different solution for the original Bug #832507. This patch constitutes the biggest delta we maintain across any Openstack component and maintaining it so far has required a great deal of effort. The regression its introduces is worse than the original bug, IMO.

Changed in nova (Ubuntu):
status: Confirmed → Triaged
Chuck Short (zulcss)
Changed in nova (Ubuntu):
status: Triaged → Fix Committed
Chuck Short (zulcss)
Changed in nova (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
James Page (james-page) wrote :

Fixed in 2012.1~rc2-0ubuntu1

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.