Insufficient resources (memory) causes instance to pending state forever

Bug #698336 reported by Tushar Patil
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Brian Lamar

Bug Description

I ran the instance knowing that the compute node doesn't have sufficient RAM memory available. It shows errors in the nova-compute.log and the instance remains in pending state forever.

nova-compute.log
-------------------------------------------------------------
[DEBUG:root:instance 1: starting...
DEBUG:root:Running cmd (subprocess): ifconfig vlan101
DEBUG:root:Result was 1
DEBUG:root:Starting VLAN inteface vlan101
DEBUG:root:Running cmd (subprocess): sudo vconfig set_name_type VLAN_PLUS_VID_NO_PAD
DEBUG:root:Running cmd (subprocess): sudo vconfig add eth0 101
DEBUG:root:Running cmd (subprocess): sudo ifconfig vlan101 up
DEBUG:root:Running cmd (subprocess): ifconfig br101
DEBUG:root:Result was 1
DEBUG:root:Starting Bridge interface for vlan101
DEBUG:root:Running cmd (subprocess): sudo brctl addbr br101
DEBUG:root:Running cmd (subprocess): sudo brctl setfd br101 0
DEBUG:root:Running cmd (subprocess): sudo brctl stp br101 off
DEBUG:root:Running cmd (subprocess): sudo brctl addif br101 vlan101
DEBUG:root:Running cmd (subprocess): sudo ifconfig br101 up
DEBUG:root:Running cmd (subprocess): sudo iptables --delete FORWARD --in-interface br101 -j ACCEPT
DEBUG:root:Result was 1
DEBUG:root:Running cmd (subprocess): sudo iptables -I FORWARD --in-interface br101 -j ACCEPT
DEBUG:root:Running cmd (subprocess): sudo iptables --delete FORWARD --out-interface br101 -j ACCEPT
DEBUG:root:Result was 1
DEBUG:root:Running cmd (subprocess): sudo iptables -I FORWARD --out-interface br101 -j ACCEPT
DEBUG:root:instance instance-1165315330: starting toXML method
DEBUG:root:instance instance-1165315330: finished toXML method
DEBUG:root:Running cmd (subprocess): mkdir -p /home/tpatil/nova/nova/..//instances/instance-1165315330/
DEBUG:root:Running cmd (subprocess): chmod 0777 /home/tpatil/nova/nova/..//instances/instance-1165315330/
INFO:root:instance instance-1165315330: Creating image
DEBUG:root:Running cmd (subprocess): /usr/bin/curl --fail --silent http://10.2.3.150:3333/_images/ami-tiny/image -H "Date: Thu, 06 Jan 2011 21:49:39 GMT" -H "Authorization: AWS admin:admin:tg82sggZV1c1dPYNPK3cL49wDX0=" -o /home/tpatil/nova/nova/..//instances/instance-1165315330/disk-raw
DEBUG:root:Running cmd (subprocess): /usr/bin/curl --fail --silent http://10.2.3.150:3333/_images/aki-lucid/image -H "Date: Thu, 06 Jan 2011 21:49:41 GMT" -H "Authorization: AWS admin:admin:sOXhKpkknKQk4aJ+LKf1BMyY+Qo=" -o /home/tpatil/nova/nova/..//instances/instance-1165315330/kernel
DEBUG:root:Running cmd (subprocess): /usr/bin/curl --fail --silent http://10.2.3.150:3333/_images/ari-lucid/image -H "Date: Thu, 06 Jan 2011 21:49:41 GMT" -H "Authorization: AWS admin:admin:jnZiGj8T1zmMh9JyZxbzde+UUeo=" -o /home/tpatil/nova/nova/..//instances/instance-1165315330/ramdisk
INFO:root:instance instance-1165315330: injecting key into image ami-tiny
DEBUG:root:Running cmd (subprocess): sudo losetup --find --show /home/tpatil/nova/nova/..//instances/instance-1165315330/disk-raw
DEBUG:root:Running cmd (subprocess): sudo tune2fs -c 0 -i 0 /dev/loop2
DEBUG:root:Running cmd (subprocess): sudo mount /dev/loop2 /tmp/tmpIejhuu
DEBUG:root:Running cmd (subprocess): sudo mkdir -p /tmp/tmpIejhuu/root/.ssh
DEBUG:root:Running cmd (subprocess): sudo chown root /tmp/tmpIejhuu/root/.ssh
DEBUG:root:Running cmd (subprocess): sudo chmod 700 /tmp/tmpIejhuu/root/.ssh
DEBUG:root:Running cmd (subprocess): sudo tee -a /tmp/tmpIejhuu/root/.ssh/authorized_keys
DEBUG:root:Running cmd (subprocess): sudo umount /dev/loop2
DEBUG:root:Running cmd (subprocess): rmdir /tmp/tmpIejhuu
DEBUG:root:Running cmd (subprocess): sudo losetup --detach /dev/loop2
DEBUG:root:Running cmd (subprocess): dd if=/dev/zero of=/home/tpatil/nova/nova/..//instances/instance-1165315330/disk-raw count=1 seek=20971519 bs=512
DEBUG:root:Running cmd (subprocess): e2fsck -fp /home/tpatil/nova/nova/..//instances/instance-1165315330/disk-raw
DEBUG:root:Result was 1
DEBUG:root:Running cmd (subprocess): resize2fs /home/tpatil/nova/nova/..//instances/instance-1165315330/disk-raw
DEBUG:root:Running cmd (subprocess): dd if=/dev/zero of=/home/tpatil/nova/nova/..//instances/instance-1165315330/disk count=1 seek=62 bs=512
DEBUG:root:Running cmd (subprocess): parted --script /home/tpatil/nova/nova/..//instances/instance-1165315330/disk mklabel msdos
DEBUG:root:Running cmd (subprocess): dd if=/home/tpatil/nova/nova/..//instances/instance-1165315330/disk-raw of=/home/tpatil/nova/nova/..//instances/instance-1165315330/disk bs=268435456 conv=notrunc,fsync oflag=append
DEBUG:root:Running cmd (subprocess): parted --script /home/tpatil/nova/nova/..//instances/instance-1165315330/disk mkpart primary 63s 20971582s
DEBUG:root:Running cmd (subprocess): dd if=/dev/zero of=/home/tpatil/nova/nova/..//instances/instance-1165315330/disk count=1 seek=188743742 bs=512
DEBUG:root:Running cmd (subprocess): parted --script /home/tpatil/nova/nova/..//instances/instance-1165315330/disk mkpartfs primary ext2 20971583s 188743742s
libvir: QEMU error : internal error process exited while connecting to monitor: qemu: at most 2047 MB RAM can be simulated
ERROR:root:Uncaught exception
Traceback (most recent call last):
  File "/home/tpatil/nova/nova/exception.py", line 83, in _wrap
    return f(*args, **kw)
  File "/home/tpatil/nova/nova/virt/libvirt_conn.py", line 355, in spawn
    self._conn.createXML(xml, 0)
  File "/usr/lib/python2.6/dist-packages/libvirt.py", line 1289, in createXML
    if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: internal error process exited while connecting to monitor: qemu: at most 2047 MB RAM can be simulated

ERROR:root:instance instance-1165315330: Failed to spawn
Traceback (most recent call last):
  File "/home/tpatil/nova/nova/compute/manager.py", line 146, in run_instance
    self.driver.spawn(instance_ref)
  File "/home/tpatil/nova/nova/exception.py", line 89, in _wrap
    raise Error(str(e))
Error: internal error process exited while connecting to monitor: qemu: at most 2047 MB RAM can be simulated

libvir: QEMU error : Domain not found: no domain with matching name 'instance-1165315330'

Revision history for this message
Thierry Carrez (ttx) wrote :

With current code, the instance apparently goes to "failed to spawn" state. Are we supposed to recover from that ? How do we handle the case when the scheduler sends the request to a host that is already full (vcpu, memory...) ?

Thierry Carrez (ttx)
Changed in nova:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Brian Lamar (blamar) wrote :

Currently it seems like the compute manager will throw away VMs which are in a failed state, which is even worse than what you described a month ago.

The '_poll_instance_states' method in nova/compute/manager.py will set the power_state of any VM which:
    +has a database entry
    +does not have a VM instance
...to power_state.SHUTOFF.

The side effect of this is that every time compute manager polls for instance status it will actually set all power_state.FAILED VMs to power_state.SHUTOFF in the database and then call self.db.instance_destroy, effectively removing all history of the failed VM.

In an async system like nova, the failed VM should stick around IMO until it has been explicitly deleted. No automatic retries or anything else for the time being. That way a scheduler can get the 202 Accepted response, and then poll GET /servers/<id> until it sees a successful/error status.

What my branch/fix covers:
    1) Ensures '_poll_instance_states' does not clear out VMs which didn't get spawned correctly
    2) Prettifies libvirt createXML error

What it doesn't cover:
    1) A method of getting the error back to the user/scheduler. You'll be able to see the VM failed, but without looking at the logs it's going to be impossible to tell why. This should be covered by the 'error-codes' blueprint which will be hopefully coming along for Diablo.
    2) Automated tests for a scenario such as this

Thierry Carrez (ttx)
Changed in nova:
assignee: nobody → Brian Lamar (blamar)
status: Confirmed → In Progress
Revision history for this message
Thierry Carrez (ttx) wrote :

@Brian: are you actually working on that ? Or should we unassign you to let someone else take it ?

Brian Lamar (blamar)
Changed in nova:
assignee: Brian Lamar (blamar) → nobody
Thierry Carrez (ttx)
Changed in nova:
status: In Progress → Confirmed
Revision history for this message
Aaron Lee (aaron-lee) wrote :

Is this still relevant? We recently added functionality to mark an instance that failed to start due to insufficient memory to the ERROR state. Also, _poll_instance_states seem to no longer exist.

Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : A change has been merged to openstack/nova

Reviewed: https://review.openstack.org/679
Committed: http://github.com/openstack/nova/commit/981f52794ed41b6f25dfc4a25b4b736e8f030a0f
Submitter: Jenkins
Branch: master

 status fixcommitted
 done

commit 981f52794ed41b6f25dfc4a25b4b736e8f030a0f
Author: Brian Lamar <email address hidden>
Date: Mon Sep 26 22:02:34 2011 -0400

    Set error state on spawn error + integration test.

    This branch should at least be considered a partial fix for bug 698336.

    (Update) Grammar fix.
    (Update) PEP8 fix.
    (Update) Merged with origin/master
    (Update) Fixed test (oops!) thanks comstud!

    Change-Id: I10d607fd40953e334670cc39040a9a00ff6919ac

Changed in nova:
status: Confirmed → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → essex-1
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Mark McLoughlin (markmc)
Changed in nova:
assignee: nobody → Brian Lamar (blamar)
Thierry Carrez (ttx)
Changed in nova:
milestone: essex-1 → 2012.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.