infinitely reschedule instance when boot instance failed

Bug #1326207 reported by Alex Xu
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Critical
Alex Xu

Bug Description

When I boot an instance that failed by image size bigger than flavor size, then nova begin to reschedule the instance infinitely.

$ glance index
ID Name Disk Format Container Format Size
------------------------------------ ------------------------------ -------------------- -------------------- --------------
0d366543-54de-48dd-8f49-d62cac6e7c6b Fedora-x86_64-20-20131211.1-sd qcow2 bare 214106112

$ nova boot --flavor 1 --image 0d366543-54de-48dd-8f49-d62cac6e7c6b --nic net-id=c2d921a0-1474-4731-a8b5-c1cc8b9069f4 vm1
+--------------------------------------+------------------------------------------------------------------------+
| Property | Value |
+--------------------------------------+------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | - |
| OS-EXT-SRV-ATTR:hypervisor_hostname | - |
| OS-EXT-SRV-ATTR:instance_name | instance-0000002b |
| OS-EXT-STS:power_state | 0 |
| OS-EXT-STS:task_state | - |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | - |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| adminPass | 8a7ru8z7bUQ6 |
| config_drive | |
| created | 2014-06-04T03:24:38Z |
| flavor | m1.tiny (1) |
| hostId | |
| id | cd5a4327-d3ee-45bb-95fa-da7817b3b0b9 |
| image | Fedora-x86_64-20-20131211.1-sda (0d366543-54de-48dd-8f49-d62cac6e7c6b) |
| key_name | - |
| metadata | {} |
| name | vm1 |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| security_groups | default |
| status | BUILD |
| tenant_id | ac99aeb3fbb24494b3bbb68a8e94cbff |
| updated | 2014-06-04T03:24:38Z |
| user_id | 2c08e17d6fb54dea96cc6d70ba0ee75f |
+--------------------------------------+------------------------------------------------------------------------+

The first error at compute node:

2014-06-04 11:24:58.121 ERROR nova.virt.libvirt.imagebackend [req-45fbcfec-7440-4193-9caa-53b9322e440c admin admin] /opt/stack/data/nova/instances/_base/d1a823
d9305ec292d5f3e13da6755ee8f057b01e virtual size 2147483648 larger than flavor root disk size 1073741824
2014-06-04 11:24:58.121 ERROR nova.compute.manager [req-45fbcfec-7440-4193-9caa-53b9322e440c admin admin] [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] Inst
ance failed to spawn
2014-06-04 11:24:58.121 TRACE nova.compute.manager [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] Traceback (most recent call last):
2014-06-04 11:24:58.121 TRACE nova.compute.manager [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] File "/media/soulxu1404/da43e419-d334-4dae-9274-04ce55413
7d9/opt/stack/nova/nova/compute/manager.py", line 2059, in _build_resources
2014-06-04 11:24:58.121 TRACE nova.compute.manager [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] yield resources
2014-06-04 11:24:58.121 TRACE nova.compute.manager [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] File "/media/soulxu1404/da43e419-d334-4dae-9274-04ce55413
7d9/opt/stack/nova/nova/compute/manager.py", line 1962, in _build_and_run_instance
2014-06-04 11:24:58.121 TRACE nova.compute.manager [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] block_device_info=block_device_info)
2014-06-04 11:24:58.121 TRACE nova.compute.manager [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] File "/media/soulxu1404/da43e419-d334-4dae-9274-04ce55413
7d9/opt/stack/nova/nova/virt/libvirt/driver.py", line 2281, in spawn
2014-06-04 11:24:58.121 TRACE nova.compute.manager [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] admin_pass=admin_password)
2014-06-04 11:24:58.121 TRACE nova.compute.manager [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] File "/media/soulxu1404/da43e419-d334-4dae-9274-04ce55413
7d9/opt/stack/nova/nova/virt/libvirt/driver.py", line 2655, in _create_image
2014-06-04 11:24:58.121 TRACE nova.compute.manager [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] project_id=instance['project_id'])
2014-06-04 11:24:58.121 TRACE nova.compute.manager [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] File "/media/soulxu1404/da43e419-d334-4dae-9274-04ce55413
7d9/opt/stack/nova/nova/virt/libvirt/imagebackend.py", line 192, in cache
2014-06-04 11:24:58.121 TRACE nova.compute.manager [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] *args, **kwargs)
2014-06-04 11:24:58.121 TRACE nova.compute.manager [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] File "/media/soulxu1404/da43e419-d334-4dae-9274-04ce55413
7d9/opt/stack/nova/nova/virt/libvirt/imagebackend.py", line 386, in create_image
2014-06-04 11:24:58.121 TRACE nova.compute.manager [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] self.verify_base_size(base, size)
2014-06-04 11:24:58.121 TRACE nova.compute.manager [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] File "/media/soulxu1404/da43e419-d334-4dae-9274-04ce554137d9/opt/stack/nova/nova/virt/libvirt/imagebackend.py", line 243, in verify_base_size
2014-06-04 11:24:58.121 TRACE nova.compute.manager [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] raise exception.FlavorDiskTooSmall()
2014-06-04 11:24:58.121 TRACE nova.compute.manager [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] FlavorDiskTooSmall: Flavor's disk is too small for requested image.
2014-06-04 11:24:58.121 TRACE nova.compute.manager [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9]
2014-06-04 11:24:58.122 ERROR root [req-45fbcfec-7440-4193-9caa-53b9322e440c admin admin] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/media/soulxu1404/da43e419-d334-4dae-9274-04ce554137d9/opt/stack/nova/nova/compute/manager.py", line 2059, in _build_resources\n yield resources\n', ' File "/media/soulxu1404/da43e419-d334-4dae-9274-04ce554137d9/opt/stack/nova/nova/compute/manager.py", line 1962, in _build_and_run_instance\n block_device_info=block_device_info)\n', ' File "/media/soulxu1404/da43e419-d334-4dae-9274-04ce554137d9/opt/stack/nova/nova/virt/libvirt/driver.py", line 2281, in spawn\n admin_pass=admin_password)\n', ' File "/media/soulxu1404/da43e419-d334-4dae-9274-04ce554137d9/opt/stack/nova/nova/virt/libvirt/driver.py", line 2655, in _create_image\n project_id=instance[\'project_id\'])\n', ' File "/media/soulxu1404/da43e419-d334-4dae-9274-04ce554137d9/opt/stack/nova/nova/virt/libvirt/imagebackend.py", line 192, in cache\n *args, **kwargs)\n', ' File "/media/soulxu1404/da43e419-d334-4dae-9274-04ce554137d9/opt/stack/nova/nova/virt/libvirt/imagebackend.py", line 386, in create_image\n self.verify_base_size(base, size)\n', ' File "/media/soulxu1404/da43e419-d334-4dae-9274-04ce554137d9/opt/stack/nova/nova/virt/libvirt/imagebackend.py", line 243, in verify_base_size\n raise exception.FlavorDiskTooSmall()\n', "FlavorDiskTooSmall: Flavor's disk is too small for requested image.\n"]

The second:

2014-06-04 11:24:58.174 DEBUG nova.compute.utils [req-45fbcfec-7440-4193-9caa-53b9322e440c admin admin] [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] 'NetworkInfo' object has no attribute 'wait' from (pid=26712) notify_about_instance_usage /media/soulxu1404/da43e419-d334-4dae-9274-04ce554137d9/opt/stack/nova/nova/compute/utils.py:291
2014-06-04 11:24:58.174 TRACE nova.compute.utils [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] Traceback (most recent call last):
2014-06-04 11:24:58.174 TRACE nova.compute.utils [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] File "/media/soulxu1404/da43e419-d334-4dae-9274-04ce554137d9/opt/stack/nova/nova/compute/manager.py", line 1966, in _build_and_run_instance
2014-06-04 11:24:58.174 TRACE nova.compute.utils [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] network_info=network_info)
2014-06-04 11:24:58.174 TRACE nova.compute.utils [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
2014-06-04 11:24:58.174 TRACE nova.compute.utils [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] self.gen.throw(type, value, traceback)
2014-06-04 11:24:58.174 TRACE nova.compute.utils [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] File "/media/soulxu1404/da43e419-d334-4dae-9274-04ce554137d9/opt/stack/nova/nova/compute/manager.py", line 2065, in _build_resources
2014-06-04 11:24:58.174 TRACE nova.compute.utils [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] network_info.wait(do_raise=False)
2014-06-04 11:24:58.174 TRACE nova.compute.utils [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] AttributeError: 'NetworkInfo' object has no attribute 'wait'
2014-06-04 11:24:58.174 TRACE nova.compute.utils [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9]
2014-06-04 11:24:58.175 DEBUG nova.compute.manager [req-45fbcfec-7440-4193-9caa-53b9322e440c admin admin] [instance: cd5a4327-d3ee-45bb-95fa-da7817b3b0b9] Build of instance cd5a4327-d3ee-45bb-95fa-da7817b3b0b9 was re-scheduled: 'NetworkInfo' object has no attribute 'wait' from (pid=26712) do_build_and_run_instance /media/soulxu1404/da43e419-d334-4dae-9274-04ce554137d9/opt/stack/nova/nova/compute/manager.py:1895

The third is nova reschedule the instance infinitely

For the first error, the nova api check the image size with flavor size, but the image size is actually size that different with virtual size. In libvirt driver check the virtual size.

the image properties as blow:
$ glance show 01efb526-8046-40d0-b276-ba32e570c965
URI: http://cloudcontroller:9292/v1/images/01efb526-8046-40d0-b276-ba32e570c965
Id: 01efb526-8046-40d0-b276-ba32e570c965
Public: Yes
Protected: No
Name: cirros-0.3.2-x86_64-uec
Status: active
Size: 25165824
Disk format: ami
Container format: ami
Minimum Ram Required (MB): 0
Minimum Disk Required (GB): 0
Owner: ac99aeb3fbb24494b3bbb68a8e94cbff
Property 'kernel_id': efa41ea7-22f0-4e08-82b7-74eb0368b28f
Property 'ramdisk_id': eb736431-a77b-4994-ad2b-e090c590e57a
Created at: 2014-05-27T07:34:27
Updated at: 2014-05-27T07:34:27

In nova api, it will check the 'Minimum Disk Required'. User should specific the 'Minimum Disk Required' for image. So I think this isn't a bug for nova

For second error, that is wrong way to use network_info object, this should be fixed.

For third error, after change conductor to build new instance, the retry info is missing. we should fix is also.

Alex Xu (xuhj)
Changed in nova:
assignee: nobody → Alex Xu (xuhj)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/97727

Changed in nova:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/97728

Revision history for this message
Mark McLoughlin (markmc) wrote :

Just from reading the tracebacks above, the two issues I see are:

1) FlavorDiskTooSmall raised by driver.spawn() is going to result in a reschedule attempt which cannot succeed

2) In the exception handling of _build_resources() we do

                # Make sure the async call finishes
                if network_info is not None:
                    network_info.wait(do_raise=False)

    and get:

      'NetworkInfo' object has no attribute 'wait'

    We obviously expect it to be a NetworkInfoAsyncWrapper but it isn't for some reason.

Revision history for this message
Mark McLoughlin (markmc) wrote :

Ok, I see the infinite rescheduling issue now too - with the new build_instances() code path, we're not keeping track of retry attempts

Changed in nova:
importance: Undecided → Critical
Revision history for this message
Andrew Laski (alaski) wrote :

The FlavorDiskTooSmall issue is being tracked at https://bugs.launchpad.net/nova/+bug/1325670

Looking at the other issue and fix now.

Revision history for this message
Mark McLoughlin (markmc) wrote :

Note that https://review.openstack.org/97728 should be backported to stable/icehouse

tags: added: icehouse-rc-potential
Revision history for this message
Andrew Laski (alaski) wrote :

Icehouse should be unaffected since build_and_run_instance wasn't used there.

Revision history for this message
Mark McLoughlin (markmc) wrote :

Icehouse's build_and_run_instance() will be used during an upgrade where you have an Icehouse compute and Juno conductor

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/98309

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/97727
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9639b55b614082206207e67af7dde7af85a9b614
Submitter: Jenkins
Branch: master

commit 9639b55b614082206207e67af7dde7af85a9b614
Author: He Jie Xu <email address hidden>
Date: Wed Jun 4 14:56:33 2014 +0800

    Fix infinitely reschedule instance due to miss retry info

    The logic of building instance is moved from scheduler to conductor,
    But 'retry' info is still populated at scheduler manager, and that
    can't be take back from rpc call. This patch moves the retry checking
    logic into conductor, and keep the old 'schedule_run_instance'
    same as before.

    Change-Id: I4efafe815e8623881e6c037eb9d04c0dbb7052a2
    Partial-Bug: #1326207

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/98309
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=24a04c405ab2c98e52ea1edf8775489907526c6d
Submitter: Jenkins
Branch: master

commit 24a04c405ab2c98e52ea1edf8775489907526c6d
Author: He Jie Xu <email address hidden>
Date: Fri Jun 6 14:57:47 2014 +0800

    Check the network_info obj type before invoke wait function

    When instance schedule to host and it isn't first time schedule,
    the instance may already allocated network. And then the code
    will return an NetworkInfo object, not an NetworkInfoAsyncWrapper.
    The checking code only assume it is NetworkInfoAsyncWrapper object.

    Change-Id: I987902fc47e509e4fd08f762d66b16c09eaf5008
    Partial-Bug: #1326207

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/97728
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ea3e03a626118fbee1626eb1a17224629299757f
Submitter: Jenkins
Branch: master

commit ea3e03a626118fbee1626eb1a17224629299757f
Author: He Jie Xu <email address hidden>
Date: Wed Jun 4 16:51:41 2014 +0800

    Disable rescheduling instance when no retry info

    When scheduler_max_attempts=1 means disable rescheduling, compute
    won't reschedule instance before. But build_and_run_instance didn't
    hornor this behavior. And cause nova reschedule instance infinitely.
    This patch fix this problem, make build_and_run_instance won't reschedule
    instance without retry info.

    Change-Id: I52d6569c1731b5fd5a10f39ff04a40b43d3857a1
    Partial-Bug: #1326207

Revision history for this message
Michael Still (mikal) wrote :

Alex, what further work (if any) is needed for this bug?

Revision history for this message
Alex Xu (xuhj) wrote :

Michael, there isn't any more work for this bug.

Andrew Laski (alaski)
Changed in nova:
status: In Progress → Fix Committed
Changed in nova:
milestone: none → juno-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-2 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.