nova-compute crashes upon startup if an instance failed to start previously

Bug #1076309 reported by Johnny Boy
28
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Critical
Jian Wen

Bug Description

While trying to get my 2nd compute machine up and running, I noticed that it keeps crashing if an instance failed to launch because of RPC communication timeouts.
When the compute restarts and tries to do something about this instance, it crashes with this log:

012-11-08 11:25:44 AUDIT nova.service [-] Starting compute node (version 2012.2-LOCALBRANCH:LOCALREVISION)
2012-11-08 11:25:44 DEBUG nova.virt.libvirt.driver [-] Connecting to libvirt: qemu:///system from (pid=1638) _get_connection /usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py:340
2012-11-08 11:25:45 DEBUG nova.utils [req-bdd9dcc1-9849-4b4f-a7f5-ba8fd84cc843 None None] backend <module 'nova.db.sqlalchemy.api' from '/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.pyc'> from (pid=1638) __get_backend /usr/lib/python2.7/dist-packages/nova/utils.py:494
2012-11-08 11:25:47 DEBUG nova.compute.manager [req-bdd9dcc1-9849-4b4f-a7f5-ba8fd84cc843 None None] [instance: 2544284a-8abf-4d71-ba5e-2366ad3369c4] Checking state from (pid=1638) _get_power_state /usr/lib/python2.7/dist-packages/nova/compute/manager.py:334
2012-11-08 11:25:47 DEBUG nova.compute.manager [req-bdd9dcc1-9849-4b4f-a7f5-ba8fd84cc843 None None] [instance: 2544284a-8abf-4d71-ba5e-2366ad3369c4] Current state is 0, state in DB is 0. from (pid=1638) init_host /usr/lib/python2.7/dist-packages/nova/compute/manager.py:288
2012-11-08 11:25:47 CRITICAL nova [-] list index out of range
2012-11-08 11:25:47 TRACE nova Traceback (most recent call last):
2012-11-08 11:25:47 TRACE nova File "/usr/bin/nova-compute", line 48, in <module>
2012-11-08 11:25:47 TRACE nova service.wait()
2012-11-08 11:25:47 TRACE nova File "/usr/lib/python2.7/dist-packages/nova/service.py", line 659, in wait
2012-11-08 11:25:47 TRACE nova _launcher.wait()
2012-11-08 11:25:47 TRACE nova File "/usr/lib/python2.7/dist-packages/nova/service.py", line 192, in wait
2012-11-08 11:25:47 TRACE nova super(ServiceLauncher, self).wait()
2012-11-08 11:25:47 TRACE nova File "/usr/lib/python2.7/dist-packages/nova/service.py", line 162, in wait
2012-11-08 11:25:47 TRACE nova service.wait()
2012-11-08 11:25:47 TRACE nova File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 166, in wait
2012-11-08 11:25:47 TRACE nova return self._exit_event.wait()
2012-11-08 11:25:47 TRACE nova File "/usr/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
2012-11-08 11:25:47 TRACE nova return hubs.get_hub().switch()
2012-11-08 11:25:47 TRACE nova File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 177, in switch
2012-11-08 11:25:47 TRACE nova return self.greenlet.switch()
2012-11-08 11:25:47 TRACE nova File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 192, in main
2012-11-08 11:25:47 TRACE nova result = function(*args, **kwargs)
2012-11-08 11:25:47 TRACE nova File "/usr/lib/python2.7/dist-packages/nova/service.py", line 132, in run_server
2012-11-08 11:25:47 TRACE nova server.start()
2012-11-08 11:25:47 TRACE nova File "/usr/lib/python2.7/dist-packages/nova/service.py", line 398, in start
2012-11-08 11:25:47 TRACE nova self.manager.init_host()
2012-11-08 11:25:47 TRACE nova File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 296, in init_host
2012-11-08 11:25:47 TRACE nova self.driver.plug_vifs(instance, legacy_net_info)
2012-11-08 11:25:47 TRACE nova File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 437, in plug_vifs
2012-11-08 11:25:47 TRACE nova self.vif_driver.plug(instance, (network, mapping))
2012-11-08 11:25:47 TRACE nova File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/vif.py", line 111, in plug
2012-11-08 11:25:47 TRACE nova return self._get_configurations(instance, network, mapping)
2012-11-08 11:25:47 TRACE nova File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/vif.py", line 68, in _get_configurations
2012-11-08 11:25:47 TRACE nova conf.add_filter_param("IP", mapping['ips'][0]['ip'])
2012-11-08 11:25:47 TRACE nova IndexError: list index out of range
2012-11-08 11:25:47 TRACE nova

Terminating this failed instance and then restarting compute works as a fix.

Revision history for this message
Johnny Boy (balboah) wrote :

Oh and this is ubuntu 12.04 running Folsom

Michael Still (mikal)
Changed in nova:
status: New → Triaged
importance: Undecided → Critical
Jian Wen (wenjianhn)
Changed in nova:
assignee: nobody → Jian Wen (wenjianhn)
Revision history for this message
Jian Wen (wenjianhn) wrote :

I need more information to locate the problem.

Could you provide the RPC communication timeouts LOG and the result of the following command?
echo "select * from instance_info_caches where instance_uuid=INSTANCE_UUID \G" | mysql -uroot -p<PASSWORD> nova

Michael Still (mikal)
Changed in nova:
status: Triaged → Incomplete
Revision history for this message
Johnny Boy (balboah) wrote :

There are one machine that runs "all" services e.g. networking, scheduler, horizon etc.
And the one failing is another physical machine with only compute.

The compute log:
https://dl.dropbox.com/u/2468164/compute.log

The controller/horizon:
https://dl.dropbox.com/u/2468164/controller.logs

Your mysql query (different instance id than the older logs attached but always the same error):
*************************** 1. row ***************************
   created_at: 2012-10-29 15:15:16
   updated_at: 2012-10-29 15:17:30
   deleted_at: 2012-10-29 15:17:30
      deleted: 1
           id: 8
 network_info: [{"network": {"bridge": "br101", "subnets": [{"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "192.168.101.10"}], "version": 4, "meta": {"dhcp_server": "192.168.101.3"}, "dns": [{"meta": {}, "version": 4, "type": "dns", "address": "10.0.1.1"}], "routes": [], "cidr": "192.168.101.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "192.168.101.1"}}, {"ips": [], "version": null, "meta": {"dhcp_server": "192.168.101.3"}, "dns": [], "routes": [], "cidr": null, "gateway": {"meta": {}, "version": null, "type": "gateway", "address": null}}], "meta": {"tenant_id": null, "multi_host": true, "should_create_bridge": true, "bridge_interface": "br101"}, "id": "3156d03a-e634-4d43-a920-13f854c4d9de", "label": "private"}, "meta": {}, "id": "c745c364-c332-4ffd-bad9-e0c51acc271e", "address": "fa:16:3e:00:73:fb"}]
instance_uuid: 980777ee-a5ec-4963-9517-eb8f74c52cfc

Jian Wen (wenjianhn)
Changed in nova:
status: Incomplete → Confirmed
Revision history for this message
Jian Wen (wenjianhn) wrote :

Your configuration is wrong.
To run in mulit-host mode, each compute host must run the following services:
nova-compute, nova-network, nova-api-metadata or nova-api

Reference:
http://docs.openstack.org/folsom/openstack-compute/admin/content/existing-ha-networking-options.html

I reproduced the bug by using a correct multi-host mode configuration.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

balboah,

can you please confirm if you fix the configuration per @wenjianhn's last comment work for you?

-- dims

Changed in nova:
status: Confirmed → Incomplete
Revision history for this message
Jian Wen (wenjianhn) wrote :

The root cause of the bug is that conf.add_filter_param("IP", mapping['ips'][0]['ip']) is called against
an instance that doesn't get a fixed ip yet.
The network_info cache is not empty because the nova-network on the controller allocated mac address for the instance.
The mysql query result balboah provides can't reproduce the bug because it's ips field is not empty.

During fixing this bugs, I produced other 2 bugs which distracted my attention.
I will add comment on how to reproduce later.

Changed in nova:
status: Incomplete → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/18116

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/18116
Committed: http://github.com/openstack/nova/commit/51546e79f366d14f25bb5a933a3ab0ad033e7a3d
Submitter: Jenkins
Branch: master

commit 51546e79f366d14f25bb5a933a3ab0ad033e7a3d
Author: Jian Wen <email address hidden>
Date: Wed Dec 12 22:14:18 2012 +0800

    libvirt: Make vif_driver.plug() returns None

    The return value of vif_driver.plug() was used only when nova wrote the
    libvirt.xml file on initial boot of the instance. Make it returns None,
    and add get_config() to get the configurations if needed.

    If LibvirtBridgeDriver is used when the nova-comptue inits, we call
    vif_driver.plug() without get the VIF configurations. By doing this
    plug doesn't raise if the instance has not yet got a fixed-ip. Fixes
    bug 1076309.

    Change-Id: I0fafb3d98e3139bce0ef2cee04cab428a0621696

Changed in nova:
status: In Progress → Fix Committed
tags: added: folsom-backport-potential
Revision history for this message
Johnny Boy (balboah) wrote :

Sorry for the late update.
But I never intended to have multi_host=true and it's not set in the configuration. Even setting it to false i nova.conf doesn't make a difference to the database result.

Revision history for this message
Jian Wen (wenjianhn) wrote :

Johnny,
I saw your question about multi_host on launchpad.
Anyway, both this bug and its fix have nothing to do with multi_host.

Revision history for this message
Johnny Boy (balboah) wrote :

Thanks.

The configuration problem in the end for me was that I thought multi_host could be configured by the nova.conf. But I had to update the setting for my network that I already created in the database.

After updating the network table I don't get the RPC timeout anymore.

Revision history for this message
xsited (xsited) wrote :

I have run into this issue and does not have anything to do multi_host and using the patches against Folsom do not seem suitable. Is there an update on any possible backports or workarounds? thx.

Thierry Carrez (ttx)
Changed in nova:
milestone: none → grizzly-2
status: Fix Committed → Fix Released
Revision history for this message
xsited (xsited) wrote :

A quick follow up. I was able to back myself out of this particular issue by deleting all the networks created when I used the 'a range of networks' syntax. I then created each network one by one and associated a tenant. I was then able to start a VM properly and move on from there. I continued to use --multi_host=T .

https://answers.launchpad.net/nova/+question/219375

Thierry Carrez (ttx)
Changed in nova:
milestone: grizzly-2 → 2013.1
Revision history for this message
Tamale (uictamale) wrote :

I had the same problem, but my resolution was to add a new quota type of fixed_ips to the quotas table. Found here:

https://privatecloudforums.rackspace.com/viewtopic.php?f=4&t=409

Revision history for this message
Tamale (uictamale) wrote :

Full query for posterity:

mysql> INSERT INTO quotas (created_at, updated_at, deleted_at, deleted, project_id, resource, hard_limit) VALUES (NOW(), NULL, NULL, 0, '<tenant-id>', 'fixed_ips', 100);

Revision history for this message
Jian Wen (wenjianhn) wrote :

@xsited, Vish backported similar function in commit d5e7f551

@Tamale, could you file a upstream bug?

tags: removed: folsom-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.