nova-compute does not start after upgrade from juno->kilo if there are boot from volume servers running

Bug #1445021 reported by Sean Dague
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
High
Nikola Đipanov

Bug Description

Running: nova master in grenade tests

Relevant job that triggers this:

http://logs.openstack.org/91/173791/11/check/check-grenade-dsvm/fc725f5/

This patch attempted to test the survivability of a "boot from volume" system over the course of the upgrade, however when we tried to do this a lot of tests failed.

It turns out that libvirt's device scan actually fails in this situation after boot:

http://logs.openstack.org/91/173791/11/check/check-grenade-dsvm/fc725f5/logs/new/screen-n-cpu.txt.gz#_2015-04-16_11_39_05_009

2015-04-16 11:39:05.009 ERROR nova.openstack.common.threadgroup [req-b09699d4-5d28-4eeb-a09c-412f48da3d68 None None] Unexpected error while running command.
Command: sudo nova-rootwrap /etc/nova/rootwrap.conf blockdev --getsize64 /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.2010-10.org.openstack:volume-f1015aa4-1998-47c1-8ce6-625ca0fa2b8c-lun-1
Exit code: 1
Stdout: u''
Stderr: u'blockdev: cannot open /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.2010-10.org.openstack:volume-f1015aa4-1998-47c1-8ce6-625ca0fa2b8c-lun-1: No such device or address\n'
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup Traceback (most recent call last):
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup File "/opt/stack/new/nova/nova/openstack/common/threadgroup.py", line 145, in wait
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup x.wait()
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup File "/opt/stack/new/nova/nova/openstack/common/threadgroup.py", line 47, in wait
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup return self.thread.wait()
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 175, in wait
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup return self._exit_event.wait()
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 121, in wait
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup return hubs.get_hub().switch()
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 294, in switch
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup return self.greenlet.switch()
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 214, in main
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup result = function(*args, **kwargs)
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup File "/opt/stack/new/nova/nova/openstack/common/service.py", line 497, in run_service
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup service.start()
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup File "/opt/stack/new/nova/nova/service.py", line 183, in start
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup self.manager.pre_start_hook()
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup File "/opt/stack/new/nova/nova/compute/manager.py", line 1288, in pre_start_hook
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup self.update_available_resource(nova.context.get_admin_context())
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup File "/opt/stack/new/nova/nova/compute/manager.py", line 6237, in update_available_resource
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup rt.update_available_resource(context)
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup File "/opt/stack/new/nova/nova/compute/resource_tracker.py", line 376, in update_available_resource
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup resources = self.driver.get_available_resource(self.nodename)
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 4908, in get_available_resource
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup disk_over_committed = self._get_disk_over_committed_size_total()
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 6109, in _get_disk_over_committed_size_total
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup self._get_instance_disk_info(dom.name(), xml))
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 6062, in _get_instance_disk_info
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup dk_size = lvm.get_volume_size(path)
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup File "/opt/stack/new/nova/nova/virt/libvirt/lvm.py", line 172, in get_volume_size
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup run_as_root=True)
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup File "/opt/stack/new/nova/nova/virt/libvirt/utils.py", line 55, in execute
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup return utils.execute(*args, **kwargs)
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup File "/opt/stack/new/nova/nova/utils.py", line 206, in execute
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup return processutils.execute(*cmd, **kwargs)
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup File "/usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py", line 233, in execute
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup cmd=sanitized_cmd)
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup ProcessExecutionError: Unexpected error while running command.
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup Command: sudo nova-rootwrap /etc/nova/rootwrap.conf blockdev --getsize64 /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.2010-10.org.openstack:volume-f1015aa4-1998-47c1-8ce6-625ca0fa2b8c-lun-1
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup Exit code: 1
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup Stdout: u''
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup Stderr: u'blockdev: cannot open /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.2010-10.org.openstack:volume-f1015aa4-1998-47c1-8ce6-625ca0fa2b8c-lun-1: No such device or address\n'
2015-04-16 11:39:05.009 13951 TRACE nova.openstack.common.threadgroup

However ....
devstack> ls -al /dev/disk/by-path/ip-10.42.0.70:3260-iscsi-iqn.2010-10.org.openstack:volume-38709eda-da3a-46fa-9607-d2992d7ed1fa-lun-1
lrwxrwxrwx 1 root root 9 Apr 16 13:29 /dev/disk/by-path/ip-10.42.0.70:3260-iscsi-iqn.2010-10.org.openstack:volume-38709eda-da3a-46fa-9607-d2992d7ed1fa-lun-1 -> ../../sdb

> ls -l /dev/sdb
brw-rw---- 1 libvirt-qemu kvm 8, 16 Apr 16 13:31 /dev/sdb

> sudo blockdev --info /dev/sdb
blockdev: cannot open /dev/sdb: No such device or address

Sean Dague (sdague)
Changed in nova:
importance: Undecided → Critical
tags: added: kilo-backport-potential
Revision history for this message
Sean Dague (sdague) wrote :

Ok, so... this turned out to be mostly an issue with the test environment:

stable/juno devstack also shut down tgt

the new service bringup order in grenade brings up nova first, cinder second.

Nova failed to come up under this circumstance for a boot from volume node (though, curiously enough, not for an attached volume).

Removing the extra tgt shutdown gets tests to pass in grenade. However, it does seem like nova-compute should never crash on startup, regardless of the state of the world. It also turns out that restarting tgt later would make the guest work perfectly well.

Changed in nova:
importance: Critical → High
Revision history for this message
Sean Dague (sdague) wrote :

Moving to High, as I think this may not be a blocker any more, however it is a good to fix thing.

Changed in nova:
status: New → Confirmed
Revision history for this message
Sean Dague (sdague) wrote :

Here is the reproduce scenario:

Start a new default devstack (I used Ubuntu 14.04, however distro probably doesn't matter for this)

Run these commands:

create a bootable volume

> openstack volume create --image cirros-0.3.2-x86_64-uec --size 1 cinder_volume

wait for volume to get to bootable state, and capture the id for it

> openstack volume show cinder_volume

boot a volume with that

> openstack server create --volume $id --flavor m1.tiny boot_from_vol_server --wait

once this has gone active, do the following

> service tgt stop

(under fedora it's tgtd iirc)

ensure that tgt is down

> screen -rd

cycle until you get to the n-cpu window:

- ctrl-C to kill it

- up arrow, enter to try to restart it.

n-cpu will stack trace trying to run blockdev64 on the volume device, and will crash out.

However, if you restart tgt the guests will keep chugging along happily. In the absense of the working iscsi connection the kernel treats the guest in the same way that an NFS hard mounted root that has lost network connectivity. So this isn't a system fatal condition, and hence shouldn't be a nova-compute fatal start condition.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/174836

Changed in nova:
assignee: nobody → Nikola Đipanov (ndipanov)
status: Confirmed → In Progress
Changed in nova:
assignee: Nikola Đipanov (ndipanov) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/174836
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=833357301bc80a27422f7bf081fae2d3da730a24
Submitter: Jenkins
Branch: master

commit 833357301bc80a27422f7bf081fae2d3da730a24
Author: Nikola Dipanov <email address hidden>
Date: Fri Apr 17 12:49:13 2015 +0100

    libvirt: make _get_instance_disk_info conservative

    We want to make sure we never try to get the size of an attached volume
    when doing _get_instance_disk_info (as this can cause issues when
    gathering available resources).

    libvirt's get_available_resources will call upon it to determine the
    available disk size on the compute node, but do so without providing
    information about block devices. This makes _get_instance_disk_info make
    incorrect guesses as to which device is a volume

    This patch makes the _get_instance_disk_info be more conservative about
    it's guesses when it can not reasonably determine if a device is a
    volume or not.

    Change-Id: Ifb20655c32896b640672917e3840add81b136780
    Partial-bug: #1445021
    Partial-Bug: #1371677

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/179500

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/kilo)

Change abandoned by Jay Bryant (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/179500
Reason: No one has been begging for this. Abandoning.

Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Nikola Đipanov (ndipanov)
Revision history for this message
Nikola Đipanov (ndipanov) wrote :

I believe that this has the same root cause as https://bugs.launchpad.net/nova/+bug/1416132

I have added more info there and expect that the discussion will be kept there in which case we may want to keep only one and make the other a duplicate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.