[mos] nova-compute fails to start due to major Ceph vs. Nova problems

Bug #1335628 reported by Joshua Dotson
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
MOS Nova
5.0.x
Fix Released
Critical
MOS Nova
Mirantis OpenStack
Fix Released
Critical
Roman Podoliaka
5.0.x
Fix Released
Critical
Roman Podoliaka
5.1.x
Fix Released
Critical
Roman Podoliaka

Bug Description

I have just deployed using the 5.0.1 prerelease ISO I built yesterday. I build my ISO using a procedure similar to what is described here: https://bugs.launchpad.net/fuel/+bug/1335484

My deployment was successful. It was H/A with Ceph backing as much as possible, Celiometer, KVM, three controllers and four compute nodes. Deployment was Ubuntu-based. All nodes have 256GB RAM and 12 physical recent Xeon cores. Disks are 240GB SSD, 3TB HDD, 3TB HDD.

Health checks involving starting an instance all fail. Nova-network shows down on the Horizon status page. All other checks seem okay, even H/A.

Here is /var/log/nova/nova-compute.log from one of the Hypervisors: http://paste.openstack.org/show/85120/

This seems like the same bug: https://<email address hidden>/msg01083.html

I don't see anything running on localhost:5672.

Let me know if I can provide more detail. I'll be working on this more than 12 hours a day until I have a working cloud.

Thanks,
Joshua

Tags: ceph nova
Revision history for this message
Joshua Dotson (tns9) wrote :

Retrying my paste because of internal server error on the paste domain: http://paste.openstack.org/show/85121/

Revision history for this message
Joshua Dotson (tns9) wrote :
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Thanks for issue Joshua, please also attach diagnostic snapshot.

Changed in fuel:
importance: Undecided → Medium
assignee: nobody → Fuel Library Team (fuel-library)
milestone: none → 5.0.1
Revision history for this message
Joshua Dotson (tns9) wrote :

I think the localhost:5672 error was false positive, because the time stamp places it during the initial deployment. The /etc/nova/nova.conf file looks good as it points toward three remote rabbitmq hosts.

I found this in the compute node's /var/log/nova-all.log:
http://paste.openstack.org/show/85132/

^ It seems there is a Python bug somewhere in nova. On the other hand, perhaps something in nova.conf is causing it:

http://paste.openstack.org/show/85133/

I'm trying to find a snapshot option. Please advise.

Revision history for this message
Joshua Dotson (tns9) wrote :

My problem has something to do with these lines of /usr/lib//python2.7/dist-packages/nova/virt/libvirt/driver.py:

http://paste.openstack.org/show/85138/

I did some checking. I thought maybe it should read as follows, but I get another error:

        elif CONF.libvirt.images_type == 'rbd':
            info = LibvirtDriver._get_rbd_driver().get_pool_info()

The error:

<179>Jun 30 02:16:41 node-5 nova-nova.openstack.common.threadgroup ERROR: 'RADOSClient' object has no attribute 'get_cluster_stats'

...
  File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/rbd_utils.py", line 246, in get_pool_info
    stats = client.get_cluster_stats()
AttributeError: 'RADOSClient' object has no attribute 'get_cluster_stats'

I think I'm going to take a look at /usr/lib/python2.7/dist-packages/nova/virt/libvirt/rbd_utils.py...

There are a couple of code bugs here. In addition, I think it's possible that the correct Ceph driver is missing on the compute nodes. There are different drivers for Ceph installed on the controller than on the compute nodes.

Here is as far as I got on the rbd_utils.py corrections(?).. (note the self.pool):

    def get_pool_info(self):
        with RADOSClient(self.pool) as client:
            stats = client.get_cluster_stats()
            return {'total': stats['kb'] * units.Ki,
                    'free': stats['kb_avail'] * units.Ki,
                    'used': stats['kb_used'] * units.Ki}

Joshua Dotson (tns9)
summary: - nova-compute receives connection refused error on localhost
+ nova-compute fails to start due to major Ceph vs. Nova problems
Changed in fuel:
status: New → Confirmed
importance: Medium → Critical
milestone: 5.0.1 → 5.1
assignee: Fuel Library Team (fuel-library) → Dmitry Borodaenko (dborodaenko)
Revision history for this message
Dmitry Borodaenko (angdraug) wrote : Re: nova-compute fails to start due to major Ceph vs. Nova problems

Nova packages for 5.0.1 and 5.1 have to be updated with a newer version of the rbd-ephemeral-clone patch series.

Icehouse version:
https://github.com/angdraug/nova/commits/rbd-ephemeral-clone-stable-icehouse

Juno version:
https://review.openstack.org/102064

Specifically, code snippet quoted by Joshua should be:

    def get_pool_info(self):
        with RADOSClient(self) as client:
            stats = client.cluster.get_cluster_stats()
            return {'total': stats['kb'] * units.Ki,
                    'free': stats['kb_avail'] * units.Ki,
                    'used': stats['kb_used'] * units.Ki}

The stack trace is happening because it should be client.cluster.get_cluster_stats(), not client.get_cluster_stats() above.

Changed in fuel:
assignee: Dmitry Borodaenko (dborodaenko) → MOS Nova (mos-nova)
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Joshua, please confirm if the fix from comment #6 works for you.

Please also elaborate on your observation about different driver versions between controller and compute: we have the same package set for nodes with all roles and combinations thereof, I'd like to know how there could be any inconsistency.

Revision history for this message
Joshua Dotson (tns9) wrote :

Dmitry,

I can confirm that the following seems to work:

@3810 in /usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py:

            info = LibvirtDriver._get_rbd_driver().get_pool_info()

@245 in /usr/lib/python2.7/dist-packages/nova/virt/libvirt/rbd_utils.py:

            stats = client.cluster.get_cluster_stats()

How should I correct this accross my cluster? Is there any built-in automation to help me? Must I rebuild?

About the difference in drives, I based that on my recursive grep looking for get_cluster_stats. I observed occurences of that string in more files on controller than compute. I can provide a paste of this observation if you would like.

Thanks! I thought I was stuck forever.

Now, I need to figure out why Heat isn't allowing the health checks to succeed. I guess it's not installed? I wonder why.

-Joshua

tags: added: ceph
tags: added: nova
Changed in mos:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Roman Podoliaka (rpodolyaka)
Changed in mos:
milestone: none → 5.1
Revision history for this message
Joshua Dotson (tns9) wrote :

I'd like to build a new 5.0.1 ISO and reinstall my environment (72 nodes) once this is patched. Any estimated time of arrival on this fix making it into a normal 5.0.1 build? If it happens in the next day or so, I can do a real world test of it. Otherwise, I'll need to fix this by hand it seems.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Changed in fuel:
importance: Critical → High
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Lowered the priority for 5.1 as we haven't created a Nova branch for it yet.

Revision history for this message
Joshua Dotson (tns9) wrote :

I cannot see content on gerrit.mirantis.com. Is there any public mirror of such content. Is the content APL 2.0?

Thank you,
Joshua

Revision history for this message
Egor Kotko (ykotko) wrote :

{"build_id": "2014-07-03_12-42-15", "mirantis": "yes", "build_number": "89", "ostf_sha": "d0fe60e0eba61685008b86d101f459fc2d3bb654", "nailgun_sha": "5c18e962d85b878e53ff6eb6eeeb14658814c5b8", "production": "docker", "api": "1.0", "fuelmain_sha": "1072bc723d14427d5fdc24662ffe1af0641e0d9a", "astute_sha": "644d279970df3daa5f5a2d2ccf8b4d22d53386ff", "release": "5.0.1", "fuellib_sha": "385d713b569bc0633e695b44ff7eedf3417f0575"}

I have reproduced this bug.
In my case steps were:
1. Create environment Ubuntu multinode, nova FlatDHCP, 1 Controller + Ceph, 2 Computes + Cephs, Ceph as backend for volumes and images + ephemeral volumes.
2. Create instance.

The instance was created in Error state. Nova compute service were down XXX.

Revision history for this message
Egor Kotko (ykotko) wrote :
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

iso #89 contains outdated python-nova DEB package, this is fixed in openstack-ci/fuel-5.0.1/2012.1.1 HEAD.

Interestingly RPM python-nova package in the same ISO actually contains the latest code.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Joshua,

gerrit.mirantis.com is a private repo, when the Ubuntu packages are ready they will show up here:
http://fuel-repository.mirantis.com/fwm/5.1/ubuntu/pool/main/

The ISO make scripts download packages from there as well, so as soon as you see updated nova packages at the link above, your next ISO rebuild should pull the updates.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Correction: for 5.0.1, the URL is http://fuel-repository.mirantis.com/fwm/5.0.1/ubuntu/pool/main/

I just checked and it also doesn't have the fix yet (so Roman's comment #15 is still true).

Changed in fuel:
status: Confirmed → Triaged
Changed in mos:
importance: High → Critical
Changed in fuel:
importance: High → Critical
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Nova is broken without this fix, it must go into 5.0.1 and 5.1 ASAP.

Revision history for this message
Egor Kotko (ykotko) wrote :

{u'build_id': u'2014-07-08_13-57-45', u'ostf_sha': u'09b6bccf7d476771ac859bb3c76c9ebec9da9e1f', u'build_number': u'107', u'api': u'1.0', u'nailgun_sha': u'c0082e3a0e8544bad7bd45c15c5dd8632ea045b5', u'production': u'docker', u'mirantis': u'yes', u'fuelmain_sha': u'b0f5151d12751b9b55dcd69bd1445d0d480012d6', u'astute_sha': u'a4edb51661f50c66e247e0b8d00f2d01e0658fe6', u'release': u'5.0.1', u'fuellib_sha': u'd4cb36208efaf51a7c0ca012fa63d596d4ee2e29'}

Revision history for this message
Joshua Dotson (tns9) wrote :

The fixed package seems to be present in: http://fuel-repository.mirantis.com/fwm/5.0.1/ubuntu/pool/main/python-nova_1:2014.1.1.fuel5.1~mira19_all.deb

Thanks all! I'll rebuild my iso today and see where it takes me.

Dmitry Ilyin (idv1985)
summary: - nova-compute fails to start due to major Ceph vs. Nova problems
+ [mos] nova-compute fails to start due to major Ceph vs. Nova problems
Revision history for this message
OSCI Robot (oscirobot) wrote :

Package nova has been built from changeset:
DEB Repository URL: http://osci-obs.vm.mirantis.net:82/ubuntu-fuel-5.1-stable/ubuntu

Revision history for this message
OSCI Robot (oscirobot) wrote :
Changed in fuel:
status: Fix Committed → In Progress
Changed in mos:
status: Fix Committed → In Progress
Revision history for this message
OSCI Robot (oscirobot) wrote :

Package nova has been built from changeset: http://gerrit.mirantis.com/18857
RPM Repository URL: http://osci-obs.vm.mirantis.net:82/centos-fuel-5.1-stable/centos

Changed in mos:
status: Fix Released → In Progress
Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/5.1.x
Revision history for this message
Alexander Gubanov (ogubanov) wrote :

Verified this on the latest 5.1 ISO

Changed in mos:
status: Fix Committed → Fix Released
Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.