Mounting of ceph-backed cinder volumes is broken after Ocata upgrade

Bug #1697782 reported by Chris Martin
32
This bug affects 7 people
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Expired
Undecided
Unassigned

Bug Description

Environment: multi-node greenfield OSA Newton deploy, upgraded to Ocata (via documented manual upgrade process).

After the Ocata upgrade, I could not launch volume-backed instances or attach cinder volumes to instances. This could ultimately be a cinder or nova bug, but starting here because we may need to implement a workaround regardless.

For context, this was changed in Ocata (per release notes):
> When making connections to Ceph-backed volumes via the Libvirt driver, the auth values
> (rbd_user, rbd_secret_uuid) are now pulled from the backing cinder.conf rather than
> nova.conf. The nova.conf values are only used if set and the cinder.conf values are not
> set, but this fallback support is considered accidental and will be removed in the Nova
> 16.0.0 Pike release.

When i try to launch an RBD volume-backed instance, this happens (fuller log output at http://paste.openstack.org/show/612469/):

```
a401-24a53056415c] File "/openstack/venvs/nova-15.1.4/lib/python2.7/site-packages/nova/virt/libvirt/config.py", line 765, in format_dom
2017-06-13 13:01:21.823 55184 ERROR nova.compute.manager [instance: 676a5ee0-a462-4095-a401-24a53056415c] uuid=self.auth_secret_uuid))
...
2017-06-13 13:01:21.823 55184 ERROR nova.compute.manager [instance: 676a5ee0-a462-4095-a401-24a53056415c] TypeError: Argument must be bytes or unicode, got 'NoneType'
```

Basically, self.auth_secret_uuid is supposed to contain the cephx auth UUID (password), but here nova-compute is getting a `None` object from the Cinder connection -- `netdisk_properties['secret_uuid']`, line 65 of nova/virt/libvirt/volume/net.py. If I insert some debugging code, we see the netdisk_properties object contains the following (note `u'secret_uuid': None`):

netdisk_properties:
{u'secret_type': u'ceph', u'name': u'volumes/volume-dd2aa145-ea90-4264-b10d-3b3c84d58c99', u'encrypted': False, u'cluster_name': u'ceph', u'secret_uuid': None, u'qos_specs': None, u'hosts': [u'192.168.1.8'], u'volume_id': u'dd2aa145-ea90-4264-b10d-3b3c84d58c99', u'auth_enabled': True, u'access_mode': u'rw', u'auth_username': u'cinder', u'ports': [u'6789']}

It appears that Cinder isn't providing Nova with a secret_uuid. cinder.conf supports an `rbd_secret_uuid` in the `[rbd]` section, but this was not set in cinder.conf in any of the Cinder containers. I tried setting this in the cinder volume containers, restarted those containers, made no difference.

As a workaround, I modified /openstack/venvs/nova-15.1.4/lib/python2.7/site-packages/nova/virt/libvirt/volume/net.py, to use the `rbd_secret_uuid` in nova.conf, the old (Newton) way. This does the trick (around line 56):

```
    def _set_auth_config_rbd(self, conf, netdisk_properties):
        # Relevant comment trimmed for length
        auth_enabled = netdisk_properties.get('auth_enabled')
        if auth_enabled:
            conf.auth_username = netdisk_properties['auth_username']
            # Comment out the following line and insert the subsequent line
            # conf.auth_secret_uuid = netdisk_properties['secret_uuid']
            conf.auth_secret_uuid = CONF.libvirt.rbd_secret_uuid
```

Has anyone else encountered this issue?

I'm guessing one of these is happening, in decreasing order of likelihood:
1. OSA isn't configuring Cinder sufficiently enough so that it will pass the secret_uuid to Nova when a volume is mounted
2. Cinder has a bug, where secret_uuid is not sent to Nova during volume mount process
3. Nova has a bug, receives secret_uuid but doesn't parse or process it correctly

Thank you!

Revision history for this message
Johan (johan-o) wrote :

I experience the same error with Newton and Ocata. For Newton I know the issue occurred since tag 14.2.3.

https://docs.openstack.org/releasenotes/openstack-ansible/newton.html#id5 Here the changelog of 14.2.3 wich has the same Ceph change as you mentioned in Ocata (Removed dependency for cinder_backends_rbd_inuse in nova.conf).

Currently testing on a new deploy of Ocata tag 15.2.5 and as you said the cinder.conf does not contain `rbd_secret_uuid`, when I provided this information everything worked again so it was not required for me to change the net.py file. So thanks for that workaround.

Revision history for this message
Johan (johan-o) wrote :

Sorry the current test environment is on tag 15.1.5

Revision history for this message
Andy McCrae (andrew-mccrae) wrote :

Just to confirm - your cinder-volume host/container doesn't have the rbd_user rbd_secret_uuid vars specified in cinder.conf under the backends section, but they're in your backends configuration in openstack_user_config.yml (e.g.: https://docs.openstack.org/developer/openstack-ansible-os_cinder/configure-cinder.html#using-ceph-for-cinder-backups)

Or the values are in cinder.conf, but it's failing still?

Revision history for this message
Chris Martin (6-chris-z) wrote :

As deployed by OSA, cinder.conf in my cinder volumes containers contains rbd_user but not rbd_secret_uuid or anything like it.

```
[rbd]
rbd_ceph_conf=/etc/ceph/ceph.conf
volume_backend_name=rbd
volume_driver=cinder.volume.drivers.rbd.RBDDriver
volume_group=cinder_volumes_ceph
rbd_pool=volumes
rbd_user=cinder
```

I added rbd_secret_uuid and restarted the containers, but unlike Johan above, it didn't all just start working for me. (Perhaps I did something wrong!)

Regarding the backends configuration, under https://docs.openstack.org/developer/openstack-ansible-os_cinder/configure-cinder.html#using-ceph-for-cinder-backups, we're using the second example (cephx authentication) as also demonstrated here (https://www.openstackfaq.com/openstack-ansible-ceph/). From openstack_user_config.yml (three hosts are configured like this):
```
  jet01:
    ip: 172.29.236.11
    container_vars:
      cinder_storage_availability_zone: cinder_jetstream_test_ceph
      cinder_default_availability_zone: cinder_jetstream_test_ceph
      cinder_backends:
        limit_container_types: cinder_volume
        rbd:
          volume_group: cinder_volumes_ceph
          volume_driver: cinder.volume.drivers.rbd.RBDDriver
          volume_backend_name: rbd
          rbd_pool: volumes
          rbd_ceph_conf: /etc/ceph/ceph.conf
          rbd_user: cinder
```

Incidentally, nova.conf on the compute nodes *does* have both an rbd_user and an rbd_secret_uuid set.

I upgraded from tag 14.2.4 (Newton) to 15.1.4 (Ocata).

Revision history for this message
Johan (johan-o) wrote :

For me the issue as Andy pointed out was that:
rbd_secret_uuid: "{{ cinder_ceph_client_uuid }}"
was missing in openstack_user_config.yml .

So the problem was we stuck with a config that worked before, but due to an upgrade the config didn't work anymore.

I also have the ceph config added to storage-infra_hosts instead of storage_hosts if that even makes a difference.

For Chris it seems weird that the fallback mechanism in nova/virt/libvirt/volume/net.py in function _set_auth_config_rbd isn't used, also in my little experience variables aren't removed from the openstack_inventory.json file easily so maybe you need to clean up the nova RBD/CEPH values there? (But this also might break things).

Revision history for this message
Paulo Matias (paulo-matias) wrote :

Johan's solution worked for me. But please note that operations such as rebooting or migrating an instance that was created before adding rbd_secret_uuid to the config will continue to cause the error.

In order to "fix" the already existing instances, one needs to manipulate the nova database directly, by replacing the null "secret_uuid" value inside the connection_info field by the correct one, e.g.

update block_device_mapping set connection_info='{..., "secret_uuid": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", ...}' where instance_uuid='yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy';

Revision history for this message
Darin Arrick (darinavbt) wrote :

I think I've also run into this. I deployed a MAAS+Autopilot cloud a couple of weeks ago. Everything seems to work but attaching volumes. Speaking with David B. from Canonical, he suggested I post here, as well.

Environment: new deployment, based on https://www.ubuntu.com/download/cloud/autopilot
"juju status" on controller: https://pastebin.com/Rk6pAGFG
nova-compute.log from the compute node in question: https://pastebin.com/XHsZkVxG

Two things:
1) How do I prove that my issue is this bug? The lack of rbd_secret_uuid somewhere?
2) What's the workaround/fix? My deployment is new and strictly for testing at this point, so I can do whatever is needed.

Revision history for this message
David Britton (dpb) wrote :

FYI -- looks very similar to this issue.

https://bugs.launchpad.net/charm-nova-compute/+bug/1671422

Changed in openstack-ansible:
assignee: nobody → Logan V (loganv)
Revision history for this message
Jean-Philippe Evrard (jean-philippe-evrard) wrote :

Logan will try to reproduce this. Thanks everyone.

Revision history for this message
Jean-Philippe Evrard (jean-philippe-evrard) wrote :

A few ceph changes were recently merged and prevented Logan to triage this. It's postponed to next week. Sorry for the inconvenience.

Revision history for this message
Logan V (loganv) wrote :
Download full text (9.4 KiB)

I deployed a ceph scenario AIO using the ocata SHA aab48ca0f1b71d527d02d75130d02b1ae9a83661 (HEAD of stable/ocata as of now). On the built AIO, I created an instance and a volume, and attached the volume to the instance successfully. nova-compute.log shows that cinder sent the secret_uuid as we expect it to:
2017-07-25 02:15:28.856 145764 DEBUG cinderclient.v2.client [req-69c07112-327b-407f-811a-55a74aefab41 62e26f6ccae04fd391daba6236482eb3 c425b84638eb49849fb6d9412f731ceb - - -] RESP: [200] X-Compute-Request-Id: req-0ffb133c-ebbd-4d5a-945e-a1f51079fba2 Content-Type: application/json Content-Length: 452 X-Openstack-Request-Id: req-0ffb133c-ebbd-4d5a-945e-a1f51079fba2 Date: Tue, 25 Jul 2017 02:15:28 GMT
RESP BODY: {"connection_info": {"driver_volume_type": "rbd", "data": {"secret_type": "ceph", "name": "volumes/volume-e6966d84-7681-4f65-ad2e-573dd4cd1ff9", "encrypted": false, "cluster_name": "ceph", "secret_uuid": "a91b8ca1-be1c-4387-918d-ec3636ad3212", "qos_specs": null, "auth_enabled": true, "hosts": ["172.29.237.153"], "volume_id": "e6966d84-7681-4f65-ad2e-573dd4cd1ff9", "discard": true, "access_mode": "rw", "auth_username": "cinder", "ports": ["6789"]}}}
 _http_log_response /openstack/venvs/nova-15.1.7/lib/python2.7/site-packages/keystoneauth1/session.py:395
2017-07-25 02:15:28.856 145764 DEBUG cinderclient.v2.client [req-69c07112-327b-407f-811a-55a74aefab41 62e26f6ccae04fd391daba6236482eb3 c425b84638eb49849fb6d9412f731ceb - - -] POST call to cinderv2 for http://172.29.236.100:8776/v2/c425b84638eb49849fb6d9412f731ceb/volumes/e6966d84-7681-4f65-ad2e-573dd4cd1ff9/action used request id req-0ffb133c-ebbd-4d5a-945e-a1f51079fba2 request /openstack/venvs/nova-15.1.7/lib/python2.7/site-packages/keystoneauth1/session.py:640
2017-07-25 02:15:28.866 145764 DEBUG nova.virt.libvirt.driver [req-69c07112-327b-407f-811a-55a74aefab41 62e26f6ccae04fd391daba6236482eb3 c425b84638eb49849fb6d9412f731ceb - - -] [instance: ee29a93c-94a0-4dfd-853a-3c7a9446f8f5] Attempting to attach volume e6966d84-7681-4f65-ad2e-573dd4cd1ff9 with discard support enabled to an instance using an unsupported configuration. target_bus = virtio. Trim commands will not be issued to the storage device. _check_discard_for_attach_volume /openstack/venvs/nova-15.1.7/lib/python2.7/site-packages/nova/virt/libvirt/driver.py:1188
2017-07-25 02:15:28.872 145764 DEBUG nova.virt.libvirt.guest [req-69c07112-327b-407f-811a-55a74aefab41 62e26f6ccae04fd391daba6236482eb3 c425b84638eb49849fb6d9412f731ceb - - -] attach device xml: <disk type="network" device="disk">
  <driver name="qemu" type="raw" cache="writeback" discard="unmap"/>
  <source protocol="rbd" name="volumes/volume-e6966d84-7681-4f65-ad2e-573dd4cd1ff9">
    <host name="172.29.237.153" port="6789"/>
  </source>
  <auth username="cinder">
    <secret type="ceph" uuid="a91b8ca1-be1c-4387-918d-ec3636ad3212"/>
  </auth>
  <target bus="virtio" dev="vdb"/>
  <serial>e6966d84-7681-4f65-ad2e-573dd4cd1ff9</serial>
</disk>
 attach_device /openstack/venvs/nova-15.1.7/lib/python2.7/site-packages/nova/virt/libvirt/guest.py:308

Once that worked, I decided to try creating a RBD volume-backed instance as Chris mentioned was broken above. This completed s...

Read more...

Revision history for this message
Chris Martin (6-chris-z) wrote :

Logan, sorry to hear you couldn't reproduce. One difference is that you deployed an AIO, whereas I have a multi-node cluster with compute nodes separate from the infrastructure nodes.

I'm certain all cinder and nova services were restarted after the Ocata upgrade because I performed a rolling restart of all the hosts -- just to ensure EVERYthing was restarted.

Revision history for this message
Daniel Marks (d3n14l) wrote :

Logan, I am also experiencing the issue after the *upgrade* to ocata, you did a fresh ocata install. I started with a 14.1.0 multi-node HA deployment and upgraded to 15.1.6. My main problem was that the Newton deployment did not write the secret uuid to the cinder conf and thus did not do that after the upgrade to ocata either. That resulted in the error above. Something must have changed in the handling of the involved variables. The secret uuid is still in our nova configuration files and newton took it from there, while ocata does not.

Revision history for this message
Logan V (loganv) wrote :

@Chris: yes understood -- the AIO is not an upgrade from Newton, and also not a full prod environment, but its worth mentioning that this apparently isn't reproducible a greenfield ocata install. Regardless of the env size, this seems like a pretty straight forward API interaction debug, either cinder is populating a secret_uuid from the backend config in its reply to nova, or not.

It makes sense that it wouldn't work if the rbd_secret_uuid adjustment to the cinder backends did not take place, but as you said you added it in your config and it still doesn't work, and on a 2nd read thru of your bug report, I think we need to start debugging in cinder @ cinder/volume/drivers/rbd.py inside initialize_connection(). Is nova getting a None for 'secret_uuid' because there's nothing in cinder's self.configuration.rbd_secret_uuid?

I'm just having a hard time figuring out how, other than the necessary config change + service restart, this would be N->O upgrade related. Maybe rbd_secret_uuid is configured in the wrong section of cinder.conf (it needs to be added to each RBD backend section, not [DEFAULT] or anywhere else)

@Daniel: yes, the configuration changed in Ocata. The ceph auth params are now passed over the API call from cinder when volume attachment occurs. Once you add the rbd_user and rbd_secret_uuid to your cinder RBD backend configuration (the backends are configured by the cinder_backends dict in your OSA config), then does it work? Please see the reno at https://docs.openstack.org/releasenotes/openstack-ansible/ocata.html#id16 for more info on this change.

Changed in openstack-ansible:
assignee: Logan V (loganv) → Jesse Pretorius (jesse-pretorius)
Revision history for this message
Jesse Pretorius (jesse-pretorius) wrote :

FYI we now have continuous testing for this issue in the ceph scenario: https://review.openstack.org/#/q/Icc7cf75fcdc6bae3058a91ae7fb57fa3424246f0,n,z

The tests are passing, so I'm not sure what the differential is which is causing this to fail in the specific environment, but we'd have to identify that.

Changed in openstack-ansible:
assignee: Jesse Pretorius (jesse-pretorius) → nobody
Changed in openstack-ansible:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for openstack-ansible because there has been no activity for 60 days.]

Changed in openstack-ansible:
status: Incomplete → Expired
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Bug 1809454 might be of interest.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.