shared_targets_online_data_migration fails when cinder-volume service not running

Bug #1806156 reported by iain MacDonnell
106
This bug affects 22 people
Affects Status Importance Assigned to Milestone
Cinder
Confirmed
Undecided
Unassigned

Bug Description

The shared_targets_online_data_migration online_data_migration fails if the cinder-volume service is not running, because the migration depends on an RPC response from the service to get_capabilities().

online_data_migrations are required to be capable of completing even if services are not online [1]. Otherwise several upgrade scenarios cannot be completed successfully.

The original/approved spec for shared_targets [2] described the migration being implemented thusly:

"During init of the volume service, we’ll query the backend capabilities for the setting “shared_targets”, if the value from the stats structure is False we’ll check any volumes that are set as True and update them accordingly. This way we migrated everything and set it to true, then on service init we verify that the setting is actually correct."

but apparently it was decided to not implement this. Had the spec been implemented as written, we wouldn't have this problem.

[1] https://github.com/openstack/cinder/commit/2cd5957c5e891f0bc5cf57253c0f5e18b330e954
[2] https://specs.openstack.org/openstack/cinder-specs/specs/queens/add-shared-targets-to-volume-ref.html

Revision history for this message
iain MacDonnell (imacdonn) wrote :

My upgrade scenario (using my bespoke ansible-based deployment automation) involves shutting down all services, installing new code, then invoking the upgrade procedures for each component - e.g. apply any needed config changes, then "db sync", then start the service, then do online_data_migrations. cinder-volume separate from the core cinder services (api, scheduler), because I run multiple instances of cinder-volume for different storage backends (I think of it almost more like an agent than a service), on separate control VMs. I do not update and start up the cinder-volume services until later, so they are not running when I would normally do the online_data_migrations for cinder, so the migration fails. I'm working around this for now by just not doing the online_data_migrations during cinder upgrade - fortunately it seems that there have not been any added since Queens, so I don't actually need any ... for now.

I don't know how other distributions handle this, so maybe this issue only affects me right now.

I suppose I could rework my playbook to delay online_data_migrations until after cinder-volume services have been updated and restarted... but it seems I really shouldn't have to.

I haven't looked into FFU in any detail yet, but I keep hearing mention of it, and suspect it will become more prevalent in the future. I believe that it requires migrations to be capable of completing when services are not online.

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

I'm catching the same problem even when all cinder-volumes are running: http://paste.openstack.org/show/737164/

So I'm not sure, whether it's related to the non-running cinder-volumes.

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :
Download full text (3.4 KiB)

Forgot to mention trace from logfile:

Dec 12 20:06:42 uacloud-mgmt02-cinder-api-container-86e8e87e cinder-manage[23905]:
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage [req-3a8a32be-e65d-4f90-b0b4-beca10d13ac9 - - - - -] Error attempting to run shared_targets_online_data_migration: MessagingTimeout: Timed out waiting for a reply to message ID 76ff5fe3a1f44c0f8dc14fc28336bcb7
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage Traceback (most recent call last):
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.0/lib/python2.7/site-packages/cinder/cmd/manage.py", line 351, in _run_migration
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage found, done = migration_meth(ctxt, count)
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.0/lib/python2.7/site-packages/cinder/cmd/manage.py", line 128, in shared_targets_online_data_migration
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage non_shared_hosts, total_vols_to_update = _get_non_shared_target_hosts(ctxt)
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.0/lib/python2.7/site-packages/cinder/cmd/manage.py", line 111, in _get_non_shared_target_hosts
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage capabilities = rpcapi.get_capabilities(ctxt, service.host, True)
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.0/lib/python2.7/site-packages/cinder/volume/rpcapi.py", line 341, in get_capabilities
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage return cctxt.call(ctxt, 'get_capabilities', discover=discover)
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.0/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage retry=self.retry)
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.0/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage retry=retry)
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.0/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage call_monitor_timeout, retry=retry)
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.0/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage call_monitor_timeout)
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.0/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage message = self.waiters.get(msg_id, timeout=timeout)
2018-12-12 20:06:42.981 23905 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.0/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2018-12-12 20:06:42.981 23905 E...

Read more...

Revision history for this message
iain MacDonnell (imacdonn) wrote :

The migrations requires an RPC response from all of the hosts that have cinder-volume services. If times out waiting for that response for any reason, the migration fails. The cinder-volume service being down is certainly one cause. There may be others. The point of this bug is that the migration should not assume that it can get a response from the service.

Jay Bryant (jsbryant)
Changed in cinder:
status: New → Confirmed
Revision history for this message
Chris Martin (6-chris-z) wrote :

Also hitting this problem, using OpenStack-Ansible to upgrade from Queens to Rocky. Has anyone identified a workaround yet?

Revision history for this message
iain MacDonnell (imacdonn) wrote :

I don't know how OpenStack-Ansible sequences the upgrade, so I can't confirm if it's susceptible to this bug, or how much you can manipulate it, but; it appears there were no new online migrations added in Rocky, so (if you can find a way) you can probably skip the step entirely for a Queens->Rocky upgrade. You could run it manually afterwards if you want to be sure (once all of the services are back up).

Revision history for this message
Thiago Martins (martinx) wrote :

Hey guys,

I believe that I'm facing this very same problem while trying to deploy a fresh OpenStack Rocky using "openstack-ansible" project.

The "os-cinder-install.yml playbook fails, with the following error:

http://paste.openstack.org/show/744145

Basically, the following command fails:

`/openstack/venvs/cinder-18.1.3/bin/cinder-manage db online_data_migrations`

"Error attempting to run shared_targets_online_data_migration"

I checked and the cinder-volume is running inside of its LXC container as expected (I believe).

This isn't an upgrade from Queens! It's a fresh Rocky install, in top of fresh installed Ubuntu 18.04 (everything deployed and redeployed via MaaS).

Cheers!

Revision history for this message
iain MacDonnell (imacdonn) wrote :

Did you check cinder-manage.log ?

Revision history for this message
Thiago Martins (martinx) wrote :

Looks like that after this systemd thing the logs where moved to God knows where...

There are no cinder-*.log inside of cinder-api containers, neither on /openstack/logs/* at the container's host, neither on the logs_hosts.

So... No, I didn't. :-/

I can manually run `cinder-manage db online_data_migrations` if helps to get more info.

Revision history for this message
Thiago Martins (martinx) wrote :

Believe you or not, I just executed that command and it worked! LOL

I'm running the "os-cinder-install.yml" plauybook again just to make sure that it's working.

Next step, try to create a Cinder volume and see if it hits Ceph.

Best!

Revision history for this message
iain MacDonnell (imacdonn) wrote :

Can't really help with the logging issue, although maybe you could try this and see if you can spot anything; journalctl --since '2019-01-28 23:12'

Looking at your ansible output again, it seems it took only 3 seconds of tail (if I'm reading it right), so it seems unlikely to be a RPC timeout.

Running it manually will at least tell you that your database is OK... it doesn't exactly tell you why it failed during the deployment, though (since the circumstances are different when you're doing it manually).

Revision history for this message
Thiago Martins (martinx) wrote :

I just reinstalled the whole thing from scratch, after going to Ubuntu MaaS and "Release -> Deploy" and then, running `openstack-ansible setup-everything.yml` again.

I know that my fresh Ceph cluster doesn't come HEALTHY after `openstack-ansible ceph-install.yml` but, I decided to proceed anyway, just to test it again as it was.

It fails on the same spot!

I noticed that when Ceph is in bad shape, and I try to create a Cinder Volume on it, the "cinder-volume" agents turns itself "DOWN", while the cinder-volume process is still running inside of cinder-container!

So, this means that cinder-volume isn't in a "up" state, then, I believe that it's triggering this bug as well!

What I'm going to do next? Same thing I did before!

Which was to bring Ceph to a HEALTHY state and try `openstack-ansible os-cinder-install.yml` again!

Revision history for this message
Cody (codylab) wrote :

I also hit the same error while running OSA 18.1.3 on freshly built bare metals running Ubuntu 18.04 LTS.

I also attempted to run '/openstack/venvs/cinder-18.1.3/bin/cinder-manage db online_data_migrations' from within the cinder-api-container, but got the same error: https://pasted.tech/l/0826e4774b12bc29

Unlike @martinx's case, my Ceph cluster was in HEALTH state when the deployment failed.

Has anyone got a fix or workaround on this?

Revision history for this message
Ian Kumlien (pomac) wrote :

We also just hit this, and I'm wondering how this would work when cinder-manage service list returns a rbd:<poolname>@<config-definition> - what host should that contact?
Or is it all going trough rabbitmq and those hosts will just accept it?

Revision history for this message
Ian Kumlien (pomac) wrote :

Actually, we have all cinder services running, but still hitting this, just FYI

Revision history for this message
Ian Kumlien (pomac) wrote :

Looking at it a bit more yields the following:
1. It can't handle rbd:
2. The volumes in ceph has shared_targets set
3. it's "capabilities = rpcapi.get_capabilities(" on line 111 that fails to get information somehow...

A quick hack gets information about the storage backendv3 just fine.

It looks like this should have
if service.host[:4] == "rbd:":
    next
---

Revision history for this message
iain MacDonnell (imacdonn) wrote :

If possible, please post what you see in cinder-manage.log when this happens....

Revision history for this message
Cody (codylab) wrote :

The error is gone after I set the cinder service to be containerized:

# /etc/openstack_deploy/env.d/cinder-volume.yml
container_skel:
  cinder_volumes_container:
    properties:
      is_metal: false

Tested on OSA 18.1.3 with Ceph integration (stable-3.1).

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

I still catch this problem even with containerized cinder-volume on brand-new setup of OSA 18.1.3.

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

@Cody: Probably you've got need_online_data_migrations cached in your ansible_facts, so you're not facing the problem anymore.

@imacdonn I've posted my log at the one of the first messages.

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

Btw, I've started to facing this issue since the following patch https://review.openstack.org/#/c/614617/ - I didn't test if it's rollback helps, as I suggest, that before this patch migrations could simply not work at all (or partially).

Revision history for this message
iain MacDonnell (imacdonn) wrote :

Yeah, well, prior to that fix, the migration could have been failing silently, and you would have been unaware.

Revision history for this message
Eric Smith (stephen-e-smith) wrote :

Did anyone identify a workaround for this? I verified all Cinder services are up, rabbitMQ is working fine, I'm wondering if the transformation is maybe trying to do something that I can skip or don't care about? I see the comment about line 111 (I assume this is from /usr/lib/python2.7/site-packages/cinder/cmd/manage.py) from Iain but it didn't work for me.

information type: Public → Public Security
Revision history for this message
Jeremy Stanley (fungi) wrote :

I've now read through this entire report and it's still unclear to me why you suspect this indicates an exploitable security vulnerability in Cinder. I'm switching the bug type back to a normal "Public" state, but if you do have some reason for the OpenStack VMT to triage it as report of a suspected security vulnerability please provide a clear explanation in a comment so we can better classify this. Thanks!

information type: Public Security → Public
Revision history for this message
Eric Smith (stephen-e-smith) wrote :

It was accidental.

Revision history for this message
Ian Kumlien (pomac) wrote :

Sorry, my code should have said continue, i think, too many programming languages ;)

In cinder/cmd/manage.py
def _get_non_shared_target_hosts(ctxt):

Original:
    services = objects.ServiceList.get_all_by_topic(ctxt,
                                                    constants.VOLUME_TOPIC)
    for service in services:
        capabilities = rpcapi.get_capabilities(ctxt, service.host, True)
---

Modified:
    services = objects.ServiceList.get_all_by_topic(ctxt,
                                                    constants.VOLUME_TOPIC)
    for service in services:
        if service.host[:4] == "rbd:":
            continue
        capabilities = rpcapi.get_capabilities(ctxt, service.host, True)
---

Basically, skip all "unknown hosts with rbd: prefix" since they don't actually exist ;)

Revision history for this message
Ian Kumlien (pomac) wrote :

Someone asked for the error log, so here it is:

2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage [req-073ba3bc-6bba-4b39-9648-2b44c26ca1db - - - - -] Error attempting to run shared_targets_online_data_migration: MessagingTimeout: Timed out waiting for a reply to message ID 61ccbe4e84f743dcbbb5d3bd0994a130
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage Traceback (most recent call last):
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/cinder/cmd/manage.py", line 355, in _run_migration
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage found, done = migration_meth(ctxt, count)
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/cinder/cmd/manage.py", line 129, in shared_targets_online_data_migration
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage non_shared_hosts, total_vols_to_update = _get_non_shared_target_hosts(ctxt)
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/cinder/cmd/manage.py", line 111, in _get_non_shared_target_hosts
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage capabilities = rpcapi.get_capabilities(ctxt, service.host, True)
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/cinder/volume/rpcapi.py", line 341, in get_capabilities
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage return cctxt.call(ctxt, 'get_capabilities', discover=discover)
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage retry=self.retry)
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage retry=retry)
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage call_monitor_timeout, retry=retry)
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage call_monitor_timeout)
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage message = self.waiters.get(msg_id, timeout=timeout)
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage 'to message ID %s' % msg_id)
2019-04-13 14:39:10.986 116 ERROR cinder.cmd.manage MessagingTimeout: Timed out waiting for a reply to message ID 61ccbe4e84f743dcbbb5d3bd0994a130

Revision history for this message
Ian Kumlien (pomac) wrote :

Gnn... formatting... here is a link:
http://paste.openstack.org/show/749273/

Revision history for this message
Jie Li (ramboman) wrote :

I also catch this problem. When we configured backend_host (not host) for cinder-volume HA, then we execute the cli "cinder-manage db online_data_migrations", and we can get the following error: "Error attempting to run shared_targets_online_data_migration: MessagingTimeout: Timed out waiting for a reply to message ID".

Revision history for this message
ilian dimov (iliandimov80) wrote :
Download full text (3.2 KiB)

We facing same problem

--------------------------
Error attempting to run shared_targets_online_data_migration: MessagingTimeout: Timed out waiting for a reply to message ID 20ce42e8bb794b3f8be826e5e79779cb
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage Traceback (most recent call last):
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.5.dev7/lib/python2.7/site-packages/cinder/cmd/manage.py", line 351, in _run_migration
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage found, done = migration_meth(ctxt, count)
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.5.dev7/lib/python2.7/site-packages/cinder/cmd/manage.py", line 128, in shared_targets_online_data_migration
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage non_shared_hosts, total_vols_to_update = _get_non_shared_target_hosts(ctxt)
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.5.dev7/lib/python2.7/site-packages/cinder/cmd/manage.py", line 111, in _get_non_shared_target_hosts
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage capabilities = rpcapi.get_capabilities(ctxt, service.host, True)
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.5.dev7/lib/python2.7/site-packages/cinder/volume/rpcapi.py", line 341, in get_capabilities
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage return cctxt.call(ctxt, 'get_capabilities', discover=discover)
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.5.dev7/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage retry=self.retry)
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.5.dev7/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage retry=retry)
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.5.dev7/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage call_monitor_timeout, retry=retry)
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.5.dev7/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage call_monitor_timeout)
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.5.dev7/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage message = self.waiters.get(msg_id, timeout=timeout)
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.5.dev7/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage 'to message ID %s' % msg_id)
2019-04-17 17:22:21.607 285 ERROR cinder.cmd.manage MessagingTimeout: Timed out waiting for a reply to message ID ...

Read more...

Revision history for this message
Vladislav Naydenov (ssap) wrote :

We have the same problem. Please provide some solution

Revision history for this message
Henry Bonath (hbonath) wrote :

I am also getting hit by this bug, using OSA 18.1.4 Rocky.
With my deployment, I am simply re-running playbooks on a fully working system.

Taking what @Dmitriy Rabotjagov said about need_online_data_migrations Ansible Fact; I attempted to set all of mine to False, and re-running playbook os-cinder-install.yml failed in the same spot. I didn't go as far as to comment out the TASK [Perform online data migrations] - but I suppose that may be another way to bypass this when using OSA.

Unless there is a workaround, we definitely seem to be stuck.

Revision history for this message
Chenjun Shen (cshen) wrote :

We're also hit by this bug, using OSA 18.1.5 Rocky. While running 'openstack-ansible setup-openstack.yml' during major upgrade from Queens to Rocky.

Revision history for this message
f-ender (f-ender) wrote :

We've hit the same problem with 18.1.6 on Ubuntu 18.04.2 on a fresh install.

Is there any known work around?

Revision history for this message
Chenjun Shen (cshen) wrote :

Whoever faces the same problem, please have a look at https://bugs.launchpad.net/cinder/+bug/1821207

It seemed to be the ERROR state volumes which caused the issue.

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

No, the problem exist even when all cinder service states are running (see my previous paste from 2018-12-12)

Revision history for this message
iain MacDonnell (imacdonn) wrote :

If the migration fails when the cinder-volume services are online, it's failing for some other reason, so not (directly) related to this bug.

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

So should I submit another one bug report? I'm almost 100% sure, that all people who said that they face with the bug in terms of the OSA, have running cinder-volumes like I do. And this looks like smth that might be reproduced...

Revision history for this message
iain MacDonnell (imacdonn) wrote :

Looking at your paste and log from 2018-12-12, the services were last updated at 19:57 but the failure occurred at 20:06, so it's not clear if the services we're actually operational at the time of the failure.

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

Ok, placed fresh paste, with more clear timings: http://paste.openstack.org/show/754329/

Revision history for this message
Erik McCormick (emccormickva) wrote :

Just came here to say I hit this as well in kolla-ansible during an upgrade from Queens to Rocky. All services were online and seemed to be working properly, but I hit the same RPC messaging timeout. I've commented out the migration in the playbook just to get by it, but I"m sure it'll rear its ugly head again come Stein upgrade time. I'm happy to share any logs if needed, but they mainly look like the above.

Revision history for this message
Chenjun Shen (cshen) wrote :

Today we did another Queens -> Rocky upgrade, and we hit problem again.
This time, no ERROR state volumes in the db.

Revision history for this message
Chenjun Shen (cshen) wrote :
Download full text (3.4 KiB)

when we hit this problem, I also observed RPC message timeout from the logfile.

2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage [req-ed246967-33af-4f36-950b-a3fb935dee99 - - - - -] Error attempting to run shared_targets_online_data_migration: MessagingTimeout: Timed out waiting for a reply to message ID fcd7e35f688c497691a4f7f1333ee732
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage Traceback (most recent call last):
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.8/lib/python2.7/site-packages/cinder/cmd/manage.py", line 351, in _run_migration
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage found, done = migration_meth(ctxt, count)
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.8/lib/python2.7/site-packages/cinder/cmd/manage.py", line 128, in shared_targets_online_data_migration
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage non_shared_hosts, total_vols_to_update = _get_non_shared_target_hosts(ctxt)
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.8/lib/python2.7/site-packages/cinder/cmd/manage.py", line 111, in _get_non_shared_target_hosts
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage capabilities = rpcapi.get_capabilities(ctxt, service.host, True)
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.8/lib/python2.7/site-packages/cinder/volume/rpcapi.py", line 341, in get_capabilities
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage return cctxt.call(ctxt, 'get_capabilities', discover=discover)
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.8/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage retry=self.retry)
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.8/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage retry=retry)
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.8/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage call_monitor_timeout, retry=retry)
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.8/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage call_monitor_timeout)
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.8/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage message = self.waiters.get(msg_id, timeout=timeout)
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage File "/openstack/venvs/cinder-18.1.8/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2019-07-25 13:19:20.607 24665 ERROR cinder.cmd.manage 'to message ID ...

Read more...

Revision history for this message
iain MacDonnell (imacdonn) wrote :

I'm not sure about all of the scenarios where the symptom is being observed, but the bottom line here is that migrations are supposed to be capable of completing when services are not running, and there will never me an RPC response if services are not running, so the way that the migration is currently implemented is fundamentally flawed.

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

TBH it would be great to prioritize this bug somehow since really lot's of people facing it and there's even no good workaround.

Revision history for this message
Georgina Shippey (gshippey) wrote :

Ran into this issue as well on Stein 19.0.1dev14, Bionic 18.04, while performing a minor version upgrade.

ERROR cinder.cmd.manage [req-6ec0adcc-de8f-4b49-a39e-d874b3eb99f3 - - - - -] Error attempting to run shared_targets_online_data_migration: MessagingTimeout: Timed out waiting for a reply to message ID 8801c0bf4ec14b1b939c673ff6e6c16b
ERROR cinder.cmd.manage Traceback (most recent call last):
ERROR cinder.cmd.manage File "/openstack/venvs/cinder-19.0.1.dev14/lib/python2.7/site-packages/cinder/cmd/manage.py", line 354, in _run_migration
...

Revision history for this message
Andreas Krebs (wintamute) wrote :

Also ran into this issue when upgrading Rocky to Stein using kolla-ansible, all cinder-volume containers were up and running.
I then checked the database via

openstack volume service list

and found some cinder-volume entries with state 'down' for no longer existing hostnames.
After removing all entries with state 'down' via

cinder-manage service remove cinder-volume $old_hostname

from inside the cinder-api container, the task completed sucessfully.

Revision history for this message
HT (h5t4) wrote :

We upgraded openstack-ansible from rocky to stain and cinder-manage also fails (see below to fix/workaround).

~# cinder-manage --debug db online_data_migrations --max_count 1
 ; echo $?
Error attempting to run volume_service_uuids_online_data_migration
+--------------------------------------------+--------------+-----------+
| Migration | Total Needed | Completed |
+--------------------------------------------+--------------+-----------+
| attachment_specs_online_data_migration | 0 | 0 |
| backup_service_online_migration | 0 | 0 |
| service_uuids_online_data_migration | 0 | 0 |
| shared_targets_online_data_migration | 0 | 0 |
| volume_service_uuids_online_data_migration | 0 | 0 |
+--------------------------------------------+--------------+-----------+
Some migrations failed unexpectedly. Check log for details.
2

~# journalctl --since "1 hour ago" | grep cinder.cmd.manage | grep Key
Aug 07 13:59:26 ik01-cinder-api-container-ebba9597 cinder-manage[326]: 2019-08-07 13:59:26.692 326 ERROR cinder.cmd.manage [req-d0a34065-a288-4b2b-9fad-1b915c1f4eea - - - - -] Error attempting to run volume_service_uuids_online_data_migration: KeyError: u'ik03@rbd-ec'

In our case database was inconsistent eg. 'service_uuid' field was sometimes NULL:
MariaDB [cinder]> select id,volume_type_id,status,service_uuid from volumes where host like '%ik03@rbd-ec%' and status!='deleted' and service_uuid is NULL;

Our workaround/fix was:
MariaDB [cinder]> update volumes set service_uuid='81d03e44-21ab-41ea-81b6-a3b5aac8a3b6' where host like '%ik03@rbd-ec%' and status!='deleted' and service_uuid is NULL;

~# /openstack/venvs/cinder-19.0.0/bin/cinder-manage --debug db online_data_migrations ; echo $?
Running batches of 50 until complete.
49 rows matched query volume_service_uuids_online_data_migration, 49 migrated
+--------------------------------------------+--------------+-----------+
| Migration | Total Needed | Completed |
+--------------------------------------------+--------------+-----------+
| attachment_specs_online_data_migration | 0 | 0 |
| backup_service_online_migration | 0 | 0 |
| service_uuids_online_data_migration | 0 | 0 |
| shared_targets_online_data_migration | 0 | 0 |
| volume_service_uuids_online_data_migration | 49 | 49 |
+--------------------------------------------+--------------+-----------+
0

Hope it helps someone to save few hours.

Revision history for this message
Chenjun Shen (cshen) wrote :

Yesterday we ran a FRESH install of openstack-ansible 18.1.8 (Rocky), but it unfortunately failed a t the same place by cinder online_data_migration.

It seems that it always failed at part _get_non_shared_target_hosts(ctxt), which needed RPC communication.
http://paste.openstack.org/show/780650/

Although according to the explanation, the online data migration doesn't need any RPC message. So I would suggest to have a look at code(_get_non_shared_target_hosts) why it needs RPC communication. I think there could be a break point.

Revision history for this message
Chenjun Shen (cshen) wrote :

I agree to what iain MacDonnell (imacdonn) explained in the ticket description.

The online data migration seems to be ambiguous. One side, it is supposed to complete when cinder service is not running, on the other side, it needs a RPC response to know the capabilities of the cinder hosts. So as Iain said, the migration is partly designed fundamentally flawed.

The workaround could be this patch https://bugs.launchpad.net/cinder/+bug/1837703
There is NO need to do data online migration since Rocky, Stein and Train.

Revision history for this message
Eric Miller (erickmiller) wrote :

I ran into a similar situation as comment 48:
https://bugs.launchpad.net/cinder/+bug/1806156/comments/48

This was caused by some failed volume backups, which created "backup-vol-<uuid>" volumes that were detached.

Once these volumes were deleted, all rows in the volumes table that had NULL service_uuid values where marked as deleted.

The problem created the following entries in cinder-manage.log file:

2019-10-06 19:10:48.277 12 ERROR cinder.cmd.manage [req-380963c9-fa87-4cba-91c1-e52f445db0f8 - - - - -] Error attempting to run volume_service_uuids_online_data_migration: KeyError: u'compute004@<redacted>'
2019-10-06 19:10:48.277 12 ERROR cinder.cmd.manage Traceback (most recent call last):
2019-10-06 19:10:48.277 12 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/cinder/cmd/manage.py", line 354, in _run_migration
2019-10-06 19:10:48.277 12 ERROR cinder.cmd.manage found, done = migration_meth(ctxt, count)
2019-10-06 19:10:48.277 12 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/cinder/db/api.py", line 109, in volume_service_uuids_online_data_migration
2019-10-06 19:10:48.277 12 ERROR cinder.cmd.manage return IMPL.volume_service_uuids_online_data_migration(context, max_count)
2019-10-06 19:10:48.277 12 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 1012, in wrapper
2019-10-06 19:10:48.277 12 ERROR cinder.cmd.manage return fn(*args, **kwargs)
2019-10-06 19:10:48.277 12 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 640, in volume_service_uuids_online_data_migration
2019-10-06 19:10:48.277 12 ERROR cinder.cmd.manage v['service_uuid'] = svc_map[host[0]]
2019-10-06 19:10:48.277 12 ERROR cinder.cmd.manage KeyError: u'compute004@<redacted>'

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

Wondering if that has been fixed with https://review.opendev.org/#/c/748482/

Revision history for this message
iain MacDonnell (imacdonn) wrote :

I doubt it; that change doesn't look related.

Revision history for this message
yule sun (syle87) wrote :
Download full text (3.8 KiB)

Hello everyone
    I have the same problem when i use kolla-ansible upgrade to rocky from queens. I deleted the volume from database which volume uuid='NULL', and i also check the cinder service in the cluster . Everything seems to be working fine, but when I do the upgrade I still get this error.
    Anybody who known how to fix it? need your help.

2022-09-29 09:19:33.719 326 INFO cinder.rpc [req-e70a1e11-6cb2-4ee9-acf5-9725bc6a4f7b - - - - -] Automatically selected cinder-volume objects version 1.27 as minimum service version.
2022-09-29 09:19:33.728 326 INFO cinder.rpc [req-e70a1e11-6cb2-4ee9-acf5-9725bc6a4f7b - - - - -] Automatically selected cinder-volume RPC version 3.15 as minimum service version.
2022-09-29 09:20:35.486 326 ERROR cinder.cmd.manage [req-e70a1e11-6cb2-4ee9-acf5-9725bc6a4f7b - - - - -] Error attempting to run shared_targets_online_data_migration: MessagingTimeout: Timed out waiting for a reply to message ID 0e25f193c9da43b987cbc5b894434220
2022-09-29 09:20:35.486 326 ERROR cinder.cmd.manage Traceback (most recent call last):
2022-09-29 09:20:35.486 326 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/cinder/cmd/manage.py", line 351, in _run_migration
2022-09-29 09:20:35.486 326 ERROR cinder.cmd.manage found, done = migration_meth(ctxt, count)
2022-09-29 09:20:35.486 326 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/cinder/cmd/manage.py", line 128, in shared_targets_online_data_migration
2022-09-29 09:20:35.486 326 ERROR cinder.cmd.manage non_shared_hosts, total_vols_to_update = _get_non_shared_target_hosts(ctxt)
2022-09-29 09:20:35.486 326 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/cinder/cmd/manage.py", line 111, in _get_non_shared_target_hosts
2022-09-29 09:20:35.486 326 ERROR cinder.cmd.manage capabilities = rpcapi.get_capabilities(ctxt, service.host, True)
2022-09-29 09:20:35.486 326 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/cinder/volume/rpcapi.py", line 341, in get_capabilities
2022-09-29 09:20:35.486 326 ERROR cinder.cmd.manage return cctxt.call(ctxt, 'get_capabilities', discover=discover)
2022-09-29 09:20:35.486 326 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2022-09-29 09:20:35.486 326 ERROR cinder.cmd.manage retry=self.retry)
2022-09-29 09:20:35.486 326 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2022-09-29 09:20:35.486 326 ERROR cinder.cmd.manage retry=retry)
2022-09-29 09:20:35.486 326 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 645, in send
2022-09-29 09:20:35.486 326 ERROR cinder.cmd.manage call_monitor_timeout, retry=retry)
2022-09-29 09:20:35.486 326 ERROR cinder.cmd.manage File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 634, in _send
2022-09-29 09:20:35.486 326 ERROR cinder.cmd.manage call_monitor_timeout)
2022-09-29 09:20:35.486 326 E...

Read more...

Revision history for this message
yule sun (syle87) wrote :

Great, I solved this problem. The steps to resolve are
1. Find the volume id with service_uuid=NULL that was deleted from volumes at that time.
2. Clear the volume information from golume_glance_metadata.
3. Clean up these volumes from ceph's volumes pool
Then execute docker exec -it cinder_api cinder-manage --debug db online_data_migrations no error is reported

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.