test_live_block_migration fails in gate-grenade-dsvm-neutron-multinode-live-migration-nv with "Shared storage live-migration requires either shared storage or boot-from-volume with no local disks." since ~8/18/2017

Bug #1720191 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann

Bug Description

The gate-grenade-dsvm-neutron-multinode-live-migration-nv job has been failing at about 100% since August 18:

http://graphite.openstack.org/render/?from=-2160hours&height=500&until=now&width=800&bgcolor=ffffff&fgcolor=000000&yMax=100&yMin=0&vtitle=Failure%20Rate%20in%20Percent&title=Test%20failure%20rates%20over%20last%202160%20hours%20%2812%20hour%20rolling%20average%29&drawNullAsZero=true&&target=lineWidth(color(alias(movingAverage(asPercent(transformNull(stats_counts.zuul.pipeline.check.job.gate-grenade-dsvm-neutron-multinode-live-migration-nv.FAILURE),transformNull(sum(stats_counts.zuul.pipeline.check.job.gate-grenade-dsvm-neutron-multinode-live-migration-nv.{SUCCESS,FAILURE}))),%2712hours%27),%20%27gate-grenade-dsvm-neutron-multinode-live-migration-nv%20%28check%29%27),%27ff0000%27),1)

With this failure:

http://logs.openstack.org/87/463987/20/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/ae8875f/logs/subnode-2/screen-n-cpu.txt.gz?level=TRACE#_Sep_26_14_28_11_637958

Sep 26 14:28:11.637958 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server [None req-b8c949a1-606a-48dc-88df-f848ac421d75 tempest-LiveMigrationRemoteConsolesV26Test-1985366765 tempest-LiveMigrationRemoteConsolesV26Test-1985366765] Exception during message handling: InvalidSharedStorage: ubuntu-xenial-2-node-rax-ord-11140716-924370 is not on shared storage: Shared storage live-migration requires either shared storage or boot-from-volume with no local disks.
Sep 26 14:28:11.638227 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
Sep 26 14:28:11.638476 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 160, in _process_incoming
Sep 26 14:28:11.638699 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
Sep 26 14:28:11.638980 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 213, in dispatch
Sep 26 14:28:11.639207 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
Sep 26 14:28:11.639413 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _do_dispatch
Sep 26 14:28:11.639617 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
Sep 26 14:28:11.639823 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server File "/opt/stack/old/nova/nova/exception_wrapper.py", line 76, in wrapped
Sep 26 14:28:11.639981 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server function_name, call_dict, binary)
Sep 26 14:28:11.640148 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Sep 26 14:28:11.640311 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server self.force_reraise()
Sep 26 14:28:11.640449 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Sep 26 14:28:11.640609 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
Sep 26 14:28:11.640863 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server File "/opt/stack/old/nova/nova/exception_wrapper.py", line 67, in wrapped
Sep 26 14:28:11.641005 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
Sep 26 14:28:11.641157 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server File "/opt/stack/old/nova/nova/compute/utils.py", line 874, in decorated_function
Sep 26 14:28:11.641301 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
Sep 26 14:28:11.641460 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server File "/opt/stack/old/nova/nova/compute/manager.py", line 217, in decorated_function
Sep 26 14:28:11.641601 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server kwargs['instance'], e, sys.exc_info())
Sep 26 14:28:11.641731 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
Sep 26 14:28:11.641848 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server self.force_reraise()
Sep 26 14:28:11.641982 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
Sep 26 14:28:11.642132 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
Sep 26 14:28:11.642291 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server File "/opt/stack/old/nova/nova/compute/manager.py", line 205, in decorated_function
Sep 26 14:28:11.642474 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
Sep 26 14:28:11.642659 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server File "/opt/stack/old/nova/nova/compute/manager.py", line 5380, in check_can_live_migrate_source
Sep 26 14:28:11.642827 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server block_device_info)
Sep 26 14:28:11.643055 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server File "/opt/stack/old/nova/nova/virt/libvirt/driver.py", line 6001, in check_can_live_migrate_source
Sep 26 14:28:11.643637 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[32137]: ERROR oslo_messaging.rpc.server raise exception.InvalidSharedStorage(reason=reason, path=source)

Looking at the tests that fail and pass:

http://logs.openstack.org/87/463987/20/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/ae8875f/console.html#_2017-09-26_14_28_03_894125

Anything using the block_migration=auto flag with the 2.25 microversion is successful, so all microversions before that are failing when we have a mixed compute (one pike, one queens) environment.

In this failure case, it's the n-1 Pike compute:

http://logs.openstack.org/87/463987/20/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/ae8875f/logs/subnode-2/screen-n-cpu.txt.gz

Sep 26 14:09:42.382331 ubuntu-xenial-2-node-rax-ord-11140716-924370 nova-compute[23418]: INFO nova.service [-] Starting compute node (version 16.0.1)

So we might be sending something from the queens compute that the pike compute doesn't understand.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looking at the nova commits between August 16 and August 23 I'm not really see anything related to this.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This grenade change to support grenade runs from pike to queens merged on August 16:

https://github.com/openstack-dev/grenade/commit/eeed61bcff8865c4ea499d5136af82298ca7a829

But I'm not sure why that would cause any issues here, especially since some of the tests pass and some fail.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I noticed this in the console log output of the failed run:

http://logs.openstack.org/87/463987/20/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/ae8875f/console.html#_2017-09-26_13_39_42_639694

2017-09-26 13:39:42.639694 | + /opt/stack/new/devstack-gate/devstack-vm-gate.sh:setup_localrc:L570: localrc_set devstack.local.conf.base USE_BLOCK_MIGRATION_FOR_LIVE_MIGRATION True

It looks like devstack-gate sets that for multinode jobs:

https://github.com/openstack-infra/devstack-gate/blob/3d915fe3aa193fcdbb17b7e4af86691bcf4322f7/devstack-vm-gate.sh#L570

localrc_set "$localrc_file" "USE_BLOCK_MIGRATION_FOR_LIVE_MIGRATION" "True"

But that's not what is getting set in tempest.conf on the queens side:

http://logs.openstack.org/87/463987/20/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/ae8875f/logs/new/tempest_conf.txt.gz

block_migration_for_live_migration = False

But it is set to True on the pike side:

http://logs.openstack.org/87/463987/20/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/ae8875f/logs/old/tempest_conf.txt.gz

block_migration_for_live_migration = True

--

The API is running from the queens side, so it's passing block_migration=False:

2017-09-26 14:28:50.220536 | 2017-09-26 14:28:50.220 | 2017-09-26 14:28:15,412 18955 INFO [tempest.lib.common.rest_client] Request (LiveMigrationTest:test_live_block_migration): 400 POST http://10.211.2.15/compute/v2.1/servers/137eba6c-78eb-4ba1-aa85-f019579f1bb0/action 1.185s
2017-09-26 14:28:50.222127 | 2017-09-26 14:28:50.221 | 2017-09-26 14:28:15,413 18955 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'X-Auth-Token': '<omitted>', 'Accept': 'application/json', 'Content-Type': 'application/json'}
2017-09-26 14:28:50.223560 | 2017-09-26 14:28:50.223 | Body: {"os-migrateLive": {"disk_over_commit": false, "block_migration": false, "host": "ubuntu-xenial-2-node-rax-ord-11140716-924370"}}

So there is something about how that flag gets set in local.conf which isn't working, and it might actually be related to the grenade change:

https://github.com/openstack-dev/grenade/commit/eeed61bcff8865c4ea499d5136af82298ca7a829

Revision history for this message
Matt Riedemann (mriedem) wrote :

This change was made in devstack-gate on 8/23:

https://github.com/openstack-infra/devstack-gate/commit/d236e4ca7d1ec03a5de665050c38ad3ada285e4f

But doesn't look too related.

Revision history for this message
Matt Riedemann (mriedem) wrote :

local.conf for the old pike node is getting the variable set from devstack-gate:

http://logs.openstack.org/87/463987/20/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/ae8875f/logs/old/local_conf.txt.gz

USE_BLOCK_MIGRATION_FOR_LIVE_MIGRATION=True

So is the local.conf on the new queens node:

http://logs.openstack.org/87/463987/20/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/ae8875f/logs/new/local.conf.txt.gz

USE_BLOCK_MIGRATION_FOR_LIVE_MIGRATION=True

But ^ isn't getting set in tempest.conf on the new side:

http://logs.openstack.org/87/463987/20/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/ae8875f/logs/new/tempest_conf.txt.gz

block_migration_for_live_migration = False

Looks like grenade copies the tempest.conf from old to new here:

http://logs.openstack.org/87/463987/20/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/ae8875f/logs/grenade.sh.txt.gz#_2017-09-26_14_23_46_489

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

OK, so when we're configuring nova for rbd, the job is using "['--config-file', '/etc/nova/nova.conf']" because it's configured for singleconductor in grenade via:

https://github.com/openstack-dev/grenade/commit/eeed61bcff8865c4ea499d5136af82298ca7a829

But the post test hook is configuring nova-cpu.conf:

http://logs.openstack.org/87/463987/20/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/ae8875f/logs/subnode-2/etc/nova/nova-cpu.conf.txt.gz

Which isn't used, so the compute isn't configured properly.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/508271

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/508271
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3e3309672f399be1bc3558e260104a2db2595970
Submitter: Zuul
Branch: master

commit 3e3309672f399be1bc3558e260104a2db2595970
Author: Matt Riedemann <email address hidden>
Date: Thu Sep 28 14:52:56 2017 -0400

    Fix live migration grenade ceph setup

    Grenade runs in singleconductor mode for queens
    as of change:

      If4c82ca12fe7b8b1ca7cfd8181d24dbd8dad3baa

    However, the nova configuration during ceph setup
    was using NOVA_CPU_CONF which is /etc/nova/nova-cpu.conf,
    which is not what we want when configuring nova.conf
    for the compute service in singleconductor mode.

    Devstack has similar logic for stuff like this, so
    we just have to handle it here since we're in a special
    snowflake.

    The stable/queens systemd stuff is all removed too since
    we run with systemd on the old pike side and can restart
    services properly with systemd on the new queens side
    during grenade live migration CI job runs.

    Change-Id: Iccb8eb55a5cc2a3d08e7fd6e31c89b3b5f8d0c70
    Closes-Bug: #1720191

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.0.0b2

This issue was fixed in the openstack/nova 17.0.0.0b2 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.