Error 500 trying to migrate an instance after wrong request_spec
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| OpenStack Compute (nova) |
High
|
Matt Riedemann | ||
| Ocata |
High
|
Matt Riedemann | ||
| Pike |
High
|
Matt Riedemann | ||
| Queens |
High
|
Matt Riedemann | ||
| Rocky |
High
|
Matt Riedemann | ||
| Stein |
High
|
Matt Riedemann |
Bug Description
We've started an instance last Wednesday, and the compute where it ran failed (maybe hardware issue?). Since the networking looked wrong (ie: missing network interfaces), I tried to migrate the instance.
According to Matt, it looked like the request_spec entry for the instance is wrong:
<mriedem> my guess is something like this happened: 1. create server in a group, 2. cold migrate the server which fails on host A and does a reschedule to host B which maybe also fails (would be good to know if previous cold migration attempts failed with reschedules), 3. try to cold migrate again which fails with the instance_group.uuid thing
<mriedem> the reschedule might be the key b/c like i said conductor has to rebuild a request spec and i think that's probably where we're doing a partial build of the request spec but missing the group uuid
Here's what I had in my novaapidb:
{
"nova_
"nova_
"nova_
"ignore_hosts": null,
"requested_
"instance_
"num_
"image": {
"
"
"
"min_disk": 40,
"min_ram": 0,
}
},
"
"
"min_ram",
"min_disk"
]
},
"availabili
"flavor": {
"
"
"
"id": 28,
"name": "cpu2-ram6-disk40",
"root_gb": 40,
"vcpus": 2,
"disabled": false,
"flavorid": "e29f3ee9-
"deleted": false,
"swap": 0,
},
"
},
"force_hosts": null,
"retry": null,
"instance_
"
"
"
"members": null,
"hosts": null,
"policy": "anti-affinity"
},
"
"
"policy",
"members",
"hosts"
]
},
"scheduler_
"group": [
]
},
"limits": {
"
"
"
"disk_gb": null,
"vcpu": null
},
"
"
"disk_gb",
"vcpu",
]
},
"force_nodes": null,
"project_id": "1bf4dbb3d2c746
"user_id": "255cca4584c24b
"numa_
"is_bfv": false,
"pci_requests": {
"
"
"
"requests": []
},
"
}
},
"nova_
"nova_
"ignore_hosts",
"requested_
"num_
"image",
"availabili
"instance_
"flavor",
"scheduler_
"pci_requests",
"instance_
"limits",
"project_id",
"user_id",
"numa_
"is_bfv",
"retry"
]
}
Matt Riedemann (mriedem) wrote : | #1 |
Matt Riedemann (mriedem) wrote : | #2 |
We can see here that the instance_group entry in the request spec is clearly missing the uuid even though there is a group scheduler hint:
"instance_
"
"
"
"members": null,
"hosts": null,
"policy": "anti-affinity"
},
"
"
"policy",
"members",
"hosts"
]
},
"scheduler_
"group": [
]
},
So I'm thinking we're somehow hitting this:
https:/
saving that, and then hitting this which triggers the error:
https:/
Fix proposed to branch: master
Review: https:/
Changed in nova: | |
assignee: | nobody → Matt Riedemann (mriedem) |
status: | New → In Progress |
Matt Riedemann (mriedem) wrote : | #4 |
This might explain what's happening during a cold migration.
Conductor creates a legacy filter_properties dict here:
https:/
If the spec has an instance_group it will call here:
https:/
and _to_legacy_
return {'group_updated': True,
Note there is no group_uuid.
Those filter_properties are passed to the prep_resize method on the dest compute:
https:/
zigo said he hit this:
https:/
(10:03:07 AM) zigo: 2019-05-28 15:02:35.534 30706 ERROR nova.compute.
which will trigger a reschedule here:
https:/
The _reschedule_
https:/
Note that in Rocky the RequestSpec is not passed back to conductor on the reschedule, only the filter_properties:
https:/
We only started passing the RequestSpec from compute to conductor on reschedule starting in Stein: https:/
Without the request spec we get here in conductor:
https:/
Note that was pass in the filter_properties but no instance_group to RequestSpec.
And because there is no instance_group but there are filter_properties, we call _populate_
https:/
Which means we get into this block that sets the RequestSpec.
https:/
Then we eventually RPC cast off to prep_resize on the next host to try for the cold migration and save the request_spec changes here:
https:/
Which is how later attempts to use that request spec to migrate the instance blow up when loading it from the DB because spec.instance_
Changed in nova: | |
importance: | Undecided → High |
Matt Riedemann (mriedem) wrote : | #5 |
This goes back to ocata because of this change:
https:/
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #7 |
Related fix proposed to branch: master
Review: https:/
Related fix proposed to branch: stable/stein
Review: https:/
Related fix proposed to branch: stable/rocky
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit c96c7c5e13bde39
Author: Matt Riedemann <email address hidden>
Date: Tue May 28 13:59:20 2019 -0400
Add regression recreate test for bug 1830747
Before change I4244f7dd8fe745
a cold migration would reschedule to conductor it would not send the
RequestSpec, only the filter_properties. The filter_properties contain
a primitive version of the instance group information from the RequestSpec
for things like the group members, hosts and policies, but not the uuid.
When conductor is trying to reschedule the cold migration without a
RequestSpec, it builds a RequestSpec from the components it has, like the
filter_
field set but with no uuid field in the RequestSpec.
That RequestSpec gets persisted and then because of change
Ie70c77db75
RequestSpec from the database will fail because of the missing
RequestSpec
The test added here recreates the pre-Stein scenario which could still
be a problem (on master) for any corrupted RequestSpecs for older
instances.
Change-Id: I05700c97f756ed
Related-Bug: #1830747
Related fix proposed to branch: stable/queens
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit da453c2bfe86ab7
Author: Matt Riedemann <email address hidden>
Date: Tue May 28 11:24:11 2019 -0400
Workaround missing RequestSpec.
It's clear that we could have a RequestSpec.
without a uuid field if the InstanceGroup is set from the
_populate_
legacy translation of request specs using legacy filter
properties dicts.
To workaround the issue, we look for the group scheduler hint
to get the group uuid before loading it from the DB.
The related functional regression recreate test is updated
to show this solves the issue.
Change-Id: I20981c987549ee
Closes-Bug: #1830747
Changed in nova: | |
status: | In Progress → Fix Released |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 6b4f89ab59a9ed6
Author: Matt Riedemann <email address hidden>
Date: Fri May 31 15:26:24 2019 -0400
Set/get group uuid when transforming RequestSpec to/from filter_properties
As a follow up to change I20981c987549ee
we can avoid having an incomplete InstanceGroup by updating
the _to_legacy_
the group uuid to/from the filter_properties.
Change-Id: I164a6dee1e92a6
Related-Bug: #1830747
Fix proposed to branch: stable/stein
Review: https:/
Fix proposed to branch: stable/rocky
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/stein
commit 8478a754802e29d
Author: Matt Riedemann <email address hidden>
Date: Tue May 28 13:59:20 2019 -0400
Add regression recreate test for bug 1830747
Before change I4244f7dd8fe745
a cold migration would reschedule to conductor it would not send the
RequestSpec, only the filter_properties. The filter_properties contain
a primitive version of the instance group information from the RequestSpec
for things like the group members, hosts and policies, but not the uuid.
When conductor is trying to reschedule the cold migration without a
RequestSpec, it builds a RequestSpec from the components it has, like the
filter_
field set but with no uuid field in the RequestSpec.
That RequestSpec gets persisted and then because of change
Ie70c77db75
RequestSpec from the database will fail because of the missing
RequestSpec
The test added here recreates the pre-Stein scenario which could still
be a problem (on master) for any corrupted RequestSpecs for older
instances.
Change-Id: I05700c97f756ed
Related-Bug: #1830747
(cherry picked from commit c96c7c5e13bde39
tags: | added: in-stable-stein |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/stein
commit 8569eb9b4fb905c
Author: Matt Riedemann <email address hidden>
Date: Tue May 28 11:24:11 2019 -0400
Workaround missing RequestSpec.
It's clear that we could have a RequestSpec.
without a uuid field if the InstanceGroup is set from the
_populate_
legacy translation of request specs using legacy filter
properties dicts.
To workaround the issue, we look for the group scheduler hint
to get the group uuid before loading it from the DB.
The related functional regression recreate test is updated
to show this solves the issue.
Change-Id: I20981c987549ee
Closes-Bug: #1830747
(cherry picked from commit da453c2bfe86ab7
Fix proposed to branch: stable/queens
Review: https:/
Related fix proposed to branch: stable/pike
Review: https:/
Fix proposed to branch: stable/pike
Review: https:/
Related fix proposed to branch: stable/ocata
Review: https:/
Fix proposed to branch: stable/ocata
Review: https:/
This issue was fixed in the openstack/nova 19.0.1 release.
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/rocky
commit a0a187c9bb9bef1
Author: Matt Riedemann <email address hidden>
Date: Tue May 28 13:59:20 2019 -0400
Add regression recreate test for bug 1830747
Before change I4244f7dd8fe745
a cold migration would reschedule to conductor it would not send the
RequestSpec, only the filter_properties. The filter_properties contain
a primitive version of the instance group information from the RequestSpec
for things like the group members, hosts and policies, but not the uuid.
When conductor is trying to reschedule the cold migration without a
RequestSpec, it builds a RequestSpec from the components it has, like the
filter_
field set but with no uuid field in the RequestSpec.
That RequestSpec gets persisted and then because of change
Ie70c77db75
RequestSpec from the database will fail because of the missing
RequestSpec
The test added here recreates the pre-Stein scenario which could still
be a problem (on master) for any corrupted RequestSpecs for older
instances.
NOTE(mriedem): The ComputeTaskAPI.
in this backport because it is not needed before Stein. Also, the
PlacementFi
Change-Id: I05700c97f756ed
Related-Bug: #1830747
(cherry picked from commit c96c7c5e13bde39
(cherry picked from commit 8478a754802e29d
tags: | added: in-stable-rocky |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/rocky
commit 9fed1803b4d6b27
Author: Matt Riedemann <email address hidden>
Date: Tue May 28 11:24:11 2019 -0400
Workaround missing RequestSpec.
It's clear that we could have a RequestSpec.
without a uuid field if the InstanceGroup is set from the
_populate_
legacy translation of request specs using legacy filter
properties dicts.
To workaround the issue, we look for the group scheduler hint
to get the group uuid before loading it from the DB.
The related functional regression recreate test is updated
to show this solves the issue.
Change-Id: I20981c987549ee
Closes-Bug: #1830747
(cherry picked from commit da453c2bfe86ab7
(cherry picked from commit 8569eb9b4fb905c
Thomas Goirand (thomas-goirand) wrote : | #26 |
FYI, the patched version of Nova (Rocky) just reached Debian Buster.
This issue was fixed in the openstack/nova 18.2.1 release.
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/queens
commit 581df2c98676b67
Author: Matt Riedemann <email address hidden>
Date: Tue May 28 13:59:20 2019 -0400
Add regression recreate test for bug 1830747
Before change I4244f7dd8fe745
a cold migration would reschedule to conductor it would not send the
RequestSpec, only the filter_properties. The filter_properties contain
a primitive version of the instance group information from the RequestSpec
for things like the group members, hosts and policies, but not the uuid.
When conductor is trying to reschedule the cold migration without a
RequestSpec, it builds a RequestSpec from the components it has, like the
filter_
field set but with no uuid field in the RequestSpec.
That RequestSpec gets persisted and then because of change
Ie70c77db75
RequestSpec from the database will fail because of the missing
RequestSpec
The test added here recreates the pre-Stein scenario which could still
be a problem (on master) for any corrupted RequestSpecs for older
instances.
NOTE(mriedem): In this version we have to request a specific port
to avoid a NetworkAmbiguous failure when creating the server.
Change-Id: I05700c97f756ed
Related-Bug: #1830747
(cherry picked from commit c96c7c5e13bde39
(cherry picked from commit 8478a754802e29d
(cherry picked from commit a0a187c9bb9bef1
tags: | added: in-stable-queens |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/queens
commit 20b90f2e26e6a46
Author: Matt Riedemann <email address hidden>
Date: Tue May 28 11:24:11 2019 -0400
Workaround missing RequestSpec.
It's clear that we could have a RequestSpec.
without a uuid field if the InstanceGroup is set from the
_populate_
legacy translation of request specs using legacy filter
properties dicts.
To workaround the issue, we look for the group scheduler hint
to get the group uuid before loading it from the DB.
The related functional regression recreate test is updated
to show this solves the issue.
Conflicts:
NOTE(mriedem): The conflict is due to not having change
Ib33719a4b9
Change-Id: I20981c987549ee
Closes-Bug: #1830747
(cherry picked from commit da453c2bfe86ab7
(cherry picked from commit 8569eb9b4fb905c
(cherry picked from commit 9fed1803b4d6b27
This issue was fixed in the openstack/nova 17.0.11 release.
Yang Youseok (ileixe) wrote : | #31 |
FYI, We recently encounter this bug, and found there is no scheduler_hints in request_spec (We are using Ocata and server group was made in the past)...
Yang Youseok (ileixe) wrote : | #32 |
I think scheduler_hints does not existed after RequestSpec was override during reschedule. If then, VM could not revive even after this workaround patch applied.
Arvydas O. (zebediejus) wrote : | #33 |
As Yang noticed, uuid is missing in scheduler_hints as well as in instance_groups. After our upgrade from Mitaka (previous upgraded from Liberty) to Rocky we had same problem. Not only cold migration/resize was failing, also "nova-manage db online_
As workaround we solved it by inserting scheduler_hints directly to DB (use on your own risk):
--UPDATE nova_api.
INNER JOIN nova.instances ins ON ins.uuid = rs.instance_uuid
INNER JOIN nova_api.
INNER JOIN nova_api.
SET rs.spec = REPLACE(rs.spec, ', "scheduler_hints": {}},', CONCAT(', "scheduler_hints": {"group": ["', ig.uuid, '"]}},'))
WHERE ins.deleted = 0 AND rs.spec LIKE '%"policies"%' AND rs.spec NOT LIKE '%"uuid":%' AND ins.uuid = '3cdd744e-
This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/pike
commit 09ec97b95b19a42
Author: Matt Riedemann <email address hidden>
Date: Tue May 28 13:59:20 2019 -0400
Add regression recreate test for bug 1830747
Before change I4244f7dd8fe745
a cold migration would reschedule to conductor it would not send the
RequestSpec, only the filter_properties. The filter_properties contain
a primitive version of the instance group information from the RequestSpec
for things like the group members, hosts and policies, but not the uuid.
When conductor is trying to reschedule the cold migration without a
RequestSpec, it builds a RequestSpec from the components it has, like the
filter_
field set but with no uuid field in the RequestSpec.
That RequestSpec gets persisted and then because of change
Ie70c77db75
RequestSpec from the database will fail because of the missing
RequestSpec
The test added here recreates the pre-Stein scenario which could still
be a problem (on master) for any corrupted RequestSpecs for older
instances.
NOTE(mriedem): In this version we have to use the MediumFakeDriver
because change I12de2e19502259
in Pike so resizing to the same host does not work with the
SmallFakeDr
Change-Id: I05700c97f756ed
Related-Bug: #1830747
(cherry picked from commit c96c7c5e13bde39
(cherry picked from commit 8478a754802e29d
(cherry picked from commit a0a187c9bb9bef1
(cherry picked from commit 581df2c98676b67
tags: | added: in-stable-pike |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/pike
commit 79cc08642172a3d
Author: Matt Riedemann <email address hidden>
Date: Tue May 28 11:24:11 2019 -0400
Workaround missing RequestSpec.
It's clear that we could have a RequestSpec.
without a uuid field if the InstanceGroup is set from the
_populate_
legacy translation of request specs using legacy filter
properties dicts.
To workaround the issue, we look for the group scheduler hint
to get the group uuid before loading it from the DB.
The related functional regression recreate test is updated
to show this solves the issue.
Change-Id: I20981c987549ee
Closes-Bug: #1830747
(cherry picked from commit da453c2bfe86ab7
(cherry picked from commit 8569eb9b4fb905c
(cherry picked from commit 9fed1803b4d6b27
(cherry picked from commit 20b90f2e26e6a46
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/ocata
commit c41fe944dbf554e
Author: Matt Riedemann <email address hidden>
Date: Tue May 28 13:59:20 2019 -0400
Add regression recreate test for bug 1830747
Before change I4244f7dd8fe745
a cold migration would reschedule to conductor it would not send the
RequestSpec, only the filter_properties. The filter_properties contain
a primitive version of the instance group information from the RequestSpec
for things like the group members, hosts and policies, but not the uuid.
When conductor is trying to reschedule the cold migration without a
RequestSpec, it builds a RequestSpec from the components it has, like the
filter_
field set but with no uuid field in the RequestSpec.
That RequestSpec gets persisted and then because of change
Ie70c77db75
RequestSpec from the database will fail because of the missing
RequestSpec
The test added here recreates the pre-Stein scenario which could still
be a problem (on master) for any corrupted RequestSpecs for older
instances.
NOTE(mriedem): In this version we have to use the FakeDriver because the
MediumFakeD
DiskFilter since we are using placement during scheduling.
Change-Id: I05700c97f756ed
Related-Bug: #1830747
(cherry picked from commit c96c7c5e13bde39
(cherry picked from commit 8478a754802e29d
(cherry picked from commit a0a187c9bb9bef1
(cherry picked from commit 581df2c98676b67
(cherry picked from commit 09ec97b95b19a42
tags: | added: in-stable-ocata |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/ocata
commit 3390c7af7ac7741
Author: Matt Riedemann <email address hidden>
Date: Tue May 28 11:24:11 2019 -0400
Workaround missing RequestSpec.
It's clear that we could have a RequestSpec.
without a uuid field if the InstanceGroup is set from the
_populate_
legacy translation of request specs using legacy filter
properties dicts.
To workaround the issue, we look for the group scheduler hint
to get the group uuid before loading it from the DB.
The related functional regression recreate test is updated
to show this solves the issue.
Change-Id: I20981c987549ee
Closes-Bug: #1830747
(cherry picked from commit da453c2bfe86ab7
(cherry picked from commit 8569eb9b4fb905c
(cherry picked from commit 9fed1803b4d6b27
(cherry picked from commit 20b90f2e26e6a46
(cherry picked from commit 79cc08642172a3d
This is the error by the way:
http:// paste.openstack .org/show/ 752159/
2019-05-28 13:40:16.610 159865 ERROR nova.api. openstack. wsgi [req-1ca4c1d0- 9f6f-4a04- 860d-1d3d03a0d0 63 9fb4630d74ad49e 8ac9f4e8a72b8ca fb 504ea0a356ca406 6aaa617daff8694 63 - default default] Unexpected exception in API method: nova.exception. ObjectActionErr or: Object action obj_load_attr failed because: unable to load uuid openstack. wsgi Traceback (most recent call last): openstack. wsgi File "/usr/lib/ python3/ dist-packages/ nova/api/ openstack/ wsgi.py" , line 801, in wrapped openstack. wsgi return f(*args, **kwargs) openstack. wsgi File "/usr/lib/ python3/ dist-packages/ nova/api/ validation/ __init_ _.py", line 110, in wrapper openstack. wsgi return func(*args, **kwargs) openstack. wsgi File "/usr/lib/ python3/ dist-packages/ nova/api/ openstack/ compute/ migrate_ server. py", line 56, in _migrate openstack. wsgi host_name= host_name) openstack. wsgi File "/usr/lib/ python3/ dist-packages/ nova/compute/ api.py" , line 205, in inner openstack. wsgi return function(self, context, instance, *args, **kwargs) openstack. wsgi File "/usr/lib/ python3/ dist-packages/ nova/compute/ api.py" , line 213, in _wrapped openstack. wsgi return fn(self, context, instance, *args, **kwargs) openstack. wsgi File "/usr/lib/ python3/ dist-packages/ nova/compute/ api.py" , line 153, in inner openstack. wsgi return f(self, context, instance, *args, **kw) openstack. wsgi File "/usr/lib/ python3/ dist-packages/ nova/compute/ api.py" , line 3516, in resize openstack. wsgi context, instance.uuid) openstack. wsgi File "/usr/lib/ python3/ dist-packages/ oslo_versionedo bjects/ base.py" , line 184, in wrapper openstack. wsgi result = fn(cls, context, *args, **kwargs) openstack. wsgi File "/usr/lib/ python3/ dist-packages/ nova/objects/ request_ spec.py" , line 531, in get_by_ instance_ uuid openstack. wsgi return cls._from_ db_object( context, cls(), db_spec) openstack. wsgi File "/usr/lib/ python3/ dist-packages/ nova/objects/ request_ spec.py" , line 510, in _from_db_object openstack. wsgi context, spec.instance_ group.uuid) openstack. wsgi File "/usr/lib/ python3/ dist-packages/ oslo_versionedo bjects/ base.py" , line 67, in getter
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
2019-05-28 13:40:16.610 159865 ERROR nova.api.
20...