OpenStack Compute (nova)

TypeError in _get_inventory_and_update_provider_generation

Series pike
Bug #1767139

Bug #1767139 reported by Shaun Crampton on 2018-04-26

This bug affects 4 people

	Status	Importance	Assigned to
OpenStack Compute (nova)	Fix Released	Medium	Matt Riedemann
Pike	Fix Committed	Medium	Oliver Walsh
Queens	Fix Committed	Medium	Oliver Walsh

Bug Description

Description
===========

Bringing up a new cluster as part of our CI after switch from 16.1.0 to 16.1.1 on Centos, I'm seeing this error on some computes:

2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager Traceback (most recent call last):
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6752, in update_available_resource_for_node
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager rt.update_available_resource(context, nodename)
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 704, in update_available_resource
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager self._update_available_resource(context, resources)
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in inner
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager return f(*args, **kwargs)
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 728, in _update_available_resource
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager self._init_compute_node(context, resources)
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 585, in _init_compute_node
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager self._update(context, cn)
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 886, in _update
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager inv_data,
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 64, in set_inventory_for_provider
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager inv_data,
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 37, in __run_method
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager return getattr(self.instance, __name)(*args, **kwargs)
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 789, in set_inventory_for_provider
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager self._update_inventory(rp_uuid, inv_data)
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 56, in wrapper
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager return f(self, *a, **k)
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 675, in _update_inventory
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager if self._update_inventory_attempt(rp_uuid, inv_data):
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 562, in _update_inventory_attempt
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager curr = self._get_inventory_and_update_provider_generation(rp_uuid)
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 546, in _get_inventory_and_update_provider_generation
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager if server_gen != my_rp['generation']:
2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager TypeError: 'NoneType' object has no attribute '__getitem__'

The error seems persistent for a single run of nova-compute.

Steps to reproduce
==================

Nodes were started by our CI infrastructure. We start 3 computes and a single control node. In 50% of cases, one of the computes comes up in this bad state.

Expected result
===============

Working cluster.

Actual result
=============

At least one of 3 nodes fails to join the cluster, it's not picked up by discover_hosts and I see the above stack trace repeated in the nova-compute logs.

Environment
===========
1. Exact version of OpenStack you are running. See the following
list for all releases: http://docs.openstack.org/releases/

$ rpm -qa | grep nova
python-nova-16.1.1-1.el7.noarch
openstack-nova-common-16.1.1-1.el7.noarch
python2-novaclient-9.1.1-1.el7.noarch
openstack-nova-api-16.1.1-1.el7.noarch
openstack-nova-compute-16.1.1-1.el7.noarch

2. Which hypervisor did you use?
(For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
What's the version of that?

$ rpm -qa | grep kvm
libvirt-daemon-kvm-3.2.0-14.el7_4.9.x86_64
qemu-kvm-common-ev-2.9.0-16.el7_4.14.1.x86_64
qemu-kvm-ev-2.9.0-16.el7_4.14.1.x86_64

2. Which storage type did you use?
(For example: Ceph, LVM, GPFS, ...)
What's the version of that?

Not sure

3. Which networking type did you use?
(For example: nova-network, Neutron with OpenVSwitch, ...)

Neutron with Calico (I work on Calico, this is our CI system)

Tags:

Revision history for this message

bodhi W (bodhi05) wrote on 2018-04-29:

This problem, I also encountered.

Revision history for this message

Shaun Crampton (fasaxc) wrote on 2018-04-30:

bodhi05, did you mark it as such using the button above? That would give the issue more points.

Revision history for this message

Guang Yee (guang-yee) wrote on 2018-05-01:

We also ran into this bug. The weird part was that after restarting the Nova compute process, this error went away. I wonder if there was a race condition.

Takashi Natsume (natsume-takashi) on 2018-05-02

tags:

added: placement

Revision history for this message

Shaun Crampton (fasaxc) wrote on 2018-05-02:

I think I've found a workaround for my set-up. It seems that this error happens if the compute starts before the control node is ready. If I delay installation of the compute until after the control node is fully up then it seems to be more reliable.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-05-03:

Bug 1765568 is related to this. We started seeing something similar in Ocata CI once this merged:

https://review.openstack.org/#/c/525309/

So we had to revert it here:

https://review.openstack.org/#/c/562739/

That likely means https://review.openstack.org/#/c/524618/ could have caused some regression in stable/pike, and https://review.openstack.org/#/c/524618/ was released in 16.1.1 which is the version you're reporting this against.

Looking at the code in 16.1.1 I can't really see how this is happening because the related methods ensure the resource provider exists in placement first, and then put that into the self._resource_providers cache which is what this code is using when it fails.

If you can attach your full nova-compute log when this fails that would help us debug this.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-05-03:

To be clear, when you say,

"It seems that this error happens if the compute starts before the control node is ready."

Does that include the placement service in the control node? If placement isn't running when nova-compute starts, nova-compute won't be able to create the compute node resource provider in the placement service and that seems to be what is failing.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-05-03:

I know what the problem is:

(9:59:34 AM) mriedem: set_inventory_for_provider -> _ensure_resource_provider -> _create_resource_provider -> safe_connect returns None because it can't talk to placement yet
(9:59:41 AM) mriedem: https://review.openstack.org/#/c/524618/2/nova/scheduler/client/report.py@516
(9:59:44 AM) mriedem: so we put None in the cache

Changed in nova:
status:	New → Triaged
assignee:	nobody → Matt Riedemann (mriedem)
importance:	Undecided → High
importance:	High → Medium

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-03: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/566096

Changed in nova:
status:	Triaged → In Progress

Revision history for this message

Shaun Crampton (fasaxc) wrote on 2018-05-03:

Great, glad you spotted it. It's a bit awkward for me to collect a log now that I've worked around the problem, let me know if you still need that.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-03: Fix proposed to nova (stable/queens)

#10

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/566161

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-03: Fix proposed to nova (stable/pike)

#11

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/566164

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-03: Fix merged to nova (master)

#12

Reviewed: https://review.openstack.org/566096
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=80a001989351d3d427c204c8c06cfacc964f2a35
Submitter: Zuul
Branch: master

commit 80a001989351d3d427c204c8c06cfacc964f2a35
Author: Matt Riedemann <email address hidden>
Date: Thu May 3 11:21:47 2018 -0400

Handle @safe_connect returns None side effect in _ensure_resource_provider

    Change I0c4ca6a81f213277fe7219cb905a805712f81e36 added more error
    handling to the _ensure_resource_provider flow but didn't account
    for @safe_connect returning None when calling _create_resource_provider
    in the case that nova-compute is started before placement is running.
    If that happens, we fail with a TypeError during the nova-compute
    startup because we put None in the resource provider cache and then
    later blindly try to use it because the compute node resource provider
    uuid is in the cache, but mapped to None.

    This adds the None check back in _ensure_resource_provider and if
    None is returned from _create_resource_provider we raise the same
    exception that _create_resource_provider would raise if it couldn't
    create the provider.

Change-Id: If9e1581db9c1ae14340b787d03c815d243d5a50c
Closes-Bug: #1767139

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-05: Fix merged to nova (stable/queens)

#13

Reviewed: https://review.openstack.org/566161
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f95a10b26eb67796a4de32c32afe85c8f6a77048
Submitter: Zuul
Branch: stable/queens

commit f95a10b26eb67796a4de32c32afe85c8f6a77048
Author: Matt Riedemann <email address hidden>
Date: Thu May 3 11:21:47 2018 -0400

Handle @safe_connect returns None side effect in _ensure_resource_provider

    Change-Id: If9e1581db9c1ae14340b787d03c815d243d5a50c
    Closes-Bug: #1767139
    (cherry picked from commit 80a001989351d3d427c204c8c06cfacc964f2a35)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-07: Fix merged to nova (stable/pike)

#14

Reviewed: https://review.openstack.org/566164
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d75b995d03df18164e6338d8b30a7729b52d2133
Submitter: Zuul
Branch: stable/pike

commit d75b995d03df18164e6338d8b30a7729b52d2133
Author: Matt Riedemann <email address hidden>
Date: Thu May 3 11:21:47 2018 -0400

Handle @safe_connect returns None side effect in _ensure_resource_provider

Conflicts:

nova/scheduler/client/report.py: entire method conflicted as
provider_tree changes have not been backported.

    nova/tests/unit/scheduler/client/test_report.py - didn't conflict but
    referenced non-existant methods. Re-implemented based on
    test_ensure_resource_provider_create_fail changing mocked
    create_resource_provider to return None instead of raising an exception.

    Change-Id: If9e1581db9c1ae14340b787d03c815d243d5a50c
    Closes-Bug: #1767139
    (cherry picked from commit 80a001989351d3d427c204c8c06cfacc964f2a35)
    (cherry picked from commit f95a10b26eb67796a4de32c32afe85c8f6a77048)