Ironic: TypeError: unsupported operand type(s) for *: 'NoneType' and 'int' - during select_destinations()

Bug #1654102 reported by Matt Riedemann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann
Newton
Fix Committed
Medium
Roman Podoliaka

Bug Description

This looks like a duplicate of bug 1610679 but that's for the non-Ironic scheduler host manager, this is for the Ironic host manager.

Seen here:

http://logs.openstack.org/20/407220/2/check/gate-tempest-dsvm-ironic-ipa-wholedisk-agent_ipmitool-tinyipa-multinode-ubuntu-xenial-nv/0779f94/logs/screen-n-sch.txt.gz#_2017-01-04_19_05_16_398

2017-01-04 19:05:16.398 20709 DEBUG oslo_concurrency.lockutils [req-6ce318a2-53e5-4b80-84a5-f896ab48d627 tempest-BaremetalMultitenancy-1820943769 tempest-BaremetalMultitenancy-1820943769] Lock "(u'ubuntu-xenial-2-node-rax-ord-6478210', u'21da4933-a128-45f6-a765-7e6bc071e0f3')" released by "nova.scheduler.host_manager._locked_update" :: held 0.002s inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:282
2017-01-04 19:05:16.398 20709 DEBUG oslo_concurrency.lockutils [req-6ce318a2-53e5-4b80-84a5-f896ab48d627 tempest-BaremetalMultitenancy-1820943769 tempest-BaremetalMultitenancy-1820943769] Lock "(u'ubuntu-xenial-2-node-rax-ord-6478210-359401', u'00d62acd-6f3b-4cb8-9668-12517c84b3b9')" acquired by "nova.scheduler.host_manager._locked_update" :: waited 0.000s inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:270
2017-01-04 19:05:16.398 20709 DEBUG nova.scheduler.host_manager [req-6ce318a2-53e5-4b80-84a5-f896ab48d627 tempest-BaremetalMultitenancy-1820943769 tempest-BaremetalMultitenancy-1820943769] Update host state from compute node: ComputeNode(cpu_allocation_ratio=16.0,cpu_info='',created_at=2017-01-04T19:02:54Z,current_workload=None,deleted=False,deleted_at=None,disk_allocation_ratio=1.0,disk_available_least=10,free_disk_gb=None,free_ram_mb=None,host='ubuntu-xenial-2-node-rax-ord-6478210-359401',host_ip=10.210.37.31,hypervisor_hostname='00d62acd-6f3b-4cb8-9668-12517c84b3b9',hypervisor_type='ironic',hypervisor_version=1,id=8,local_gb=10,local_gb_used=0,memory_mb=384,memory_mb_used=0,metrics=None,numa_topology=None,pci_device_pools=PciDevicePoolList,ram_allocation_ratio=1.0,running_vms=None,service_id=None,stats={cpu_arch='x86_64'},supported_hv_specs=[HVSpec],updated_at=2017-01-04T19:04:51Z,uuid=521cf775-5a16-4111-bf26-c30fb6725716,vcpus=1,vcpus_used=0) _locked_update /opt/stack/new/nova/nova/scheduler/host_manager.py:168
2017-01-04 19:05:16.400 20709 DEBUG oslo_concurrency.lockutils [req-6ce318a2-53e5-4b80-84a5-f896ab48d627 tempest-BaremetalMultitenancy-1820943769 tempest-BaremetalMultitenancy-1820943769] Lock "(u'ubuntu-xenial-2-node-rax-ord-6478210-359401', u'00d62acd-6f3b-4cb8-9668-12517c84b3b9')" released by "nova.scheduler.host_manager._locked_update" :: held 0.002s inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:282
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server [req-6ce318a2-53e5-4b80-84a5-f896ab48d627 tempest-BaremetalMultitenancy-1820943769 tempest-BaremetalMultitenancy-1820943769] Exception during message handling
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 155, in _process_incoming
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 222, in dispatch
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 192, in _do_dispatch
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 218, in inner
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server return func(*args, **kwargs)
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/scheduler/manager.py", line 84, in select_destinations
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server dests = self.driver.select_destinations(ctxt, spec_obj)
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/scheduler/filter_scheduler.py", line 51, in select_destinations
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server selected_hosts = self._schedule(context, spec_obj)
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/scheduler/filter_scheduler.py", line 96, in _schedule
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server hosts = self._get_all_host_states(elevated)
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/scheduler/filter_scheduler.py", line 134, in _get_all_host_states
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server return self.host_manager.get_all_host_states(context)
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/scheduler/host_manager.py", line 610, in get_all_host_states
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server self._get_instance_info(context, compute))
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/scheduler/host_manager.py", line 180, in update
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server return _locked_update(self, compute, service, aggregates, inst_dict)
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", line 271, in inner
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server return f(*args, **kwargs)
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/scheduler/host_manager.py", line 169, in _locked_update
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server self._update_from_compute_node(compute)
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server File "/opt/stack/new/nova/nova/scheduler/ironic_host_manager.py", line 46, in _update_from_compute_node
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server self.free_disk_mb = compute.free_disk_gb * 1024
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server

Right before the scheduler failure I see this in the n-cpu logs:

http://logs.openstack.org/20/407220/2/check/gate-tempest-dsvm-ironic-ipa-wholedisk-agent_ipmitool-tinyipa-multinode-ubuntu-xenial-nv/0779f94/logs/screen-n-cpu.txt.gz#_2017-01-04_19_03_55_573

2017-01-04 19:03:55.573 28458 INFO nova.compute.manager [req-a0b07a5e-2c5a-4550-a0b5-6d1fe2b1d9e0 - -] Deleting orphan compute node 1 hypervisor host is 00d62acd-6f3b-4cb8-9668-12517c84b3b9, nodes are set([u'74396195-3bd0-4bc3-9f51-2f9d9ff32616', u'dd8dcc23-04f4-4718-a079-e1b77d70bad9', u'36827858-740c-4f1d-a0e9-17339f441338', u'86939305-5678-4796-a250-76295faf1890', u'0f813337-25e3-4c12-a2b0-cf2222f3cc7f', u'21da4933-a128-45f6-a765-7e6bc071e0f3'])

I'm not really sure what's going on with the compute node record though, if it's getting deleted or what, because once the resource tracker creates the compute node record it should be setting free_disk_gb shortly thereafter and then updating the record in the database. There is definitely a window of time between when the compute node record is first created and when it's updated, but that doesn't appear to be the case here since the updated_at field is set:

updated_at=2017-01-04T19:04:51Z

But the failure is about 25 seconds after that.

This also seems to be happening quite a bit since January 2nd 2017:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22_update_from_compute_node%5C%22%20AND%20message%3A%5C%22TypeError%3A%20unsupported%20operand%20type(s)%20for%20*%3A%20'NoneType'%20and%20'int'%5C%22%20AND%20tags%3A%5C%22screen-n-sch.txt%5C%22&from=7d

Matt Riedemann (mriedem)
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
description: updated
Revision history for this message
Matt Riedemann (mriedem) wrote :

Actually when looking at logstash, the failures are mostly in the non-Ironic jobs which are using keystone v3, and those are failing because we create the compute_nodes record in the database from the resource tracker in the nova-compute service, but we fail to update the compute node record which sets the free_disk_gb value:

http://logs.openstack.org/65/416765/1/check/gate-tempest-dsvm-neutron-identity-v3-only-full-ubuntu-xenial-nv/faf1363/logs/screen-n-cpu.txt.gz#_2017-01-04_23_15_06_674

And it looks like that's happening because of some keystone v3 auth issue when talking to the placement service.

As for the 1/2 date, ~1/3 is when we made the placement-api run in all jobs in devstack:

https://review.openstack.org/#/c/412537/

Which would include the keystone v3 API job.

Why we're hitting the auth issue with keystone v3 I don't know, but it's when we're trying to talk to the placement API for the resource provider for the compute node, which should be handled with a safe_connect decorator...

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is the auth failure on the keystone side:

http://logs.openstack.org/65/416765/1/check/gate-tempest-dsvm-neutron-identity-v3-only-full-ubuntu-xenial-nv/faf1363/logs/apache/keystone.txt.gz#_2017-01-04_23_15_06_670

2017-01-04 23:15:06.670 3310 WARNING keystone.auth.plugins.core [req-38ce6df3-1dca-4b66-a680-d2691bb4955f - - - - -] Could not find user: placement
2017-01-04 23:15:06.670 3310 WARNING keystone.common.wsgi [req-38ce6df3-1dca-4b66-a680-d2691bb4955f - - - - -] Authorization failed. The request you have made requires authentication. from 10.18.76.58

We don't handle Unauthorized in safe_connect in the nova scheduler report client, we probably should.

Also, it looks like this devstack patch will fix the placement stuff for keystone v3:

https://review.openstack.org/#/c/416599/

Revision history for this message
Matt Riedemann (mriedem) wrote :

I see that we check for Unauthorized responses in the nova-status upgrade check command, but not in safe_connect in the scheduler report client, so we need to handle that.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I've created bug 1654107 to handle the Unauthorized exception separately from the Ironic host manager update race.

Changed in nova:
importance: Medium → Low
Matt Riedemann (mriedem)
Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
importance: Low → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/416822

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Shivanand Tendulker (stendulker) wrote :
Download full text (13.1 KiB)

This issue is seen in Ironic HPE CI jobs
https://15.184.0.52/28/372428/18/check/gate-tempest-ironic-ilo-driver-iscsi_ilo/3d53e55/

See relevant portions of n-sch.log, n-cpu.log and key.log below.

n-sch.log
----------
2017-01-03 23:35:05.933 25579 DEBUG oslo_concurrency.lockutils [req-7ac1f5ec-68a1-47e1-b4a6-0e850839da16 tempest-BaremetalBasicOps-1350158805 tempest-BaremetalBasicOps-1350158805] Lock "(u'jenkins-slave1', u'c25e2761-3b47-4dee-9737-8bb99c469a4b')" acquired by "nova.scheduler.host_manager._locked_update" :: waited 0.000s inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:270
2017-01-03 23:35:05.935 25579 DEBUG nova.scheduler.host_manager [req-7ac1f5ec-68a1-47e1-b4a6-0e850839da16 tempest-BaremetalBasicOps-1350158805 tempest-BaremetalBasicOps-1350158805] Update host state from compute node: ComputeNode(cpu_allocation_ratio=16.0,cpu_info='',created_at=2017-01-03T23:32:43Z,current_workload=None,deleted=False,deleted_at=None,disk_allocation_ratio=1.0,disk_available_least=99,free_disk_gb=None,free_ram_mb=None,host='jenkins-slave1',host_ip=10.13.120.197,hypervisor_hostname='c25e2761-3b47-4dee-9737-8bb99c469a4b',hypervisor_type='ironic',hypervisor_version=1,id=1,local_gb=99,local_gb_used=0,memory_mb=16384,memory_mb_used=0,metrics=None,numa_topology=None,pci_device_pools=PciDevicePoolList,ram_allocation_ratio=1.0,running_vms=None,service_id=None,stats={boot_mode='bios',cpu_arch='x86_64'},supported_hv_specs=[HVSpec],updated_at=2017-01-03T23:34:44Z,uuid=93e176d6-dca2-46c4-bd41-b41570e61e2c,vcpus=2,vcpus_used=0) _locked_update /opt/stack/nova/nova/scheduler/host_manager.py:168
2017-01-03 23:35:05.938 25579 DEBUG oslo_concurrency.lockutils [req-7ac1f5ec-68a1-47e1-b4a6-0e850839da16 tempest-BaremetalBasicOps-1350158805 tempest-BaremetalBasicOps-1350158805] Lock "(u'jenkins-slave1', u'c25e2761-3b47-4dee-9737-8bb99c469a4b')" released by "nova.scheduler.host_manager._locked_update" :: held 0.005s inner /usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:282
2017-01-03 23:35:05.938 25579 ERROR oslo_messaging.rpc.server [req-7ac1f5ec-68a1-47e1-b4a6-0e850839da16 tempest-BaremetalBasicOps-1350158805 tempest-BaremetalBasicOps-1350158805] Exception during message handling
2017-01-03 23:35:05.938 25579 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2017-01-03 23:35:05.938 25579 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 155, in _process_incoming
2017-01-03 23:35:05.938 25579 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2017-01-03 23:35:05.938 25579 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 222, in dispatch
2017-01-03 23:35:05.938 25579 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2017-01-03 23:35:05.938 25579 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 192, in _do_dispatch
2017-01-03 23:35:05.938 25579 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2017-01-03 23:35:05.938 25579 ...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/416822
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2f1245a56c238646a056a20caa543f2a254b83d7
Submitter: Jenkins
Branch: master

commit 2f1245a56c238646a056a20caa543f2a254b83d7
Author: Matt Riedemann <email address hidden>
Date: Wed Jan 4 22:06:23 2017 -0500

    Fix TypeError in _update_from_compute_node race

    A compute node record is created in the database from the
    _init_compute_node method in the resource tracker in the
    nova-compute service. It is created without the updated_at
    or free_disk_gb fields being set.

    Shortly after it's created, update_resource_stats() is called
    on the scheduler report client which calls compute_node.save()
    which calls the update_compute_node() method in the DB API which
    *always* updates the updated_at field even if nothing else
    was changed.

    So at this point, we have a compute node with updated_at set
    but not free_disk_gb. The free_disk_gb field gets set by the
    _update_usage_from_instances() method in the resource tracker
    and then that value is later saved off when the compute node
    is updated in the database on a second call to the
    update_resource_stats() method in the scheduler report client.
    At that point the free_disk_gb field is set on the compute
    node record.

    There is a race in between the compute node create and initial
    update but before free_disk_gb is set where the scheduler host
    manager can be attempting to update the HostState object from
    the same compute node when selecting host destinations during
    a server build. If that picks up the compute node before its
    free_disk_gb field is set it will result in a TypeError when
    trying to multiply None * 1024.

    Change 36a0ba9c8141b445f2c6bfc093fde4cc98d229b2 was an earlier
    attempt at fixing this bug and shortened the race window but
    didn't handle the problem where updated_at is set but free_disk_gb
    is not yet.

    This change builds on that by simply checking for the thing
    the scheduler host manager actually cares about, which is the
    free_disk_gb field.

    Closes-Bug: #1654102
    Closes-Bug: #1610679

    Change-Id: I37b75dabb3ea7ec2d5678550d9aff30b1a0c29e6

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.0.0.0b3

This issue was fixed in the openstack/nova 15.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/427080

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/newton)

Reviewed: https://review.openstack.org/427080
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e30b75097840019c38e0619e70924ddc9f9487a0
Submitter: Jenkins
Branch: stable/newton

commit e30b75097840019c38e0619e70924ddc9f9487a0
Author: Matt Riedemann <email address hidden>
Date: Wed Jan 4 22:06:23 2017 -0500

    Fix TypeError in _update_from_compute_node race

    A compute node record is created in the database from the
    _init_compute_node method in the resource tracker in the
    nova-compute service. It is created without the updated_at
    or free_disk_gb fields being set.

    Shortly after it's created, update_resource_stats() is called
    on the scheduler report client which calls compute_node.save()
    which calls the update_compute_node() method in the DB API which
    *always* updates the updated_at field even if nothing else
    was changed.

    So at this point, we have a compute node with updated_at set
    but not free_disk_gb. The free_disk_gb field gets set by the
    _update_usage_from_instances() method in the resource tracker
    and then that value is later saved off when the compute node
    is updated in the database on a second call to the
    update_resource_stats() method in the scheduler report client.
    At that point the free_disk_gb field is set on the compute
    node record.

    There is a race in between the compute node create and initial
    update but before free_disk_gb is set where the scheduler host
    manager can be attempting to update the HostState object from
    the same compute node when selecting host destinations during
    a server build. If that picks up the compute node before its
    free_disk_gb field is set it will result in a TypeError when
    trying to multiply None * 1024.

    Change 36a0ba9c8141b445f2c6bfc093fde4cc98d229b2 was an earlier
    attempt at fixing this bug and shortened the race window but
    didn't handle the problem where updated_at is set but free_disk_gb
    is not yet.

    This change builds on that by simply checking for the thing
    the scheduler host manager actually cares about, which is the
    free_disk_gb field.

    Closes-Bug: #1654102
    Closes-Bug: #1610679

    (cherry-picked from 2f1245a56c238646a056a20caa543f2a254b83d7)
    Change-Id: I37b75dabb3ea7ec2d5678550d9aff30b1a0c29e6

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.4

This issue was fixed in the openstack/nova 14.0.4 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.