NULL free_disk_gb causes scheduler failure

Bug #1517770 reported by James Dennis
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Low
aginwala

Bug Description

It appears a race exists between nova-scheduler and the compute manager when a ComputeNode entry is created for the first time.

The following log messages were noticed after multiple transient failures to create VM on a newly deployed single node system.

2015-11-03 18:41:27.886 13735 WARNING nova.scheduler.host_manager [req-dd2b0758-78a4-4a67-90c8-9586d4d55489 db30a70a389548ed916f52d2f5c25544 617c3194750f44cfa1e9a747b2ac36f5 - - -] Host zs-zhost1 has more disk space than database expected (13119gb > Nonegb)
2015-11-03 18:41:27.904 13783 WARNING nova.scheduler.utils [req-dd2b0758-78a4-4a67-90c8-9586d4d55489 db30a70a389548ed916f52d2f5c25544 617c3194750f44cfa1e9a747b2ac36f5 - - -] Failed to compute_task_build_instances: unsupported operand type(s) for *: 'NoneType' and 'int'
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
executor_callback))
File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch
executor_callback)
File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 130, in _do_dispatch
result = func(ctxt, **new_args)
File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 142, in inner
return func(*args, **kwargs)
File "/usr/lib/python2.7/dist-packages/nova/scheduler/manager.py", line 86, in select_destinations
filter_properties)
File "/usr/lib/python2.7/dist-packages/nova/scheduler/filter_scheduler.py", line 67, in select_destinations
filter_properties)
File "/usr/lib/python2.7/dist-packages/nova/scheduler/filter_scheduler.py", line 131, in _schedule
hosts = self._get_all_host_states(elevated)
File "/usr/lib/python2.7/dist-packages/nova/scheduler/filter_scheduler.py", line 176, in _get_all_host_states
return self.host_manager.get_all_host_states(context)
File "/usr/lib/python2.7/dist-packages/nova/scheduler/host_manager.py", line 552, in get_all_host_states
host_state = self.host_state_cls(host, node, compute=compute)
File "/usr/lib/python2.7/dist-packages/nova/scheduler/host_manager.py", line 309, in host_state_cls
return HostState(host, node, **kwargs)
File "/usr/lib/python2.7/dist-packages/nova/scheduler/host_manager.py", line 157, in _init_
self.update_from_compute_node(compute)
File "/usr/lib/python2.7/dist-packages/nova/scheduler/host_manager.py", line 202, in update_from_compute_node
free_disk_mb = free_gb * 1024
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
2015-11-03 18:41:27.907 13783 WARNING nova.scheduler.utils [req-dd2b0758-78a4-4a67-90c8-9586d4d55489 db30a70a389548ed916f52d2f5c25544 617c3194750f44cfa1e9a747b2ac36f5 - - -] [instance: bd6bb6a7-e917-4ce7-b207-817144ac7853] Setting instance to ERROR state.

I believe that during the execution of resource_tracker._update_available_resource() for a new node, the period between the initial insert of the ComputeNode entry in _init_compute_node() and the call to _update() leaves a ComputeNode with a NULL free_disk_gb for a small window of time.

Commit 6aa36ab seems likely to have exposed this more widely.

Versions (Kilo):
ii nova-common 1:2015.1.1-0ubuntu1~cloud2 all OpenStack Compute - common files
ii nova-compute 1:2015.1.1-0ubuntu1~cloud2 all OpenStack Compute - compute node base
ii nova-compute-kvm 1:2015.1.1-0ubuntu1~cloud2 all OpenStack Compute - compute node (KVM)
ii nova-compute-libvirt 1:2015.1.1-0ubuntu1~cloud2 all OpenStack Compute - compute node libvirt support
ii python-nova 1:2015.1.1-0ubuntu1~cloud2 all OpenStack Compute Python libraries
ii python-novaclient 1:2.22.0-0ubuntu1~cloud0 all client library for OpenStack Compute API

Tags: scheduler
Revision history for this message
aginwala (aginwala-f) wrote :

Is it Kilo or Juno?

Revision history for this message
James Dennis (jtmes) wrote :

Kilo

description: updated
Revision history for this message
aginwala (aginwala-f) wrote :

Can you explicitely login to DB and confirm the values? Also what flavors are you using? Can you update the CLI of what boot commands are you running?

Revision history for this message
James Dennis (jtmes) wrote :

The condition only occurs for a small period of time so it hasn't been possible to log in and check the DB. I followed the code path back from:

WARNING nova.scheduler.host_manager [req-dd2b0758-78a4-4a67-90c8-9586d4d55489 db30a70a389548ed916f52d2f5c25544 617c3194750f44cfa1e9a747b2ac36f5 - - -] Host zs-zhost1 has more disk space than database expected (13119gb > Nonegb)

to conclude that free_disk_gb must have been NULL in the db. I could add an arbitrary time.sleep where I believe the race is to get more certainty if necessary.

The API data POSTed to nova to create the instance was:

{
  "server": {
    "name": "auto-0",
    "imageRef": "a7cb14bf-2bc0-40a9-89f1-f7e2cb05b91f",
    "block_device_mapping_v2": [
      {
        "boot_index": "0",
        "uuid": "8c1d9ce3-0138-474a-857f-6aa13acb4bc9",
        "device_name": "/dev/sda",
        "disk_bus": "virtio",
        "source_type": "volume",
        "destination_type": "volume",
        "delete_on_termination": true
      }
    ],
    "flavorRef": "1",
    "networks": [
      {
        "uuid": "28957e78-8d5b-400e-81d0-2a379b6c7274"
      }
    ],
    "security_groups": [
      {
        "name": "Basic"
      }
    ]
  }
}

The flavors are:
-----------+----+-----------+-------+------+-------------+----------+-------------+---------+--------------+----------+-----------+
| name | id | memory_mb | vcpus | swap | vcpu_weight | flavorid | rxtx_factor | root_gb | ephemeral_gb | disabled | is_public |
+-----------+----+-----------+-------+------+-------------+----------+-------------+---------+--------------+----------+-----------+
| m1.medium | 1 | 4096 | 2 | 0 | NULL | 3 | 1 | 40 | 0 | 0 | 1 |
| m1.tiny | 2 | 512 | 1 | 0 | NULL | 1 | 1 | 1 | 0 | 0 | 1 |
| m1.large | 3 | 8192 | 4 | 0 | NULL | 4 | 1 | 80 | 0 | 0 | 1 |
| m1.xlarge | 4 | 16384 | 8 | 0 | NULL | 5 | 1 | 160 | 0 | 0 | 1 |
| m1.small | 5 | 2048 | 1 | 0 | NULL | 2 | 1 | 20 | 0 | 0 | 1 |
+-----------+----+-----------+-------+------+-------------+----------+-------------+---------+--------------+----------+-----------+

Revision history for this message
aginwala (aginwala-f) wrote :

Before you start booting any instance, Can you please check whats in the DB when compute service is running in compute_node table?

Revision history for this message
aginwala (aginwala-f) wrote :

just need to make sure that your compute has enough resources

Revision history for this message
James Dennis (jtmes) wrote :

This was discovered from an automated test that runs on the same platform multiple times and passes in the majority of test runs, so there's definitely enough resources on the machine in question.

tags: added: scheduler
aginwala (aginwala-f)
Changed in nova:
status: New → Confirmed
assignee: nobody → aginwala (aginwala-f)
importance: Undecided → Medium
Revision history for this message
aginwala (aginwala-f) wrote :

It is in Kilo and also exists in liberty

Changed in nova:
status: Confirmed → In Progress
aginwala (aginwala-f)
Changed in nova:
importance: Medium → Low
aginwala (aginwala-f)
Changed in nova:
milestone: none → ongoing
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/250767

Revision history for this message
Chris Dent (cdent) wrote :

This is perhaps related to https://bugs.launchpad.net/nova/+bug/1515870 which has its own (fairly different) fix at: https://review.openstack.org/#/c/261145/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by aginwala (<email address hidden>) on branch: master
Review: https://review.openstack.org/250767
Reason: This patch is no more needed since some one is already optimizing @https://review.openstack.org/#/c/261145/

Revision history for this message
aginwala (aginwala-f) wrote :

This bug is duplicate of https://bugs.launchpad.net/nova/+bug/1515870 . Fix has been proposed and will be merged soon.

Changed in nova:
status: In Progress → Confirmed
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Marking this one as duplicate as there is already an open change against 1515870 even tho if that one is older.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.