nova-scheduler fails when running out of disk space

Bug #1630658 reported by Magnus Lööf
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned

Bug Description

Description
===========

When trying to launch an instance and nova determines that a compute node is out of disk space, other hypervisors are not considered for scheduling.

Steps to reproduce
==================

OpenStack deployed using local storage on two nodes (using Kolla), 50 GB disk in my lab.
Launch instances until disk is near full, fex 2 Centos with flavor with 20 GB disk.
Launch another instance with flavor 20 GB disk.

Expected Result
===============
Nova schedules the "other instance" on the second compute host where there are sufficient resources.

Actual Result
=============
Nova fails with "Not Enough Hosts"

[req-a822f4c7-1788-4ff3-b03f-91db6d722937 ec4d0ac3f3684b1db4fa82449091339a df00326cc81048c28466529f82f4cf1a - - -] Filtering removed all hosts for the request with instance ID 'fb1ee3fe-3230-4d3a-a6d2-7b4a5bae1c09'. Filter results: ['RetryFilter: (start: 4, end: 4)', 'AvailabilityZoneFilter: (start: 4, end: 4)', 'RamFilter: (start: 4, end: 4)', 'DiskFilter: (start: 4, end: 2)', 'ComputeFilter: (start: 2, end: 0)']\n",

Environment
===========

OpenStack Version: mitaka, installed from Kolla stable/mitaka

Hypervisor: KVM Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 GNU/Linux

Storage: Local

Networking: Neutron

Revision history for this message
Magnus Lööf (magnus-loof) wrote :
Revision history for this message
Steven Dake (sdake) wrote :

I have confirmed this to be the case, atleast with the default Kolla configs in the operator's environment (pristine upstream Newton). If the nova scheduler fails to schedule to the first node, it gives up on other nodes.

Revision history for this message
Sean Dague (sdague) wrote :

The scheduler logs look like there are 4 hosts in your environment. 1/2 of them were thrown out because they had insufficent disk, however the remaining 2 were thrown out because they didn't have enough CPU capacity. Are you sure there was enough compute capacity on any nodes?

Changed in nova:
status: New → Incomplete
Revision history for this message
Sean Dague (sdague) wrote :

I'm also really curious by the fact that the number of nodes here that the scheduler is considering doesn't seem to match what you think you provided to kolla. Nova thinks there are 4, the bug says 2.

Revision history for this message
Magnus Lööf (magnus-loof) wrote :

Sorry guys. I edited the "hypervisor" list because I had two more Hyper-V hosts that were not part of the equeation, in an effort to simplify the problem description. Bad move.

Anyway. I believe the logs show the following:

1. There are - at least ;-) - two hypervisors with sufficient disk space:

   openstack hypervisor show 3:
   vcpus_used 0
   free_disk_gb 49

   openstack hypervisor show 5:
   vcpus_used 0
   free_disk_gb 49

2. It is possible to schedule servers to both hypervisors:

   7 x nova boot --image cirros --flavor m1.tiny

   openstack hypervisor show 3:
   vcpus_used 5
   free_disk_gb 45

   openstack hypervisor show 5:
   vcpus_used 3
   free_disk_gb 46

3. There is diskspace remaining on control02 (id 5) after trying to schedule:

   nova boot --image CentOS-7-x86_64-GenericCloud-1608.qcow2 --flavor m1.small three

   openstack hypervisor show 5:
   vcpus_used 0
   free_disk_gb 49

4. Nova fails to schedule, even though there is sufficient disk on control02 (id 5):

  'DiskFilter: (start: 4, end: 2)'

  The DiskFilter *should* have returned '3' (both Hyper-V hosts also had sufficient disk + control02)

Revision history for this message
Magnus Lööf (magnus-loof) wrote :

This problem may not occur often in production, because there you would normally have shared storage of some kind. But for people doing PoCs or trying to learn, this could trip them up.

Changed in nova:
status: Incomplete → New
Revision history for this message
Sean Dague (sdague) wrote :

There was still no information provided about why the CPU Filter threw out the rest of the nodes. That is the important bit here that we need to move forward.

As such, closing as Invalid. If someone is able to put more detailed debug into the scheduler to figure out which hosts get dumped at the Disk and CPU filter parts for what reason, and provide that, please do and we can reopen.

Changed in nova:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.