If a rebuild is refused by the scheduler, the instance's imageref is not rolled back

Bug #1744325 reported by Artom Lifshitz on 2018-01-19
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
int32bit
Newton
Undecided
Unassigned
Ocata
High
melanie witt
Pike
High
melanie witt

Bug Description

Description
===========

Since CVE-2017-16239, we now go through the scheduler for rebuilds. If the scheduler refuses a rebuild with a new image because of filter constraints (for example IsolatedHostsFilter), the instance's imageref is set to the new image and never rolled back.

Steps to reproduce
==================

1. Configure IsolatedHostsFilter:

   [filter_scheduler]
   enabled_filters = [...],IsolatedHostsFilter
   isolated_images = 41d3e5ca-14cf-436c-9413-4826b5c8bdb1
   isolated_hosts = ubuntu
   restrict_isolated_hosts_to_isolated_images = true

2. Have two images, one isolated and one not:

   $ openstack image list

     8d0581a5-ed9d-4b98-a766-a41efbc99929 | centos | active
     41d3e5ca-14cf-436c-9413-4826b5c8bdb1 | cirros-0.3.5-x86_64-disk | active

     cirros is the isolated one

3. Have only one hypervisor (the isolated one):

   $ openstack hypervisor list

     ubuntu | QEMU | 192.168.100.194 | up

4. To confirm, boot a centos (non-isolated) image, expecting it to be refused by the scheduler:

   $ openstack server create \
     --image 8d0581a5-ed9d-4b98-a766-a41efbc99929 \
     --flavor \
     m1.nano centos-test-expect-fail

   $ openstack server list

     centos-test-expect-fail | ERROR | | centos | m1.nano

5. Boot a cirros (isolated) image:

   $ openstack server create \
     --image 41d3e5ca-14cf-436c-9413-4826b5c8bdb1 \
     --flavor m1.nano \
     cirros-test-expect-success

   $ openstack server list

     cirros-test-expect-success | ACTIVE | [...] | cirros-0.3.5-x86_64-disk | m1.nano

6. Rebuild the cirros instance with centos:

   $ nova --debug rebuild cirros-test-expect-success centos

     DEBUG (session:722) POST call to compute for
     http://192.168.100.194/compute/v2.1/servers/d9d98bf7-623e-4587-b82c-06f36abf59cb/action
     used request id req-c234346a-6e05-47cf-a0cd-45f89d11e15d

7. Observer the rebuild being refused in the conductor:

   WARNING nova.conductor.manager
   [None req-c234346a-6e05-47cf-a0cd-45f89d11e15d demo admin]
   [instance: d9d98bf7-623e-4587-b82c-06f36abf59cb]
   No valid host found for rebuild: NoValidHost_Remote:
   No valid host was found. There are not enough hosts available.

8. Observe the API is showing the new centos image for the instance:

   $ nova show cirros-test-expect-success

     [...]
     image | centos (8d0581a5-ed9d-4b98-a766-a41efbc99929)
     [...]

Expected result
===============

Some indication that the rebuild was refused, or at least rolling back the instance's imageref.

Actual result
=============

No indication that the rebuild was refused, and worse, we now have a wrong imageref for the instance.

Environment
===========

1. Exact version of OpenStack you are running. See the following

   This was picked up by QE for stable/pike, and is still present in master,
   and presumably in all versions affected by the CVE fix, including newton,
   which is now EOL.

2. Which hypervisor did you use?
   libvirt+kvm

Sylvain Bauza (sylvain-bauza) wrote :

That's IMHO a very critical bug that we need to tackle ASAP

Changed in nova:
status: New → Confirmed
importance: Undecided → Critical
tags: added: scheduler
int32bit (int32bit) on 2018-01-22
Changed in nova:
assignee: nobody → int32bit (int32bit)
int32bit (int32bit) wrote :

I think we also need restore original image metadata and set server status to ERROR.

Fix proposed to branch: master
Review: https://review.openstack.org/536268

Changed in nova:
status: Confirmed → In Progress
Matt Riedemann (mriedem) on 2018-01-22
tags: added: queens-rc-potential
Matt Riedemann (mriedem) wrote :

I wouldn't say this is critical. It's a change, yes, but it's similar to what we always had as a bug when trying to rebuild a volume-backed instance with a different image up until Queens when we finally made that an error in the API because it's not supported in the compute service. I think we can just mark the instance ERROR in this case and let the user rebuild the instance with a valid image to get it out of ERROR state, but just leave the instance properties as the API set them without trying to get the conductor manager code to roll everything back.

Changed in nova:
importance: Critical → High
Changed in nova:
assignee: int32bit (int32bit) → melanie witt (melwitt)
melanie witt (melwitt) on 2018-01-23
Changed in nova:
assignee: melanie witt (melwitt) → int32bit (int32bit)
Changed in nova:
assignee: int32bit (int32bit) → melanie witt (melwitt)
Matt Riedemann (mriedem) on 2018-01-23
Changed in nova:
assignee: melanie witt (melwitt) → int32bit (int32bit)
Changed in nova:
assignee: int32bit (int32bit) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem) on 2018-01-23
Changed in nova:
assignee: Matt Riedemann (mriedem) → int32bit (int32bit)
Matt Riedemann (mriedem) wrote :

This bug exists in newton but newton is EOL upstream so it won't be fixed there.

Reviewed: https://review.openstack.org/536268
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d03a890a34f632adc9a19a33c8a5aebbccec50e4
Submitter: Zuul
Branch: master

commit d03a890a34f632adc9a19a33c8a5aebbccec50e4
Author: int32bit <email address hidden>
Date: Mon Jan 22 17:05:53 2018 +0800

    Set server status to ERROR if rebuild failed

    Currently there is no indication that the rebuild was refused,
    and worse, we may have a wrong imageref for the instance.

    This patch set the instance to ERROR status if rebuild failed in the
    scheduling stage. The user can rebuild the instance with valid image
    to get it out of ERROR state and reset with right instance metadata and
    properties.

    Closes-Bug: 1744325

    Change-Id: Ibb7bee15a3d4ee6f0ef53ba12e8b41f65a1fe999

Changed in nova:
status: In Progress → Fix Released

This issue was fixed in the openstack/nova 17.0.0.0b3 development milestone.

Reviewed: https://review.openstack.org/536897
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=22a39b8f9b76d5f330a3f185e7d40f208a6a47f3
Submitter: Zuul
Branch: stable/pike

commit 22a39b8f9b76d5f330a3f185e7d40f208a6a47f3
Author: int32bit <email address hidden>
Date: Mon Jan 22 17:05:53 2018 +0800

    Set server status to ERROR if rebuild failed

    Currently there is no indication that the rebuild was refused,
    and worse, we may have a wrong imageref for the instance.

    This patch set the instance to ERROR status if rebuild failed in the
    scheduling stage. The user can rebuild the instance with valid image
    to get it out of ERROR state and reset with right instance metadata and
    properties.

    Closes-Bug: 1744325

     Conflicts:
     nova/tests/unit/conductor/test_conductor.py

    NOTE(melwitt): The conflict was because the select_destinations method
    has additional parameters in Queens that don't exist in Pike.

    Change-Id: Ibb7bee15a3d4ee6f0ef53ba12e8b41f65a1fe999
    (cherry picked from commit d03a890a34f632adc9a19a33c8a5aebbccec50e4)

Reviewed: https://review.openstack.org/536904
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=83fd8ac0bfd3d9e5dd4ad48bfcb845d80095b845
Submitter: Zuul
Branch: stable/ocata

commit 83fd8ac0bfd3d9e5dd4ad48bfcb845d80095b845
Author: int32bit <email address hidden>
Date: Mon Jan 22 17:05:53 2018 +0800

    Set server status to ERROR if rebuild failed

    Currently there is no indication that the rebuild was refused,
    and worse, we may have a wrong imageref for the instance.

    This patch set the instance to ERROR status if rebuild failed in the
    scheduling stage. The user can rebuild the instance with valid image
    to get it out of ERROR state and reset with right instance metadata and
    properties.

    Closes-Bug: 1744325

     Conflicts:
        nova/conductor/manager.py
        nova/tests/functional/regressions/test_bug_1713783.py
        nova/tests/functional/test_servers.py
        nova/tests/unit/conductor/test_conductor.py

    NOTE(melwitt): The conflicts were because of log translation, the fact
    that the regression test for bug 1713783 doesn't exist in Ocata, there
    are addtional rebuild functional tests in Pike that don't exist in
    Ocata, and the select_destinations method has additional parameters in
    Pike that don't exist in Ocata.

    Change-Id: Ibb7bee15a3d4ee6f0ef53ba12e8b41f65a1fe999
    (cherry picked from commit d03a890a34f632adc9a19a33c8a5aebbccec50e4)
    (cherry picked from commit 22a39b8f9b76d5f330a3f185e7d40f208a6a47f3)

This issue was fixed in the openstack/nova 16.1.0 release.

This issue was fixed in the openstack/nova 15.1.1 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers