metadata-api fails to get availability zone for instances created before pike

Bug #1768876 reported by Belmiro Moreira on 2018-05-03
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Surya Seetharaman
Pike
High
Matt Riedemann
Queens
High
Matt Riedemann

Bug Description

Can't get AVZ for old instances:

curl http://169.254.169.254/latest/meta-data/placement/availability-zone
None#

This is because the upcall to the nova_api DB was removed in the commit: 9f7bac2
and old instances may haven't the AVZ defined.
Previously, the AVZ in the instance was only set if explicitly defined by the user.

Changed in nova:
assignee: nobody → Surya Seetharaman (tssurya)
tags: added: api api-ref metadata
Matt Riedemann (mriedem) wrote :

Yeah I'm not sure what "We store the availability_zone on the instance now" is specifically referring to, i.e. which change was made for that statement to be true.

tags: added: cells
removed: api-ref
Matt Riedemann (mriedem) wrote :

Found it, this was the change, https://review.openstack.org/#/c/446053/, in Pike.

I'm not sure what to do here, I guess if the instance.get('availability_zone') returns None, we have to fallback to trying to do the up-call.

summary: - Old instances can get AVZ from metadata
+ metadata-api fails to get availability zone for instances created before
+ pike
Changed in nova:
status: New → Triaged
importance: Undecided → High
Matt Riedemann (mriedem) wrote :

Alternatively, we need an online data migration to set the instance.availability_zone for those that don't yet have it.

Matt Riedemann (mriedem) wrote :

So before that change in Pike, the instance.availability_zone was set to CONF.default_availability_zone which defaults to None:

https://docs.openstack.org/nova/pike/configuration/config.html#DEFAULT.default_schedule_zone

And reading the help on that, it sounds like we maybe regressed some functionality / contract in Pike, see:

"None, which means that the instance can move from one availability zone to another during its lifetime if it is moved from one compute node to another."

So I think that means before Pike, I could create an instance without a specific AZ and then the operator could live migrate it to any other compute node regardless of AZ, but after Pike the operator can only live migrate the instance to a compute node in the same AZ, which is a regression. It should be fairly easy to reproduce this with an in-tree functional test that has 2 compute nodes in separate AZs, create an instance w/o a specific AZ, and then try to live migrate the instance - if that works in Ocata but not after Ocata, it's a regression.

Matt Riedemann (mriedem) wrote :

Turns out there wasn't a regression wrt setting the instance.availability_zone in pike, see https://review.openstack.org/567701.

Surya Seetharaman (tssurya) wrote :

Okay, so this means if req_spec.avz is None, this holds -> "None, which means that the instance can move from one availability zone to another during its lifetime if it is moved from one compute node to another."

Since the default before Pike if the avz was not specified is None, we will go ahead with an online_migration tool to populate instance.avz if its None.

Fix proposed to branch: master
Review: https://review.openstack.org/567878

Changed in nova:
status: Triaged → In Progress
Changed in nova:
assignee: Surya Seetharaman (tssurya) → Matt Riedemann (mriedem)

Reviewed: https://review.openstack.org/567878
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6b4c38c04177ff194d05368cd4aff69958075167
Submitter: Zuul
Branch: master

commit 6b4c38c04177ff194d05368cd4aff69958075167
Author: Surya Seetharaman <email address hidden>
Date: Fri May 11 17:12:34 2018 +0200

    Metadata-API fails to retrieve avz for instances created before Pike

    In Pike (through change: I8d426f2635232ffc4b510548a905794ca88d7f99)
    we started setting instance.avilability_zone during schedule time by
    calculating the avz of the host into which the instance was scheduled
    into. After this change was introduced, the metadata request for the avz
    on the instance (through change: I73c3b10e52ab4cfda9dacc0c0ba92d1fcb60bcc9)
    started using instance.get(availability_zone) instead of doing the upcall.
    However this would return None for instances older than Pike whose
    availability_zone was not mentioned during boot time as it would be set to
    CONF.default_schedule_zone whose default value is None.

    This patch adds an online_migration tool to populate missing
    instance.availability_zone values.

    Change-Id: I2a1d81bfeb1ea006c16d8f403e045e9acedcbe57
    Closes-Bug: #1768876

Changed in nova:
status: In Progress → Fix Released
Matt Riedemann (mriedem) on 2018-05-30
Changed in nova:
assignee: Matt Riedemann (mriedem) → Surya Seetharaman (tssurya)

Reviewed: https://review.openstack.org/571317
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0a481a52929626c5fab8fd6fd50cca6882db3bd9
Submitter: Zuul
Branch: stable/queens

commit 0a481a52929626c5fab8fd6fd50cca6882db3bd9
Author: Surya Seetharaman <email address hidden>
Date: Fri May 11 17:12:34 2018 +0200

    Metadata-API fails to retrieve avz for instances created before Pike

    In Pike (through change: I8d426f2635232ffc4b510548a905794ca88d7f99)
    we started setting instance.avilability_zone during schedule time by
    calculating the avz of the host into which the instance was scheduled
    into. After this change was introduced, the metadata request for the avz
    on the instance (through change: I73c3b10e52ab4cfda9dacc0c0ba92d1fcb60bcc9)
    started using instance.get(availability_zone) instead of doing the upcall.
    However this would return None for instances older than Pike whose
    availability_zone was not mentioned during boot time as it would be set to
    CONF.default_schedule_zone whose default value is None.

    This patch adds an online_migration tool to populate missing
    instance.availability_zone values.

    Change-Id: I2a1d81bfeb1ea006c16d8f403e045e9acedcbe57
    Closes-Bug: #1768876
    (cherry picked from commit 6b4c38c04177ff194d05368cd4aff69958075167)

This issue was fixed in the openstack/nova 18.0.0.0b2 development milestone.

Reviewed: https://review.openstack.org/571320
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=487c6dd778312780740d8cb9a0b51f3fa177c1c4
Submitter: Zuul
Branch: stable/pike

commit 487c6dd778312780740d8cb9a0b51f3fa177c1c4
Author: Surya Seetharaman <email address hidden>
Date: Fri May 11 17:12:34 2018 +0200

    Metadata-API fails to retrieve avz for instances created before Pike

    In Pike (through change: I8d426f2635232ffc4b510548a905794ca88d7f99)
    we started setting instance.avilability_zone during schedule time by
    calculating the avz of the host into which the instance was scheduled
    into. After this change was introduced, the metadata request for the avz
    on the instance (through change: I73c3b10e52ab4cfda9dacc0c0ba92d1fcb60bcc9)
    started using instance.get(availability_zone) instead of doing the upcall.
    However this would return None for instances older than Pike whose
    availability_zone was not mentioned during boot time as it would be set to
    CONF.default_schedule_zone whose default value is None.

    This patch adds an online_migration tool to populate missing
    instance.availability_zone values.

    Conflicts:
          nova/cmd/manage.py
          nova/tests/functional/db/test_instance.py

    NOTE(mriedem): The conflicts are due to the following changes which
    were added in Queens:

      I6db4eb46df0d7ec025b969a46621823957503958

      I5b4b235b88367c361d38371d430d67ff583a906c

      I4b33751b6793f60c6f2703c379c36387c49d866d

    Change-Id: I2a1d81bfeb1ea006c16d8f403e045e9acedcbe57
    Closes-Bug: #1768876
    (cherry picked from commit 6b4c38c04177ff194d05368cd4aff69958075167)
    (cherry picked from commit 0a481a52929626c5fab8fd6fd50cca6882db3bd9)

This issue was fixed in the openstack/nova 16.1.5 release.

This issue was fixed in the openstack/nova 17.0.6 release.

Reviewed: https://review.openstack.org/567701
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0ed68c76fa8a84d1d5f0ab945e34c8e16341d627
Submitter: Zuul
Branch: master

commit 0ed68c76fa8a84d1d5f0ab945e34c8e16341d627
Author: Matt Riedemann <email address hidden>
Date: Thu May 10 19:27:36 2018 -0400

    Update instance.availability_zone during live migration

    While triaging bug 1768876 there was some concern
    that change I8d426f2635232ffc4b510548a905794ca88d7f99
    in Pike had regressed some behavior where a user that
    does not explicitly request a specific AZ during server
    create is then later restricted to only move operations
    within that same AZ.

    This test shows that is not a regression because the
    AvailabilityZoneFilter looks at RequestSpec.availability_zone
    rather than instance.availabililty_zone, so the instance
    is free to be moved across zones.

    As a result of the test, however, it was noticed that
    the instance.availability_zone isn't updated during live
    migration once the destination host is selected. The other
    move operations like unshelve, evacuate and cold migrate
    all update the instance.availabiltiy_zone, so this copies
    the same logic.

    Change-Id: I9f73c237923fdcbf4096edc5aedd2c968d4b893e
    Closes-Bug: #1771860
    Related-Bug: #1768876

Reviewed: https://review.openstack.org/643173
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=133763d3582c2e85e4e5962b542294135d1a7f4c
Submitter: Zuul
Branch: stable/rocky

commit 133763d3582c2e85e4e5962b542294135d1a7f4c
Author: Matt Riedemann <email address hidden>
Date: Thu May 10 19:27:36 2018 -0400

    Update instance.availability_zone during live migration

    While triaging bug 1768876 there was some concern
    that change I8d426f2635232ffc4b510548a905794ca88d7f99
    in Pike had regressed some behavior where a user that
    does not explicitly request a specific AZ during server
    create is then later restricted to only move operations
    within that same AZ.

    This test shows that is not a regression because the
    AvailabilityZoneFilter looks at RequestSpec.availability_zone
    rather than instance.availabililty_zone, so the instance
    is free to be moved across zones.

    As a result of the test, however, it was noticed that
    the instance.availability_zone isn't updated during live
    migration once the destination host is selected. The other
    move operations like unshelve, evacuate and cold migrate
    all update the instance.availabiltiy_zone, so this copies
    the same logic.

    Conflicts:
          nova/tests/unit/conductor/tasks/test_live_migrate.py

    NOTE(mriedem): The conflict is due to not having change
    I8e47cac8bab50a086b98f37c2f9f659b10009cf1 in Rocky.
    Also note that the func_fixtures import in the functional
    test was changed since it was added in Stein with change
    Idaed39629095f86d24a54334c699a26c218c6593.

    Change-Id: I9f73c237923fdcbf4096edc5aedd2c968d4b893e
    Closes-Bug: #1771860
    Related-Bug: #1768876
    (cherry picked from commit 0ed68c76fa8a84d1d5f0ab945e34c8e16341d627)

tags: added: in-stable-rocky

Reviewed: https://review.openstack.org/647623
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ee07c1c67d4ee6b146783c40826c4a28bd7367c3
Submitter: Zuul
Branch: stable/queens

commit ee07c1c67d4ee6b146783c40826c4a28bd7367c3
Author: Matt Riedemann <email address hidden>
Date: Thu May 10 19:27:36 2018 -0400

    Update instance.availability_zone during live migration

    While triaging bug 1768876 there was some concern
    that change I8d426f2635232ffc4b510548a905794ca88d7f99
    in Pike had regressed some behavior where a user that
    does not explicitly request a specific AZ during server
    create is then later restricted to only move operations
    within that same AZ.

    This test shows that is not a regression because the
    AvailabilityZoneFilter looks at RequestSpec.availability_zone
    rather than instance.availabililty_zone, so the instance
    is free to be moved across zones.

    As a result of the test, however, it was noticed that
    the instance.availability_zone isn't updated during live
    migration once the destination host is selected. The other
    move operations like unshelve, evacuate and cold migrate
    all update the instance.availabiltiy_zone, so this copies
    the same logic.

    Change-Id: I9f73c237923fdcbf4096edc5aedd2c968d4b893e
    Closes-Bug: #1771860
    Related-Bug: #1768876
    (cherry picked from commit 0ed68c76fa8a84d1d5f0ab945e34c8e16341d627)
    (cherry picked from commit 133763d3582c2e85e4e5962b542294135d1a7f4c)

tags: added: in-stable-queens

Reviewed: https://review.opendev.org/647630
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d6832b0e070cebbe48a71a46d88b9412375989c1
Submitter: Zuul
Branch: stable/pike

commit d6832b0e070cebbe48a71a46d88b9412375989c1
Author: Matt Riedemann <email address hidden>
Date: Thu May 10 19:27:36 2018 -0400

    Update instance.availability_zone during live migration

    While triaging bug 1768876 there was some concern
    that change I8d426f2635232ffc4b510548a905794ca88d7f99
    in Pike had regressed some behavior where a user that
    does not explicitly request a specific AZ during server
    create is then later restricted to only move operations
    within that same AZ.

    This test shows that is not a regression because the
    AvailabilityZoneFilter looks at RequestSpec.availability_zone
    rather than instance.availabililty_zone, so the instance
    is free to be moved across zones.

    As a result of the test, however, it was noticed that
    the instance.availability_zone isn't updated during live
    migration once the destination host is selected. The other
    move operations like unshelve, evacuate and cold migrate
    all update the instance.availabiltiy_zone, so this copies
    the same logic.

    Conflicts:
          nova/conductor/tasks/live_migrate.py
          nova/tests/unit/conductor/tasks/test_live_migrate.py

    NOTE(mriedem): The conflicts were due to changes
    Idad5cdbb2c5647c469e4ad5e9393564255df0f7f and
    I9068a5a5b47cef565802a6d58f37777464644100 in Queens.
    The _wait_for_migration_status interface also changed
    with I752617066bb2167b49239ab9d17b0c89754a3e12 in Queens.

    Change-Id: I9f73c237923fdcbf4096edc5aedd2c968d4b893e
    Closes-Bug: #1771860
    Related-Bug: #1768876
    (cherry picked from commit 0ed68c76fa8a84d1d5f0ab945e34c8e16341d627)
    (cherry picked from commit 133763d3582c2e85e4e5962b542294135d1a7f4c)
    (cherry picked from commit ee07c1c67d4ee6b146783c40826c4a28bd7367c3)

tags: added: in-stable-pike
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers