ironic: moving node to maintenance makes it unusable afterwards

Bug #1839560 reported by Mohammed Naser
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann
Rocky
Fix Committed
High
Matt Riedemann
Stein
Fix Committed
High
Matt Riedemann

Bug Description

If you use the Ironic API to set a node into a maintenance (for whatever reason), it will no longer be included in the list of available nodes to Nova.

When Nova refreshes it's resources periodically, it will find that it is no longer in the list of available nodes and delete it from the database.

Once you enable the node again and Nova attempts to create the ComputeNode again, it fails due to the duplicate UUID in the database, because the old record is soft deleted and had the same UUID.

ref:
https://github.com/openstack/nova/commit/9f28727eb75e05e07bad51b6eecce667d09dfb65
- this made computenode.uuid match the baremetal uuid

https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L8304-L8316
- this soft-deletes the computenode record when it doesn't see it in the list of active nodes

traces:
2019-08-08 17:20:13.921 6379 INFO nova.compute.manager [req-c71e5c81-eb34-4f72-a260-6aa7e802f490 - - - - -] Deleting orphan compute node 31 hypervisor host is 77788ad5-f1a4-46ac-8132-2d88dbd4e594, nodes are set([u'6d556617-2bdc-42b3-a3fe-b9218a1ebf0e', u'a634fab2-ecea-4cfa-be09-032dce6eaf51', u'2dee290d-ef73-46bc-8fc2-af248841ca12'])
...
2019-08-08 22:21:25.284 82770 WARNING nova.compute.resource_tracker [req-a58eb5e2-9be0-4503-bf68-dff32ff87a3a - - - - -] No compute node record for ctl1-xxxx:77788ad5-f1a4-46ac-8132-2d88dbd4e594: ComputeHostNotFound_Remote: Compute host ctl1-xxxx could not be found.
....
Remote error: DBDuplicateEntry (pymysql.err.IntegrityError) (1062, u"Duplicate entry '77788ad5-f1a4-46ac-8132-2d88dbd4e594' for key 'compute_nodes_uuid_idx'")
....

Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in nova:
status: New → Triaged
importance: Undecided → High
tags: added: compute ironic
Revision history for this message
Matt Riedemann (mriedem) wrote :

There are some ideas about hard-deleting the compute nodes records when they (soft) deleted but only if ironic nodes, but that gets messy (and called from lots of places, like when a nova-compute service record is deleted), so it's probably easiest to just revert this:

https://review.opendev.org/#/c/571535/

Note you'd also have to revert this to avoid conflicts:

https://review.opendev.org/#/c/611162/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/675496

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/675705

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/676507

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/676509

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/675705
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=89dd74ac7f1028daadf86cb18948e27fe9d1d411
Submitter: Zuul
Branch: master

commit 89dd74ac7f1028daadf86cb18948e27fe9d1d411
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 9 17:24:07 2019 -0400

    Add functional regression recreate test for bug 1839560

    This adds a functional test which recreates bug 1839560
    where the driver reports a node, then no longer reports
    it so the compute manager deletes it, and then the driver
    reports it again later (this can be common with ironic
    nodes as they undergo maintenance). The issue is that since
    Ia69fabce8e7fd7de101e291fe133c6f5f5f7056a in Rocky, the
    ironic node uuid is re-used for the compute node uuid but
    there is a unique constraint on the compute node uuid so
    when trying to create the compute node once the ironic node
    is available again, the compute node create fails with a
    duplicate entry error due to the duplicate uuid. To recreate
    this in the functional test, a new fake virt driver is added
    which provides a predictable uuid per node like the ironic
    driver. The test also shows that archiving the database is
    a way to workaround the bug until it's properly fixed.

    Change-Id: If822509e906d5094f13a8700b2b9ed3c40580431
    Related-Bug: #1839560

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/675496
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8b007266f438ec0a5a797d05731cce6f2b155f4c
Submitter: Zuul
Branch: master

commit 8b007266f438ec0a5a797d05731cce6f2b155f4c
Author: Matt Riedemann <email address hidden>
Date: Mon Aug 12 14:39:16 2019 -0400

    Restore soft-deleted compute node with same uuid

    There is a unique index on the compute_nodes.uuid column which
    means we can't have more than one compute_nodes record in the
    same DB with the same UUID even if one is soft deleted because
    the deleted column is not part of that unique index constraint.

    This is a problem with ironic nodes where the node is 1:1 with
    the compute node record, and when a node is undergoing maintenance
    the driver doesn't return it from get_available_nodes() so the
    ComputeManager.update_available_resource periodic task (soft)
    deletes the compute node record, but when the node is no longer
    under maintenance in ironic and the driver reports it, the
    ResourceTracker._init_compute_node code will fail to create the
    ComputeNode record again because of the duplicate uuid.

    This change handles the DBDuplicateEntry error in compute_node_create
    by finding the soft-deleted compute node with the same uuid and
    simply updating it to no longer be (soft) deleted.

    Closes-Bug: #1839560

    Change-Id: Iafba419fe86446ffe636721f523fb619f8f787b3

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/676513

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/676514

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/676507
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e7109d43d6d6db0f9db18976585f3334d2be72bf
Submitter: Zuul
Branch: stable/stein

commit e7109d43d6d6db0f9db18976585f3334d2be72bf
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 9 17:24:07 2019 -0400

    Add functional regression recreate test for bug 1839560

    This adds a functional test which recreates bug 1839560
    where the driver reports a node, then no longer reports
    it so the compute manager deletes it, and then the driver
    reports it again later (this can be common with ironic
    nodes as they undergo maintenance). The issue is that since
    Ia69fabce8e7fd7de101e291fe133c6f5f5f7056a in Rocky, the
    ironic node uuid is re-used for the compute node uuid but
    there is a unique constraint on the compute node uuid so
    when trying to create the compute node once the ironic node
    is available again, the compute node create fails with a
    duplicate entry error due to the duplicate uuid. To recreate
    this in the functional test, a new fake virt driver is added
    which provides a predictable uuid per node like the ironic
    driver. The test also shows that archiving the database is
    a way to workaround the bug until it's properly fixed.

    NOTE(mriedem): Since change I2cf2fcbaebc706f897ce5dfbff47d32117064f9c
    is not in Stein this backport needs to modify the test to use
    the global set_nodes/restore_nodes which means we can remove some
    of the startup hackery at the beginning of the test. Also, since
    FakeDriver.set_nodes does not exist in Stein we have to modify the
    FakeDriver._nodes variable directly (the global doesn't affect that
    after startup).

    Change-Id: If822509e906d5094f13a8700b2b9ed3c40580431
    Related-Bug: #1839560
    (cherry picked from commit 89dd74ac7f1028daadf86cb18948e27fe9d1d411)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/676509
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1b021665281b74c865d3571fc90772b52d70e467
Submitter: Zuul
Branch: stable/stein

commit 1b021665281b74c865d3571fc90772b52d70e467
Author: Matt Riedemann <email address hidden>
Date: Mon Aug 12 14:39:16 2019 -0400

    Restore soft-deleted compute node with same uuid

    There is a unique index on the compute_nodes.uuid column which
    means we can't have more than one compute_nodes record in the
    same DB with the same UUID even if one is soft deleted because
    the deleted column is not part of that unique index constraint.

    This is a problem with ironic nodes where the node is 1:1 with
    the compute node record, and when a node is undergoing maintenance
    the driver doesn't return it from get_available_nodes() so the
    ComputeManager.update_available_resource periodic task (soft)
    deletes the compute node record, but when the node is no longer
    under maintenance in ironic and the driver reports it, the
    ResourceTracker._init_compute_node code will fail to create the
    ComputeNode record again because of the duplicate uuid.

    This change handles the DBDuplicateEntry error in compute_node_create
    by finding the soft-deleted compute node with the same uuid and
    simply updating it to no longer be (soft) deleted.

    Closes-Bug: #1839560

    Change-Id: Iafba419fe86446ffe636721f523fb619f8f787b3
    (cherry picked from commit 8b007266f438ec0a5a797d05731cce6f2b155f4c)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/676513
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ecd1e046214e087dd484359f256386a3e8962ec1
Submitter: Zuul
Branch: stable/rocky

commit ecd1e046214e087dd484359f256386a3e8962ec1
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 9 17:24:07 2019 -0400

    Add functional regression recreate test for bug 1839560

    This adds a functional test which recreates bug 1839560
    where the driver reports a node, then no longer reports
    it so the compute manager deletes it, and then the driver
    reports it again later (this can be common with ironic
    nodes as they undergo maintenance). The issue is that since
    Ia69fabce8e7fd7de101e291fe133c6f5f5f7056a in Rocky, the
    ironic node uuid is re-used for the compute node uuid but
    there is a unique constraint on the compute node uuid so
    when trying to create the compute node once the ironic node
    is available again, the compute node create fails with a
    duplicate entry error due to the duplicate uuid. To recreate
    this in the functional test, a new fake virt driver is added
    which provides a predictable uuid per node like the ironic
    driver. The test also shows that archiving the database is
    a way to workaround the bug until it's properly fixed.

    NOTE(mriedem): Since change Idaed39629095f86d24a54334c699a26c218c6593
    is not in Rocky the PlacementFixture still comes from nova_fixtures.

    Change-Id: If822509e906d5094f13a8700b2b9ed3c40580431
    Related-Bug: #1839560
    (cherry picked from commit 89dd74ac7f1028daadf86cb18948e27fe9d1d411)
    (cherry picked from commit e7109d43d6d6db0f9db18976585f3334d2be72bf)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/676514
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9ce94844fa6a43368445182bb086876874256197
Submitter: Zuul
Branch: stable/rocky

commit 9ce94844fa6a43368445182bb086876874256197
Author: Matt Riedemann <email address hidden>
Date: Mon Aug 12 14:39:16 2019 -0400

    Restore soft-deleted compute node with same uuid

    There is a unique index on the compute_nodes.uuid column which
    means we can't have more than one compute_nodes record in the
    same DB with the same UUID even if one is soft deleted because
    the deleted column is not part of that unique index constraint.

    This is a problem with ironic nodes where the node is 1:1 with
    the compute node record, and when a node is undergoing maintenance
    the driver doesn't return it from get_available_nodes() so the
    ComputeManager.update_available_resource periodic task (soft)
    deletes the compute node record, but when the node is no longer
    under maintenance in ironic and the driver reports it, the
    ResourceTracker._init_compute_node code will fail to create the
    ComputeNode record again because of the duplicate uuid.

    This change handles the DBDuplicateEntry error in compute_node_create
    by finding the soft-deleted compute node with the same uuid and
    simply updating it to no longer be (soft) deleted.

    Closes-Bug: #1839560

    Change-Id: Iafba419fe86446ffe636721f523fb619f8f787b3
    (cherry picked from commit 8b007266f438ec0a5a797d05731cce6f2b155f4c)
    (cherry picked from commit 1b021665281b74c865d3571fc90772b52d70e467)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.0.0.0rc1

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.3

This issue was fixed in the openstack/nova 19.0.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.2.3

This issue was fixed in the openstack/nova 18.2.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/707886

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/707887

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/queens)

Change abandoned by Artom Lifshitz (<email address hidden>) on branch: stable/queens
Review: https://review.opendev.org/707886

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Artom Lifshitz (<email address hidden>) on branch: stable/queens
Review: https://review.opendev.org/707887
Reason: As mentioned in the commit message of the previous patch, and I completely missed it, but this is only applicable to >= Rocky.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.