Bug #1839560 “ironic: moving node to maintenance makes it unusab...” : Bugs : OpenStack Compute (nova)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-08-08:

#1

IRC discussion about the issue:

http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2019-08-08.log.html#t2019-08-08T22:43:39

Changed in nova:
status:	New → Triaged
importance:	Undecided → High
tags:	added: compute ironic

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-08-08:

#2

There are some ideas about hard-deleting the compute nodes records when they (soft) deleted but only if ironic nodes, but that gets messy (and called from lots of places, like when a nova-compute service record is deleted), so it's probably easiest to just revert this:

https://review.opendev.org/#/c/571535/

Note you'd also have to revert this to avoid conflicts:

https://review.opendev.org/#/c/611162/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-08: Fix proposed to nova (master)

#3

Fix proposed to branch: master
Review: https://review.opendev.org/675496

Changed in nova:
assignee:	nobody → Matt Riedemann (mriedem)
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-09: Related fix proposed to nova (master)

#4

Related fix proposed to branch: master
Review: https://review.opendev.org/675705

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-14: Related fix proposed to nova (stable/stein)

#5

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/676507

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-14: Fix proposed to nova (stable/stein)

#6

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/676509

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-14: Related fix merged to nova (master)

#7

Reviewed: https://review.opendev.org/675705
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=89dd74ac7f1028daadf86cb18948e27fe9d1d411
Submitter: Zuul
Branch: master

commit 89dd74ac7f1028daadf86cb18948e27fe9d1d411
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 9 17:24:07 2019 -0400

Add functional regression recreate test for bug 1839560

    This adds a functional test which recreates bug 1839560
    where the driver reports a node, then no longer reports
    it so the compute manager deletes it, and then the driver
    reports it again later (this can be common with ironic
    nodes as they undergo maintenance). The issue is that since
    Ia69fabce8e7fd7de101e291fe133c6f5f5f7056a in Rocky, the
    ironic node uuid is re-used for the compute node uuid but
    there is a unique constraint on the compute node uuid so
    when trying to create the compute node once the ironic node
    is available again, the compute node create fails with a
    duplicate entry error due to the duplicate uuid. To recreate
    this in the functional test, a new fake virt driver is added
    which provides a predictable uuid per node like the ironic
    driver. The test also shows that archiving the database is
    a way to workaround the bug until it's properly fixed.

Change-Id: If822509e906d5094f13a8700b2b9ed3c40580431
Related-Bug: #1839560

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-14: Fix merged to nova (master)

#8

Reviewed: https://review.opendev.org/675496
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8b007266f438ec0a5a797d05731cce6f2b155f4c
Submitter: Zuul
Branch: master

commit 8b007266f438ec0a5a797d05731cce6f2b155f4c
Author: Matt Riedemann <email address hidden>
Date: Mon Aug 12 14:39:16 2019 -0400

Restore soft-deleted compute node with same uuid

    There is a unique index on the compute_nodes.uuid column which
    means we can't have more than one compute_nodes record in the
    same DB with the same UUID even if one is soft deleted because
    the deleted column is not part of that unique index constraint.

    This is a problem with ironic nodes where the node is 1:1 with
    the compute node record, and when a node is undergoing maintenance
    the driver doesn't return it from get_available_nodes() so the
    ComputeManager.update_available_resource periodic task (soft)
    deletes the compute node record, but when the node is no longer
    under maintenance in ironic and the driver reports it, the
    ResourceTracker._init_compute_node code will fail to create the
    ComputeNode record again because of the duplicate uuid.

    This change handles the DBDuplicateEntry error in compute_node_create
    by finding the soft-deleted compute node with the same uuid and
    simply updating it to no longer be (soft) deleted.

Closes-Bug: #1839560

Change-Id: Iafba419fe86446ffe636721f523fb619f8f787b3

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-14: Related fix proposed to nova (stable/rocky)

#9

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/676513

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-14: Fix proposed to nova (stable/rocky)

#10

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/676514

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-27: Related fix merged to nova (stable/stein)

#11

Reviewed: https://review.opendev.org/676507
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e7109d43d6d6db0f9db18976585f3334d2be72bf
Submitter: Zuul
Branch: stable/stein

commit e7109d43d6d6db0f9db18976585f3334d2be72bf
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 9 17:24:07 2019 -0400

Add functional regression recreate test for bug 1839560

    This adds a functional test which recreates bug 1839560
    where the driver reports a node, then no longer reports
    it so the compute manager deletes it, and then the driver
    reports it again later (this can be common with ironic
    nodes as they undergo maintenance). The issue is that since
    Ia69fabce8e7fd7de101e291fe133c6f5f5f7056a in Rocky, the
    ironic node uuid is re-used for the compute node uuid but
    there is a unique constraint on the compute node uuid so
    when trying to create the compute node once the ironic node
    is available again, the compute node create fails with a
    duplicate entry error due to the duplicate uuid. To recreate
    this in the functional test, a new fake virt driver is added
    which provides a predictable uuid per node like the ironic
    driver. The test also shows that archiving the database is
    a way to workaround the bug until it's properly fixed.

    NOTE(mriedem): Since change I2cf2fcbaebc706f897ce5dfbff47d32117064f9c
    is not in Stein this backport needs to modify the test to use
    the global set_nodes/restore_nodes which means we can remove some
    of the startup hackery at the beginning of the test. Also, since
    FakeDriver.set_nodes does not exist in Stein we have to modify the
    FakeDriver._nodes variable directly (the global doesn't affect that
    after startup).

    Change-Id: If822509e906d5094f13a8700b2b9ed3c40580431
    Related-Bug: #1839560
    (cherry picked from commit 89dd74ac7f1028daadf86cb18948e27fe9d1d411)

tags:

added: in-stable-stein

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-27: Fix merged to nova (stable/stein)

#12

Reviewed: https://review.opendev.org/676509
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1b021665281b74c865d3571fc90772b52d70e467
Submitter: Zuul
Branch: stable/stein

commit 1b021665281b74c865d3571fc90772b52d70e467
Author: Matt Riedemann <email address hidden>
Date: Mon Aug 12 14:39:16 2019 -0400

Restore soft-deleted compute node with same uuid

    There is a unique index on the compute_nodes.uuid column which
    means we can't have more than one compute_nodes record in the
    same DB with the same UUID even if one is soft deleted because
    the deleted column is not part of that unique index constraint.

    This is a problem with ironic nodes where the node is 1:1 with
    the compute node record, and when a node is undergoing maintenance
    the driver doesn't return it from get_available_nodes() so the
    ComputeManager.update_available_resource periodic task (soft)
    deletes the compute node record, but when the node is no longer
    under maintenance in ironic and the driver reports it, the
    ResourceTracker._init_compute_node code will fail to create the
    ComputeNode record again because of the duplicate uuid.

    This change handles the DBDuplicateEntry error in compute_node_create
    by finding the soft-deleted compute node with the same uuid and
    simply updating it to no longer be (soft) deleted.

Closes-Bug: #1839560

Change-Id: Iafba419fe86446ffe636721f523fb619f8f787b3
(cherry picked from commit 8b007266f438ec0a5a797d05731cce6f2b155f4c)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-29: Related fix merged to nova (stable/rocky)

#13

Reviewed: https://review.opendev.org/676513
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ecd1e046214e087dd484359f256386a3e8962ec1
Submitter: Zuul
Branch: stable/rocky

commit ecd1e046214e087dd484359f256386a3e8962ec1
Author: Matt Riedemann <email address hidden>
Date: Fri Aug 9 17:24:07 2019 -0400

Add functional regression recreate test for bug 1839560

    This adds a functional test which recreates bug 1839560
    where the driver reports a node, then no longer reports
    it so the compute manager deletes it, and then the driver
    reports it again later (this can be common with ironic
    nodes as they undergo maintenance). The issue is that since
    Ia69fabce8e7fd7de101e291fe133c6f5f5f7056a in Rocky, the
    ironic node uuid is re-used for the compute node uuid but
    there is a unique constraint on the compute node uuid so
    when trying to create the compute node once the ironic node
    is available again, the compute node create fails with a
    duplicate entry error due to the duplicate uuid. To recreate
    this in the functional test, a new fake virt driver is added
    which provides a predictable uuid per node like the ironic
    driver. The test also shows that archiving the database is
    a way to workaround the bug until it's properly fixed.

NOTE(mriedem): Since change Idaed39629095f86d24a54334c699a26c218c6593
is not in Rocky the PlacementFixture still comes from nova_fixtures.

    Change-Id: If822509e906d5094f13a8700b2b9ed3c40580431
    Related-Bug: #1839560
    (cherry picked from commit 89dd74ac7f1028daadf86cb18948e27fe9d1d411)
    (cherry picked from commit e7109d43d6d6db0f9db18976585f3334d2be72bf)

tags:

added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-29: Fix merged to nova (stable/rocky)

#14

Reviewed: https://review.opendev.org/676514
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9ce94844fa6a43368445182bb086876874256197
Submitter: Zuul
Branch: stable/rocky

commit 9ce94844fa6a43368445182bb086876874256197
Author: Matt Riedemann <email address hidden>
Date: Mon Aug 12 14:39:16 2019 -0400

Restore soft-deleted compute node with same uuid

    There is a unique index on the compute_nodes.uuid column which
    means we can't have more than one compute_nodes record in the
    same DB with the same UUID even if one is soft deleted because
    the deleted column is not part of that unique index constraint.

    This is a problem with ironic nodes where the node is 1:1 with
    the compute node record, and when a node is undergoing maintenance
    the driver doesn't return it from get_available_nodes() so the
    ComputeManager.update_available_resource periodic task (soft)
    deletes the compute node record, but when the node is no longer
    under maintenance in ironic and the driver reports it, the
    ResourceTracker._init_compute_node code will fail to create the
    ComputeNode record again because of the duplicate uuid.

    This change handles the DBDuplicateEntry error in compute_node_create
    by finding the soft-deleted compute node with the same uuid and
    simply updating it to no longer be (soft) deleted.

Closes-Bug: #1839560

    Change-Id: Iafba419fe86446ffe636721f523fb619f8f787b3
    (cherry picked from commit 8b007266f438ec0a5a797d05731cce6f2b155f4c)
    (cherry picked from commit 1b021665281b74c865d3571fc90772b52d70e467)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-27: Fix included in openstack/nova 20.0.0.0rc1

#15

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-04: Fix included in openstack/nova 19.0.3

#16

This issue was fixed in the openstack/nova 19.0.3 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-10: Fix included in openstack/nova 18.2.3

#17

This issue was fixed in the openstack/nova 18.2.3 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-14: Related fix proposed to nova (stable/queens)

#18

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/707886

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-14: Fix proposed to nova (stable/queens)

#19

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/707887

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-17: Change abandoned on nova (stable/queens)

#20

Change abandoned by Artom Lifshitz (<email address hidden>) on branch: stable/queens
Review: https://review.opendev.org/707886

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-17:

#21

Change abandoned by Artom Lifshitz (<email address hidden>) on branch: stable/queens
Review: https://review.opendev.org/707887
Reason: As mentioned in the commit message of the previous patch, and I completely missed it, but this is only applicable to >= Rocky.

OpenStack Compute (nova)

ironic: moving node to maintenance makes it unusable afterwards

Bug Description

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to
OpenStack Compute (nova)	Fix Released	High	Matt Riedemann
Rocky	Fix Committed	High	Matt Riedemann
Stein	Fix Committed	High	Matt Riedemann