Baremetal recreation may fool the heat stack.

Bug #1298465 reported by jan grant
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Expired
Medium
Unassigned

Bug Description

We've seen a problem (recreatable in the lab here) related to baremetal use of the devtest scripts.

The situation is this: a running (virtualised) seed booting a (single-node) undercloud on real tin, which in turn boots a (multiple-node) overcloud.

Running "devtest.sh --trash-my-machine -c" causes the heat stack to get confused.

(We think that the reason why "-c" is critical is that the devtest script moves onto the heat stack-create step quickly; without this, the intervening DIB step will mean that this bug doesn't get triggered.)

What appears to happen is this: the running undercloud has a running o-c-c which is polling for metadata. It gets refreshed metadata from heat as the stack-create happens. o-c-c runs to completion on the node and posts success to its wait condition in the heat stack on the seed. All this happens before the seed has a chance to reboot and refresh the node.

It's not clear if this is fundamentally down to nova baremetal populating the metadata for the new instance too early (perhaps it should wait until the image loader is able to start imaging the node?)

Revision history for this message
jan grant (jan-grant) wrote :

(Whether the timing of the "-c" flag actually does mean this doesn't happen, or we just got lucky, I don't know.)

This obviously doesn't show if you heat stack-delete between runs; however, in the case of partially-working underclouds (or overclouds), if the stack-delete wedges for some reason - or the seed is unable to complete that for other reasons - the running baremetal node state may still compromise this process.

Changed in tripleo:
status: New → Confirmed
James Polley (tchaypo)
Changed in tripleo:
status: Confirmed → Triaged
importance: Undecided → Medium
Revision history for this message
Robert Collins (lifeless) wrote :

I'm going to disagree that this is confirmed - the heat urls are signed, there's no way the running machine should be able to pickup details from the new cluster. It would be a huge security hole if they could :).

What do you mean by 'confused' ?

Changed in tripleo:
status: Triaged → Incomplete
Revision history for this message
jan grant (jan-grant) wrote :

We'll dig into this today. By "confused": template wait conditions posted prematurely. We've seen a few other issues around misbehaving PXE deployments onto real tin - going to drill into these and actually see if we can bottom them out.

Revision history for this message
jan grant (jan-grant) wrote :

Just to mention: confirmed this. The heat urls are signed, yes, but they're supplied by metadata. An old running BM machine can pick up metadata via the 169.254.169.254 address.

Basically, if the BM machine is not powered down before heat tries to bring it up, then it can easily go wrong - depending on timing. (But timing isn't that critical - the metadata is posted long before nova-baremetal gets around to rebooting the host.)

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for tripleo because there has been no activity for 60 days.]

Changed in tripleo:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.