[RFE] introduce "inspect wait" provision state

Bug #1725211 reported by milan k
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
Wishlist
Kaifeng Wang
Ironic Inspector
Fix Released
Wishlist
Kaifeng Wang

Bug Description

The in-band inspection is an asynchronous process that isn't currently protected by a timeout but rather relies on the Inspector inspect interface's periodic task[1] to eventually set the appropriate node state.

This however has the downside of "locking" the node in the inspecting state for periods indefinitely long[2].

Another disadvantage of current implementation is the absence of an operator interface to abort the running introspection and a tricky way of wrapping the asynchronous nature of the ironic-inspector's abort REST API endpoint[3].

Last but not least, this approach isn't future-proof: we want to eventually handle the node inspection state solely thru ironic[4].

Let's therefore have a new passive state in the ironic state machine, the inspect-wait state, protected by a configurable timeout, that:

  * node state is navigated to, from the inspecting state,
    upon the _start_introspection method ends[5]
  * upon a (periodically-checked) timeout happened, the node state is navigated into the
    inspect-failed state (and the node is, if configured so, powered off)
  * upon the abort request, the node is kept in the same inspect-wait state till either:
    * the (current) periodic task[2] moves the node state to the inspect-failed state
    * the inspection timeout checking periodic task moves the node state into the
      inspect-failed state

[1] https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/inspector.py#L121,#L135
[2] https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/inspector.py#L168,#L176
[3] https://review.openstack.org/#/c/482867/16/specs/approved/inspection-abort.rst
[4] https://etherpad.openstack.org/p/inspector-queens-virtual-ptg
[5] https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/inspector.py#L137

Revision history for this message
Dmitry Tantsur (divius) wrote :

I'm good with this. Note that it will need a fix in ironic-inspector to treat "inspect wait" as a valid state OR we'll have to guard this new state with a microversion.

summary: - introduce an inspect-wait state
+ [RFE] introduce "inspect wait" provision state
Changed in ironic:
status: New → Confirmed
importance: Undecided → Wishlist
tags: added: rfe
Revision history for this message
milan k (vetrisko) wrote :

Good point, I'll report that briefly on inspector!
Thanks!

Revision history for this message
Sam Betts (sambetts) wrote :

Thanks Milan sounds like a good plan!

Revision history for this message
milan k (vetrisko) wrote :

Before inspection of a node can be triggered, inspector checks the node is in a valid provisioning state[1]. Since ironic will eventually adopt the inspect-wait state[2].

Let's add it to the list of valid provisioning states[3] to avoid inspection from breaking.

[1] https://github.com/openstack/ironic-inspector/blob/master/ironic_inspector/introspect.py#L47
[2] https://bugs.launchpad.net/ironic/+bug/1725211
[3] https://github.com/openstack/ironic-inspector/blob/master/ironic_inspector/common/ironic.py#L29

milan k (vetrisko)
Changed in ironic-inspector:
status: New → Confirmed
Dmitry Tantsur (divius)
Changed in ironic-inspector:
importance: Undecided → Wishlist
Revision history for this message
Ruby Loo (rloo) wrote :

Hi Milan, thx for the RFE.

When providing links to code, it would be good to provide a link to an 'instance' of the file, not latest, cuz that could change and then your line numbers won't make any sense.

wrt "This however has the downside of "locking" the node in the inspecting state for periods indefinitely long[2]." I could be wrong, but I took a quick look at [2] and it only does a write lock when updating the node; is that what you mean by 'indefinitely long' or am I missing something? Specifically, i think it is this line [Z] that does a write lock.

You're going to need to update our state diagram [Y]. It isn't clear to me what/how you transition from inspect to "inspect wait" back to inspect or "inspect fail". Is it clear to the folks that know how inspecting works?

Your description mentions 'abort': "upon the abort request, the node is kept in the same inspect-wait state till either". Why, if the user is requesting an abort, do you have to keep it in inspect-wait til ... ?? (Again, I don't know much about the inspect stuff). Also, maybe the 'abort' feature should be a separate RFE? Well, there is a proposal for that, so are you just commenting in this, about how the abort might work with the new state, or is the intention for the abort to be implemented here?

Have you thought about whether we need an API version to hide/reveal this new state?
- regardless of the API version, I'm assuming that you want to put a node into this new 'inspect wait' state
- if so, let's say we add this new state in version 1.40. If the user is using version 1.39 and the node is in 'inspect wait' state, do we show that, or do we show 'inspect'? I think we probably need to show the node in inspect, not 'inspect wait', since that's the whole idea about our wonderful API versions.

In general, I'm good with an 'inspect wait' state; it makes it more consistent with the rest of our states. If your description works for the folks that are more familiar with inspect and for the folks that will be reviewing this code, then it is fine with me.

If you can rewrite/answer all my questions in your one description, then I am fine with no spec. I don't want people combing through these comments, trying to figure out how this is meant to work.

[Y]: https://docs.openstack.org/ironic/latest/contributor/states.html

[Z] https://github.com/openstack/ironic/blob/7337fefd978b14d8a5e04314565e696e86879f21/ironic/drivers/modules/inspector.py#L185

Revision history for this message
milan k (vetrisko) wrote :
Download full text (3.4 KiB)

Hi Ruby!

Thanks for the proofreading!
TL;DR: let's have a (separate?) spec

> When providing links to code, it would be good to provide a link to an 'instance' of the file, not latest, cuz that could change and then your line numbers won't make any sense.
True, next time I'm using some release tag, sorry.

>> wrt "This however has the downside of "locking" the node in the inspecting state for periods indefinitely long[2]."
> I could be wrong, but I took a quick look at [2] and it only does a write lock when updating the node; is that what you mean by 'indefinitely long' or am I missing something? Specifically, i think it is this line [Z] that does a write lock.
I'm rather referring to: the node provisioning state need not ever change. I might be wrong here; my assumption is based just on reading through this inspector module.

> Your description mentions
>> 'abort': "upon the abort request, the node is kept in the same inspect-wait state till either".
> Why, if the user is requesting an abort, do you have to keep it in inspect-wait til ... ??
The inspector response is async on the abort call. I though it would be better not to assume inspector was able to handle the request right away (as would have been implied by having advanced the node provision state to inspect-failed). I wanted the inspect-wait instead to be indifferent to the abort call while still relaying the abort request to inspector to handle. In the end of the day, inspector would propagate the inspection failure back through the periodic sync. Should the timeout kick in, ironic would set inspect-failed state on its own. But I see this might be rather inspector-specific (and complicated). I haven't thought about aborting vendors' out-of-band inspection call-flow with regards to the state machine.

> (Again, I don't know much about the inspect stuff). Also, maybe the 'abort' feature should be a separate RFE? Well, there is a proposal for that, so are you just commenting in this, about how the abort might work with the new state, or is the intention for the abort to be implemented here?
Originally I thought these were separate issues and inspect-wait might have been beneficial on its own, as a separate RFE (mind the timeout protection and the "state-locking" issue). I'm not against merging with the spec in question[A] or having a separate one, what ever makes it smoother to adopt ;)

> Have you thought about whether we need an API version to hide/reveal this new state?
I haven't but I believe it should be exposed

> - regardless of the API version, I'm assuming that you want to put a node into this new 'inspect wait' state
hmmm... I thought one would rather put the node into inspecting state, the state machine would go to the inspect-wait on its own?

> - if so, let's say we add this new state in version 1.40. If the user is using version 1.39 and the node is in 'inspect wait' state, do we show that, or do we show 'inspect'? I think we probably need to show the node in inspect, not 'inspect wait', since that's the whole idea about our wonderful API versions.
I agree, not very nice IMO but I guess no other way, just show the "fake" state.

> If you can rewrite/answer all my ques...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic-specs (master)

Fix proposed to branch: master
Review: https://review.openstack.org/515657

Changed in ironic:
assignee: nobody → milan k (vetrisko)
status: Confirmed → In Progress
Ruby Loo (rloo)
tags: added: needs-spec
Revision history for this message
Kaifeng Wang (kaifeng) wrote :

Greetings guys, mind I ask if this work will be continued?

When I proposed the spec of inspection abort[1], milan said it's better to have a deploy wait state, so I was expecting continuing my spec after this one is done. If no one is working on it, maybe I can take it over.

[1] https://bugs.launchpad.net/ironic/+bug/1703089

Revision history for this message
Dmitry Tantsur (divius) wrote :

Milan is no longer working in OpenStack, so this can be taken over.

Changed in ironic:
assignee: milan k (vetrisko) → nobody
Kaifeng Wang (kaifeng)
Changed in ironic:
assignee: nobody → Kaifeng Wang (kaifeng)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic-specs (master)

Reviewed: https://review.openstack.org/515657
Committed: https://git.openstack.org/cgit/openstack/ironic-specs/commit/?id=b2e9b6dac6c7a2f4c0755a94bdb008317c34a9df
Submitter: Zuul
Branch: master

commit b2e9b6dac6c7a2f4c0755a94bdb008317c34a9df
Author: dparalen <email address hidden>
Date: Fri Oct 27 10:04:54 2017 +0200

    Introduce inspect wait state

    The in-band inspection is an asynchronous process isn't currently
    handled through a "waiting" state. This is a discrepancy from the
    rest of the asynchronous ironic states and may comprise a problem
    for future features such as aborting the introspection or merging
    ironic and ironic-inspector.

    The spec adds a new inspect wait state to ironic state machine to
    keep consistent with other asynchronous states.

    Change-Id: I6fe864079d85fe7e7ba9650a51628e88e8a662cc
    Partial-Bug: #1725211

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ironic-specs (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/553822

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ironic-specs (master)

Reviewed: https://review.openstack.org/553822
Committed: https://git.openstack.org/cgit/openstack/ironic-specs/commit/?id=c48069bea7c925847291d62490ae67a2020ae736
Submitter: Zuul
Branch: master

commit c48069bea7c925847291d62490ae67a2020ae736
Author: Kaifeng Wang <email address hidden>
Date: Fri Mar 16 22:05:09 2018 +0800

    Follow up of inspect wait state spec

    The comments raised at patch [1] needs to be addressed:

    1. The document inconsistency about default value of
       ``[conductor]inspect_wait_timeout``.
    2. The deprecation of returning ``INSPECTING`` from inspect_hardware
       of inspect interface.

    [1] https://review.openstack.org/#/c/515657/

    Change-Id: Idec0e7be483d9617318006a5592cbc0502eafaf2
    Related-Bug: #1725211

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic-inspector (master)

Fix proposed to branch: master
Review: https://review.openstack.org/555202

Changed in ironic-inspector:
assignee: nobody → Kaifeng Wang (kaifeng)
status: Confirmed → In Progress
Dmitry Tantsur (divius)
tags: added: rfe-approved
removed: rfe
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic-inspector (master)

Reviewed: https://review.openstack.org/555202
Committed: https://git.openstack.org/cgit/openstack/ironic-inspector/commit/?id=f257c571771a024cd36f792e80e77f4483c4162c
Submitter: Zuul
Branch: master

commit f257c571771a024cd36f792e80e77f4483c4162c
Author: Kaifeng Wang <email address hidden>
Date: Thu Mar 22 16:16:34 2018 +0800

    Add 'inspect wait' as a valid state

    ironic-inspector checks node provision state before starting hardware
    introspection, to allow inspection for node at inspect wait state,
    this state has to be added to ironic-inspector.

    Story: #1725211
    Task: #11372
    Change-Id: I89d9dfb85e191e781d869374911950d322fc227e
    Partial-Bug: #1725211

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/555610
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=6df82ee2bcc702168e80ec7c81e5573d766f9c71
Submitter: Zuul
Branch: master

commit 6df82ee2bcc702168e80ec7c81e5573d766f9c71
Author: Kaifeng Wang <email address hidden>
Date: Thu Mar 22 17:35:00 2018 +0800

    Implementation of inspect wait state

    This patch provides implementations to feature of adding inspect wait state.

    Changes covered in this patch:

    * Added state and transitions, state diagram regenerated.
    * inspector and oneview inspect interface now return INSPECTWAIT instead of
      INSPECTING. Move node to inspect wait if inspect interface returns
      INSPECTING or INSPECTWAIT.
    * Add a timeout option to conductor, and a periodic task to check timeout
      in the inspect wait state.

    Story: #1725211
    Task: #10630

    Partial-Bug: #1725211
    Change-Id: Ie76bfdad5966014a4dae826919ff5705462c743b

Changed in ironic:
status: In Progress → Fix Released
Kaifeng Wang (kaifeng)
Changed in ironic-inspector:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.