stack and baremetal workbooks are miss-using retry and continue-on

Bug #1755754 reported by Dougal Matthews
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Dmitry Tantsur

Bug Description

Using this code from the baremetal.yaml file to demonstrate the issue. At the time of reporting I count 6 places this wont work as expected.

      wait_for_provision_state:
        action: ironic.node_get node_id=<% $.node_uuid %>
        timeout: 1200 #20 minutes
        retry:
          delay: 3
          count: 400
          continue-on: <% task().result.provision_state != $.target_state %>

We are using the retry Mistral policy to wait for an node to reach a target state. However, this is actually a misuse and it wont work as we expect. "retry" and everything under it (delay, count, continue-on, break-on etc.) will only ever be triggered if the action fails. In this case, we expect ironic.node_get will not fail - we just want to check different properties. Since it never fails, the retry policy is never triggered.

So, unless the API call fails for a different reason, the aciton will only be called once and the task will complete.

The easiest way I can see to get the behaviour we want is with a small utility workflow. Essentially we need to wrap the action and make it error if the behaviour is something we want to trigger retrying. So if the action doesn't reach one of our end states, then we consider it to have failed.

    verify_provision_state:
      input:
        - node_uuid
        - end_states # both the target state and the error states. Then continue-on and break-on can decide what to do
      tasks:
        check_provision_state:
          action: ironic.node_get node_id=<% $.node_uuid %>
          on-complete:
            - fail: <% task().result.provision_state not in $.end_states %>

So the above example would become something like this...

      wait_for_provision_state:
        workflow: verify_provision_state node_id=<% $.node_uuid %> end_states=<% [$.target_state, "error"] %>
        timeout: 1200 #20 minutes
        retry:
          delay: 3
          count: 400
          continue-on: <% task().result.provision_state != $.target_state %>
          break-on: <% task().result.provision_state in $.error_states %>

note; error_states isn't in master yet, but is based on a wip patch: https://review.openstack.org/#/c/552554/

Changed in tripleo:
assignee: nobody → Dmitry Tantsur (divius)
status: Confirmed → In Progress
Revision history for this message
Dougal Matthews (d0ugal) wrote :

Marked as invalid. Turned out I *still* didn't understand this. We need to document this better in Mistral, perhaps with examples.

Changed in tripleo:
status: In Progress → Invalid
Revision history for this message
Brad P. Crochet (brad-9) wrote :

Use break-on if you expect your action to be in ERROR and it may eventually go to SUCCESS. Use continue-on if your action will always be in SUCCESS, and you have some other trigger as to whether to continue or not.

I think this works as it is supposed to. The docs are just unclear as to whether they should be used together.

Changed in tripleo:
status: Invalid → In Progress
Ryan Brady (rbrady)
Changed in tripleo:
status: In Progress → Invalid
Changed in tripleo:
status: Invalid → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/552554
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=d97cd4a00556457914b783282c544b5233af0260
Submitter: Zuul
Branch: master

commit d97cd4a00556457914b783282c544b5233af0260
Author: Dmitry Tantsur <email address hidden>
Date: Wed Mar 14 14:20:15 2018 +0100

    Fix error handling in set_provision_state/set_power_state workflows

    Currently these workflows succeed in any case, since we don't have
    any condition to fail. This change makes them fail if the resulting
    state does not match the expected one.

    It also handles the case when a node goes into one of the failure
    states, so that we don't wait until timeout. Proper error message
    is returned to avoid confusing operators.

    Finally, it reduces the traffic between mistral and ironic by only
    requesting the required fields.

    Partial-Bug: #1755754
    Closes-Bug: #1667776
    Change-Id: Ice19306d4c4a2080b0337bc02a6ccee4a81411b5

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/559314

Changed in tripleo:
milestone: rocky-1 → rocky-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/queens)

Reviewed: https://review.openstack.org/559314
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=27b2ef1d72365c105be7bc03eb1f67986688dfe7
Submitter: Zuul
Branch: stable/queens

commit 27b2ef1d72365c105be7bc03eb1f67986688dfe7
Author: Dmitry Tantsur <email address hidden>
Date: Wed Mar 14 14:20:15 2018 +0100

    Fix error handling in set_provision_state/set_power_state workflows

    Currently these workflows succeed in any case, since we don't have
    any condition to fail. This change makes them fail if the resulting
    state does not match the expected one.

    It also handles the case when a node goes into one of the failure
    states, so that we don't wait until timeout. Proper error message
    is returned to avoid confusing operators.

    Finally, it reduces the traffic between mistral and ironic by only
    requesting the required fields.

    Conflicts:
     workbooks/baremetal.yaml

    Partial-Bug: #1755754
    Closes-Bug: #1667776
    Change-Id: Ice19306d4c4a2080b0337bc02a6ccee4a81411b5
    (cherry picked from commit d97cd4a00556457914b783282c544b5233af0260)

tags: added: in-stable-queens
Dougal Matthews (d0ugal)
Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.