default api loop count / intervals can't cope with 40 machine clusters

Bug #1319858 reported by Robert Collins
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ironic
Expired
Undecided
Unassigned

Bug Description

We have ironic configured with 5 workers (as currently recommended), and the ironic driver with only
[ironic]
admin_username = ironic
admin_password = ....
admin_url = http://127.0.0.1:35357/v2.0
admin_tenant_name = service

set.

Deleting a cluster of 40 machines consistently throws 5-10 of them into error state from the api deleting loop.

The defaults are 5 retries 2 seconds apart. With 5 workers, if each request took 1 second, we'd spend 8 seconds deleting the cluster.

I suspect this falls in the 'async apis are needed mmmkay' category, but it would be good to handle larger setups more tolerantly by default.

For now I suggest changing the defaults to 60 retries. In future, maybe we could move the unassociation into Ironic, start with a power off (and wait fo rthat) then unplug etc etc and let ironic do the later work on its own..

This bug is separate to the failing-to-cleanup-task-state bug, but certainly exacerbates it ;)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/93731

Changed in ironic:
assignee: nobody → Robert Collins (lifeless)
status: New → In Progress
Changed in ironic:
assignee: Robert Collins (lifeless) → Chris Jones (cmsj)
Changed in ironic:
assignee: Chris Jones (cmsj) → Robert Collins (lifeless)
Dmitry Tantsur (divius)
Changed in ironic:
importance: Undecided → High
Changed in ironic:
assignee: Robert Collins (lifeless) → Ruby Loo (rloo)
Ruby Loo (rloo)
Changed in ironic:
assignee: Ruby Loo (rloo) → Robert Collins (lifeless)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/93731
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=70f518322872a464628e47c70e9d0f0faf2399a3
Submitter: Jenkins
Branch: master

commit 70f518322872a464628e47c70e9d0f0faf2399a3
Author: Robert Collins <email address hidden>
Date: Fri May 16 02:27:07 2014 +1200

    Allow more time for API requests to be completed

    We currently poll for success, but physical machines can take a while
    to respond to IPMI etc. The current defaults reliably fail to complete
    with 5 workers (which is the desired default). Until we have event
    based logic from Ironic -> Nova we need to try harder, or be
    unreliable. 60 retries of 2 seconds each means 2 minutes before a
    truely hard fail (as opposed to queue backlog) is reported. Note that
    deploys can starve deletes at the moment, so this fix is not enough to
    stop spurious reports of problems.

    Change-Id: I917a4430e5e77936deb82343ff057b9adc9822ec
    Partial-Bug: #1319858

aeva black (tenbrae)
Changed in ironic:
milestone: none → juno-3
Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :

Bumping to ironic-next, as the actual fix to this bug will require implementing callbacks to nova.

Changed in ironic:
milestone: juno-3 → next
Revision history for this message
Dmitry Tantsur (divius) wrote :

Hi Robert! I'm assuming you're not working on this bug right now, correct? Please feel free to reassign at any moment.

Changed in ironic:
status: In Progress → Confirmed
assignee: Robert Collins (lifeless) → nobody
Revision history for this message
John Stafford (john-stafford) wrote :

The callbacks will require a spec in L Cycle

Changed in ironic:
importance: High → Wishlist
Dmitry Tantsur (divius)
Changed in ironic:
milestone: next → none
Changed in ironic:
assignee: nobody → Galyna Zholtkevych (gzholtkevych)
Revision history for this message
Galyna Zholtkevych (gzholtkevych) wrote :

hello,
I am trying to figure out how to fix that.

And first question: the discussion looks like you are trying to delete instances. Is it true?
If so, I think it is better to fix in nova project.

The first question: is this bug still valid, can you still reproduce this? Probably, it has been fixed in nova

Revision history for this message
Galyna Zholtkevych (gzholtkevych) wrote :

I am asking about nova, because proposed patch does not even exist in ironic anymore

Revision history for this message
Galyna Zholtkevych (gzholtkevych) wrote :

Robert Collins,

please, provide exact steps to reproduce this.

I cannot figure out and this is old bug. It could be fixed

Changed in ironic:
assignee: Galyna Zholtkevych (gzholtkevych) → nobody
Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

I don't think we are going to get additional details from Robert at this point on reproducing this issue, given that he has not worked on OpenStack for some time. It seems like it is a pure bulk delete from a failed TripleO build.

While the essence of Robert's change has been moved into Nova, I believe the proper action at this point is to mark it as incomplete in order to allow the bug to expire, since we can't verify that the issue still present with nova's current state interacting with ironic.

If anyone else is encountering this issue, then we can always open a new bug.

Changed in ironic:
status: Confirmed → Incomplete
importance: Wishlist → Low
importance: Low → Undecided
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Ironic because there has been no activity for 60 days.]

Changed in ironic:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.