Unreliable Ironic operations due to difficult to identify environmental conditions.

Bug #1651127 reported by Justin Kilpatrick
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Expired
Undecided
Unassigned
tripleo
Invalid
Medium
Unassigned

Bug Description

I have a 24 node Super Micro 'microcloud' that is being used for bare metal CI. While using it I noticed that introspection and to a lesser degree overcloud deployments are very unreliable.

Juniper switch configuration checked, lab conditions checked, no visible issues, I finally sat down and wrote a tool specifically to extract data about this issue.

http://i.imgur.com/9mBGar9.png

First I'll describe the test, I have a tool that issues introspection commands for all 5 nodes individually with a 30 second delay, it then waits 15 minutes for the nodes to complete introspection, the ideal time I have seen is in the 3 minute range I've never see a 10 min + come back as successful on this hardware when I tested with a longer timeout. If a node fails to introspect the failure count is incremented and the test is tried again, if a node where to fail twice in a row it would look the same as if two different nodes failed on the same round, I'm working on a way to visualize the data to avoid that.

So that graph covers about 500 tests, which is 2,500 total introspection events minimum, more including the retires, those timestamps are an average over 3 hours which is 6 rounds or 30 introspections for each data point.

You'll notice the results for newton and mitaka are the same, finally look at the same graphs over a time period where I put down and then took back up testing a few days later.

http://i.imgur.com/gYpGODs.png

clearly the issue is active and based on lab conditions which don't involve actual connectivity over the introspection network or the local switch config.

This is a placeholder bug for when I figure out whatever is going on here, while it might not be directly an ironic issue it's the sort of problem operators will spend weeks figuring out, so it's more than worth it to document a solution once one is found.

I'm happy share raw data if anyone is interested, or take suggestions for test design.

description: updated
Revision history for this message
John Trowbridge (trown) wrote :

I added tripleo to this bug, because this is being used as justification for a workaround in tripleo-quickstart: https://review.openstack.org/403677

I think rather than putting that logic in tripleo-quickstart, we should allow for retries via the mistral workflow for introspection.

Changed in tripleo:
importance: Undecided → Medium
status: New → Triaged
milestone: none → pike-1
description: updated
description: updated
Revision history for this message
Jay Faulkner (jason-oldos) wrote :

IME, these kinds of inspection failures are usually caused by an overloaded TFTP server (particularly when trying to boot multiple things in parallel).

Revision history for this message
Justin Kilpatrick (jkilpatr) wrote :

I had a similar conversation with some operators in #openstack-ironic a while ago. Thanks for reminding me I should post it here.

Nov 01 13:14:17 <JayF> dtantsur: particularly at one point when we had our http and tftp servers co-located on the same box. The http se
ssions would choke out tftp at high scale
Nov 01 13:15:06 <JayF> dtantsur: I got $20 that says if you move http and tftp to different boxes (tftp can colocate with dhcp, but not
http), you scale up 10x higher without issue
Nov 01 13:18:21 <JayF> Seriously though, colocating TFTP and HTTP is a scaling nightmare
Nov 01 13:18:32 <JayF> because TCP connections from the HTTP server choke out TFTP and break PXE booting
Nov 01 13:18:53 <JayF> and overloaded server running TFTP leads to full timeouts and failures
Nov 01 13:20:26 <JayF> we saw cleaning max out at about 10-20 machines at a time until we split tftp and http servers

The issue is that with TripleO we can't really separate those processes out to different machines. A multi machine undercloud is a hard sell. So we need to find some other way of addressing the problem, or even just testing the theory since I have no idea how to move undercloud processes to different machines to test the idea.

Revision history for this message
Miles Gould (mgould) wrote :

We can't really diagnose this on the Ironic side without more information about what exactly is failing in introspection. On the TripleO side, are you able to verify that the problem is TFTP/HTTP contention, as suggested in #3?

Changed in tripleo:
status: Triaged → Incomplete
Changed in ironic:
status: New → Incomplete
Revision history for this message
Justin Kilpatrick (jkilpatr) wrote :

 I still need to test that theory. How would I go about running TFTP/HTTP from different machines will still running TripleO?

Since I couldn't figure the above out I've been developing more detailed tests/metrics to run starting some time next week.

Revision history for this message
Justin Kilpatrick (jkilpatr) wrote :

https://bugs.launchpad.net/tripleo/+bug/1672854

The above bug is related. Clearly TFTP limits are the problem above 20-30 nodes, the nodes I am seeing this bug on are much less beefy supermicro blades so It's possible I'm seeing the same problems just sooner. I'll be interested to test the fix for larger numbers of nodes and see if it reduces failures.

Revision history for this message
Justin Kilpatrick (jkilpatr) wrote :

http://git.openstack.org/cgit/openstack/puppet-ironic/tree/templates/inspector_dnsmasq_tftp.erb#n6

It seems that dhcp leases are only 29 seconds for inspecting nodes. derekh identified this and speculates that it's causing spurious failures when the ip pool is exhausted and ip's are reassigned.

Changed in tripleo:
milestone: pike-1 → pike-2
Revision history for this message
Justin Kilpatrick (jkilpatr) wrote :
Revision history for this message
Justin Kilpatrick (jkilpatr) wrote :

s/fixed/mitigated

Changed in tripleo:
milestone: pike-2 → pike-3
Changed in tripleo:
milestone: pike-3 → pike-rc1
Changed in tripleo:
milestone: pike-rc1 → queens-1
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Ironic because there has been no activity for 60 days.]

Changed in ironic:
status: Incomplete → Expired
Changed in tripleo:
milestone: queens-1 → queens-2
Changed in tripleo:
milestone: queens-2 → queens-3
Changed in tripleo:
milestone: queens-3 → queens-rc1
Changed in tripleo:
milestone: queens-rc1 → rocky-1
Changed in tripleo:
milestone: rocky-1 → rocky-2
Changed in tripleo:
milestone: rocky-2 → rocky-3
Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Changed in tripleo:
milestone: stein-1 → stein-2
Changed in tripleo:
milestone: stein-2 → stein-3
Changed in tripleo:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.