I have a 24 node Super Micro 'microcloud' that is being used for bare metal CI. While using it I noticed that introspection and to a lesser degree overcloud deployments are very unreliable.
Juniper switch configuration checked, lab conditions checked, no visible issues, I finally sat down and wrote a tool specifically to extract data about this issue.
http://i.imgur.com/9mBGar9.png
First I'll describe the test, I have a tool that issues introspection commands for all 5 nodes individually with a 30 second delay, it then waits 15 minutes for the nodes to complete introspection, the ideal time I have seen is in the 3 minute range I've never see a 10 min + come back as successful on this hardware when I tested with a longer timeout. If a node fails to introspect the failure count is incremented and the test is tried again, if a node where to fail twice in a row it would look the same as if two different nodes failed on the same round, I'm working on a way to visualize the data to avoid that.
So that graph covers about 500 tests, which is 2,500 total introspection events minimum, more including the retires, those timestamps are an average over 3 hours which is 6 rounds or 30 introspections for each data point.
You'll notice the results for newton and mitaka are the same, finally look at the same graphs over a time period where I put down and then took back up testing a few days later.
http://i.imgur.com/gYpGODs.png
clearly the issue is active and based on lab conditions which don't involve actual connectivity over the introspection network or the local switch config.
This is a placeholder bug for when I figure out whatever is going on here, while it might not be directly an ironic issue it's the sort of problem operators will spend weeks figuring out, so it's more than worth it to document a solution once one is found.
I'm happy share raw data if anyone is interested, or take suggestions for test design.
I added tripleo to this bug, because this is being used as justification for a workaround in tripleo-quickstart: https:/ /review. openstack. org/403677
I think rather than putting that logic in tripleo-quickstart, we should allow for retries via the mistral workflow for introspection.