Ironic

Unreliable Ironic operations due to difficult to identify environmental conditions.

Bug #1651127 reported by Justin Kilpatrick on 2016-12-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ironic	Expired	Undecided	Unassigned
	tripleo	Invalid	Medium	Unassigned	tripleo stein-3

Bug Description

I have a 24 node Super Micro 'microcloud' that is being used for bare metal CI. While using it I noticed that introspection and to a lesser degree overcloud deployments are very unreliable.

Juniper switch configuration checked, lab conditions checked, no visible issues, I finally sat down and wrote a tool specifically to extract data about this issue.

http://i.imgur.com/9mBGar9.png

First I'll describe the test, I have a tool that issues introspection commands for all 5 nodes individually with a 30 second delay, it then waits 15 minutes for the nodes to complete introspection, the ideal time I have seen is in the 3 minute range I've never see a 10 min + come back as successful on this hardware when I tested with a longer timeout. If a node fails to introspect the failure count is incremented and the test is tried again, if a node where to fail twice in a row it would look the same as if two different nodes failed on the same round, I'm working on a way to visualize the data to avoid that.

So that graph covers about 500 tests, which is 2,500 total introspection events minimum, more including the retires, those timestamps are an average over 3 hours which is 6 rounds or 30 introspections for each data point.

You'll notice the results for newton and mitaka are the same, finally look at the same graphs over a time period where I put down and then took back up testing a few days later.

http://i.imgur.com/gYpGODs.png

clearly the issue is active and based on lab conditions which don't involve actual connectivity over the introspection network or the local switch config.

This is a placeholder bug for when I figure out whatever is going on here, while it might not be directly an ironic issue it's the sort of problem operators will spend weeks figuring out, so it's more than worth it to document a solution once one is found.

I'm happy share raw data if anyone is interested, or take suggestions for test design.

See original description

Justin Kilpatrick (jkilpatr) on 2016-12-19

description:

updated

Revision history for this message

John Trowbridge (trown) wrote on 2016-12-20:

I added tripleo to this bug, because this is being used as justification for a workaround in tripleo-quickstart: https://review.openstack.org/403677

I think rather than putting that logic in tripleo-quickstart, we should allow for retries via the mistral workflow for introspection.

Changed in tripleo:
importance:	Undecided → Medium
status:	New → Triaged
milestone:	none → pike-1

Justin Kilpatrick (jkilpatr) on 2016-12-20

description:	updated
description:	updated

Revision history for this message

Jay Faulkner (jason-oldos) wrote on 2016-12-28:

IME, these kinds of inspection failures are usually caused by an overloaded TFTP server (particularly when trying to boot multiple things in parallel).

Revision history for this message

Justin Kilpatrick (jkilpatr) wrote on 2017-01-03:

I had a similar conversation with some operators in #openstack-ironic a while ago. Thanks for reminding me I should post it here.

Nov 01 13:14:17 <JayF> dtantsur: particularly at one point when we had our http and tftp servers co-located on the same box. The http se
ssions would choke out tftp at high scale
Nov 01 13:15:06 <JayF> dtantsur: I got $20 that says if you move http and tftp to different boxes (tftp can colocate with dhcp, but not
http), you scale up 10x higher without issue
Nov 01 13:18:21 <JayF> Seriously though, colocating TFTP and HTTP is a scaling nightmare
Nov 01 13:18:32 <JayF> because TCP connections from the HTTP server choke out TFTP and break PXE booting
Nov 01 13:18:53 <JayF> and overloaded server running TFTP leads to full timeouts and failures
Nov 01 13:20:26 <JayF> we saw cleaning max out at about 10-20 machines at a time until we split tftp and http servers

The issue is that with TripleO we can't really separate those processes out to different machines. A multi machine undercloud is a hard sell. So we need to find some other way of addressing the problem, or even just testing the theory since I have no idea how to move undercloud processes to different machines to test the idea.

Revision history for this message

Miles Gould (mgould) wrote on 2017-01-23:

We can't really diagnose this on the Ironic side without more information about what exactly is failing in introspection. On the TripleO side, are you able to verify that the problem is TFTP/HTTP contention, as suggested in #3?

Changed in tripleo:
status:	Triaged → Incomplete
Changed in ironic:
status:	New → Incomplete

Revision history for this message

Justin Kilpatrick (jkilpatr) wrote on 2017-01-23:

I still need to test that theory. How would I go about running TFTP/HTTP from different machines will still running TripleO?

Since I couldn't figure the above out I've been developing more detailed tests/metrics to run starting some time next week.

Revision history for this message

Justin Kilpatrick (jkilpatr) wrote on 2017-03-15:

https://bugs.launchpad.net/tripleo/+bug/1672854

The above bug is related. Clearly TFTP limits are the problem above 20-30 nodes, the nodes I am seeing this bug on are much less beefy supermicro blades so It's possible I'm seeing the same problems just sooner. I'll be interested to test the fix for larger numbers of nodes and see if it reduces failures.

Revision history for this message

Justin Kilpatrick (jkilpatr) wrote on 2017-03-23:

http://git.openstack.org/cgit/openstack/puppet-ironic/tree/templates/inspector_dnsmasq_tftp.erb#n6

It seems that dhcp leases are only 29 seconds for inspecting nodes. derekh identified this and speculates that it's causing spurious failures when the ip pool is exhausted and ip's are reassigned.

Emilien Macchi (emilienm) on 2017-04-11

Changed in tripleo:
milestone:	pike-1 → pike-2

Revision history for this message

Justin Kilpatrick (jkilpatr) wrote on 2017-06-08:

https://docs.google.com/document/d/194ww0Pi2J-dRG3-X75mphzwUZVPC2S1Gsy1V0K0PqBo/

I think we can declare this one fixed.

Revision history for this message

Justin Kilpatrick (jkilpatr) wrote on 2017-06-08:

s/fixed/mitigated

Emilien Macchi (emilienm) on 2017-06-08

Changed in tripleo:
milestone:	pike-2 → pike-3

Emilien Macchi (emilienm) on 2017-07-30

Changed in tripleo:
milestone:	pike-3 → pike-rc1

Emilien Macchi (emilienm) on 2017-07-31

Changed in tripleo:
milestone:	pike-rc1 → queens-1

Revision history for this message

Launchpad Janitor (janitor) wrote on 2017-09-30:

#10

[Expired for Ironic because there has been no activity for 60 days.]

Changed in ironic:
status:	Incomplete → Expired

Alex Schultz (alex-schultz) on 2017-10-04

Changed in tripleo:
milestone:	queens-1 → queens-2

Alex Schultz (alex-schultz) on 2017-11-02

Changed in tripleo:
milestone:	queens-2 → queens-3

Emilien Macchi (emilienm) on 2018-01-26

Changed in tripleo:
milestone:	queens-3 → queens-rc1

Alex Schultz (alex-schultz) on 2018-02-20

Changed in tripleo:
milestone:	queens-rc1 → rocky-1

Alex Schultz (alex-schultz) on 2018-04-20

Changed in tripleo:
milestone:	rocky-1 → rocky-2

Emilien Macchi (emilienm) on 2018-06-05

Changed in tripleo:
milestone:	rocky-2 → rocky-3

Emilien Macchi (emilienm) on 2018-07-26

Changed in tripleo:
milestone:	rocky-3 → rocky-rc1

Emilien Macchi (emilienm) on 2018-07-26

Changed in tripleo:
milestone:	rocky-rc1 → stein-1

Juan Antonio Osorio Robles (juan-osorio-robles) on 2018-10-30

Changed in tripleo:
milestone:	stein-1 → stein-2

Emilien Macchi (emilienm) on 2019-01-13

Changed in tripleo:
milestone:	stein-2 → stein-3

Juan Antonio Osorio Robles (juan-osorio-robles) on 2019-01-22

Changed in tripleo:
status:	Incomplete → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.