TripleO Undercloud Ironic can not pxe effectively beyond 20 nodes
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| puppet-ironic |
Fix Released
|
Undecided
|
Derek Higgins | ||
| tripleo |
Fix Released
|
Medium
|
Derek Higgins | ||
Bug Description
Ironic can be deployed in a scale-out HA configuration, but TripleO deploys only a single instance on it's undercloud to be used for Overcloud deployment. That's why this is being filed against TripleO
This is a serious bottleneck to large deployments, the single xinetd tftp server located on the undercloud is incapable of serving ramdisk's to more than about 20 nodes at a time.
The following data was gathered on a 47 node cloud using a tool that issues introspection operations in configurable batches, failure count is the number of nodes that failed to load the introspection ramdisk in the process of introspecting all nodes. The median failure rate across 21 attempts (or 987 individual introspections) with batches of 16 is zero. The median failure rate across 22 attempts with batches of 32 is 19 failures before all 47 nodes successfully introspected.
http://
This shows up in overcloud deployments too. If max concurrent builds is set to some value exceeding 20 or so you run into the same issue.
This acts as a scaling bottleneck for TripleO, if you have a 500 node cloud (a serious near term goal) you will need to issue 32 batches of 16 introspection operations taking about 480 seconds each that's 4 and a half hours just to introspect, not to mention that you need to write a script that issues these operations in batches since the documented workflow "openstack baremetal introspection bulk start" or “openstack overcloud node introspect --all-manageable --provide” is pretty much guaranteed to fail on any cloud with more than 30 or so nodes. 25 attempts, zero successes.
Then for deploying an overcloud, lets say you did 32 batches of 16 using stack updates to make sure the deployment doesn't fail, that's 64 continuous hours of stack updates to get a overcloud totally deployed.
This bug can be addressed either by optimizing performance of the current driver, providing a method to scale out Ironic using a TripleO undercloud, or automatic batching for deployment operations.
| description: | updated |
| description: | updated |
| Changed in puppet-ironic: | |
| assignee: | nobody → Derek Higgins (derekh) |
| status: | New → In Progress |
For a start, we should switch from tftp to http(s) (iPXE). Does Ironic support that? Than we could just scale out webservers.