TripleO Undercloud Ironic can not pxe effectively beyond 20 nodes

Bug #1672854 reported by Justin Kilpatrick
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
puppet-ironic
Fix Released
Undecided
Derek Higgins
tripleo
Fix Released
Medium
Derek Higgins

Bug Description

Ironic can be deployed in a scale-out HA configuration, but TripleO deploys only a single instance on it's undercloud to be used for Overcloud deployment. That's why this is being filed against TripleO

This is a serious bottleneck to large deployments, the single xinetd tftp server located on the undercloud is incapable of serving ramdisk's to more than about 20 nodes at a time.

The following data was gathered on a 47 node cloud using a tool that issues introspection operations in configurable batches, failure count is the number of nodes that failed to load the introspection ramdisk in the process of introspecting all nodes. The median failure rate across 21 attempts (or 987 individual introspections) with batches of 16 is zero. The median failure rate across 22 attempts with batches of 32 is 19 failures before all 47 nodes successfully introspected.

http://elk.browbeatproject.org:8080/goto/7ed385f27192ac178c48b09c461a2c9f

This shows up in overcloud deployments too. If max concurrent builds is set to some value exceeding 20 or so you run into the same issue.

This acts as a scaling bottleneck for TripleO, if you have a 500 node cloud (a serious near term goal) you will need to issue 32 batches of 16 introspection operations taking about 480 seconds each that's 4 and a half hours just to introspect, not to mention that you need to write a script that issues these operations in batches since the documented workflow "openstack baremetal introspection bulk start" or “openstack overcloud node introspect --all-manageable --provide” is pretty much guaranteed to fail on any cloud with more than 30 or so nodes. 25 attempts, zero successes.

Then for deploying an overcloud, lets say you did 32 batches of 16 using stack updates to make sure the deployment doesn't fail, that's 64 continuous hours of stack updates to get a overcloud totally deployed.

This bug can be addressed either by optimizing performance of the current driver, providing a method to scale out Ironic using a TripleO undercloud, or automatic batching for deployment operations.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

For a start, we should switch from tftp to http(s) (iPXE). Does Ironic support that? Than we could just scale out webservers.

Revision history for this message
Justin Kilpatrick (jkilpatr) wrote :

From what I understand the issue is that the out of band interfaces (DRAC in this case) are rather specific about how you serve them images. Supposedly you can host images over http using a feature called virtual media but not all out of band interfaces support this and a virtual media using Ironic driver for even DRAC (which does support it) doesn't exist upstream.

Revision history for this message
Miles Gould (mgould) wrote :

Bogdan: that's already the default behaviour. But first we need to transfer the iPXE binary via TFTP; as I understand it, that's the stage that's failing (jkilpatr, can you confirm?).

Revision history for this message
Miles Gould (mgould) wrote :

Justin: we don't need virtual media to transfer the introspection ramdisk over HTTP. We can transfer the (50-500kB) iPXE firmware over TFTP, then tell iPXE to fetch the IPA kernel and ramdisk over HTTP, and finally chainload into IPA. This should already be the default behaviour in TripleO.

Revision history for this message
Justin Kilpatrick (jkilpatr) wrote :

Ah thanks Miles I was confused, yes we never get past transferring the tiny iPXE firmware, if you watch the console you can see it trying to pxe and just falling right through saying it timed out on the image. No ramdisk boot output or anything, you never even leave the BIOS on the node.

Revision history for this message
Miles Gould (mgould) wrote :
Download full text (3.3 KiB)

I think this will be hard to fix within ironic(-inspector). A bit of context: here's how ironic-inspector introspects a node, given the TripleO defaults.

1) Ironic-inspector updates the undercloud's firewall settings so the node can see inspector's DHCP server but not neutron-dhcp-agent's.
2) Ironic-inspector tells ironic to set the node's boot mode to PXE (this is hardcoded in Inspector).
3) Ironic-inspector tells ironic to reboot the node.
4) The node reboots and contacts inspector-dnsmasq to ask for a PXE-boot image.
5) If the node is already running iPXE (some network cards have it built-in), go to 9. Steps 5, 6 and 10 are specified in the file /etc/ironic-inspector/dnsmasq.conf, which is generated from a template by puppet-ironic: https://github.com/openstack/puppet-ironic/blob/master/templates/inspector_dnsmasq_http.erb.
6) Inspector-dnsmasq tells the node to fetch iPXE over TFTP.
7) The node makes a TFTP request to the undercloud; this is handled by xinetd, which hands it off to tftpd (called in.tftpd on Red Hat-like systems). This is configured via /etc/xinetd.d/tftp, which is installed from an operating system package along with tftpd.
8) The node PXE-boots into iPXE.
9) The node makes a DHCP request to the undercloud, identifying itself as iPXE.
10) Inspector-dnsmasq tells the node to fetch the file inspector.ipxe over HTTP.
11) iPXE fetches inspector.ipxe from the undercloud's Apache instance, which serves it up from the filesystem.
12) inspector.ipxe contains HTTP URLs for the Ironic Python Agent kernel and ramdisk. These URLs are generated at `openstack undercloud install` time by puppet-ironic: https://github.com/openstack/puppet-ironic/blob/master/templates/inspector_ipxe.erb.
13) iPXE fetches IPA over HTTP, and then boots into it.
14) IPA introspects the node and reports back to Ironic.

Some things to note:

a) Currently, inspector has no ability to provide the IPA image using virtual media, even if the node is being managed by an Ironic driver that does support virtual media. A patch to fix this (https://review.openstack.org/#/c/239729/) exists, but it hasn't been touched since late 2015.
b) The bottleneck is in steps 5-7, "fetch iPXE over TFTP". We don't even need to wait for IPA to boot before we can start PXE-booting the next batch of nodes, we only need to wait for iPXE to boot, because everything after that stage (step 9 above) happens over HTTP. If we can manage that, pipelining should perform much better.
c) Unfortunately, steps 5-13 are invisible to both ironic and ironic-inspector. inspector-dnsmasq knows that we've reached step 9, and Apache knows we've reached step 11, but they don't report this fact back to ironic.
d) There are AFAICT two xinetd settings that control number of simultaneous connections, "instances" and "wait". We're leaving "instances" as the default (infinity), and setting "wait" to "yes", ie only service one connection at a time; this is typical for UDP services like TFTP (see https://linux.die.net/man/5/xinetd.conf), and in jkilpatr's tests setting wait=no made TFTP totally break.
e) dnsmasq does have the ability to provide round-robin load-balancing of TFTP servers: search for "Instead of an ...

Read more...

description: updated
Revision history for this message
Miles Gould (mgould) wrote :

Further to point d) in comment 6: I've tried requesting 100 copies of ipxe.efi over tftp on an undercloud node (also running the tftp client on the undercloud, with 0.2ms of latency added to the loopback interface using tc), and running the requests in parallel *did* produce a substantial speedup:

$ sudo tc qdisc add dev lo root netem delay 0.2ms
$ time ./runN -j 1 tftp localhost -c get ipxe.efi -- $(seq 1 100)

real 0m57.654s
user 0m1.343s
sys 0m3.350s
$ time ./runN -j 100 tftp localhost -c get ipxe.efi -- $(seq 1 100)

real 0m0.837s
user 0m1.480s
sys 0m4.469s

So I think lack of concurrency in tftpd is not the problem after all.

description: updated
Revision history for this message
Derek Higgins (derekh) wrote :

The problem here is that the default dhcp range available to inspector is
192.168.24.100 - 192.168.24.120
After 20 nodes inspector no longer has IPs available to hand out so PXE booting fails.

After looking at the setup where this problem occurred, the default IP range had been changed too
192.0.2.101 - 192.0.2.250
this should have been enough to avoid the problem,

but a bug in the undercloud installation process(in the puppet-ironic module) meant that openstack-ironic-inspector-dnsmasq wasn't restarted even though the range in undercloud.conf was increased and the undercloud reinstalled.

Revision history for this message
Derek Higgins (derekh) wrote :
Changed in tripleo:
assignee: nobody → Derek Higgins (derekh)
importance: Undecided → Medium
status: New → Triaged
Derek Higgins (derekh)
Changed in puppet-ironic:
assignee: nobody → Derek Higgins (derekh)
status: New → In Progress
Revision history for this message
Justin Kilpatrick (jkilpatr) wrote : Re: [Bug 1672854] Re: TripleO Undercloud Ironic can not pxe effectively beyond 20 nodes
Download full text (3.3 KiB)

http://elk.browbeatproject.org/goto/356e6a39f9253dc071fb402349bac7b1

Dug around in the data some and found this result, which shows an
average failure rate of 1 for batches of 32 after dnsmasq was bounced
manually.

This bug can be resolved then. Although we still need some sort of
batching and failure tolerance since even that 1 failure would cause
the default bulk introspection workflow to fail.

On Thu, Mar 23, 2017 at 8:04 AM, Derek Higgins
<email address hidden> wrote:
> Patch to puppet-ironic
> https://review.openstack.org/#/c/449101/
>
> ** Changed in: tripleo
> Assignee: (unassigned) => Derek Higgins (derekh)
>
> ** Changed in: tripleo
> Importance: Undecided => Medium
>
> ** Changed in: tripleo
> Status: New => Triaged
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1672854
>
> Title:
> TripleO Undercloud Ironic can not pxe effectively beyond 20 nodes
>
> Status in tripleo:
> Triaged
>
> Bug description:
> Ironic can be deployed in a scale-out HA configuration, but TripleO
> deploys only a single instance on it's undercloud to be used for
> Overcloud deployment. That's why this is being filed against TripleO
>
> This is a serious bottleneck to large deployments, the single xinetd
> tftp server located on the undercloud is incapable of serving
> ramdisk's to more than about 20 nodes at a time.
>
> The following data was gathered on a 47 node cloud using a tool that
> issues introspection operations in configurable batches, failure count
> is the number of nodes that failed to load the introspection ramdisk
> in the process of introspecting all nodes. The median failure rate
> across 21 attempts (or 987 individual introspections) with batches of
> 16 is zero. The median failure rate across 22 attempts with batches of
> 32 is 19 failures before all 47 nodes successfully introspected.
>
> http://elk.browbeatproject.org:8080/goto/7ed385f27192ac178c48b09c461a2c9f
>
> This shows up in overcloud deployments too. If max concurrent builds
> is set to some value exceeding 20 or so you run into the same issue.
>
> This acts as a scaling bottleneck for TripleO, if you have a 500 node
> cloud (a serious near term goal) you will need to issue 32 batches of
> 16 introspection operations taking about 480 seconds each that's 4 and
> a half hours just to introspect, not to mention that you need to write
> a script that issues these operations in batches since the documented
> workflow "openstack baremetal introspection bulk start" or “openstack
> overcloud node introspect --all-manageable --provide” is pretty much
> guaranteed to fail on any cloud with more than 30 or so nodes. 25
> attempts, zero successes.
>
> Then for deploying an overcloud, lets say you did 32 batches of 16
> using stack updates to make sure the deployment doesn't fail, that's
> 64 continuous hours of stack updates to get a overcloud totally
> deployed.
>
> This bug can be addressed either by optimizing performance of the
> current driver, providing a method to scale out Ironic using a TripleO
> undercloud, or autom...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-ironic (master)

Reviewed: https://review.openstack.org/449101
Committed: https://git.openstack.org/cgit/openstack/puppet-ironic/commit/?id=ff48eb8f73100e409789546c21a2bc415dbb56c5
Submitter: Jenkins
Branch: master

commit ff48eb8f73100e409789546c21a2bc415dbb56c5
Author: Derek Higgins <email address hidden>
Date: Thu Mar 23 11:16:22 2017 +0000

    Restart dnsmasq when config changes

    Change-Id: I77374ca0b9262bd5fb627909e6fcfe73d3cf615b
    Closes-bug: #1672854

Changed in puppet-ironic:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-ironic (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/449575

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-ironic (stable/ocata)

Reviewed: https://review.openstack.org/449575
Committed: https://git.openstack.org/cgit/openstack/puppet-ironic/commit/?id=d8242246409bbff9a644b6173aa95ccfb9606c42
Submitter: Jenkins
Branch: stable/ocata

commit d8242246409bbff9a644b6173aa95ccfb9606c42
Author: Derek Higgins <email address hidden>
Date: Thu Mar 23 11:16:22 2017 +0000

    Restart dnsmasq when config changes

    Change-Id: I77374ca0b9262bd5fb627909e6fcfe73d3cf615b
    Closes-bug: #1672854
    (cherry picked from commit ff48eb8f73100e409789546c21a2bc415dbb56c5)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-ironic (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/449688

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-ironic (stable/newton)

Reviewed: https://review.openstack.org/449688
Committed: https://git.openstack.org/cgit/openstack/puppet-ironic/commit/?id=d4f7805ea2d30c805bd2d8a3c0f829381ed76a69
Submitter: Jenkins
Branch: stable/newton

commit d4f7805ea2d30c805bd2d8a3c0f829381ed76a69
Author: Derek Higgins <email address hidden>
Date: Thu Mar 23 11:16:22 2017 +0000

    Restart dnsmasq when config changes

    Change-Id: I77374ca0b9262bd5fb627909e6fcfe73d3cf615b
    Closes-bug: #1672854
    (cherry picked from commit ff48eb8f73100e409789546c21a2bc415dbb56c5)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-ironic 11.0.0

This issue was fixed in the openstack/puppet-ironic 11.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-ironic 10.4.1

This issue was fixed in the openstack/puppet-ironic 10.4.1 release.

Revision history for this message
Emilien Macchi (emilienm) wrote :

There are no currently open reviews on this bug, changing the status back to the previous state and unassigning. If there are active reviews related to this bug, please include links in comments.

Changed in tripleo:
assignee: Derek Higgins (derekh) → nobody
Revision history for this message
Derek Higgins (derekh) wrote :

This was fixed in puppet-ironic

Changed in tripleo:
status: Triaged → Fix Released
assignee: nobody → Derek Higgins (derekh)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-ironic 9.6.0

This issue was fixed in the openstack/puppet-ironic 9.6.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.