Only 100 of 200 nodes booted successfully with Ubuntu based bootstrap

Bug #1481721 reported by Alexei Sheplyakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
MOS Scale

Bug Description

Other 100 nodes fail to PXE boot.

Presumably the link gets saturated by HTTP traffic, so DHCP requests time out due to a high collisions rate.

Tags: scale
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/209486

Changed in fuel:
status: New → In Progress
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

may it be related with the size of bootstrap image?

what're the actual sizes of them?

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> may it be related with the size of bootstrap image?

Ubuntu based bootstrap image is slightly smaller than CentOS one:

[root@fuel /]# ls -lRh /var/www/nailgun/bootstrap/
/var/www/nailgun/bootstrap/:
total 233M
-rwxr-xr-x. 1 root root 228M Aug 4 08:22 initramfs.img
-rwxr-xr-x. 1 root root 4.7M Aug 4 08:22 linux
drwxr-xr-x. 2 root root 4.0K Aug 4 11:26 ubuntu

/var/www/nailgun/bootstrap/ubuntu:
total 229M
-rwxr-xr-x 1 root root 16M Aug 4 11:24 initramfs.img
-rwxr-xr-x 1 root root 5.6M Jul 29 12:35 linux
-rwxr-xr-x 1 root root 209M Aug 4 11:26 root.squashfs

The problem has nothing to do with the image size.
The tftp server we use, tftpd-hpa, is extremely dumb and spawns a process to handle each client,
so it's unable to saturate a 10Gb link. On the other nginx uses proper IO multiplexing (and TCP)
and is much more efficient, so it can easily saturate the link. By the way, this is why astute limits
the number of the nodes being provisioned concurrently [1].

[1] https://github.com/stackforge/fuel-astute/blob/master/lib/astute/config.rb#L81

Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

> The tftp server we use, tftpd-hpa, is extremely dumb and spawns a process to handle each client,
so it's unable to saturate a 10Gb link.

RC is that we use tftp servers to distribute large files (in terms of tftp).

ideally, we should send only initial PXE bootloader and configs through TFTP. Total size of them is less than 100 KB per node.
Then, the images will be downloaded though HTTP. Eg.: iPXE is able to work with HTTP http://ipxe.org/

BTW, is it possible to monitor how many tftpd-hpa processes were spawned and figure out where's the bottleneck? It might be a lack of CPU time due to tftpd server implementation inefficiency.

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Can not reproduce the bug any more, marking as Incomplete

Changed in fuel:
status: In Progress → Incomplete
assignee: Alexei Sheplyakov (asheplyakov) → MOS Scale (mos-scale)
Revision history for this message
Dina Belova (dbelova) wrote :

Marking as invalid till it'll be reproduced again

Changed in fuel:
status: Incomplete → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Alexei Sheplyakov (<email address hidden>) on branch: master
Review: https://review.openstack.org/209486
Reason: The patch does not really solve the referenced bug.
Also Ubuntu based bootstrap won't be shipped with MOS 7.0

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.