Deployment of 50 nodes failed on provisioning
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Invalid
|
Medium
|
Alexander Evseev | ||
6.1.x |
Won't Fix
|
Medium
|
Fuel Library (Deprecated) | ||
7.0.x |
Invalid
|
Medium
|
Fuel Library (Deprecated) | ||
8.0.x |
Invalid
|
Medium
|
Alexander Evseev |
Bug Description
Provisioning almost failed (we did manual steps to reboot two nodes again) because 2 nodes wasn't able to obtain IP via PXE (suggestion) because they booted straight to the previous OS. As far as I know Fuel previously erased first sectors of HDDs when "deploy" command was received and node forcibly rebooted by command from masternode. Looks like this code didn't work now because node didn't get an IP address during PXE and booting up to the next card and still couldn't do that (we have two attemtps because we have two NICs tuned the same way) and after that booting from HDD(SSD).
There is two issues that this bug reveals:
1. Code responsible for wiping out HDD before provisioning doesn't works.
2. Fuel/MOS masternode not able to provide nodes with IP address in time.
Suggested solutions(it would be nice to have them all implemented in some way):
1. Wipe out the disc space. In our situation boot order is looped so it finally will boot from nic.
2. Optimize TFTP/PXE/DHCP traffic priority on masternode. Probably do traffic shaping/
3. Enable gPXE loader ability to fetch files using HTTP/TCP instead of TFTP/UDP.
4. Implement https:/
5. (Depends on 4) Increase latency between cobbler commands to restart nodes to escape "all the nodes boots at the same time". Suggested delay - 3-5 seconds from each other. Increasing it a lot will lead to increased overall time so let's try to keep provisioning time reasonable. Best approach - boot up controllers with delay of 10 seconds and when they started to provision we could reboot computes one by one because they wouldn't be deployed/installed unless controllers are ready. Without fourth variant this wouldn't work at all.
P.S. 47 nodes could go without any problems. 50 - problems happen. But for Ubuntu and provisioning. CentOS looks better. It looks like the bottleneck somewhere on masternode because Ubuntu and provisioning have pretty big initrd's but CentOS doesn't.
Changed in fuel: | |
milestone: | none → 6.0 |
importance: | Undecided → High |
assignee: | nobody → Fuel Library Team (fuel-library) |
description: | updated |
description: | updated |
description: | updated |
Changed in fuel: | |
status: | New → Triaged |
Changed in fuel: | |
status: | Triaged → In Progress |
assignee: | Fuel Library Team (fuel-library) → Łukasz Oleś (loles) |
Changed in fuel: | |
assignee: | Łukasz Oleś (loles) → nobody |
Changed in fuel: | |
assignee: | nobody → Fuel Library Team (fuel-library) |
Changed in fuel: | |
assignee: | Fuel Library Team (fuel-library) → Tomasz 'Zen' Napierala (tzn) |
Changed in fuel: | |
assignee: | Tomasz 'Zen' Napierala (tzn) → Fuel Library Team (fuel-library) |
status: | Triaged → Won't Fix |
Changed in fuel: | |
milestone: | 6.1 → 7.0 |
Moving to critical, as it results in unreliable discovery and provisioning on 100-node scale