Deployment of 50 nodes failed on provisioning

Bug #1386861 reported by Aleksandr Shaposhnikov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
Medium
Alexander Evseev
6.1.x
Won't Fix
Medium
Fuel Library (Deprecated)
7.0.x
Invalid
Medium
Fuel Library (Deprecated)
8.0.x
Invalid
Medium
Alexander Evseev

Bug Description

Provisioning almost failed (we did manual steps to reboot two nodes again) because 2 nodes wasn't able to obtain IP via PXE (suggestion) because they booted straight to the previous OS. As far as I know Fuel previously erased first sectors of HDDs when "deploy" command was received and node forcibly rebooted by command from masternode. Looks like this code didn't work now because node didn't get an IP address during PXE and booting up to the next card and still couldn't do that (we have two attemtps because we have two NICs tuned the same way) and after that booting from HDD(SSD).
There is two issues that this bug reveals:
1. Code responsible for wiping out HDD before provisioning doesn't works.
2. Fuel/MOS masternode not able to provide nodes with IP address in time.

Suggested solutions(it would be nice to have them all implemented in some way):

1. Wipe out the disc space. In our situation boot order is looped so it finally will boot from nic.
2. Optimize TFTP/PXE/DHCP traffic priority on masternode. Probably do traffic shaping/prioritizing. DHCP is HIGH priority, DNS, TFTP - medium priority, HTTP, SYSLOG - low priority.
3. Enable gPXE loader ability to fetch files using HTTP/TCP instead of TFTP/UDP.
4. Implement https://blueprints.launchpad.net/fuel/+spec/continue-deployment to be able to continue deployment even if some compute nodes failed.
5. (Depends on 4) Increase latency between cobbler commands to restart nodes to escape "all the nodes boots at the same time". Suggested delay - 3-5 seconds from each other. Increasing it a lot will lead to increased overall time so let's try to keep provisioning time reasonable. Best approach - boot up controllers with delay of 10 seconds and when they started to provision we could reboot computes one by one because they wouldn't be deployed/installed unless controllers are ready. Without fourth variant this wouldn't work at all.

P.S. 47 nodes could go without any problems. 50 - problems happen. But for Ubuntu and provisioning. CentOS looks better. It looks like the bottleneck somewhere on masternode because Ubuntu and provisioning have pretty big initrd's but CentOS doesn't.

Tags: scale
Łukasz Oleś (loles)
Changed in fuel:
milestone: none → 6.0
importance: Undecided → High
assignee: nobody → Fuel Library Team (fuel-library)
Revision history for this message
Roman Alekseenkov (ralekseenkov) wrote :

Moving to critical, as it results in unreliable discovery and provisioning on 100-node scale

Changed in fuel:
importance: High → Critical
description: updated
description: updated
description: updated
Changed in fuel:
status: New → Triaged
Łukasz Oleś (loles)
Changed in fuel:
status: Triaged → In Progress
assignee: Fuel Library Team (fuel-library) → Łukasz Oleś (loles)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-astute (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/132077

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Lukasz, if you can finish for 6.0, please indicate. Otherwise, move to 6.1.

Revision history for this message
Łukasz Oleś (loles) wrote :

I will run some tests during the weekend

Revision history for this message
Roman Alekseenkov (ralekseenkov) wrote :

hey guys - why is it still open? I though the issue with deployment on 50 nodes is already resolved.

Revision history for this message
Łukasz Oleś (loles) wrote :

Actually we only tweaked some options in dnsmasq, nginx, syslog and now it works for 100 nodes.

Some of this points are still worth implementing for bigger number of nodes and for better stability. We may change bug title and move it to 6.1 or close it and open new bug in 6.1 or create blueprint for it

Revision history for this message
Tomasz 'Zen' Napierala (tzn) wrote :

This bug is complex, and partially resolved. I will keep this as e reminder for 6.1 and we will discuss further improvements

Changed in fuel:
milestone: 6.0 → 6.1
importance: Critical → Medium
Łukasz Oleś (loles)
Changed in fuel:
assignee: Łukasz Oleś (loles) → nobody
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

Moved to triages, as long as nothing can be in progress doing by all team.

Changed in fuel:
status: In Progress → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-astute (master)

Change abandoned by Lukasz Oles (<email address hidden>) on branch: master
Review: https://review.openstack.org/132077
Reason: Change spitted into two parts: first https://review.openstack.org/#/c/159764/ second in progress

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Tomasz 'Zen' Napierala (tzn)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

This should be covered by a blueprint

Changed in fuel:
assignee: Tomasz 'Zen' Napierala (tzn) → Fuel Library Team (fuel-library)
status: Triaged → Won't Fix
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Tamasz, could you transform it to BP.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/178119

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Alexander Evseev (aevseev-h)
status: Won't Fix → In Progress
Changed in fuel:
milestone: 6.1 → 7.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Igor Shishkin (<email address hidden>) on branch: master
Review: https://review.openstack.org/178119
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

I am not sure we have this issue anymore as we switched to IBP and have successfully deployed >50 nodes environment hundreds of times. Marking as Incomplete until there is a reproducer.

Revision history for this message
Alexander Evseev (aevseev) wrote :

It is related not to provision, but to bootstrapping via PXE.

And it will be not reproduced at least by our scale team, because they use 7 second pause between nodes power on/reboot. So only power on of 200-node cluster takes 200*7=1400 seconds or about 23 minutes.

Revision history for this message
Dina Belova (dbelova) wrote :

Let's mark as invalid, as the original description does not fit current deployment issues we sometimes face (and that are tracked in other bugs)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Fuel DevOps Robot (<email address hidden>) on branch: master
Review: https://review.openstack.org/178119
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.