Fuel for OpenStack

Deployment of 50 nodes failed on provisioning

Bug #1386861 reported by Aleksandr Shaposhnikov on 2014-10-28

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Invalid	Medium	Alexander Evseev	Fuel for OpenStack 7.0
6.1.x	Won't Fix	Medium	Fuel Library (Deprecated)	Fuel for OpenStack 6.1
7.0.x	Invalid	Medium	Fuel Library (Deprecated)	Fuel for OpenStack 7.0
8.0.x	Invalid	Medium	Alexander Evseev	Fuel for OpenStack 7.0

Bug Description

Provisioning almost failed (we did manual steps to reboot two nodes again) because 2 nodes wasn't able to obtain IP via PXE (suggestion) because they booted straight to the previous OS. As far as I know Fuel previously erased first sectors of HDDs when "deploy" command was received and node forcibly rebooted by command from masternode. Looks like this code didn't work now because node didn't get an IP address during PXE and booting up to the next card and still couldn't do that (we have two attemtps because we have two NICs tuned the same way) and after that booting from HDD(SSD).
There is two issues that this bug reveals:
1. Code responsible for wiping out HDD before provisioning doesn't works.
2. Fuel/MOS masternode not able to provide nodes with IP address in time.

Suggested solutions(it would be nice to have them all implemented in some way):

1. Wipe out the disc space. In our situation boot order is looped so it finally will boot from nic.
2. Optimize TFTP/PXE/DHCP traffic priority on masternode. Probably do traffic shaping/prioritizing. DHCP is HIGH priority, DNS, TFTP - medium priority, HTTP, SYSLOG - low priority.
3. Enable gPXE loader ability to fetch files using HTTP/TCP instead of TFTP/UDP.
4. Implement https://blueprints.launchpad.net/fuel/+spec/continue-deployment to be able to continue deployment even if some compute nodes failed.
5. (Depends on 4) Increase latency between cobbler commands to restart nodes to escape "all the nodes boots at the same time". Suggested delay - 3-5 seconds from each other. Increasing it a lot will lead to increased overall time so let's try to keep provisioning time reasonable. Best approach - boot up controllers with delay of 10 seconds and when they started to provision we could reboot computes one by one because they wouldn't be deployed/installed unless controllers are ready. Without fourth variant this wouldn't work at all.

P.S. 47 nodes could go without any problems. 50 - problems happen. But for Ubuntu and provisioning. CentOS looks better. It looks like the bottleneck somewhere on masternode because Ubuntu and provisioning have pretty big initrd's but CentOS doesn't.

See original description

Tags:

Łukasz Oleś (loles) on 2014-10-28

Changed in fuel:
milestone:	none → 6.0
importance:	Undecided → High
assignee:	nobody → Fuel Library Team (fuel-library)

Revision history for this message

Roman Alekseenkov (ralekseenkov) wrote on 2014-10-28:

Moving to critical, as it results in unreliable discovery and provisioning on 100-node scale

Changed in fuel:
importance:	High → Critical

Aleksandr Shaposhnikov (alashai8) on 2014-10-28

description:

updated

Aleksandr Shaposhnikov (alashai8) on 2014-10-28

description:

updated

Aleksandr Shaposhnikov (alashai8) on 2014-10-28

description:

updated

Bogdan Dobrelya (bogdando) on 2014-10-29

Changed in fuel:
status:	New → Triaged

Łukasz Oleś (loles) on 2014-10-30

Changed in fuel:
status:	Triaged → In Progress
assignee:	Fuel Library Team (fuel-library) → Łukasz Oleś (loles)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-10-30: Related fix proposed to fuel-astute (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/132077

Revision history for this message

Matthew Mosesohn (raytrac3r) wrote on 2014-11-14:

Lukasz, if you can finish for 6.0, please indicate. Otherwise, move to 6.1.

Revision history for this message

Łukasz Oleś (loles) wrote on 2014-11-14:

I will run some tests during the weekend

Revision history for this message

Roman Alekseenkov (ralekseenkov) wrote on 2014-11-19:

hey guys - why is it still open? I though the issue with deployment on 50 nodes is already resolved.

Revision history for this message

Łukasz Oleś (loles) wrote on 2014-11-23:

Actually we only tweaked some options in dnsmasq, nginx, syslog and now it works for 100 nodes.

Some of this points are still worth implementing for bigger number of nodes and for better stability. We may change bug title and move it to 6.1 or close it and open new bug in 6.1 or create blueprint for it

Revision history for this message

Tomasz 'Zen' Napierala (tzn) wrote on 2014-11-27:

This bug is complex, and partially resolved. I will keep this as e reminder for 6.1 and we will discuss further improvements

Changed in fuel:
milestone:	6.0 → 6.1
importance:	Critical → Medium

Łukasz Oleś (loles) on 2015-01-20

Changed in fuel:
assignee:	Łukasz Oleś (loles) → nobody

Vladimir Kuklin (vkuklin) on 2015-01-21

Changed in fuel:
assignee:	nobody → Fuel Library Team (fuel-library)

Revision history for this message

Stanislaw Bogatkin (sbogatkin) wrote on 2015-02-06:

Moved to triages, as long as nothing can be in progress doing by all team.

Changed in fuel:
status:	In Progress → Triaged

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-03: Change abandoned on fuel-astute (master)

Change abandoned by Lukasz Oles (<email address hidden>) on branch: master
Review: https://review.openstack.org/132077
Reason: Change spitted into two parts: first https://review.openstack.org/#/c/159764/ second in progress

Vladimir Kuklin (vkuklin) on 2015-03-31

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Tomasz 'Zen' Napierala (tzn)

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-04-01:

#10

This should be covered by a blueprint

Tomasz 'Zen' Napierala (tzn) on 2015-04-14

Changed in fuel:
assignee:	Tomasz 'Zen' Napierala (tzn) → Fuel Library Team (fuel-library)
status:	Triaged → Won't Fix

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2015-04-19:

#11

Tamasz, could you transform it to BP.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-28: Fix proposed to fuel-library (master)

#12

Fix proposed to branch: master
Review: https://review.openstack.org/178119

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Alexander Evseev (aevseev-h)
status:	Won't Fix → In Progress

Fuel Devops McRobotson (fuel-devops-robot) on 2015-04-29

Changed in fuel:
milestone:	6.1 → 7.0

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-07-28: Change abandoned on fuel-library (master)

#13

Change abandoned by Igor Shishkin (<email address hidden>) on branch: master
Review: https://review.openstack.org/178119
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-08-03:

#14

I am not sure we have this issue anymore as we switched to IBP and have successfully deployed >50 nodes environment hundreds of times. Marking as Incomplete until there is a reproducer.

Revision history for this message

Alexander Evseev (aevseev) wrote on 2015-08-03:

#15

It is related not to provision, but to bootstrapping via PXE.

And it will be not reproduced at least by our scale team, because they use 7 second pause between nodes power on/reboot. So only power on of 200-node cluster takes 200*7=1400 seconds or about 23 minutes.

Revision history for this message

Dina Belova (dbelova) wrote on 2015-08-18:

#16

Let's mark as invalid, as the original description does not fit current deployment issues we sometimes face (and that are tracked in other bugs)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-29:

#17

Change abandoned by Fuel DevOps Robot (<email address hidden>) on branch: master
Review: https://review.openstack.org/178119
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.