[VirtualBox]: Slave node cannot boot: Fatal: No bootable medium found. System halted

Bug #1401851 reported by Anastasia Palkina
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Serhii Ovsianikov
6.0.x
Fix Released
High
Serhii Ovsianikov
6.1.x
Fix Released
High
Serhii Ovsianikov

Bug Description

On VBox master node installed successful.

But often 1 of 3 slave nodes cannot boot (see screen).
If you restart this node, it is booting successfully

Tags: virtualbox
Changed in fuel:
milestone: none → 6.1
Revision history for this message
Stanislav Makar (smakar) wrote :

please provide vbox scripts and version of fuel

tags: added: virtualbox
summary: - Slave node cannot boot: Fatal: No bootable medium found. System halted
+ [VirtualBox]: Slave node cannot boot: Fatal: No bootable medium found.
+ System halted
Stanislav Makar (smakar)
Changed in fuel:
status: New → Incomplete
milestone: 6.1 → 6.0
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Stas,
you can use latest scripts from fuel-main.
I've seen this issue too. You can try 5.1.1 & 6.0.

I'm pretty sure it happens on loaded systems, such as laptops, and our expect script in vbox script do not use the right source of information about Fuel Master node installation finishing.
I believe cobbler container is not yet initialized fully, or networking stack (forwards/etc.) on Fuel master node to pass PXE traffic at the time when slave nodes are started.

Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Moving to critical, as it significantly affects user experience. Installation of Fuel and OpenStack has to be very smooth.

Revision history for this message
Roman Prykhodchenko (romcheg) wrote :

@Stanislav: due to HCF it cannot be targeted to 6.1

Changed in fuel:
milestone: 6.0 → 6.1
Mike Scherbakov (mihgen)
Changed in fuel:
importance: High → Critical
Revision history for this message
Roman Prykhodchenko (romcheg) wrote :

g/6.1/6.0/

Revision history for this message
Roman Prykhodchenko (romcheg) wrote :

Now, since it was set to Critical, it can be merged to 6.0.*
I'm letting someone to re-target it properly to avoid another race-condition.

Revision history for this message
Anastasia Palkina (apalkina) wrote :
Revision history for this message
Miroslav Anashkin (manashkin) wrote :

Probable root cause - there is not enough CPU cores on master node.

On all my machines with VBox the number of VMs I may boot with PXE at the same time matches almost exactly with the number of physical CPU cores I have.
PXE boot process 100% loads 2 cores during the image transfer by TFTP - one on server and one on client.
If there are too many client nodes - these extra clients fail to boot from the first time and need restart.

Since 6.0 uses image-based provisioning - master node may be even more CPU bound.

So, I usually set the CPU number for master node to the number of physically available cores.
If CPU supports hyper-threading - I anyway set the number of CPU to the number of physical cores, not virtual.

I propose the following modification to all version of VBox scripts, not for 6.0 only:
Set default CPU number for master node = number of physical CPU cores.

Later we may implement sequential one-by-one node boot.
Say, by introduce some configurable delay parameter before new VM creation.

BTW, this issue appears on bare-metal installations as well, but against the several times greater number of simultaneously booting machines. In case of bare-metal clients have own CPUs and network intrerfaces do some offloading.

Changed in fuel:
status: Confirmed → Triaged
assignee: Miroslav Anashkin (manashkin) → Serhiy Ovsianikov (sovsianikov)
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Miroslav,
> Since 6.0 uses image-based provisioning - master node may be even more CPU bound.
how is it related? Who is eating CPU on the master node?

I just experienced this issue again. 1 VM booted fine, and 2 didn't. Previously, I've been running fine. I believe the problem is how we check that Master node have finished provisioning.

This issue can be very easily mitigated by provisioning slaves with slight delay. Also, it would be better to do 1-by-1 with slight delay between runs (let's say 1 sec).

Revision history for this message
Tomasz 'Zen' Napierala (tzn) wrote :

From my debugging it all depends on how we check master node readiness. We only search for line in logs. This line does not mean that all containers are ready.
I had discussion with Mike, we can somply introduce another ugly "wait" and fix it in 6.1

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

We should not consider any kind of an issues with virtual box as a critical ones. Virtual box is never used for production enviromnents deployments so it cannot be critical

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-main (master)

Fix proposed to branch: master
Review: https://review.openstack.org/142142

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-main (stable/6.0)

Fix proposed to branch: stable/6.0
Review: https://review.openstack.org/142143

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-main (stable/5.1)

Fix proposed to branch: stable/5.1
Review: https://review.openstack.org/142161

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-main (master)

Reviewed: https://review.openstack.org/142142
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=65a9bcdbd228dc8478091999689c7c81bcd7d762
Submitter: Jenkins
Branch: master

commit 65a9bcdbd228dc8478091999689c7c81bcd7d762
Author: Serhiy Ovsianikov <email address hidden>
Date: Tue Dec 16 18:10:57 2014 +0200

    The delay required for downloading tftp boot image

    The delay allows to increase time for obtaining a boot image by
    fuel-slave nodes and decreasing CPU load on fuel-master node.

    Change-Id: I1da48a2374c4d4f0783b12bb9f736e54098fcb41
    Closes-Bug: #1401851

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-main (stable/6.0)

Reviewed: https://review.openstack.org/142143
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=93af8119dcc5758ead2a17ab48f729a26a15d7fd
Submitter: Jenkins
Branch: stable/6.0

commit 93af8119dcc5758ead2a17ab48f729a26a15d7fd
Author: Serhiy Ovsianikov <email address hidden>
Date: Tue Dec 16 18:10:57 2014 +0200

    The delay required for downloading tftp boot image

    The delay allows to increase time for obtaining a boot image by
    fuel-slave nodes and decreasing CPU load on fuel-master node.

    Change-Id: I1da48a2374c4d4f0783b12bb9f736e54098fcb41
    Closes-Bug: #1401851
    (cherry picked from commit 65a9bcdbd228dc8478091999689c7c81bcd7d762 )

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-main (stable/5.1)

Change abandoned by Serhiy Ovsianikov (<email address hidden>) on branch: stable/5.1
Review: https://review.openstack.org/142161
Reason: necessary to clarify

Revision history for this message
Anastasia Palkina (apalkina) wrote :

Verified on ISO #56

"build_id": "2014-12-18_01-32-01", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "56", "auth_required": true, "api": "1.0", "nailgun_sha": "5f91157daa6798ff522ca9f6d34e7e135f150a90", "production": "docker", "fuelmain_sha": "45caacadb878abfbd9d60e134d72229698b469c9", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "release_versions": {"2014.2-6.0": {"VERSION": {"build_id": "2014-12-18_01-32-01", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "56", "api": "1.0", "nailgun_sha": "5f91157daa6798ff522ca9f6d34e7e135f150a90", "production": "docker", "fuelmain_sha": "45caacadb878abfbd9d60e134d72229698b469c9", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "fuellib_sha": "73332192a257ea02c40a39885c502ad1ebdf3eda"}}}, "fuellib_sha": "73332192a257ea02c40a39885c502ad1ebdf3eda"

Revision history for this message
Anastasia Palkina (apalkina) wrote :

Reproduced on ISOs #58, 61 for 6.1

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-main (master)

Fix proposed to branch: master
Review: https://review.openstack.org/149575

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-main (stable/5.1)

Reviewed: https://review.openstack.org/142161
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=fb31eaaecd59fc0e681878e77f3b1b543fe12021
Submitter: Jenkins
Branch: stable/5.1

commit fb31eaaecd59fc0e681878e77f3b1b543fe12021
Author: Serhiy Ovsianikov <email address hidden>
Date: Tue Dec 16 18:10:57 2014 +0200

    The delay required for downloading tftp boot image

    The delay allows to increase time for obtaining a boot image by
    fuel-slave nodes and decreasing CPU load on fuel-master node.

    Change-Id: I1da48a2374c4d4f0783b12bb9f736e54098fcb41
    Closes-Bug: #1401851
    (cherry picked from commit 65a9bcdbd228dc8478091999689c7c81bcd7d762 )

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-main (master)

Reviewed: https://review.openstack.org/149575
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=0a96b4a59bf8ae28e1f88db7485b5ea425b0a6b5
Submitter: Jenkins
Branch: master

commit 0a96b4a59bf8ae28e1f88db7485b5ea425b0a6b5
Author: Serhiy Ovsianikov <email address hidden>
Date: Fri Jan 23 12:26:27 2015 +0200

    Added the number of processor cores to fuel master node

    Adding the number of processor cores increases the productivity of
    fuel master node

    Change-Id: Idd7933579b5309fc038488d1a808024d3132e829
    Closes-Bug: #1401851

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Anastasia Palkina (apalkina) wrote :

Cannot reproduced on latest ISOs

Revision history for this message
Ali Jabbar (jabbar-ali) wrote :

Yes, If you rest or reboot your slave node one by one with some delay in between them then this problem will not occur.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.