[astute] OS install failure blocks environment

Bug #1374376 reported by Damia Pastor
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Vladimir Sharshov

Bug Description

Scenario:

While deploying multiple nodes using Ubuntu, one of the nodes has a hardware failure and restarts. While Fuel tries to reinstall again, either stop nor reset will work, leaving the process in a loop.

Steps to reproduce:

- Deploy multiple nodes
- During OS installation (tested with Ubuntu), hard reset the node simulating a hardware failure
- Either by using GUI or CLI, stop the process.

Expected behaviour:

- Deployment is stopped and Fuel goes back to the pre-deploy status for the environment

Actual behaviour:

- The deployment blocks itself, repeating in a loop the OS installation. There is no cancel option.

Notes:

- We also tried a "reset" from fuel cli and deleting the environment.

Tags: astute
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
importance: Undecided → High
milestone: none → 6.0
Revision history for this message
Damia Pastor (magradallegir) wrote :

Hi Aleksey,

I think that this was caused by an idle user syndrome: The user gets nervous as the task takes too long, so we start trying things:

- Cancel, Reset, Delete

I will retest if I can still reproduce it, if not, I will open a new ticket to block Reset and delete actions when cancel is ongoing.

Thanks and sorry for the inconvenience.

Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

I can't reproduce that.

Tried on CentOS HA with 3 controllers and force reboot one of them, then stop deployment - all stopped ok.
And on Ubuntu simple with 5 nodes and force reboot two of them, then stop deployment - all stopped ok.

Changed in fuel:
status: New → Confirmed
assignee: Fuel Library Team (fuel-library) → Fuel Astute Team (fuel-astute)
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

This behavior really very strange, because in case of Stop Deployment we kill main process (prevent conflicting commands), then try to access to all cluster's nodes via ssh and erase them. This step can take 5 minutes by default (60 seconds, 5 retries) and after it we remove nodes from Cobbler and after it send nodes to reboot via ssh.

After the above i think that idea from Damià Pastor is closest to the truth.

Unfortunately without logs i could not say more specifically.

Changed in fuel:
status: Confirmed → Incomplete
assignee: Fuel Astute Team (fuel-astute) → Vladimir Sharshov (vsharshov)
tags: added: astute
summary: - OS install failure blocks environment
+ [astute]OS install failure blocks environment
summary: - [astute]OS install failure blocks environment
+ [astute] OS install failure blocks environment
Revision history for this message
Damia Pastor (magradallegir) wrote :

No problem Vladimir,

I will try to work better my case and provide logs. I think there are two points of interest:

a) What happens if Fuel can't reach one of the nodes, scheduled for deployment.

b) Should we block any interaction with the cluster (delete, reset) once it is in stop process.

I will work further on these two scenarios and create a new bug if necessary. :)

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Is there an update on this bug?

Revision history for this message
Damia Pastor (magradallegir) wrote :

Hi Matthew,

I need to differentiate the two cases, and re-test them again:

a) Nodes are not coming back, user only does stop. In 1 of 3 it might happen the operation gets stuck (reseted after 24h).
b) Nodes are not coming back, user stops, then tries to delete the environment. Got stuck in 3 of 3 cases (reseted after 24h).

I will check if it is possible to kill the task through CLI as a feasible workaround, based on support tickets.

Revision history for this message
Mike Scherbakov (mihgen) wrote :

Vladimir,
any update on this bug?

Changed in fuel:
milestone: 6.0 → 6.1
Changed in fuel:
status: Incomplete → Invalid
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

This bug was incomplete for more than 4 weeks. We cannot investigate it further so we are setting the status to Invalid. If you think it is not correct, please feel free to provide requested information and reopen the bug, and we will look into it further.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.