[SAL/PROACTIVE] Cluster deployment fails when worker node is of type c1.xlarge

Bug #2067514 reported by Robert Sanfeliu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
NebulOuS
Triaged
High
Ankica Barisic

Bug Description

I tried to deploy a cluster using POSTMAN on AWS. The master is a c5.xlarge instance, the worker is a c1.xlarge instance. The creation of master works OK. Worker node is instantiated (as seen on AWS panel) but the deployment task is stuck at 14% and the NodeSource on proactive shows the error "Description: Timeout occurred after 300000 ms.".

If master and worker are c5.xlarge, it works OK.

Robert Sanfeliu (rsprat)
description: updated
Revision history for this message
Ankica Barisic (akki55) wrote :

Hi Robert,

can you please report on
- the ProActive job ID which failed
- Nebulous environment which you used
- SAL scripts which were used

Thank you

Changed in nebulous:
assignee: Ankica Barisic (akki55) → Robert Sanfeliu (rsprat)
Revision history for this message
Robert Sanfeliu (rsprat) wrote :

- the ProActive job ID which failed: 1664
- Nebulous environment which you used: nebulous-cd
- SAL scripts which were used: ONM version

Robert Sanfeliu (rsprat)
Changed in nebulous:
assignee: Robert Sanfeliu (rsprat) → Ankica Barisic (akki55)
Revision history for this message
Ankica Barisic (akki55) wrote :

We were redeploying the nodes on your aws so please check there are no running nodes left.

What we found was that you were deploying g2.2xlarge and we do not support this type of nodes. But it should not be returned to you by findNodeCandidates if we do not support it.

We will need to investigate why this happen.

When we change the hardware to c1.xlarge the deployment happen as is expected.

Changed in nebulous:
assignee: Ankica Barisic (akki55) → Robert Sanfeliu (rsprat)
Revision history for this message
Robert Sanfeliu (rsprat) wrote :

Still I'm not able to deploy a cluster containing a c5.xlarge as master and c1.xlarge as worker. Worker gets stuck at 14%.

- the ProActive job ID which failed: 1792
- Nebulous environment which you used: nebulous-cd
- SAL scripts which were used: ONM version

Changed in nebulous:
assignee: Robert Sanfeliu (rsprat) → Ankica Barisic (akki55)
Changed in nebulous:
importance: Undecided → Critical
Changed in nebulous:
importance: Critical → High
Revision history for this message
Ankica Barisic (akki55) wrote :

We will need to decide how to go around this problem -> we could get more of these instances and as we discussed quick fix is to consider having a blacklist.

For now, SAL supports blacklisting certain regions.
We can provide support for blacklisting images, however, they will then need to be propagated somehow when adding the cloud.

Another solution is that Optimizer does this when requesting node candidates.

Some design decision is to be made regarding how to solve this.

Changed in nebulous:
assignee: Ankica Barisic (akki55) → Robert Sanfeliu (rsprat)
Revision history for this message
Robert Sanfeliu (rsprat) wrote :

So, the question is: Does SAL/Proactive support c1.xlarge? If not, we can for sure blacklist them in the optimiser controller.

Changed in nebulous:
assignee: Robert Sanfeliu (rsprat) → Ankica Barisic (akki55)
Revision history for this message
Joanna Chmielewska (joannach) wrote :
Changed in nebulous:
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.