NebulOuS

[SAL/PROACTIVE] Cluster deployment fails when worker node is of type c1.xlarge

Bug #2067514 reported by Robert Sanfeliu on 2024-05-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	NebulOuS	Triaged	High	Ankica Barisic

Bug Description

I tried to deploy a cluster using POSTMAN on AWS. The master is a c5.xlarge instance, the worker is a c1.xlarge instance. The creation of master works OK. Worker node is instantiated (as seen on AWS panel) but the deployment task is stuck at 14% and the NodeSource on proactive shows the error "Description: Timeout occurred after 300000 ms.".

If master and worker are c5.xlarge, it works OK.

See original description

Robert Sanfeliu (rsprat) on 2024-05-29

description:

updated

Revision history for this message

Ankica Barisic (akki55) wrote on 2024-05-30:

Hi Robert,

can you please report on
- the ProActive job ID which failed
- Nebulous environment which you used
- SAL scripts which were used

Thank you

Changed in nebulous:
assignee:	Ankica Barisic (akki55) → Robert Sanfeliu (rsprat)

Revision history for this message

Robert Sanfeliu (rsprat) wrote on 2024-05-30:

- the ProActive job ID which failed: 1664
- Nebulous environment which you used: nebulous-cd
- SAL scripts which were used: ONM version

Robert Sanfeliu (rsprat) on 2024-05-30

Changed in nebulous:
assignee:	Robert Sanfeliu (rsprat) → Ankica Barisic (akki55)

Revision history for this message

Ankica Barisic (akki55) wrote on 2024-05-30:

Screenshot 2024-05-30 190943.png Edit (78.3 KiB, image/png)

We were redeploying the nodes on your aws so please check there are no running nodes left.

What we found was that you were deploying g2.2xlarge and we do not support this type of nodes. But it should not be returned to you by findNodeCandidates if we do not support it.

We will need to investigate why this happen.

When we change the hardware to c1.xlarge the deployment happen as is expected.

Changed in nebulous:
assignee:	Ankica Barisic (akki55) → Robert Sanfeliu (rsprat)

Revision history for this message

Robert Sanfeliu (rsprat) wrote on 2024-05-30:

Still I'm not able to deploy a cluster containing a c5.xlarge as master and c1.xlarge as worker. Worker gets stuck at 14%.

- the ProActive job ID which failed: 1792
- Nebulous environment which you used: nebulous-cd
- SAL scripts which were used: ONM version

Changed in nebulous:
assignee:	Robert Sanfeliu (rsprat) → Ankica Barisic (akki55)

Joanna Chmielewska (joannach) on 2024-05-31

Changed in nebulous:
importance:	Undecided → Critical

Joanna Chmielewska (joannach) on 2024-05-31

Changed in nebulous:
importance:	Critical → High

Revision history for this message

Ankica Barisic (akki55) wrote on 2024-06-05:

We will need to decide how to go around this problem -> we could get more of these instances and as we discussed quick fix is to consider having a blacklist.

For now, SAL supports blacklisting certain regions.
We can provide support for blacklisting images, however, they will then need to be propagated somehow when adding the cloud.

Another solution is that Optimizer does this when requesting node candidates.

Some design decision is to be made regarding how to solve this.