[all] too many NODE_FAILURES on periodic CI jobs

Bug #1917418 reported by Bhagyashri Shewale
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

https://review.rdoproject.org/zuul/builds?result=NODE_FAILURE

We are getting below issue and failed to build node:

1. Detailed node error: Build of instance <instance-id> aborted: Failed to allocate the network(s), not rescheduling / "Failed to allocate the network(s), not rescheduling"
2. nodepool.exceptions.LaunchNetworkException: Unable to find public IP of server
3. No valid host found
4. Quota exceeded for ram

First one got resolve.

yatin (yatinkarel)
Changed in tripleo:
importance: High → Critical
description: updated
Revision history for this message
Javier Peña (jpena-c) wrote :

Looking at the nodepool logs, it all points to some error in the cloud provider side. I have opened a ticket on Vexxhost to track this.

Revision history for this message
Javier Peña (jpena-c) wrote :

The ticket was fixed by Vexxhost, mentioning issues in their routing infrastructure.

In addition to that, we had an additional issue with OVB jobs, where stacks were not cleaned up at the job end. This was eating up all resources in the tenant, causing more errors. https://review.rdoproject.org/r/32147 should have fixed that.

Revision history for this message
Rabi Mishra (rabi) wrote :

Looks like we're still seeing those. I just noticed them in https://review.opendev.org/c/openstack/tripleo-heat-templates/+/777294.

Not sure if it's the same set of issues or something new.

wes hayutin (weshayutin)
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.