stable/liberty CI: all jobs failing due to nodes stuck in wait call-back

Bug #1550772 reported by James Slagle
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
James Slagle

Bug Description

It seems all stable/liberty CI jobs are currently failing. I took a look at a few of the failures, and they all seem stuck during the node deployment of the Overcloud. The nodes are started by Ironic, but the disk deployment is never started and all the nodes are stuck in the "wait call-back" state.

Example failure:
http://logs.openstack.org/09/285509/1/check-tripleo/gate-tripleo-ci-f22-nonha/9f8621d/

I updated my local environment to the latest stable/liberty repos, and I was able to reproduce the same issue. I suspect a regression in either ironic-python-agent, ironic, or diskimage-builder.

Tags: alert
Revision history for this message
James Slagle (james-slagle) wrote :

I can't see anyway to debug what might be causing the nodes to not be able to reach back to Ironic to start the disk deployment. There's no way to see what is on the vm console, you apparently can't log the console to a file at this stage, and there are no logs saved anywhere else afaict.

i'm trying a few ealier delorean repos to see if i can pinpoint if there might be a regression. It seems our successeful job to pass on stable/liberty was around 8:00 2/25.

This repo was broken for me:
https://trunk.rdoproject.org/centos7-liberty/56/ef/56effa1f8d8bb2545669019dbb159703c3e54bde_5e110e28

Trying some earlier repos and will report back.

description: updated
Changed in tripleo:
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → James Slagle (james-slagle)
tags: added: alert
Revision history for this message
James Slagle (james-slagle) wrote :

this is the first repo that appears to be broken:
https://trunk.rdoproject.org/centos7-liberty/2e/1b/2e1b6cd0ed444b9f2accbfeae52f5240a0798912_56e2361d

that was built due to a commit in ironic-python-agent.

that commit is actually from the master branch of ipa. The previous repo which works, had a build of ipa from the stable/liberty branch. So the new build in the broken repo actually picked up all the changes between stable/liberty and master of ipa.

It's as if delorean's rdo.yaml for liberty switched building ipa from master instead of liberty.

Revision history for this message
James Slagle (james-slagle) wrote :

indeed ipa was switched to build from the master branch for liberty:
https://github.com/redhat-openstack/rdoinfo/commit/4513e16781c0532a4b313b6573e6b98dbcfcf589

and that no longer works with tripleo liberty (for a yet unknown reason).

Not sure why the switch was made as ipa does indeed have a stable/liberty branch

Revision history for this message
James Slagle (james-slagle) wrote :

emailed apevec. maybe we can switch back to the stable/liberty ipa until we can figure out what is wrong with the master version

Revision history for this message
Alan Pevec (apevec) wrote :

Dmitry identified commit https://github.com/openstack/ironic-python-agent/commit/df701c979cf2bc6faa10c0a87ed0fc19d60fe905 requiring oslo.service 0.12.0 at least, while liberty has 0.9.1.

Revision history for this message
Alan Pevec (apevec) wrote :
Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.