Jobs failing in oooq when running "openstack baremetal configure boot"

Bug #1675384 reported by Alfredo Moralejo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Gael Chamoulaud

Bug Description

In oooq jobs, command "openstack baremetal configure boot" is executed as part of the "prepare image phase" task.

Jobs are failing both RDO-CI and upstream oooq periodic jobs in this step:

https://ci.centos.org/job/tripleo-quickstart-promote-master-delorean-minimal/1009/consoleFull
http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-nonha-tempest-oooq/3c375a2/console.html

What I see in the logs is:

- Apparently it's failing after moving zaqar behind wsgi in https://review.openstack.org/#/c/442482/
- In mistral executor logs, i see errors when trying to post messages to a zaqar queue, however mistral retries it and messages are being posted successfully.
- For non oooq jobs, "openstack baremetal configure boot" is only executed for mitaka [1]. When running other workflows in these jobs, I find similar errors/retries when mistral try to push messages into zaqar queues [2] but jobs are not failing.

Is there something different in tripleo.baremetal.v1.configure workflow as compared to tripleo.baremetal.v1.register_or_update?

Could we change oooq to use "openstack baremetal configure boot" only for mitaka?

[1] https://github.com/openstack-infra/tripleo-ci/blob/master/scripts/tripleo.sh#L667-L670
[2] http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-ha/910c9b9/logs/undercloud/var/log/mistral/executor.txt.gz#_2017-03-21_07_06_38_803

Tags: ci
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-quickstart-extras (master)

Fix proposed to branch: master
Review: https://review.openstack.org/449160

Changed in tripleo:
assignee: nobody → Alfredo Moralejo (amoralej)
status: New → In Progress
Revision history for this message
Gael Chamoulaud (gael-chamoulaud) wrote :

Hi Alfredo, there is already a fix for that in review https://review.openstack.org/#/c/440518/

no longer affects: tripleo-quickstart
Changed in tripleo:
importance: Undecided → High
milestone: none → pike-1
Revision history for this message
Alfredo Moralejo (amoralej) wrote :

After some additional investigation, i've realized that when we hit this, the task is failing after one hour, so it seems some kind of timeout.

I've been able to reproduce this locally and what i find is that when i run command:

# openstack --debug baremetal configure boot

it just doesn't return after message:

Started Mistral Workflow tripleo.baremetal.v1.configure. Execution ID: 5ef72f29-dbb7-45c5-b9ad-16973395ee6b
Started Mistral Workflow tripleo.baremetal.v1.configure. Execution ID: 4a1f7b60-fe60-4f6a-938c-8dce242c1006
Waiting for messages on queue '4250bd94-88e0-472e-afff-de64ba86f59a' with no timeout.
Waiting for messages on queue '4250bd94-88e0-472e-afff-de64ba86f59a' with no timeout.

However, if i check the worflow executions they are marked as successfull:

| 5ef72f29-dbb7-45c5 | 5e81b4d8-72f5-442f- | tripleo.baremetal.v1 | | <none> | SUCCESS | None | 2017-03-24 17:27:44 | 2017-03-24 17:28:02 |
| -b9ad-16973395ee6b | 9ff0-3800a8da16f7 | .configure | | | | | | |
| 4a1f7b60-fe60-4f6a- | 5e81b4d8-72f5-442f- | tripleo.baremetal.v1 | | <none> | SUCCESS | None | 2017-03-24 17:27:55 | 2017-03-24 17:28:05 |
| 938c-8dce242c1006 | 9ff0-3800a8da16f7 | .configure | | | | | | |

and if i check the messages in the queue where the command is supposedly listening, the messages are there:

$ openstack queue stats 4250bd94-88e0-472e-afff-de64ba86f59a
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Stats | {u'messages': {u'claimed': 0, u'oldest': {u'age': 2330.519677877426, u'href': u'/v2/queues/4250bd94-88e0-472e-afff-de64ba86f59a/messages/3ab2ec7a-10b7-11e7-911a-009e82dc8afb', |
| | u'created': u'2017-03-24T17:28:01Z'}, u'total': 2, u'newest': {u'age': 2328.3804879188538, u'href': u'/v2/queues/4250bd94-88e0-472e-afff- |
| | de64ba86f59a/messages/3c465eb4-10b7-11e7-b442-009e82dc8afb', u'created': u'2017-03-24T17:28:03Z'}, u'free': 2}} |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

So, it seems that somehow openstack command is not getting the messages from the queue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart-extras (master)

Change abandoned by Alfredo Moralejo (<email address hidden>) on branch: master
Review: https://review.openstack.org/449160
Reason: In favor of https://review.openstack.org/#/c/440518

Revision history for this message
Alan Pevec (apevec) wrote :

Emilien, this should be re-assigned to Gael, but he cannot be found in Launchpad?

Changed in tripleo:
status: In Progress → Fix Committed
assignee: Alfredo Moralejo (amoralej) → nobody
Changed in tripleo:
assignee: nobody → Gael Chamoulaud (gael-chamoulaud)
Changed in tripleo:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.