zaqar returns 503 causing overcloud deployment failure

Bug #1754061 reported by Alex Schultz
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Thomas Herve
zaqar
Fix Released
High
Thomas Herve

Bug Description

We've seen zaqar return 503 during the deployment which causes the deployment to fail.

http://logs.openstack.org/22/539522/4/check/tripleo-ci-centos-7-containers-multinode/37005e6/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-03-07_00_09_36

2018-03-07 00:09:36 | u'message': u"Failed to run action [action_ex_id=a6d33603-a829-4118-b53d-cc78b7a0d591, action_cls='<class 'mistral.actions.action_factory.ZaqarAction'>', attributes='{u'client_method_name': u'claim_messages'}', params='{u'queue_name': u'tripleo-ui-logging', u'grace': 60, u'ttl': 60}']\n ZaqarAction.claim_messages failed: Error response from Zaqar. Code: 503. Title: Service temporarily unavailable. Description: Claim could not be created. Please try again in a few seconds..",
2018-03-07 00:09:36 | u'status': u'FAILED'}

In looking at the apache logs, we see a 503...
http://logs.openstack.org/22/539522/4/check/tripleo-ci-centos-7-containers-multinode/37005e6/logs/undercloud/var/log/httpd/zaqar_wsgi_access.log.txt.gz

192.168.24.1 - - [07/Mar/2018:00:09:30 +0000] "POST /v2/queues/tripleo-ui-logging/claims HTTP/1.1" 503 135 "-" "python-requests/2.14.2"

This seems to be causes because the tripleo-ui-logging queue does not exist:
http://logs.openstack.org/22/539522/4/check/tripleo-ci-centos-7-containers-multinode/37005e6/logs/undercloud/var/log/zaqar/zaqar.log.txt.gz#_2018-03-07_00_09_30_679

2018-03-07 00:09:30.679 1912 ERROR zaqar.transport.wsgi.v2_0.claims [req-ea2b494e-1f00-4303-b146-3c9e0580331c 38621b8d28a14d88857c2abff18c9e07 3468247bbd3a479d9c77b2442e8d628f - default default] Queue tripleo-ui-logging does not exist for project 3468247bbd3a479d9c77b2442e8d628f: QueueDoesNotExist: Queue tripleo-ui-logging does not exist for project 3468247bbd3a479d9c77b2442e8d628f
2018-03-07 00:09:30.679 1912 ERROR zaqar.transport.wsgi.v2_0.claims Traceback (most recent call last):
2018-03-07 00:09:30.679 1912 ERROR zaqar.transport.wsgi.v2_0.claims File "/usr/lib/python2.7/site-packages/zaqar/transport/wsgi/v2_0/claims.py", line 85, in on_post
2018-03-07 00:09:30.679 1912 ERROR zaqar.transport.wsgi.v2_0.claims **claim_options)
2018-03-07 00:09:30.679 1912 ERROR zaqar.transport.wsgi.v2_0.claims File "/usr/lib/python2.7/site-packages/zaqar/common/pipeline.py", line 97, in consumer
2018-03-07 00:09:30.679 1912 ERROR zaqar.transport.wsgi.v2_0.claims tmp = target(*args, **kwargs)
2018-03-07 00:09:30.679 1912 ERROR zaqar.transport.wsgi.v2_0.claims File "/usr/lib/python2.7/site-packages/zaqar/storage/swift/claims.py", line 111, in create
2018-03-07 00:09:30.679 1912 ERROR zaqar.transport.wsgi.v2_0.claims include_delayed=include_delayed)
2018-03-07 00:09:30.679 1912 ERROR zaqar.transport.wsgi.v2_0.claims File "/usr/lib/python2.7/site-packages/zaqar/storage/swift/messages.py", line 109, in _list
2018-03-07 00:09:30.679 1912 ERROR zaqar.transport.wsgi.v2_0.claims raise errors.QueueDoesNotExist(queue, project)
2018-03-07 00:09:30.679 1912 ERROR zaqar.transport.wsgi.v2_0.claims QueueDoesNotExist: Queue tripleo-ui-logging does not exist for project 3468247bbd3a479d9c77b2442e8d628f
2018-03-07 00:09:30.679 1912 ERROR zaqar.transport.wsgi.v2_0.claims

Revision history for this message
Giulio Fidente (gfidente) wrote :

Honza, it looks like the tripleo-ui-logging queue is missing in zaqar [1]; you have any idea what is happening?

1. http://logs.openstack.org/22/539522/4/check/tripleo-ci-centos-7-containers-multinode/37005e6/logs/undercloud/var/log/zaqar/zaqar.log.txt.gz#_2018-03-07_00_09_30_679

Honza Pokorny (hpokorny)
Changed in tripleo:
assignee: nobody → Honza Pokorny (hpokorny)
Revision history for this message
Honza Pokorny (hpokorny) wrote :

Interesting that this hasn't come up until now. The new logging workflow has to be changed to either stop processing when the queue can't be found, or we create it at the beginning. I'm looking into it more.

Revision history for this message
Honza Pokorny (hpokorny) wrote :

Should we disable mistral cron triggers in ci? It seems like it's just randomly executing in the middle of the deployment. It's not really related to the deployment itself.

Revision history for this message
Giulio Fidente (gfidente) wrote :

I got this on a fresh deployment

Revision history for this message
Thomas Herve (therve) wrote :

The error from Zaqar is totally expected. You can't claim messages from a queue that doesn't exist. We should either check the existence of the queue or create it inconditionally in the workflow.

That said; there is no reason for this to fail the deployment. The 503 happens (as expected) on working CI runs. Either this is noise, and it just shows that error for some reasons, or something checks wrongly that workflow for success (randomly?).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/551920

Changed in tripleo:
assignee: Honza Pokorny (hpokorny) → Thomas Herve (therve)
status: Triaged → In Progress
Revision history for this message
Thomas Herve (therve) wrote :
Changed in zaqar:
importance: Undecided → High
assignee: nobody → Thomas Herve (therve)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/551920
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=3efc2f9f1b9b3360ac22c4f76c6a839e23db641c
Submitter: Zuul
Branch: master

commit 3efc2f9f1b9b3360ac22c4f76c6a839e23db641c
Author: Thomas Herve <email address hidden>
Date: Mon Mar 12 09:57:36 2018 +0100

    Don't notify zaqar in publish_ui_logs_to_swift

    We shouldn't notify the main event queue when publishing logs: those are
    run in a cron, so we're not waiting for messages from it. When it fails,
    it can interrupt regular workflows.

    Change-Id: Ifb17352c4a89e2a1ea7c014cf64a586b4e5a2859
    Closes-Bug: #1754061

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/552452

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/queens)

Reviewed: https://review.openstack.org/552452
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=1624b9be5e700f1dd0ac4ecc226050ba156cbb3c
Submitter: Zuul
Branch: stable/queens

commit 1624b9be5e700f1dd0ac4ecc226050ba156cbb3c
Author: Thomas Herve <email address hidden>
Date: Mon Mar 12 09:57:36 2018 +0100

    Don't notify zaqar in publish_ui_logs_to_swift

    We shouldn't notify the main event queue when publishing logs: those are
    run in a cron, so we're not waiting for messages from it. When it fails,
    it can interrupt regular workflows.

    Change-Id: Ifb17352c4a89e2a1ea7c014cf64a586b4e5a2859
    Closes-Bug: #1754061
    (cherry picked from commit 3efc2f9f1b9b3360ac22c4f76c6a839e23db641c)

tags: added: in-stable-queens
Revision history for this message
kobig (kobi.ginon) wrote :

hi @Zuul,
We have this issue also in the deployment - random failures of overcloud deployment.

is this sufficient to update the file on the undercloud prior to overcloud deployment ?
/usr/share/openstack-tripleo-common/workbooks/plan_management.yaml

Revision history for this message
Alex Schultz (alex-schultz) wrote :

@kobig, no you'd have to update the workbooks in mistral after updating the file itself

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to zaqar (master)

Reviewed: https://review.openstack.org/551929
Committed: https://git.openstack.org/cgit/openstack/zaqar/commit/?id=df24b8b0238b019f0c2437a1074b69e19e3d0342
Submitter: Zuul
Branch: master

commit df24b8b0238b019f0c2437a1074b69e19e3d0342
Author: Thomas Herve <email address hidden>
Date: Mon Mar 12 10:30:24 2018 +0100

    Fix claims on non-existing queue on swift

    This returns an empty list instead of an error if we try to claim
    messages on a queue that doesn't exist yet.

    Change-Id: Ia92774ef1c55a371e37fc845511a5dceb8f92c00
    Depends-On: I7e2128f3a5608ed9a41d1e18bd72d771a2a4ddb3
    Closes-Bug: #1754061

Changed in zaqar:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to zaqar (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/555724

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 8.6.0

This issue was fixed in the openstack/tripleo-common 8.6.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to zaqar (stable/queens)

Reviewed: https://review.openstack.org/555724
Committed: https://git.openstack.org/cgit/openstack/zaqar/commit/?id=6357d69fe900e35266f33e1d42227dc58f96cdc3
Submitter: Zuul
Branch: stable/queens

commit 6357d69fe900e35266f33e1d42227dc58f96cdc3
Author: Thomas Herve <email address hidden>
Date: Mon Mar 12 10:30:24 2018 +0100

    Fix claims on non-existing queue on swift

    This returns an empty list instead of an error if we try to claim
    messages on a queue that doesn't exist yet.

    Change-Id: Ia92774ef1c55a371e37fc845511a5dceb8f92c00
    Closes-Bug: #1754061
    (cherry picked from commit df24b8b0238b019f0c2437a1074b69e19e3d0342)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 9.0.0

This issue was fixed in the openstack/tripleo-common 9.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/zaqar 7.0.0.0b1

This issue was fixed in the openstack/zaqar 7.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/zaqar 6.0.1

This issue was fixed in the openstack/zaqar 6.0.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.