NodeLocked log messages causing random test failures

Bug #1321494 reported by aeva black
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
Dmitry Tantsur

Bug Description

When the ironic-api service returns NodeLocked to the client, the client should merely retry the request -- and the nova.virt.ironic driver does the proper error handling for this, and everything is fine. Except that ir-api records this error in the log file, which is interpreted as a failure by tempest. It appears in the test run's output console.html file as follows.

http://logs.openstack.org/36/88336/2/check/check-tempest-dsvm-virtual-ironic/b12a587/console.html.gz

...

2014-05-16 15:43:16.719 | Log File Has Errors: ir-api
2014-05-16 15:43:16.719 | *** Not Whitelisted *** 2014-05-16 15:40:12.654 21242 ERROR wsme.api [-] Server-side error: "Node ab167bb8-8b6d-46de-a294-8078d7f09305 is locked by host devstack-precise-check-rax-iad-4098544.slave.openstack.org, please retry after the current operation is completed.

...

2014-05-16 15:43:16.970 | Logs have errors
2014-05-16 15:43:16.971 | Log: ir-api not allowed to have ERRORS or TRACES
2014-05-16 15:43:16.977 | ERROR: the main setup script run by this job failed - exit code: 1
2014-05-16 15:43:16.977 | please look at the relevant log files to determine the root cause

In fact, this condition is not an error at all -- we merely need to tell the client to retry, and nothing should be logged.

aeva black (tenbrae)
Changed in ironic:
status: New → Triaged
importance: Undecided → Critical
importance: Critical → High
milestone: none → juno-1
Dmitry Tantsur (divius)
Changed in ironic:
assignee: nobody → Dmitry "Divius" Tantsur (divius)
Revision history for this message
Dmitry Tantsur (divius) wrote :

Thinking that 503 is not the best code for the situation. 503 means server experiences troubles, here it actually does not - it's just conflicting access to a resource. Hence I suggest "409 conflict".

Revision history for this message
aeva black (tenbrae) wrote :

Regardless of the HTTP response code used, this shouldn't be treated as an error within the API service.

Over the last few months, I think we've learned that we really should not expose this retry-condition to the clients at all. I've filed a spec to make the API fully async (*), which would solve this problem, but in the mean time we need to stop logging this as an error, or whitelist it in the test runner, so that it stops causing random test failures.

(*) https://review.openstack.org/#/c/94923/

Revision history for this message
Dmitry Tantsur (divius) wrote :

My point here is that WSME expects 5xx codes to designate error on server-side. Judging by code, changing to 409 (which I think is also more correct) will make WSME log it as warning, not error.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/95573

Changed in ironic:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/95573
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=e04a574aa18359b07ca7d71898c32da028d73f14
Submitter: Jenkins
Branch: master

commit e04a574aa18359b07ca7d71898c32da028d73f14
Author: Dmitry Tantsur <email address hidden>
Date: Mon May 26 21:03:19 2014 +0200

    Change NodeLocked status code to a client-side error

    Currently, the NodeLocked exception's HTTP status code indicates a
    server-side failure, resulting in WSME logging it as such, and this
    is leading to random failures in our CI tests. While NodeLocked is
    a valid issue which should (currently) be exposed to the client,
    it is not a server-side error and does not warrant a 5xx status code.

    This patch changes the HTTP status code for this exception to
    409 CONFLICT, causing WSME to log it as a warning. This is not ideal,
    but it resolves the immediate issue (random CI test failures).

    A longer term solution is being drafted at:
    https://review.openstack.org/94923

    Change-Id: Ied9dcf9bbcb505177adc57d7d8fe06e4096ff488
    Closes-Bug: #1321494

Changed in ironic:
status: In Progress → Fix Committed
Revision history for this message
aeva black (tenbrae) wrote : Re: NodeLocked causing random test failures
Changed in ironic:
status: Fix Committed → Triaged
Revision history for this message
aeva black (tenbrae) wrote :

The previous fix correctly addressed the issue in the *tempest-dsvm-virtual-ironic* jobs, as described.

However, it is also appearing in the *tempest-dsvm-ironic tests because our tempest test client doesn't retry when it encounters a NodeLocked error.

Revision history for this message
aeva black (tenbrae) wrote :

Updating the title of this bug to clarify the scope to failures due to log messages (which was fixed) and re-closing.

https://bugs.launchpad.net/ironic/+bug/1316773 relates to the tempest client failures (which is not yet fixed)

summary: - NodeLocked causing random test failures
+ NodeLocked log messages causing random test failures
Changed in ironic:
status: Triaged → Fix Committed
Thierry Carrez (ttx)
Changed in ironic:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: juno-1 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.