Inappropriate HTTP error status from os-server-external-events

Bug #1855752 reported by Sundar Nadathur
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Eric Fried

Bug Description

The handling of os-server-external-events API [1] has a bug. It is designed to handle multiple events, with the following expected behavior:
* If all events are successfully handled, it should return HTTP 200.
* If no event is successfully handled, it should return HTTP 404.
* If some are handled successfully but not all, it should return HTTP 207, with per-event status codes.

However, when Cyborg sends a single event for a single instance, and that instance is not yet associated with a host [*], the 'else' clause in Line 137 [1] will set HTTP 207 as return code; but, since accepted_events is [] in Line 146, that will throw an exception and return 404. IOW, the expected return is 207 but the actual return is 404.

This has been discussed in IRC [2]. A patch has been proposed [3] to address this.

[*] This happens because Nova calls into Cyborg from the conductor to initiate binding of accelerator requests (ARQs), lets it proceed asynchronously, and waits for the binding notification event in the compute manager. The notification event could come before the compute manager has called self._rt.instance_claim(), which would associate the instance with a host and a node. That race condition triggers the behavior above.

[1] https://github.com/openstack/nova/blob/62f6a0a1bc6c4b24621e1c2e927177f99501bef3/nova/api/openstack/compute/server_external_events.py

[2] http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2019-12-09.log.html#t2019-12-09T15:45:18

[3] https://review.opendev.org/#/c/698037/

Tags: api
Changed in nova:
assignee: nobody → Eric Fried (efried)
status: New → In Progress
Revision history for this message
Eric Fried (efried) wrote :

> designed to handle multiple events, with the following expected behavior:
> * If all events are successfully handled, it should return HTTP 200.
> * If no event is successfully handled, it should return HTTP 404.
> * If some are handled successfully but not all, it should return HTTP 207, with per-event status > codes.

To be clear, the proposed fix is based on the premise that the second bullet is (has always been) invalid. 207 [1] is specifically designed for this kind of API, and is intended to include the case where *all* components fail (individually -- as opposed to something going wrong at the API level like permissions or malformed inputs).

[1] https://tools.ietf.org/html/rfc4918#section-13

tags: added: api
Revision history for this message
Surya Seetharaman (tssurya) wrote :

Just fyi, there is a duplicate bug lying around: https://bugs.launchpad.net/nova/+bug/1839009 which was reported some time back.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/698037
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e6f742544432d6066f1fba4666580919eb7859bd
Submitter: Zuul
Branch: master

commit e6f742544432d6066f1fba4666580919eb7859bd
Author: Eric Fried <email address hidden>
Date: Mon Dec 9 09:58:53 2019 -0600

    Nix os-server-external-events 404 condition

    The POST /os-server-external-events API had the following confusing
    behavior:

    With multiple events in the payload, if *some* (but not all) were
    dropped, the HTTP response was 207, with per-event 4xx error codes in
    the payload. But if *all* of the events were dropped, the overall HTTP
    response was 404 with no payload. Thus, especially for consumers sending
    only one event at a time, it was impossible to distinguish e.g. "you
    tried to send an event for a nonexistent instance" from "the instance
    you specified hasn't landed on a host yet".

    This fix gets rid of that sweeping 404 condition, so if *any* subset of
    the events are dropped (including *all* of them), the HTTP response will
    always be 207, and the payload will always contain granular per-event
    error codes.

    This effectively means the API can no longer return 404, ever.

    Closes-Bug: #1855752
    Change-Id: Ibad1b51e2cf50d00102295039b6e82bc00bec058

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
Eric Fried (efried) wrote :

Ah rats, wish I had known about that other bug.

Since this one is already marked as Fix Released, I'm going to dup that one to this, okay?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.