MarkerNotFound when limit>num(instances) and marker starts in cell0

Bug #1689692 reported by zhang wenjian
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann
Newton
Invalid
High
Unassigned
Ocata
Fix Committed
High
Matt Riedemann

Bug Description

In my Ocata/RDO enviroment, nova version is 15.0.3, if I list instances with a marker&limit, sometimes it says "maker not found".

More details of my operation steps:

First, all instances are listed here without limit&marker:

[root@host015 astute(keystone_admin)]# nova list --sort created_at:desc
+--------------------------------------+------+--------+------------+-------------+----------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+------+--------+------------+-------------+----------+
| 0e02233c-6c73-4bbe-bede-299ba41f44c3 | 11 | ERROR | - | NOSTATE | |
| f6347ddb-e870-447f-8f14-3e3b57a610f2 | 11 | ERROR | - | NOSTATE | |
| 5de9524f-1167-4ccb-b13c-13acf5435ead | 11 | ERROR | - | NOSTATE | |
| 0548ebed-f0e4-4233-acf3-0339c4802f0d | 11 | ERROR | - | NOSTATE | |
| 2c9ee616-eab9-4a4a-af3c-79f858c571d5 | 11 | ERROR | - | NOSTATE | |
| 9aab5d6a-f5bd-459b-bb25-c04bc56efcf0 | 11 | ERROR | - | NOSTATE | |
| f9ca1f1c-01f3-41a5-a68b-63f01fd87081 | 11 | ERROR | - | NOSTATE | |
| dbb3955a-c768-4883-aae4-f3143f7b3a51 | 11 | ERROR | - | NOSTATE | |
| a587dc5c-54c8-432b-9e38-174aae5e848c | 11 | ERROR | - | NOSTATE | |
| 609ba5ca-bc49-4de5-be7a-16aab8fcb6d2 | 11 | ERROR | - | NOSTATE | |
| b42e32e5-2aaa-46ee-b0bf-b08f29867af1 | 11 | ERROR | - | NOSTATE | |
+--------------------------------------+------+--------+------------+-------------+----------+

Then, I try to list with the first instance id as marker, limited to 3, it's OK:

[root@host015 astute(keystone_admin)]# nova list --sort created_at:desc --limit 3 --marker 0e02233c-6c73-4bbe-bede-299ba41f44c3
+--------------------------------------+------+--------+------------+-------------+----------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+------+--------+------------+-------------+----------+
| f6347ddb-e870-447f-8f14-3e3b57a610f2 | 11 | ERROR | - | NOSTATE | |
| 5de9524f-1167-4ccb-b13c-13acf5435ead | 11 | ERROR | - | NOSTATE | |
| 0548ebed-f0e4-4233-acf3-0339c4802f0d | 11 | ERROR | - | NOSTATE | |
+--------------------------------------+------+--------+------------+-------------+----------+

Then, I try to list with another instance, limited to 3, it's error:

[root@host015 astute(keystone_admin)]# nova list --sort created_at:desc --limit 3 --marker a587dc5c-54c8-432b-9e38-174aae5e848c
ERROR (BadRequest): marker [a587dc5c-54c8-432b-9e38-174aae5e848c] not found (HTTP 400) (Request-ID: req-308371f4-2962-4f3f-8d4c-69bf8c19664f)

That's because no enough instance of limitation after the marker?
When I set limitation to 2, it's OK:

[root@host015 astute(keystone_admin)]# nova list --sort created_at:desc --limit 2 --marker a587dc5c-54c8-432b-9e38-174aae5e848c
+--------------------------------------+------+--------+------------+-------------+----------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+------+--------+------------+-------------+----------+
| 609ba5ca-bc49-4de5-be7a-16aab8fcb6d2 | 11 | ERROR | - | NOSTATE | |
| b42e32e5-2aaa-46ee-b0bf-b08f29867af1 | 11 | ERROR | - | NOSTATE | |
+--------------------------------------+------+--------+------------+-------------+----------+

My question: Why does not limitation work when no enough instance ?

Revision history for this message
Matt Riedemann (mriedem) wrote :

Here is the problem:

All of your instances are in ERROR state so they are in cell0 and we'll pull them from cell0 here:

https://github.com/openstack/nova/blob/15.0.3/nova/compute/api.py#L2466

We got 2 back from cell0 but limit was 3, so we make limit=1 here:

https://github.com/openstack/nova/blob/15.0.3/nova/compute/api.py#L2474

Since we still have more in the limit, we check the cells:

https://github.com/openstack/nova/blob/15.0.3/nova/compute/api.py#L2481

The marker was in cell0 so we're not going to find the marker in the main cell (cell1) so we're not going to find any instances in other cells and eventually raise the 404 here:

https://github.com/openstack/nova/blob/15.0.3/nova/compute/api.py#L2596

We should set the marker to None when we pulled instances out of the cell0 database so we don't attempt to use a marker in the other cells.

Changed in nova:
status: New → Triaged
summary: - marker not found
+ MarkerNotFound when limit>num(instances) and marker starts in cell0
Changed in nova:
importance: Undecided → High
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/468549

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/468559

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/468549
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fe0cf0fe047f9e8890170a90c48594d90e73bda5
Submitter: Jenkins
Branch: master

commit fe0cf0fe047f9e8890170a90c48594d90e73bda5
Author: Matt Riedemann <email address hidden>
Date: Fri May 26 17:21:30 2017 -0400

    Add recreate functional test for regression bug 1689692

    When paging through instances, if the marker is found in cell0
    and there are more instances under the limit, we continue paging
    through the cell(s) to fill the limit. However, since the marker
    was found in cell0 it's not going to be in any other cell database
    so we'll end up failing with a marker not found error.

    This change adds a functional recreate test for the bug.

    The fix will build on this to show when the bug is fixed and the
    test will be changed to assert expected normal behavior.

    Change-Id: I234e0425e7e800b32cea78f5c1d99997bc03343f
    Partial-Bug: #1689692

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/469206

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/469207

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/468559
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dbaf80d2c94db074a6651c890d532a11baec8da0
Submitter: Jenkins
Branch: master

commit dbaf80d2c94db074a6651c890d532a11baec8da0
Author: Matt Riedemann <email address hidden>
Date: Fri May 26 17:48:10 2017 -0400

    Fix MarkerNotFound when paging and marker was found in cell0

    If we're paging over cells and the marker was found in cell0,
    we need to null it out so we don't attempt to lookup by marker
    from any other cells if there is more room in the limit.

    Change-Id: I8a957bebfcecd6ac712103c346e028d80f1ecd7c
    Closes-Bug: #1689692

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
Matt Riedemann (mriedem) wrote :

The compute API code is wrong in stable/newton, but we don't need to fix it there because we didn't actually populate cell0 in Newton. We started creating instances in cell0 in Ocata:

https://github.com/openstack/nova/commit/bcbfee183e74f696085fcd5c18aff333fc5f1403

So in Newton you'll always get a MarkerNotFound looking in cell0 since it's always empty.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/469206
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0316f63e56d2115f79880947cea828af29af37a6
Submitter: Jenkins
Branch: stable/ocata

commit 0316f63e56d2115f79880947cea828af29af37a6
Author: Matt Riedemann <email address hidden>
Date: Fri May 26 17:21:30 2017 -0400

    Add recreate functional test for regression bug 1689692

    When paging through instances, if the marker is found in cell0
    and there are more instances under the limit, we continue paging
    through the cell(s) to fill the limit. However, since the marker
    was found in cell0 it's not going to be in any other cell database
    so we'll end up failing with a marker not found error.

    This change adds a functional recreate test for the bug.

    The fix will build on this to show when the bug is fixed and the
    test will be changed to assert expected normal behavior.

    Change-Id: I234e0425e7e800b32cea78f5c1d99997bc03343f
    Partial-Bug: #1689692
    (cherry picked from commit fe0cf0fe047f9e8890170a90c48594d90e73bda5)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/469207
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3696a895321f743a824d1b89feb51eccfd07a332
Submitter: Jenkins
Branch: stable/ocata

commit 3696a895321f743a824d1b89feb51eccfd07a332
Author: Matt Riedemann <email address hidden>
Date: Fri May 26 17:48:10 2017 -0400

    Fix MarkerNotFound when paging and marker was found in cell0

    If we're paging over cells and the marker was found in cell0,
    we need to null it out so we don't attempt to lookup by marker
    from any other cells if there is more room in the limit.

    Change-Id: I8a957bebfcecd6ac712103c346e028d80f1ecd7c
    Closes-Bug: #1689692
    (cherry picked from commit dbaf80d2c94db074a6651c890d532a11baec8da0)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.0.0.0b2

This issue was fixed in the openstack/nova 16.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.0.6

This issue was fixed in the openstack/nova 15.0.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Dan Smith (<email address hidden>) on branch: master
Review: https://review.openstack.org/505661
Reason: This isn't really needed now that we sort by default in the instance_list routine

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/505661
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=affb25ecef86537bfaebc69eb0af3b84e9cad4de
Submitter: Zuul
Branch: master

commit affb25ecef86537bfaebc69eb0af3b84e9cad4de
Author: Dan Smith <email address hidden>
Date: Wed Sep 20 07:20:32 2017 -0700

    Fix a pagination logic bug in test_bug_1689692

    This test attempts to list all instances, then list them again with the
    first instance as the marker and ensure that the remaining instances
    are returned in the page. Now that we are doing queries to cells in
    parallel, consecutive unsorted list queries can return things in
    different orders as the cells may reply at different times. The fix
    in this patch is to ask for results to be sorted, which is the only way
    it makes sense. It worked before purely because we were always scanning
    the cells linearly and in the same order.

    Related-Bug: #1689692
    Change-Id: I3ca2a167c902d565c36a5d5dbba1bf1c214aa20b

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.