Launchpad itself

Mutated items in launchpadlib search results iterator can break iteration

Bug #541637 reported by Martin Pool on 2010-03-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Triaged	Low	Unassigned
	launchpadlib	Triaged	Low	Unassigned

Bug Description

We want to standardize away from using Triaged, so I have been running:

>>> for t in p.searchTasks(status="Triaged"): flip_task(t)
...

where flip_task changes the status and saves the task.

I would have thought this would update all the tasks, but for some reason it does not!

Is this a kind of unstable iterator where you're not allowed to change things while using it? That seems pretty strange for a web api.

Tags:

Revision history for this message

Martin Pool (mbp) wrote on 2010-03-19:

I guess the actual requests are batched, and the batching is not stable.

However it does seem very poor that by reading all the batches one after the other, you miss out on a lot of data. Perhaps missing a few that are eg inserted during iteration would be ok, but missing a lot that existed the whole time seems just wrong. Perhaps it would have done better if I'd specified a sort order?

Revision history for this message

Gary Poster (gary) wrote on 2010-03-19:

I'm inclined to change the title of this bug to the following.

Mutating results of iterator can produce incorrect results in iterator

I'd then mark as "Wont Fix" at worst or "Triaged: Low" at best. While not ideal, this is not surprising behavior to me, and pretty typical of Python iterators.

I can think of two arguments that would change my mind. Maybe there are more. :-)

1) list(iterator) gives the wrong answer.

2) you can't know if list(iterator) will be too big or problematic in some other way.

This also should have Leonard's weigh-in, which won't happen till next week.

affects:	malone → launchpad-foundations
Changed in launchpad-foundations:
status:	New → Incomplete
Changed in launchpadlib:
status:	New → Incomplete

Revision history for this message

Martin Pool (mbp) wrote on 2010-03-22: Re: [Bug 541637] Re: searchTasks doesn't return all matching tasks?!

On 20 March 2010 00:22, Gary Poster <email address hidden> wrote:
> I'm inclined to change the title of this bug to the following.
>
> Mutating results of iterator can produce incorrect results in
> iterator
>
> I'd then mark as "Wont Fix" at worst or "Triaged: Low" at best. While
> not ideal, this is not surprising behavior to me, and pretty typical of
> Python iterators.

That's true. It's only surprising to me because my model (which may
well be wrong) is that each rpc returned a new batch of objects which
were essentially just buffered on the client. In other words, it's
more like objects from a file in sequence and then mutating them than
it is like mutating a dict while you're iterating it.

> I can think of two arguments that would change my mind. Maybe there are
> more. :-)
>
> 1) list(iterator) gives the wrong answer.
>
> 2) you can't know if list(iterator) will be too big or problematic in
> some other way.
>
> This also should have Leonard's weigh-in, which won't happen till next
> week.

Is the problem here that I'm mutating the list, or could it also
happen if they were being changed by someone else on Launchpad? If
the latter there is no way any particular client could avoid it.

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message

Gary Poster (gary) wrote on 2010-03-22: Re: searchTasks doesn't return all matching tasks?!

> Is the problem here that I'm mutating the list, or could it also
> happen if they were being changed by someone else on Launchpad? If
> the latter there is no way any particular client could avoid it.

I suspect that it is generically fragile, yes. That's a compelling argument. ``list(iterable)`` would reduce the risk but it would just make the time of the vulnerability smaller, not make it disappear.

Leonard, could you confirm that Martin's scenario (anyone could break the iteration) is correct?

If that's the case, are there pre-existing ideas on how to make this better, or do you have something that comes to mind quickly?

For now, I'm marking this Triaged: Low, and changing the title as I mentioned, but also adding this to the list of bugs to consider for the API clean-up after the performance work.

Gary

Gary Poster (gary) on 2010-03-22

Changed in launchpad-foundations:
status:	Incomplete → Triaged
Changed in launchpadlib:
status:	Incomplete → Triaged
Changed in launchpad-foundations:
importance:	Undecided → Low
Changed in launchpadlib:
importance:	Undecided → Low
summary:	- searchTasks doesn't return all matching tasks?! + Mutated items in launchpadlib search results iterator can break + iteration

Revision history for this message

Leonard Richardson (leonardr) wrote on 2010-03-22:

Yes, Martin has the right idea. When you iterate over a lazr.restfulclient collection it makes a series of HTTP requests grabbing different chunks of the data. So you're effectively making the same SELECT statement with LIMIT=50, LIMIT=51,100, etc. (I don't remember the exact SQL syntax, but you get the idea.) If the database changes between requests you will silently miss or repeat entries. On the server side I could see a database connection keeping open a view of the database as it was when the transaction started, and iterating over that view. But here every SELECT statement runs in a totally different context.

I think we can *detect* this problem fairly easily by adding ETags to collections and asking for different pages using If-Match=[collection ETag]. But the 412 you get in that situation will be the same as when you call lp_save() and discovered someone else just saved the same object. You won't have any option other than to restart your operation from the beginning. And when you're talking about hundreds of bugs, that sucks.

I can think of ways to avoid the iterator (eg. create a server-side 'set of bugs' object representing the result of your query, and then modify that set with a single request, allowing the iterator to run on the server side), but they're significantly more complex.

As a stupid workaround you can set a huge page size and get all the Triaged bugs at once. Since you're stopping the bugs from being Triaged, you could also repeatedly get the first page of Triaged bugs and set that page.

Sorry that I don't have more to offer.

Revision history for this message

Martin Pool (mbp) wrote on 2010-03-22: Re: [Bug 541637] Re: Mutated items in launchpadlib search results iterator can break iteration

I think I would like the client to say "get me the first 50 bugs in
order by bug number" then "get me the next 50 with id >
previous_result[-1].id" and so on up. That should be stable: I might
miss new things inserted into ranges I've already seen but that's
entirely reasonable.

I believe at the moment there is an orderBy parameter and I can't see
how it would work with this. Also I suppose the current api exposes a
"start at row" not "start at id."

> As a stupid workaround you can set a huge page size and get all the
> Triaged bugs at once.

How would I do that?

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message

Stuart Bishop (stub) wrote on 2010-03-24: Re: [Bug 541637] Re: Mutated items in launchpadlib search results iterator can break iteration

On Tue, Mar 23, 2010 at 4:10 AM, Martin Pool <email address hidden> wrote:
> I think I would like the client to say "get me the first 50 bugs in
> order by bug number" then "get me the next 50 with id >
> previous_result[-1].id" and so on up. That should be stable: I might
> miss new things inserted into ranges I've already seen but that's
> entirely reasonable.

This is a good approach when results are ordered by a unique, unchangeable key. It is also more efficient at the database end (as 'WHERE id > x ORDER BY id LIMIT y' is faster than 'ORDER BY id OFFSET z LIMIT y' when you get to higher offsets).

I wonder if we could special case things when results are ordered by id?

Materializing the resultset on the server and reusing it for multiple client requests can be problematic, for example materializing several million rows of results but only ever retrieving the first 50 is very inefficient.

--
Stuart Bishop <email address hidden>
http://www.stuartbishop.net/

Revision history for this message

Martin Pool (mbp) wrote on 2010-03-24: Re: [Bug 541637] Re: Mutated items in launchpadlib search results iterator can break iteration

On 24 March 2010 22:21, Stuart Bishop <email address hidden> wrote:
> On Tue, Mar 23, 2010 at 4:10 AM, Martin Pool <email address hidden> wrote:
>> I think I would like the client to say "get me the first 50 bugs in
>> order by bug number" then "get me the next 50 with id >
>> previous_result[-1].id" and so on up. That should be stable: I might
>> miss new things inserted into ranges I've already seen but that's
>> entirely reasonable.
>
> This is a good approach when results are ordered by a unique,
> unchangeable key. It is also more efficient at the database end (as
> 'WHERE id > x ORDER BY id LIMIT y' is faster than 'ORDER BY id OFFSET z
> LIMIT y' when you get to higher offsets).
>
> I wonder if we could special case things when results are ordered by id?

I almost think there is no point supporting other orders if in the
very plausible case of someone else editing bugs while you're
iterating you may lose many results.

However, it's also plausible that someone would want to know about the
50 hottest bugs in Ubuntu and we shouldn't make them retrieve all the
bugs and do a client-side sort.

Perhaps it should be special cased when ordered by id, id should be
made the default (if it's not already) and there should be a caution
about this in the api user documentation.

> Materializing the resultset on the server and reusing it for multiple
> client requests can be problematic, for example materializing several
> million rows of results but only ever retrieving the first 50 is very
> inefficient.

Right, letting users allocate arbitrary server-side storage seems poor.

Perhaps we should just document that people can avoid this by asking
for a larger batch. It doesn't seem unreasonable to me that you could
get a few hundred bugs in a single download; it may be much faster
too.

--
Martin <http://launchpad.net/~mbp/>

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.