OpenStack Compute (nova)

maximum recursion possible while setting aggregates in placement

Bug #1804453 reported by Chris Dent on 2018-11-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	Medium	Chris Dent

Bug Description

It's possible for the _ensure_aggregate code in objects/resource_provider.py to, under unusual circumstances, reach a maximum recursion error, because it calls itself when there is a DBDuplicateEntry error.

http://logs.openstack.org/84/602484/30/check/placement-perfload/8a8642e/controller/logs/screen-placement-api.txt.gz#_Nov_21_13_05_03_661629

http://logs.openstack.org/84/602484/30/check/placement-perfload/8a8642e/controller/logs/screen-placement-api.txt.gz#_Nov_21_13_05_03_654874

" ERROR placement.fault_wrap [None req-5fc62d1e-a1bd-47e3-a61e-45e01281fed3 None None] Placement API unexpected error: maximum recursion depth exceeded while getting the str of an object: RuntimeError: maximum recursion depth exceeded while getting the str of an object"

The "getting the str" part appears to be a coincidence based on reaching a bad stack depth at that particular moment.

This happened while the placeload script was doing its thing of adding aggregates to to 1000 resource providers using asyncio, so concurrency is high and weird. See https://review.openstack.org/#/c/602484/ for the code that caused this.

It is unlikely that this is going to happen in the real world, but it is the sort of thing it would be nice to be more robust about, perhaps by counting attempts and bailing out?

Tags:

Revision history for this message

Chris Dent (cdent) wrote on 2018-11-21:

This proved to be a significant issue while working on https://review.openstack.org/#/c/619248/ , a performance measuring job that uses placeload. That uses aiohttp to make concurrent connections to placement. At high concurrency _ensure_aggregate loops a great deal and causes the server to block enough that the client starts experiencing errors because it cannot make a good connection.

I fixed it by preheating the aggregates so that _ensure_aggregate almost always returns after getting the aggregate id, rather than looping to try to create it. With that in place things are very smooth.

That experience suggests we should fix this, because it seems likely that operators might like to do mass aggregate management and use asyncio-based tools to do it, or maybe something from languages like go where similar behaviour might happen.

Revision history for this message

Chris Dent (cdent) wrote on 2018-12-10:

Resolved in https://review.openstack.org/#/c/624144/

Not sure if the updates from gerrit are slow or I'm impatient, but

Changed in nova:
importance:	Undecided → Medium
status:	New → In Progress
assignee:	nobody → Chris Dent (cdent)

Revision history for this message

Chris Dent (cdent) wrote on 2019-03-04:

There's a new bug related to this, and the fix described here was committed, so going to make it as such to avoid confusion.

https://bugs.launchpad.net/nova/+bug/1818498

Changed in nova:
status:	In Progress → Fix Committed

Balazs Gibizer (balazs-gibizer) on 2020-11-06

Changed in nova:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.