maximum recursion possible while setting aggregates in placement

Bug #1804453 reported by Chris Dent on 2018-11-21
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Chris Dent

Bug Description

It's possible for the _ensure_aggregate code in objects/resource_provider.py to, under unusual circumstances, reach a maximum recursion error, because it calls itself when there is a DBDuplicateEntry error.

http://logs.openstack.org/84/602484/30/check/placement-perfload/8a8642e/controller/logs/screen-placement-api.txt.gz#_Nov_21_13_05_03_661629

http://logs.openstack.org/84/602484/30/check/placement-perfload/8a8642e/controller/logs/screen-placement-api.txt.gz#_Nov_21_13_05_03_654874

" ERROR placement.fault_wrap [None req-5fc62d1e-a1bd-47e3-a61e-45e01281fed3 None None] Placement API unexpected error: maximum recursion depth exceeded while getting the str of an object: RuntimeError: maximum recursion depth exceeded while getting the str of an object"

The "getting the str" part appears to be a coincidence based on reaching a bad stack depth at that particular moment.

This happened while the placeload script was doing its thing of adding aggregates to to 1000 resource providers using asyncio, so concurrency is high and weird. See https://review.openstack.org/#/c/602484/ for the code that caused this.

It is unlikely that this is going to happen in the real world, but it is the sort of thing it would be nice to be more robust about, perhaps by counting attempts and bailing out?

Chris Dent (cdent) wrote :

This proved to be a significant issue while working on https://review.openstack.org/#/c/619248/ , a performance measuring job that uses placeload. That uses aiohttp to make concurrent connections to placement. At high concurrency _ensure_aggregate loops a great deal and causes the server to block enough that the client starts experiencing errors because it cannot make a good connection.

I fixed it by preheating the aggregates so that _ensure_aggregate almost always returns after getting the aggregate id, rather than looping to try to create it. With that in place things are very smooth.

That experience suggests we should fix this, because it seems likely that operators might like to do mass aggregate management and use asyncio-based tools to do it, or maybe something from languages like go where similar behaviour might happen.

Chris Dent (cdent) wrote :

Resolved in https://review.openstack.org/#/c/624144/

Not sure if the updates from gerrit are slow or I'm impatient, but

Changed in nova:
importance: Undecided → Medium
status: New → In Progress
assignee: nobody → Chris Dent (cdent)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers