HostManager host_state_map should have synchronized access
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Reporting this as a bug to get some discussion rolling on how best to fix it (or whether to fix it). I kind of assume this is a well known problem, but since the code isn't covered with NOTEs about it, I thought perhaps best to throw it up here.
In nova's master branch as of Dec 2015 the HostState objects and host_state_map are not robust in the face of concurrent access. That concurrent access doesn't happen frequently under test situations is primarily the fault of the test situations and luck, not the correctness of the code. Unless I'm completely misunderstanding the way in which eventlet is being used, each of the following methods has the potential to yield on I/O (rpc calls, log messages) or sleeps (not used, but handy in debugging situations). This is not an exhaustive list, just what came up in my digging in nova/scheduler/
* HostManager.
* HostState.
* HostState.
If these yield at exactly the wrong time, an ongoing set of select_destinations calls will have some which have incorrect host information. As we know, this is often true anyway, but the suggested fix (below) is pretty lightweight, so perhaps a small improvement is worth it?
How I figured this out:
I was reading the code and realized that access to the HostState is not synchronized so tried to see if I could come up with a way to get incorrect information to show up at the scheduler when running just one scheduler, just one compute node and doing multiple "concurrent" server creates (calling three creates over the API in a backgrounding loop). It was pretty easy to report some incorrect RAM usage. To make things fail more reliably I introduced small sleeps into consume_
Adding a utils.synchronized decorator to consume_
I'm not clear on the implications of these changes: they make sense to me (we don't want to concurrent access to a shared data structure) but I wonder if perhaps there are other things to consider that I'm not aware of such as "well, actually, this stuff was supposed to be written so it never yielded, that they can or may is the bug" (in which case the methods need some phat warnings on them for future maintainers).
I'm also not clear on the semaphores that ought to be used (if synchronization is the way to go). In my POC solution get_all_host_states had its own semaphore while the other two methods shared one.
Beyond that, though, it might make sense for the host_state_map and the entire HostState object to be "thread" safe, to get all places where this could be a problem rather than piecemeal dealing with methods as curious people notice them.
Changed in nova: | |
status: | New → Invalid |
ndipanov points out that this should be addressed by: https:/ /review. openstack. org/#/q/ topic:bp/ host-state- level-locking, n,z