OpenStack Compute (nova)

Bug #1861067
Activity log

Activity log for bug #1861067

Date	Who	What changed	Old value	New value	Message
2020-01-28 07:17:49	Yang Youseok	bug			added bug
2020-01-28 07:18:17	Yang Youseok	description	For ocata/stein, we got serious scheduler problem makes us to upgrade to upper release. I could not find any issue report for that so leave it for whom meet this issue later. The problem which we encounter is like this - conductor try to schedule one compute nodes for 2 instances - nova-compute at that time has enough resource in compute_nodes, scheduler choose the nova-compute - resource tracker in nova-compute claim for resource to placement - placement returns for the answer of one of the request 409, since there were several concurrent requests. - [BUG here] resource tracker in nova-compute does not care about the return code from placement, so 'allocation' is only increased for share of the one instance. - After that compute_nodes in scheduler was full but allocation in placement has slot to be used. - [User meet weirdness here] since there were slot to be used in scheduler side, instance could be made in compute node which is actually full. The result is that compute node is over provisionning. - OOM occurs. (We got tight memory, if admin has other resource policy, they would be meet different side effect) I found it's already fixed over pike in which scheduler make allocation first and nova-compute just checks the compute_nodes. But for me, it's very hard to find root cause and need to investigate a lot for scheduler history, so I hope someone who meet this problem would be helpful. I do not sure it should be fixed since ocata is quite old though, we can fix it up to change the function (nova/scheduler/client/report.py _allocate_for_instance()) to catch the 409 conflict similar to the function latter added (put_allocations()) Thanks.	For stable/ocata, we got serious scheduler problem makes us to upgrade to upper release. I could not find any issue report for that so leave it for whom meet this issue later. The problem which we encounter is like this - conductor try to schedule one compute nodes for 2 instances - nova-compute at that time has enough resource in compute_nodes, scheduler choose the nova-compute - resource tracker in nova-compute claim for resource to placement - placement returns for the answer of one of the request 409, since there were several concurrent requests. - [BUG here] resource tracker in nova-compute does not care about the return code from placement, so 'allocation' is only increased for share of the one instance. - After that compute_nodes in scheduler was full but allocation in placement has slot to be used. - [User meet weirdness here] since there were slot to be used in scheduler side, instance could be made in compute node which is actually full. The result is that compute node is over provisionning. - OOM occurs. (We got tight memory, if admin has other resource policy, they would be meet different side effect) I found it's already fixed over pike in which scheduler make allocation first and nova-compute just checks the compute_nodes. But for me, it's very hard to find root cause and need to investigate a lot for scheduler history, so I hope someone who meet this problem would be helpful. I do not sure it should be fixed since ocata is quite old though, we can fix it up to change the function (nova/scheduler/client/report.py _allocate_for_instance()) to catch the 409 conflict similar to the function latter added (put_allocations()) Thanks.
2020-04-16 06:55:51	Balazs Gibizer	nova: status	New	Confirmed
2020-04-16 06:56:15	Balazs Gibizer	nominated for series		nova/ocata
2020-04-16 06:56:15	Balazs Gibizer	bug task added		nova/ocata
2020-04-16 06:56:23	Balazs Gibizer	nova/ocata: status	New	Confirmed
2020-04-16 06:56:28	Balazs Gibizer	nova: status	Confirmed	Invalid
2020-04-16 06:56:39	Balazs Gibizer	nova/ocata: importance	Undecided	Low
2020-04-16 06:56:50	Balazs Gibizer	tags		placement scheduler