Activity log for bug #1861067

Date Who What changed Old value New value Message
2020-01-28 07:17:49 Yang Youseok bug added bug
2020-01-28 07:18:17 Yang Youseok description For ocata/stein, we got serious scheduler problem makes us to upgrade to upper release. I could not find any issue report for that so leave it for whom meet this issue later. The problem which we encounter is like this - conductor try to schedule one compute nodes for 2 instances - nova-compute at that time has enough resource in compute_nodes, scheduler choose the nova-compute - resource tracker in nova-compute claim for resource to placement - placement returns for the answer of one of the request 409, since there were several concurrent requests. - [BUG here] resource tracker in nova-compute does not care about the return code from placement, so 'allocation' is only increased for share of the one instance. - After that compute_nodes in scheduler was full but allocation in placement has slot to be used. - [User meet weirdness here] since there were slot to be used in scheduler side, instance could be made in compute node which is actually full. The result is that compute node is over provisionning. - OOM occurs. (We got tight memory, if admin has other resource policy, they would be meet different side effect) I found it's already fixed over pike in which scheduler make allocation first and nova-compute just checks the compute_nodes. But for me, it's very hard to find root cause and need to investigate a lot for scheduler history, so I hope someone who meet this problem would be helpful. I do not sure it should be fixed since ocata is quite old though, we can fix it up to change the function (nova/scheduler/client/report.py _allocate_for_instance()) to catch the 409 conflict similar to the function latter added (put_allocations()) Thanks. For stable/ocata, we got serious scheduler problem makes us to upgrade to upper release. I could not find any issue report for that so leave it for whom meet this issue later. The problem which we encounter is like this - conductor try to schedule one compute nodes for 2 instances - nova-compute at that time has enough resource in compute_nodes, scheduler choose the nova-compute - resource tracker in nova-compute claim for resource to placement - placement returns for the answer of one of the request 409, since there were several concurrent requests. - [BUG here] resource tracker in nova-compute does not care about the return code from placement, so 'allocation' is only increased for share of the one instance. - After that compute_nodes in scheduler was full but allocation in placement has slot to be used. - [User meet weirdness here] since there were slot to be used in scheduler side, instance could be made in compute node which is actually full. The result is that compute node is over provisionning. - OOM occurs. (We got tight memory, if admin has other resource policy, they would be meet different side effect) I found it's already fixed over pike in which scheduler make allocation first and nova-compute just checks the compute_nodes. But for me, it's very hard to find root cause and need to investigate a lot for scheduler history, so I hope someone who meet this problem would be helpful. I do not sure it should be fixed since ocata is quite old though, we can fix it up to change the function (nova/scheduler/client/report.py _allocate_for_instance()) to catch the 409 conflict similar to the function latter added (put_allocations()) Thanks.
2020-04-16 06:55:51 Balazs Gibizer nova: status New Confirmed
2020-04-16 06:56:15 Balazs Gibizer nominated for series nova/ocata
2020-04-16 06:56:15 Balazs Gibizer bug task added nova/ocata
2020-04-16 06:56:23 Balazs Gibizer nova/ocata: status New Confirmed
2020-04-16 06:56:28 Balazs Gibizer nova: status Confirmed Invalid
2020-04-16 06:56:39 Balazs Gibizer nova/ocata: importance Undecided Low
2020-04-16 06:56:50 Balazs Gibizer tags placement scheduler