I spent some time investigating that issue today and here is what I found so far.
In the nova-compute logs in 2 failed cases which I checked I noticed that the sequence of events was:
Nov 23 19:22:05.162249 ubuntu-focal-ovh-bhs1-0027471441 nova-compute[58433]: INFO nova.compute.manager [None req-aad28555-9fa1-441f-9239-2a1299e35298 tempest-ServersNegativeTestJSON-724065228 tempest-ServersNegativeTestJSON-724065228-project] [instance: 1a207544-1228-4ec1-ad99-4b6f7b6a5ea1] Shelving
Nov 23 19:22:09.112566 ubuntu-focal-ovh-bhs1-0027471441 nova-compute[58433]: INFO nova.compute.manager [None req-aad28555-9fa1-441f-9239-2a1299e35298 tempest-ServersNegativeTestJSON-724065228 tempest-ServersNegativeTestJSON-724065228-project] [instance: 1a207544-1228-4ec1-ad99-4b6f7b6a5ea1] Shelve offloading
Nov 23 19:22:12.795123 ubuntu-focal-ovh-bhs1-0027471441 nova-compute[58433]: INFO nova.compute.manager [None req-7861da7e-4733-4cda-9a8d-aa569ef301f6 tempest-ServersNegativeTestJSON-724065228 tempest-ServersNegativeTestJSON-724065228-project] [instance: 1a207544-1228-4ec1-ad99-4b6f7b6a5ea1] Unshelving
And after "Unshelving" log in the nova-compute, I saw in the neutron server logs that port was switched to active but there was no notification send to nova then (I don't know why):
Nov 23 19:22:42.686159 ubuntu-focal-ovh-bhs1-0027471439 neutron-server[95486]: DEBUG neutron.plugins.ml2.rpc [None req-9ab9f406-90d2-4bf0-9f6f-37f98968d32d None None] Device 46ceed9e-1262-47c2-b7dc-335a31f78b71 up at agent ovs-agent-ubuntu-focal-ovh-bhs1-0027471441 {{(pid=95486) update_device_up /opt/stack/neutron/neutron/plugins/ml2/rpc.py:296}}
...
Nov 23 19:22:42.811808 ubuntu-focal-ovh-bhs1-0027471439 neutron-server[95486]: DEBUG neutron.db.provisioning_blocks [None req-9ab9f406-90d2-4bf0-9f6f-37f98968d32d None None] Provisioning complete for port 46ceed9e-1262-47c2-b7dc-335a31f78b71 triggered by entity L2. {{(pid=95486) provisioning_complete /opt/stack/neutron/neutron/db/provisioning_blocks.py:139}}
Nov 23 19:22:42.812114 ubuntu-focal-ovh-bhs1-0027471439 neutron-server[95486]: DEBUG neutron_lib.callbacks.manager [None req-9ab9f406-90d2-4bf0-9f6f-37f98968d32d None None] Publish callbacks ['neutron.plugins.ml2.plugin.Ml2Plugin._port_provisioned-3745000'] for port, provisioning_complete {{(pid=95486) _notify_loop /usr/local/lib/python3.8/dist-packages/neutron_lib/callbacks/manager.py:175}}
I compared it with the case when test passed and there in nova-compute logs I can see:
Dec 22 09:08:21.469791 ubuntu-focal-iweb-mtl01-0027794527 nova-compute[57704]: INFO nova.compute.manager [None req-a6373604-981c-4cca-b6a8-85abdc256ad0 tempest-ServersNegativeTestJSON-764009652 tempest-ServersNegativeTestJSON-764009652-project] [instance: 8b1d44ec-c81c-476f-95b0-624a8b62805c] Shelving
Dec 22 09:08:24.621978 ubuntu-focal-iweb-mtl01-0027794527 nova-compute[57704]: INFO nova.compute.manager [None req-a6373604-981c-4cca-b6a8-85abdc256ad0 tempest-ServersNegativeTestJSON-764009652 tempest-ServersNegativeTestJSON-764009652-project] [instance: 8b1d44ec-c81c-476f-95b0-624a8b62805c] Shelve offloading
But unshelving happened on the other node:
Dec 22 09:08:28.417211 ubuntu-focal-iweb-mtl01-0027794526 nova-compute[91278]: INFO nova.compute.manager [None req-dfd80af1-fdae-412b-bf4c-ae17882c70d8 tempest-ServersNegativeTestJSON-764009652 tempest-ServersNegativeTestJSON-764009652-project] [instance: 8b1d44ec-c81c-476f-95b0-624a8b62805c] Unshelving
In such case notification is sent properly as binding of the port was changed.
I think that this may be the problem (maybe some race condition) when it's unshelved on the same node as was shelved.
I will continue investigation on that.
I spent some time investigating that issue today and here is what I found so far.
In the nova-compute logs in 2 failed cases which I checked I noticed that the sequence of events was:
Nov 23 19:22:05.162249 ubuntu- focal-ovh- bhs1-0027471441 nova-compute[ 58433]: INFO nova.compute. manager [None req-aad28555- 9fa1-441f- 9239-2a1299e352 98 tempest- ServersNegative TestJSON- 724065228 tempest- ServersNegative TestJSON- 724065228- project] [instance: 1a207544- 1228-4ec1- ad99-4b6f7b6a5e a1] Shelving
Nov 23 19:22:09.112566 ubuntu- focal-ovh- bhs1-0027471441 nova-compute[ 58433]: INFO nova.compute. manager [None req-aad28555- 9fa1-441f- 9239-2a1299e352 98 tempest- ServersNegative TestJSON- 724065228 tempest- ServersNegative TestJSON- 724065228- project] [instance: 1a207544- 1228-4ec1- ad99-4b6f7b6a5e a1] Shelve offloading
Nov 23 19:22:12.795123 ubuntu- focal-ovh- bhs1-0027471441 nova-compute[ 58433]: INFO nova.compute. manager [None req-7861da7e- 4733-4cda- 9a8d-aa569ef301 f6 tempest- ServersNegative TestJSON- 724065228 tempest- ServersNegative TestJSON- 724065228- project] [instance: 1a207544- 1228-4ec1- ad99-4b6f7b6a5e a1] Unshelving
And after "Unshelving" log in the nova-compute, I saw in the neutron server logs that port was switched to active but there was no notification send to nova then (I don't know why):
Nov 23 19:22:42.686159 ubuntu- focal-ovh- bhs1-0027471439 neutron- server[ 95486]: DEBUG neutron. plugins. ml2.rpc [None req-9ab9f406- 90d2-4bf0- 9f6f-37f98968d3 2d None None] Device 46ceed9e- 1262-47c2- b7dc-335a31f78b 71 up at agent ovs-agent- ubuntu- focal-ovh- bhs1-0027471441 {{(pid=95486) update_device_up /opt/stack/ neutron/ neutron/ plugins/ ml2/rpc. py:296} } focal-ovh- bhs1-0027471439 neutron- server[ 95486]: DEBUG neutron. db.provisioning _blocks [None req-9ab9f406- 90d2-4bf0- 9f6f-37f98968d3 2d None None] Provisioning complete for port 46ceed9e- 1262-47c2- b7dc-335a31f78b 71 triggered by entity L2. {{(pid=95486) provisioning_ complete /opt/stack/ neutron/ neutron/ db/provisioning _blocks. py:139} } focal-ovh- bhs1-0027471439 neutron- server[ 95486]: DEBUG neutron_ lib.callbacks. manager [None req-9ab9f406- 90d2-4bf0- 9f6f-37f98968d3 2d None None] Publish callbacks ['neutron. plugins. ml2.plugin. Ml2Plugin. _port_provision ed-3745000' ] for port, provisioning_ complete {{(pid=95486) _notify_loop /usr/local/ lib/python3. 8/dist- packages/ neutron_ lib/callbacks/ manager. py:175} }
...
Nov 23 19:22:42.811808 ubuntu-
Nov 23 19:22:42.812114 ubuntu-
I compared it with the case when test passed and there in nova-compute logs I can see:
Dec 22 09:08:21.469791 ubuntu- focal-iweb- mtl01-002779452 7 nova-compute[ 57704]: INFO nova.compute. manager [None req-a6373604- 981c-4cca- b6a8-85abdc256a d0 tempest- ServersNegative TestJSON- 764009652 tempest- ServersNegative TestJSON- 764009652- project] [instance: 8b1d44ec- c81c-476f- 95b0-624a8b6280 5c] Shelving
Dec 22 09:08:24.621978 ubuntu- focal-iweb- mtl01-002779452 7 nova-compute[ 57704]: INFO nova.compute. manager [None req-a6373604- 981c-4cca- b6a8-85abdc256a d0 tempest- ServersNegative TestJSON- 764009652 tempest- ServersNegative TestJSON- 764009652- project] [instance: 8b1d44ec- c81c-476f- 95b0-624a8b6280 5c] Shelve offloading
But unshelving happened on the other node: focal-iweb- mtl01-002779452 6 nova-compute[ 91278]: INFO nova.compute. manager [None req-dfd80af1- fdae-412b- bf4c-ae17882c70 d8 tempest- ServersNegative TestJSON- 764009652 tempest- ServersNegative TestJSON- 764009652- project] [instance: 8b1d44ec- c81c-476f- 95b0-624a8b6280 5c] Unshelving
Dec 22 09:08:28.417211 ubuntu-
In such case notification is sent properly as binding of the port was changed.
I think that this may be the problem (maybe some race condition) when it's unshelved on the same node as was shelved.
I will continue investigation on that.