Bug #1953478 “Resize and shelve server fails in the multinode CI...” : Bugs : neutron

Sylvain Bauza (sylvain-bauza) on 2021-12-07

Changed in nova:
importance:	Undecided → High

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2021-12-08:

#1

So I looked at first reported failed job[1] where unshelving timed out. Here is the grepped output with the relevant logs [2]

Sequence of events from the nova-compute perspective:
1) unshelving starts
?) (I don't see when the binding of the port happened but it should be done by the compute at [3])
2) vif-plugged event received but treated as unexpected and ignored
3) nova starts waiting for the vif-plugged event
5) nova plugs the vif
4) nova times out waiting for the plugged event.

Does neutron sends the vif plugged event at bind time instead of plug time in this case?!

[1] https://zuul.opendev.org/t/openstack/build/bbf40b69b30d42a194af50f60915f9cd/logs

[2] https://paste.opendev.org/show/811535/
[3] https://github.com/openstack/nova/blob/9f296d775d8f58fcbd03393c81a023268c7071cb/nova/compute/manager.py#L6675

Balazs Gibizer (balazs-gibizer) on 2021-12-08

affects:

nova → neutron

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2021-12-08:

#2

I also looked at the 3rd failure reported in this bug[1]. Here is the relevant logs[2].
It seems that here neutron also sends the vif-plugged event at bind time and not at plug time and that causes nova to ignore the bind time event and time out waiting for the plug time event.

[1] https://zuul.opendev.org/t/openstack/build/94dc22a686014a8a8845b0d86fa7ba3e/logs
[2] https://paste.opendev.org/show/811539/

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2021-12-21:

#3

Another example of such failure https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_695/821727/5/check/neutron-ovs-tempest-multinode-full/69557fd/testr_results.html

But I still didn't had time to investigate it.

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2021-12-22:

#4

Download full text (3.6 KiB)

I spent some time investigating that issue today and here is what I found so far.

In the nova-compute logs in 2 failed cases which I checked I noticed that the sequence of events was:

Nov 23 19:22:05.162249 ubuntu-focal-ovh-bhs1-0027471441 nova-compute[58433]: INFO nova.compute.manager [None req-aad28555-9fa1-441f-9239-2a1299e35298 tempest-ServersNegativeTestJSON-724065228 tempest-ServersNegativeTestJSON-724065228-project] [instance: 1a207544-1228-4ec1-ad99-4b6f7b6a5ea1] Shelving

Nov 23 19:22:09.112566 ubuntu-focal-ovh-bhs1-0027471441 nova-compute[58433]: INFO nova.compute.manager [None req-aad28555-9fa1-441f-9239-2a1299e35298 tempest-ServersNegativeTestJSON-724065228 tempest-ServersNegativeTestJSON-724065228-project] [instance: 1a207544-1228-4ec1-ad99-4b6f7b6a5ea1] Shelve offloading

Nov 23 19:22:12.795123 ubuntu-focal-ovh-bhs1-0027471441 nova-compute[58433]: INFO nova.compute.manager [None req-7861da7e-4733-4cda-9a8d-aa569ef301f6 tempest-ServersNegativeTestJSON-724065228 tempest-ServersNegativeTestJSON-724065228-project] [instance: 1a207544-1228-4ec1-ad99-4b6f7b6a5ea1] Unshelving

And after "Unshelving" log in the nova-compute, I saw in the neutron server logs that port was switched to active but there was no notification send to nova then (I don't know why):

Nov 23 19:22:42.686159 ubuntu-focal-ovh-bhs1-0027471439 neutron-server[95486]: DEBUG neutron.plugins.ml2.rpc [None req-9ab9f406-90d2-4bf0-9f6f-37f98968d32d None None] Device 46ceed9e-1262-47c2-b7dc-335a31f78b71 up at agent ovs-agent-ubuntu-focal-ovh-bhs1-0027471441 {{(pid=95486) update_device_up /opt/stack/neutron/neutron/plugins/ml2/rpc.py:296}}
...
Nov 23 19:22:42.811808 ubuntu-focal-ovh-bhs1-0027471439 neutron-server[95486]: DEBUG neutron.db.provisioning_blocks [None req-9ab9f406-90d2-4bf0-9f6f-37f98968d32d None None] Provisioning complete for port 46ceed9e-1262-47c2-b7dc-335a31f78b71 triggered by entity L2. {{(pid=95486) provisioning_complete /opt/stack/neutron/neutron/db/provisioning_blocks.py:139}}
Nov 23 19:22:42.812114 ubuntu-focal-ovh-bhs1-0027471439 neutron-server[95486]: DEBUG neutron_lib.callbacks.manager [None req-9ab9f406-90d2-4bf0-9f6f-37f98968d32d None None] Publish callbacks ['neutron.plugins.ml2.plugin.Ml2Plugin._port_provisioned-3745000'] for port, provisioning_complete {{(pid=95486) _notify_loop /usr/local/lib/python3.8/dist-packages/neutron_lib/callbacks/manager.py:175}}

I compared it with the case when test passed and there in nova-compute logs I can see:

Dec 22 09:08:21.469791 ubuntu-focal-iweb-mtl01-0027794527 nova-compute[57704]: INFO nova.compute.manager [None req-a6373604-981c-4cca-b6a8-85abdc256ad0 tempest-ServersNegativeTestJSON-764009652 tempest-ServersNegativeTestJSON-764009652-project] [instance: 8b1d44ec-c81c-476f-95b0-624a8b62805c] Shelving

Dec 22 09:08:24.621978 ubuntu-focal-iweb-mtl01-0027794527 nova-compute[57704]: INFO nova.compute.manager [None req-a6373604-981c-4cca-b6a8-85abdc256ad0 tempest-ServersNegativeTestJSON-764009652 tempest-ServersNegativeTestJSON-764009652-project] [instance: 8b1d44ec-c81c-476f-95b0-624a8b62805c] Shelve offloading

But unshelving happened on the other node:
Dec 22 09:08:28.417211 ubuntu-focal-iweb-mtl01-0027794...