Resize and shelve server fails in the multinode CI jobs intemittent

Bug #1953478 reported by Slawek Kaplonski
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Slawek Kaplonski

Bug Description

I noticed failures of the 2 tests in the Neutron CI multinode job. Both failures looks similar for me at first glance but if those are different issues, feel free to open another bug for one of them.
Failed tests:

tempest.api.compute.servers.test_servers_negative.ServersNegativeTestJSON.test_shelve_shelved_server

and

tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert

Failure examples:

https://4d2664479333d0a7d727-a4adcb41e06ec456a00383225f090da6.ssl.cf5.rackcdn.com/786478/25/check/neutron-ovs-tempest-dvr-ha-multinode-full/bbf40b6/testr_results.html
https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_061/804846/18/gate/neutron-ovs-tempest-multinode-full/061df45/testr_results.html
https://be2e92e10ead782aa651-35e07a4cf42cfaed2fcffa4bf0b16f1b.ssl.cf1.rackcdn.com/819253/1/check/neutron-ovs-tempest-dvr-ha-multinode-full/94dc22a/testr_results.html

Stacktrace:

Traceback (most recent call last):
  File "/opt/stack/tempest/tempest/api/compute/servers/test_servers_negative.py", line 50, in tearDown
    self.server_check_teardown()
  File "/opt/stack/tempest/tempest/api/compute/base.py", line 222, in server_check_teardown
    waiters.wait_for_server_status(cls.servers_client,
  File "/opt/stack/tempest/tempest/common/waiters.py", line 96, in wait_for_server_status
    raise lib_exc.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: (ServersNegativeTestJSON:tearDown) Server 1a207544-1228-4ec1-ad99-4b6f7b6a5ea1 failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: SHELVED_OFFLOADED. Current task state: spawning.

and:

Traceback (most recent call last):
  File "/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 419, in test_resize_server_revert
    waiters.wait_for_server_status(self.client, self.server_id, 'ACTIVE')
  File "/opt/stack/tempest/tempest/common/waiters.py", line 96, in wait_for_server_status
    raise lib_exc.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: (ServerActionsTestJSON:test_resize_server_revert) Server f996ef6b-4417-4eb3-aeb7-9f66d8c4d2c5 failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: REVERT_RESIZE. Current task state: resize_reverting.

Changed in nova:
importance: Undecided → High
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

So I looked at first reported failed job[1] where unshelving timed out. Here is the grepped output with the relevant logs [2]

Sequence of events from the nova-compute perspective:
1) unshelving starts
?) (I don't see when the binding of the port happened but it should be done by the compute at [3])
2) vif-plugged event received but treated as unexpected and ignored
3) nova starts waiting for the vif-plugged event
5) nova plugs the vif
4) nova times out waiting for the plugged event.

Does neutron sends the vif plugged event at bind time instead of plug time in this case?!

[1] https://zuul.opendev.org/t/openstack/build/bbf40b69b30d42a194af50f60915f9cd/logs

[2] https://paste.opendev.org/show/811535/
[3] https://github.com/openstack/nova/blob/9f296d775d8f58fcbd03393c81a023268c7071cb/nova/compute/manager.py#L6675

affects: nova → neutron
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I also looked at the 3rd failure reported in this bug[1]. Here is the relevant logs[2].
It seems that here neutron also sends the vif-plugged event at bind time and not at plug time and that causes nova to ignore the bind time event and time out waiting for the plug time event.

[1] https://zuul.opendev.org/t/openstack/build/94dc22a686014a8a8845b0d86fa7ba3e/logs
[2] https://paste.opendev.org/show/811539/

Revision history for this message
Slawek Kaplonski (slaweq) wrote :
Revision history for this message
Slawek Kaplonski (slaweq) wrote :
Download full text (3.6 KiB)

I spent some time investigating that issue today and here is what I found so far.

In the nova-compute logs in 2 failed cases which I checked I noticed that the sequence of events was:

Nov 23 19:22:05.162249 ubuntu-focal-ovh-bhs1-0027471441 nova-compute[58433]: INFO nova.compute.manager [None req-aad28555-9fa1-441f-9239-2a1299e35298 tempest-ServersNegativeTestJSON-724065228 tempest-ServersNegativeTestJSON-724065228-project] [instance: 1a207544-1228-4ec1-ad99-4b6f7b6a5ea1] Shelving

Nov 23 19:22:09.112566 ubuntu-focal-ovh-bhs1-0027471441 nova-compute[58433]: INFO nova.compute.manager [None req-aad28555-9fa1-441f-9239-2a1299e35298 tempest-ServersNegativeTestJSON-724065228 tempest-ServersNegativeTestJSON-724065228-project] [instance: 1a207544-1228-4ec1-ad99-4b6f7b6a5ea1] Shelve offloading

Nov 23 19:22:12.795123 ubuntu-focal-ovh-bhs1-0027471441 nova-compute[58433]: INFO nova.compute.manager [None req-7861da7e-4733-4cda-9a8d-aa569ef301f6 tempest-ServersNegativeTestJSON-724065228 tempest-ServersNegativeTestJSON-724065228-project] [instance: 1a207544-1228-4ec1-ad99-4b6f7b6a5ea1] Unshelving

And after "Unshelving" log in the nova-compute, I saw in the neutron server logs that port was switched to active but there was no notification send to nova then (I don't know why):

Nov 23 19:22:42.686159 ubuntu-focal-ovh-bhs1-0027471439 neutron-server[95486]: DEBUG neutron.plugins.ml2.rpc [None req-9ab9f406-90d2-4bf0-9f6f-37f98968d32d None None] Device 46ceed9e-1262-47c2-b7dc-335a31f78b71 up at agent ovs-agent-ubuntu-focal-ovh-bhs1-0027471441 {{(pid=95486) update_device_up /opt/stack/neutron/neutron/plugins/ml2/rpc.py:296}}
...
Nov 23 19:22:42.811808 ubuntu-focal-ovh-bhs1-0027471439 neutron-server[95486]: DEBUG neutron.db.provisioning_blocks [None req-9ab9f406-90d2-4bf0-9f6f-37f98968d32d None None] Provisioning complete for port 46ceed9e-1262-47c2-b7dc-335a31f78b71 triggered by entity L2. {{(pid=95486) provisioning_complete /opt/stack/neutron/neutron/db/provisioning_blocks.py:139}}
Nov 23 19:22:42.812114 ubuntu-focal-ovh-bhs1-0027471439 neutron-server[95486]: DEBUG neutron_lib.callbacks.manager [None req-9ab9f406-90d2-4bf0-9f6f-37f98968d32d None None] Publish callbacks ['neutron.plugins.ml2.plugin.Ml2Plugin._port_provisioned-3745000'] for port, provisioning_complete {{(pid=95486) _notify_loop /usr/local/lib/python3.8/dist-packages/neutron_lib/callbacks/manager.py:175}}

I compared it with the case when test passed and there in nova-compute logs I can see:

Dec 22 09:08:21.469791 ubuntu-focal-iweb-mtl01-0027794527 nova-compute[57704]: INFO nova.compute.manager [None req-a6373604-981c-4cca-b6a8-85abdc256ad0 tempest-ServersNegativeTestJSON-764009652 tempest-ServersNegativeTestJSON-764009652-project] [instance: 8b1d44ec-c81c-476f-95b0-624a8b62805c] Shelving

Dec 22 09:08:24.621978 ubuntu-focal-iweb-mtl01-0027794527 nova-compute[57704]: INFO nova.compute.manager [None req-a6373604-981c-4cca-b6a8-85abdc256ad0 tempest-ServersNegativeTestJSON-764009652 tempest-ServersNegativeTestJSON-764009652-project] [instance: 8b1d44ec-c81c-476f-95b0-624a8b62805c] Shelve offloading

But unshelving happened on the other node:
Dec 22 09:08:28.417211 ubuntu-focal-iweb-mtl01-0027794...

Read more...

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I checked some more similar cases where that test passed and sometimes unshelving happens on the same node as shelve so it's not the reason of the issue for sure.

So far I think that the issue is on the Neutron side as it didn't send proper notification to nova when port came up after unshelving.
I will try to reproduce that issue locally or at least add some more debug logs to neutron to be able to better understand what happens there.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-lib (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron-lib/+/822817

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/822818

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron-lib (master)

Reviewed: https://review.opendev.org/c/openstack/neutron-lib/+/822817
Committed: https://opendev.org/openstack/neutron-lib/commit/76b9a095918c02fa90e1cc6f1a481b3a40938b3b
Submitter: "Zuul (22348)"
Branch: master

commit 76b9a095918c02fa90e1cc6f1a481b3a40938b3b
Author: Slawek Kaplonski <email address hidden>
Date: Thu Dec 23 11:55:40 2021 +0100

    Log resource id in the callbacks publish method

    It may help in debugging if we will know for what resource exactly
    callbacks are published.

    Related-Bug: #1953478
    Change-Id: Ia741d4d87d3a284add2f2c48f780d5332540fafe

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/822818
Committed: https://opendev.org/openstack/neutron/commit/34e560fad0b773d4b81032f876994d5340195600
Submitter: "Zuul (22348)"
Branch: master

commit 34e560fad0b773d4b81032f876994d5340195600
Author: Slawek Kaplonski <email address hidden>
Date: Thu Dec 23 11:57:47 2021 +0100

    Add extra debug log with current and new port statuses in ML2 plugin

    This patch only adds one extra debug log to log current and new
    port statuses in the method which updates that status in the DB.
    This may be useful e.g. to understand why notifications about port
    change aren't send to Nova in some cases (like unshelve instance, see
    related bug).

    Related-Bug: #1953478
    Change-Id: I4c6fd5b0b33bf764c0b182f169173453ea7a4efc

Revision history for this message
Slawek Kaplonski (slaweq) wrote :
tags: added: neutron-proactive-backport-potential
Revision history for this message
Slawek Kaplonski (slaweq) wrote :
Download full text (12.4 KiB)

I dig into that issue again and I think that I more or less know what is going on there (but not sure 100% why it's like that).
So in the case when that test is failing, the problem is that after instance is shelved, there is info that port is in DOWN state:

Jan 14 14:31:19.984862 ubuntu-focal-rax-ord-0028007551 neutron-server[70556]: DEBUG neutron.plugins.ml2.rpc [None req-58bb688c-d96a-49ab-9dc8-a4845ca7d6fa None None] Device 63c13ed9-d3ad-409b-8412-134ae41c404b no longer exists at agent ovs-agent-ubuntu-focal-rax-ord-0028007551 {{(pid=70556) update_device_down /opt/sta
ck/neutron/neutron/plugins/ml2/rpc.py:259}}
Jan 14 14:31:20.132185 ubuntu-focal-rax-ord-0028007551 neutron-server[70556]: DEBUG neutron.plugins.ml2.plugin [None req-58bb688c-d96a-49ab-9dc8-a4845ca7d6fa None None] Current status of the port 63c13ed9-d3ad-409b-8412-134ae41c404b is: ACTIVE; New status is: DOWN {{(pid=70556) _update_individual_port_db_status /opt/stack/neutron/neutron/plugins/ml2/plugin.py:2213}}

but just after that, and before nova actuall do unshelve of the VM, there is notification from the DHCP agent that port's provisioning is completed:

Jan 14 14:31:24.275380 ubuntu-focal-rax-ord-0028007551 neutron-server[70556]: DEBUG neutron.db.provisioning_blocks [None req-d4f3e3de-4409-406b-adf2-34be949f3734 None None] Provisioning complete for port 63c13ed9-d3ad-409b-8412-134ae41c404b triggered by entity DHCP. {{(pid=70556) provisioning_complete /opt/stack/neutron/neutron/db/provisioning_blocks.py:139}}
Jan 14 14:31:24.275680 ubuntu-focal-rax-ord-0028007551 neutron-server[70556]: DEBUG neutron_lib.callbacks.manager [None req-d4f3e3de-4409-406b-adf2-34be949f3734 None None] Publish callbacks ['neutron.plugins.ml2.plugin.Ml2Plugin._port_provisioned-1912498'] for port (63c13ed9-d3ad-409b-8412-134ae41c404b), provisioning_complete {{(pid=70556) _notify_loop /usr/local/lib/python3.8/dist-packages/neutron_lib/callbacks/manager.py:176}}
Jan 14 14:31:24.804444 ubuntu-focal-rax-ord-0028007551 neutron-server[70556]: DEBUG neutron.plugins.ml2.plugin [None req-d4f3e3de-4409-406b-adf2-34be949f3734 None None] Current status of the port 63c13ed9-d3ad-409b-8412-134ae41c404b is: DOWN; New status is: ACTIVE {{(pid=70556) _update_individual_port_db_status /opt/stack/neutron/neutron/plugins/ml2/plugin.py:2213}}
Jan 14 14:31:24.804773 ubuntu-focal-rax-ord-0028007551 neutron-server[70556]: DEBUG neutron_lib.callbacks.manager [None req-d4f3e3de-4409-406b-adf2-34be949f3734 None None] Publish callbacks ['neutron.plugins.ml2.plugin.SecurityGroupDbMixin._ensure_default_security_group_handler-1790959'] for port (63c13ed9-d3ad-409b-8412-134ae41c404b), before_update {{(pid=70556) _notify_loop /usr/local/lib/python3.8/dist-packages/neutron_lib/callbacks/manager.py:176}}
Jan 14 14:31:24.877831 ubuntu-focal-rax-ord-0028007551 neutron-server[70556]: DEBUG neutron.notifiers.nova [-] Sending events: [{'server_uuid': 'a18584c2-4138-4c50-b426-f68b4c69b490', 'name': 'network-vif-plugged', 'status': 'completed', 'tag': '63c13ed9-d3ad-409b-8412-134ae41c404b'}] {{(pid=70556) send_events /opt/stack/neutron/neutron/notifiers/nova.py:262}}

And that triggers send of the notification to nova. But nova don't expe...

Changed in neutron:
status: New → Confirmed
assignee: nobody → Slawek Kaplonski (slaweq)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/824981

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/824982

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/824982
Committed: https://opendev.org/openstack/neutron/commit/1562f9141ba9c90c69c1c851dd9a50c8108ab15f
Submitter: "Zuul (22348)"
Branch: master

commit 1562f9141ba9c90c69c1c851dd9a50c8108ab15f
Author: Slawek Kaplonski <email address hidden>
Date: Mon Jan 17 17:02:07 2022 +0100

    Call enable DHCP only if there are subnets with enabled DHCP in network

    In the configure_dhcp_for_network() method in the DHCP agent, we should
    call "enable" from the DHCP driver, and put ports as dhcp_ready_ports
    only if there are subnets with enabled DHCP service in that network.
    Before this patch it was done for network if there were ANY subnets
    there. That in some rare condition could result in calling enable()
    method from the DHCP driver, and putting ports into dhcp_ready_ports
    set even if there were only subnets with disabled dhcp there.

    That could cause the problem with reporting provisioning block completed
    by the DHCP entity and lead to unexpected transition of the port from
    DOWN to UP state on the server side.

    Related-Bug: 1953478
    Change-Id: I8b2cc04a824cd7996b9ba486f8ef614c92b242d5

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Patch https://review.opendev.org/c/openstack/neutron/+/824982 should be enough to fix that issue so I will close it now.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/824981
Reason: Patch https://review.opendev.org/c/openstack/neutron/+/824982 should be enough to fix original issue with SHELVE/UNSHELVE instances and this change in the provisioning blocks is causing too many different issues as not always provisioning block for ports seems to be set, even if later port needs to be switched to ACTIVE.

Changed in neutron:
status: In Progress → Fix Released
tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/xena)

Related fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/839733

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/839734

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/839733
Committed: https://opendev.org/openstack/neutron/commit/d8f6fdbd22f5002160ed7115e40f2782c7c20ac8
Submitter: "Zuul (22348)"
Branch: stable/xena

commit d8f6fdbd22f5002160ed7115e40f2782c7c20ac8
Author: Slawek Kaplonski <email address hidden>
Date: Thu Dec 23 11:57:47 2021 +0100

    Add extra debug log with current and new port statuses in ML2 plugin

    This patch only adds one extra debug log to log current and new
    port statuses in the method which updates that status in the DB.
    This may be useful e.g. to understand why notifications about port
    change aren't send to Nova in some cases (like unshelve instance, see
    related bug).

    Related-Bug: #1953478
    Change-Id: I4c6fd5b0b33bf764c0b182f169173453ea7a4efc
    (cherry picked from commit 34e560fad0b773d4b81032f876994d5340195600)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/839734
Committed: https://opendev.org/openstack/neutron/commit/81ae3eb98de5b4038ad3ca0268b55e87bad15ca3
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 81ae3eb98de5b4038ad3ca0268b55e87bad15ca3
Author: Slawek Kaplonski <email address hidden>
Date: Thu Dec 23 11:57:47 2021 +0100

    Add extra debug log with current and new port statuses in ML2 plugin

    This patch only adds one extra debug log to log current and new
    port statuses in the method which updates that status in the DB.
    This may be useful e.g. to understand why notifications about port
    change aren't send to Nova in some cases (like unshelve instance, see
    related bug).

    Related-Bug: #1953478
    Change-Id: I4c6fd5b0b33bf764c0b182f169173453ea7a4efc
    (cherry picked from commit 34e560fad0b773d4b81032f876994d5340195600)

tags: added: in-stable-wallaby
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.