[SRIOV] Agent is reporting "resources_synced=True" when restarted and placement is down

Bug #1856319 reported by Rodolfo Alonso
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Lajos Katona

Bug Description

In fullstack test "test_configurations_are_synced_towards_placement", the Placement process is stopped, a new network is added and the L2 agent is restarted (SRIOV in this case).

The agent should sync the resources using the placement API, but because the Placement process is down, the server should report "resources_synced=False" until the placement report is received.

In [1] can be seen that the first reply from the server, informing about the agent with the new configuration (new network), the flag is True ("resources_synced": true):
2019-12-13 09:59:09.251 5513 DEBUG neutronclient.client [-] RESP: 200 {'Content-Type': 'application/json', 'Content-Length': '860', 'X-Openstack-Request-Id': 'req-363fc0e2-c3ae-4995-853b-cfbba9f2beb2', 'Date': 'Fri, 13 Dec 2019 09:59:09 GMT', 'Connection': 'keep-alive'} {"agents": [{"id": "158dea82-07e4-47d8-a65a-849e6de5e981", "agent_type": "NIC Switch agent", "binary": "neutron-sriov-nic-agent", "topic": "N/A", "host": "host-a677ed52", "admin_state_up": true, "created_at": "2019-12-13 09:58:39", "started_at": "2019-12-13 09:58:58", "heartbeat_timestamp": "2019-12-13 09:58:58", "description": null, "resources_synced": true, "availability_zone": null, "alive": true, "configurations": {"device_mappings": {"physnet1": ["ens590a255c847e", "ens61730c512d2e"], "physnetnew": ["ens74c23d91936f"]}, "extensions": [], "resource_provider_bandwidths": {"ens590a255c847e": {"egress": 1000, "ingress": 1000}, "ens61730c512d2e": {"egress": 1000, "ingress": 1000}, "ens74c23d91936f": {"egress": 1000, "ingress": 1000}}, "resource_provider_inventory_defaults": {"allocation_ratio": 1.0, "min_unit": 1, "step_size": 1, "reserved": 0}}}]} http_log_resp /home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/neutronclient/common/utils.py:204

Logs: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1de/697251/7/check/neutron-fullstack/1deb02b/testr_results.html.gz

[1] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1de/697251/7/check/neutron-fullstack/1deb02b/controller/logs/dsvm-fullstack-logs/TestPlacementBandwidthReport.test_configurations_are_synced_towards_placement_NIC-Switch-agent_.txt.gz

tags: added: sriov-pci-pt
Revision history for this message
YAMAMOTO Takashi (yamamoto) wrote :

the test decided to fail and started killing the neutron server at:

2019-12-13 09:59:32.058 5513 DEBUG neutron.agent.common.async_process [-] Halting async process [/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/bin/neutron-server --log-dir /opt/stack/logs/dsvm-fullstack-logs/TestPlacementBandwidthReport.test_configurations_are_synced_towards_placement_NIC-Switch-agent_ --log-file neutron-server--2019-12-13--09-58-26-449940.log --config-file /tmp/tmptz3q23or/tmpodugwxcl/neutron.conf --config-file /tmp/tmptz3q23or/tmpodugwxcl/ml2_conf.ini]. stop /home/zuul/src/opendev.org/openstack/neutron/neutron/agent/common/async_process.py:135

while the server noticed resources_synced=False at:

2019-12-13 09:59:34.268 14875 WARNING neutron.services.placement_report.plugin [req-f5af49ea-fb5c-496a-bcfc-4d37004763f0 - - - - -] Synchronization of resources of agent type NIC Switch agent at host host-a677ed52 to placement failed.: keystoneauth1.exceptions.http.HttpError: HTTP Error (HTTP 0)

maybe for some reasons the placement plugin is taking too longer to fail than expected?

tags: removed: sriov-pci-pt
Changed in neutron:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Bence Romsics (bence-romsics) wrote :

This sounds like a bug for me, I put it on my todo list.

Changed in neutron:
assignee: nobody → Bence Romsics (bence-romsics)
Revision history for this message
Bence Romsics (bence-romsics) wrote :
Download full text (6.1 KiB)

I had a bit of time to look at this. I don't get it fully yet, but anyway I found something strange I wanted record and share here. I don't think the agent was actually restarted.

Have a look at this before and after the 'Process stopped/started' lines:

$ wget -q -O- https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1de/697251/7/check/neutron-fullstack/1deb02b/controller/logs/dsvm-fullstack-logs/TestPlacementBandwidthReport.test_configurations_are_synced_towards_placement_NIC-Switch-agent_.txt.gz | zcat | egrep 'resources_synced|Process .*: neutron-sriov-nic-agent' | head

2019-12-13 09:58:34.221 5513 DEBUG neutron.tests.fullstack.resources.process [-] Process started: neutron-sriov-nic-agent start /home/zuul/src/opendev.org/openstack/neutron/neutron/tests/fullstack/resources/process.py:74
2019-12-13 09:58:44.785 5513 DEBUG neutronclient.client [-] RESP: 200 {'Content-Type': 'application/json', 'Content-Length': '771', 'X-Openstack-Request-Id': 'req-004406ad-2459-4b88-beb8-ecce40d80135', 'Date': 'Fri, 13 Dec 2019 09:58:44 GMT', 'Connection': 'keep-alive'} {"agents": [{"id": "158dea82-07e4-47d8-a65a-849e6de5e981", "agent_type": "NIC Switch agent", "binary": "neutron-sriov-nic-agent", "topic": "N/A", "host": "host-a677ed52", "admin_state_up": true, "created_at": "2019-12-13 09:58:39", "started_at": "2019-12-13 09:58:39", "heartbeat_timestamp": "2019-12-13 09:58:39", "description": null, "resources_synced": true, "availability_zone": null, "alive": true, "configurations": {"device_mappings": {"physnet1": ["ens590a255c847e", "ens61730c512d2e"]}, "extensions": [], "resource_provider_bandwidths": {"ens590a255c847e": {"egress": 1000, "ingress": 1000}, "ens61730c512d2e": {"egress": 1000, "ingress": 1000}}, "resource_provider_inventory_defaults": {"allocation_ratio": 1.0, "min_unit": 1, "step_size": 1, "reserved": 0}}}]} http_log_resp /home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/neutronclient/common/utils.py:204
2019-12-13 09:58:44.802 5513 DEBUG neutronclient.client [-] RESP: 200 {'Content-Type': 'application/json', 'Content-Length': '771', 'X-Openstack-Request-Id': 'req-0c7fef64-e9d7-4e6e-8a35-99421381b62c', 'Date': 'Fri, 13 Dec 2019 09:58:44 GMT', 'Connection': 'keep-alive'} {"agents": [{"id": "158dea82-07e4-47d8-a65a-849e6de5e981", "agent_type": "NIC Switch agent", "binary": "neutron-sriov-nic-agent", "topic": "N/A", "host": "host-a677ed52", "admin_state_up": true, "created_at": "2019-12-13 09:58:39", "started_at": "2019-12-13 09:58:39", "heartbeat_timestamp": "2019-12-13 09:58:39", "description": null, "resources_synced": true, "availability_zone": null, "alive": true, "configurations": {"device_mappings": {"physnet1": ["ens590a255c847e", "ens61730c512d2e"]}, "extensions": [], "resource_provider_bandwidths": {"ens590a255c847e": {"egress": 1000, "ingress": 1000}, "ens61730c512d2e": {"egress": 1000, "ingress": 1000}}, "resource_provider_inventory_defaults": {"allocation_ratio": 1.0, "min_unit": 1, "step_size": 1, "reserved": 0}}}]} http_log_resp /home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/neutronclient/common...

Read more...

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Thanks Bence for your time debugging this problem.

Changed in neutron:
assignee: Bence Romsics (bence-romsics) → Lajos Katona (lajos-katona)
Revision history for this message
Lajos Katona (lajos-katona) wrote :
Download full text (3.1 KiB)

After reexamining this failure with rubasov, we collected the following things.
1.) A logstash query as the old links are not working now:
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22test_configurations_are_synced_towards_placement%5C%22%20AND%20build_name%3A%5C%22neutron-fullstack%5C%22%20AND%20build_status%3A%5C%22FAILURE%5C%22

2.) I the case of a successful run, the started_at field in the agent report changes after a few iterations to the time after the restart. In this case under the log folder (i.e.: controller/logs/dsvm-fullstack-logs/TestPlacementBandwidthReport.test_configurations_are_synced_towards_placement_NIC-Switch-agent_/) there are 2 logs for the agent, one related to the agent before the restart and for after the restart.
In the failing case the started_at remains the same, and there is only 1 agent log file in the log folder.

3.) In both cases the start of the new agent process seems to be successful, see a failing case:

2020-01-15 10:09:18.985 5633 DEBUG neutron.agent.common.async_process [-] Launching async process [/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/bin/neutron-sriov-nic-agent --log-dir /opt/stack/logs/dsvm-fullstack-logs/TestPlacementBandwidthReport.test_configurations_are_synced_towards_placement_NIC-Switch-agent_ --log-file neutron-sriov-nic-agent--2020-01-15--10-09-18-984586.log --config-file /tmp/tmp8kcs8kod/tmpfl9b37bi/neutron.conf --config-file /tmp/tmp8kcs8kod/tmpfl9b37bi/sriov_agent.ini]. start /home/zuul/src/opendev.org/openstack/neutron/neutron/agent/common/async_process.py:112
2020-01-15 10:09:18.985 5633 DEBUG neutron.agent.linux.utils [-] Running command: ['/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/bin/neutron-sriov-nic-agent', '--log-dir', '/opt/stack/logs/dsvm-fullstack-logs/TestPlacementBandwidthReport.test_configurations_are_synced_towards_placement_NIC-Switch-agent_', '--log-file', 'neutron-sriov-nic-agent--2020-01-15--10-09-18-984586.log', '--config-file', '/tmp/tmp8kcs8kod/tmpfl9b37bi/neutron.conf', '--config-file', '/tmp/tmp8kcs8kod/tmpfl9b37bi/sriov_agent.ini'] create_process /home/zuul/src/opendev.org/openstack/neutron/neutron/agent/linux/utils.py:87
2020-01-15 10:09:19.115 5633 DEBUG neutron.agent.linux.utils [-] Found cmdline ['/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/bin/python3', '/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/bin/neutron-sriov-nic-agent', '--log-dir', '/opt/stack/logs/dsvm-fullstack-logs/TestPlacementBandwidthReport.test_configurations_are_synced_towards_placement_NIC-Switch-agent_', '--log-file', 'neutron-sriov-nic-agent--2020-01-15--10-09-18-984586.log', '--config-file', '/tmp/tmp8kcs8kod/tmpfl9b37bi/neutron.conf', '--config-file', '/tmp/tmp8kcs8kod/tmpfl9b37bi/sriov_agent.ini'] for process with PID 9205. get_cmdline_from_pid /home/zuul/src/opendev.org/openstack/neutron/neutron/agent/linux/utils.py:343
2020-01-15 10:09:19.118 5633 DEBUG neutron.tests.fullstack.resources.process [-] Process started: neutron-sriov-nic-agent start /home/zuul/src/opendev.org/openstack/neutron/neutron/tests/fullstack/resources/process.py:74

So even PID is there...

Read more...

Revision history for this message
Lajos Katona (lajos-katona) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/870956

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/870956
Committed: https://opendev.org/openstack/neutron/commit/0ae432e71712be671ec93dded3245513b6810cd0
Submitter: "Zuul (22348)"
Branch: master

commit 0ae432e71712be671ec93dded3245513b6810cd0
Author: elajkat <email address hidden>
Date: Wed Jan 18 14:30:44 2023 +0100

    Fullstack: Wait placement process fixtrue to really stop

    It seems that in test test_configurations_are_synced_towards_placement
    when the fake placement process is stopped it can take longer and
    the test restarts the agent before it is really dead.
    In this case neutron-server can report the status and the agent's
    'resources_synced' field is True when the tests checks it.

    A simple workaround for this can be to check if the AsyncProcess
    class' _is_running property is False.

    Closes-Bug: #1856319
    Change-Id: Ia4c25110ee5a7b7539407d3531f559555fec50cc

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/zed)

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/neutron/+/872878

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/872878
Committed: https://opendev.org/openstack/neutron/commit/4d983d2fea7605859609f00220a6cf1d3ffc3950
Submitter: "Zuul (22348)"
Branch: stable/zed

commit 4d983d2fea7605859609f00220a6cf1d3ffc3950
Author: elajkat <email address hidden>
Date: Wed Jan 18 14:30:44 2023 +0100

    Fullstack: Wait placement process fixtrue to really stop

    It seems that in test test_configurations_are_synced_towards_placement
    when the fake placement process is stopped it can take longer and
    the test restarts the agent before it is really dead.
    In this case neutron-server can report the status and the agent's
    'resources_synced' field is True when the tests checks it.

    A simple workaround for this can be to check if the AsyncProcess
    class' _is_running property is False.

    Closes-Bug: #1856319
    Change-Id: Ia4c25110ee5a7b7539407d3531f559555fec50cc
    (cherry picked from commit 0ae432e71712be671ec93dded3245513b6810cd0)

tags: added: in-stable-zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/neutron/+/873279

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/873962

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/873962
Committed: https://opendev.org/openstack/neutron/commit/a26f20fe950c88e5e5540a095de365b56691803e
Submitter: "Zuul (22348)"
Branch: stable/xena

commit a26f20fe950c88e5e5540a095de365b56691803e
Author: elajkat <email address hidden>
Date: Wed Jan 18 14:30:44 2023 +0100

    Fullstack: Wait placement process fixtrue to really stop

    It seems that in test test_configurations_are_synced_towards_placement
    when the fake placement process is stopped it can take longer and
    the test restarts the agent before it is really dead.
    In this case neutron-server can report the status and the agent's
    'resources_synced' field is True when the tests checks it.

    A simple workaround for this can be to check if the AsyncProcess
    class' _is_running property is False.

    Closes-Bug: #1856319
    Change-Id: Ia4c25110ee5a7b7539407d3531f559555fec50cc
    (cherry picked from commit 0ae432e71712be671ec93dded3245513b6810cd0)
    (cherry picked from commit 4d983d2fea7605859609f00220a6cf1d3ffc3950)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/873279
Committed: https://opendev.org/openstack/neutron/commit/54501e7a24145aff318a6b5b1fa8b762a0e6b664
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 54501e7a24145aff318a6b5b1fa8b762a0e6b664
Author: elajkat <email address hidden>
Date: Wed Jan 18 14:30:44 2023 +0100

    Fullstack: Wait placement process fixtrue to really stop

    It seems that in test test_configurations_are_synced_towards_placement
    when the fake placement process is stopped it can take longer and
    the test restarts the agent before it is really dead.
    In this case neutron-server can report the status and the agent's
    'resources_synced' field is True when the tests checks it.

    A simple workaround for this can be to check if the AsyncProcess
    class' _is_running property is False.

    Closes-Bug: #1856319
    Change-Id: Ia4c25110ee5a7b7539407d3531f559555fec50cc
    (cherry picked from commit 0ae432e71712be671ec93dded3245513b6810cd0)
    (cherry picked from commit 4d983d2fea7605859609f00220a6cf1d3ffc3950)

tags: added: in-stable-yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/874499

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 22.0.0.0rc1

This issue was fixed in the openstack/neutron 22.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/874499
Committed: https://opendev.org/openstack/neutron/commit/0f0cd189d5e771c1d9687872242031c6b0441b93
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 0f0cd189d5e771c1d9687872242031c6b0441b93
Author: elajkat <email address hidden>
Date: Wed Jan 18 14:30:44 2023 +0100

    Fullstack: Wait placement process fixtrue to really stop

    It seems that in test test_configurations_are_synced_towards_placement
    when the fake placement process is stopped it can take longer and
    the test restarts the agent before it is really dead.
    In this case neutron-server can report the status and the agent's
    'resources_synced' field is True when the tests checks it.

    A simple workaround for this can be to check if the AsyncProcess
    class' _is_running property is False.

    Closes-Bug: #1856319
    Change-Id: Ia4c25110ee5a7b7539407d3531f559555fec50cc
    (cherry picked from commit 0ae432e71712be671ec93dded3245513b6810cd0)
    (cherry picked from commit 4d983d2fea7605859609f00220a6cf1d3ffc3950)
    (cherry picked from commit a26f20fe950c88e5e5540a095de365b56691803e)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 19.6.0

This issue was fixed in the openstack/neutron 19.6.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 20.3.0

This issue was fixed in the openstack/neutron 20.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 21.1.0

This issue was fixed in the openstack/neutron 21.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron wallaby-eom

This issue was fixed in the openstack/neutron wallaby-eom release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.