Fullstack tests failing due to problem with connection to the fake placement service

Bug #1862177 reported by Slawek Kaplonski on 2020-02-06
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
High
Lajos Katona

Bug Description

Tests like neutron.tests.fullstack.test_agent_bandwidth_report.TestPlacementBandwidthReport.test_configurations_are_synced_towards_placement(NIC Switch agent) are failing from time to time due to problem with connection from neutron-server to the fake placement service.

Example of such error https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f66/703143/3/check/neutron-fullstack/f667c93/testr_results.html

Error in neutron-server:

2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client [req-29a0f974-e549-4784-924e-0ee82cc8c910 - - - - -] Connection Error appeared: requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=8081): Max retries exceeded with url: /placement/resource_providers?name=ubuntu-bionic-ovh-bhs1-0014319701 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc0d34cb5c0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client Traceback (most recent call last):
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/urllib3/connection.py", line 157, in _new_conn
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client (self._dns_host, self.port), self.timeout, **extra_kw
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/urllib3/util/connection.py", line 84, in create_connection
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client raise err
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/urllib3/util/connection.py", line 74, in create_connection
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client sock.connect(sa)
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/eventlet/greenio/base.py", line 267, in connect
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client socket_checkerr(fd)
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/eventlet/greenio/base.py", line 51, in socket_checkerr
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client raise socket.error(err, errno.errorcode[err])
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client ConnectionRefusedError: [Errno 111] ECONNREFUSED
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client During handling of the above exception, another exception occurred:
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client Traceback (most recent call last):
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/urllib3/connectionpool.py", line 672, in urlopen
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client chunked=chunked,
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/urllib3/connectionpool.py", line 387, in _make_request
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client conn.request(method, url, **httplib_request_kw)
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/usr/lib/python3.6/http/client.py", line 1254, in request
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client self._send_request(method, url, body, headers, encode_chunked)
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/usr/lib/python3.6/http/client.py", line 1300, in _send_request
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client self.endheaders(body, encode_chunked=encode_chunked)
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/usr/lib/python3.6/http/client.py", line 1249, in endheaders
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client self._send_output(message_body, encode_chunked=encode_chunked)
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/usr/lib/python3.6/http/client.py", line 1036, in _send_output
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client self.send(msg)
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/usr/lib/python3.6/http/client.py", line 974, in send
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client self.connect()
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/urllib3/connection.py", line 184, in connect
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client conn = self._new_conn()
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/urllib3/connection.py", line 169, in _new_conn
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client self, "Failed to establish a new connection: %s" % e
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fc0d34cb5c0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client During handling of the above exception, another exception occurred:
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client Traceback (most recent call last):
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client timeout=timeout
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/urllib3/connectionpool.py", line 720, in urlopen
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/urllib3/util/retry.py", line 436, in increment
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client raise MaxRetryError(_pool, url, error or ResponseError(cause))
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=8081): Max retries exceeded with url: /placement/resource_providers?name=ubuntu-bionic-ovh-bhs1-0014319701 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc0d34cb5c0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client During handling of the above exception, another exception occurred:
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client Traceback (most recent call last):
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/neutron_lib/placement/client.py", line 114, in request
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client **kwargs)
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/requests/api.py", line 60, in request
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client return session.request(method=method, url=url, **kwargs)
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/requests/sessions.py", line 533, in request
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client resp = self.send(prep, **send_kwargs)
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/requests/sessions.py", line 646, in send
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client r = adapter.send(request, **kwargs)
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client File "/home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client raise ConnectionError(e, request=request)
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=8081): Max retries exceeded with url: /placement/resource_providers?name=ubuntu-bionic-ovh-bhs1-0014319701 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc0d34cb5c0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
2020-02-04 12:09:41.164 29846 ERROR neutron_lib.placement.client

tags: added: fullstack gate-failure
removed: fulls
Changed in neutron:
assignee: nobody → Bence Romsics (bence-romsics)
Lajos Katona (lajos-katona) wrote :

The good news is that the extra logs are now visible.

I checked the fake placement client logs:
http://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f66/703143/3/check/neutron-fullstack/f667c93/controller/logs/dsvm-fullstack-logs/TestPlacementBandwidthReport.test_configurations_are_synced_towards_placement_NIC-Switch-agent_/placement--2020-02-04--12-09-21-528383_log.txt
.
.
.
2020-02-04 12:09:37.629 29984 DEBUG __main__ [-] Response from put_resource_providers_inventories: {"resource_provider_generation": 1, "inventories": {"NET_BW_EGR_KILOBIT_PER_SEC": {"allocation_ratio": 1.0, "min_unit": 1, "step_size": 1, "reserved": 0, "total": 1000}, "NET_BW_IGR_KILOBIT_PER_SEC": {"allocation_ratio": 1.0, "min_unit": 1, "step_size": 1, "reserved": 0, "total": 1000}}} wsgi_app /home/zuul/src/opendev.org/openstack/neutron/neutron/tests/fullstack/servers/placement.py:130

and the restart came here.

The new service logs:
2020-02-04 12:09:37.629 29984 DEBUG __main__ [-] Response from put_resource_providers_inventories: {"resource_provider_generation": 1, "inventories": {"NET_BW_EGR_KILOBIT_PER_SEC": {"allocation_ratio": 1.0, "min_unit": 1, "step_size": 1, "reserved": 0, "total": 1000}, "NET_BW_IGR_KILOBIT_PER_SEC": {"allocation_ratio": 1.0, "min_unit": 1, "step_size": 1, "reserved": 0, "total": 1000}}} wsgi_app /home/zuul/src/opendev.org/openstack/neutron/neutron/tests/fullstack/servers/placement.py:130

2020-02-04 12:10:04.582 30833 INFO neutron.common.config [-] Logging enabled!
2020-02-04 12:10:04.582 30833 INFO neutron.common.config [-] /home/zuul/src/opendev.org/openstack/neutron/neutron/tests/fullstack/servers/placement.py version 15.1.0.dev432
2020-02-04 12:10:04.583 30833 DEBUG neutron.common.config [-] command line: /home/zuul/src/opendev.org/openstack/neutron/neutron/tests/fullstack/servers/placement.py --log-dir /opt/stack/logs/dsvm-fullstack-logs/TestPlacementBandwidthReport.test_configurations_are_synced_towards_placement_NIC-Switch-agent_ --log-file placement--2020-02-04--12-10-01-611315.log --config-file /tmp/tmpokrmznnl/tmpjh5gv_ss/placement.ini setup_logging /home/zuul/src/opendev.org/openstack/neutron/neutron/common/config.py:111
2020-02-04 12:10:04.586 30833 INFO __main__ [-] Placement fixture started on port: 8081

And that's all. Seemingly it started, it's waiting for requests.
I check if there should be something weird.

Bence Romsics (bence-romsics) wrote :
Download full text (6.8 KiB)

I do not understand the root cause yet, but anyway here are my findings so far:

ECONNREFUSED in the logs is not a problem in itself. The test intentionally stops the placement service temporarily so some ECONNREFUSED is expected.

I was comparing a failed run:

https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f66/703143/3/check/neutron-fullstack/f667c93/controller/logs/dsvm-fullstack-logs/TestPlacementBandwidthReport.test_configurations_are_synced_towards_placement_NIC-Switch-agent_/neutron-server--2020-02-04--12-09-12-759753_log.txt

with a successful run if the same test:

https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_6d5/701571/10/check/neutron-fullstack/6d566a7/controller/logs/dsvm-fullstack-logs/TestPlacementBandwidthReport.test_configurations_are_synced_towards_placement_NIC-Switch-agent_/neutron-server--2020-02-05--22-39-20-199852_log.txt

The difference I found is below. In the code we are at the point of waiting for the agent to go to SYNC_FAILED state:

https://opendev.org/openstack/neutron/src/commit/d5b33ffc775ffff500271c0f15f055ae3132a8d0/neutron/tests/fullstack/test_agent_bandwidth_report.py#L239

The successful run's logs:

2020-02-05 22:40:15.578 6911 INFO neutron.wsgi [req-12d85268-7050-454b-a40e-4b92877ab3a9 - - - - -] 127.0.0.1 "GET /v2.0/agents HTTP/1.1" status: 200 len: 1261 time: 0.0185254
2020-02-05 22:40:16.586 6912 DEBUG neutron.wsgi [req-12d85268-7050-454b-a40e-4b92877ab3a9 - - - - -] (6912) accepted ('127.0.0.1', 35783) server /home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/eventlet/wsgi.py:985
2020-02-05 22:40:16.595 6912 INFO neutron.wsgi [req-12d85268-7050-454b-a40e-4b92877ab3a9 - - - - -] 127.0.0.1 "GET /v2.0/agents HTTP/1.1" status: 200 len: 1261 time: 0.0080390
2020-02-05 22:40:17.602 6911 DEBUG neutron.wsgi [req-12d85268-7050-454b-a40e-4b92877ab3a9 - - - - -] (6911) accepted ('127.0.0.1', 35797) server /home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/eventlet/wsgi.py:985
2020-02-05 22:40:17.612 6911 INFO neutron.wsgi [req-12d85268-7050-454b-a40e-4b92877ab3a9 - - - - -] 127.0.0.1 "GET /v2.0/agents HTTP/1.1" status: 200 len: 1261 time: 0.0096648
2020-02-05 22:40:18.176 6915 DEBUG neutron.db.agents_db [req-12d85268-7050-454b-a40e-4b92877ab3a9 - - - - -] Agent healthcheck: found 1 active agents agent_health_check /home/zuul/src/opendev.org/openstack/neutron/neutron/db/agents_db.py:327
2020-02-05 22:40:18.599 6914 DEBUG neutron_lib.callbacks.manager [req-e4714672-6da0-4664-9882-0d32fcc1b061 - - - - -] Notify callbacks ['neutron.services.segments.db._update_segment_host_mapping_for_agent-8765997064085', 'neutron.plugins.ml2.plugin.Ml2Plugin._retry_binding_revived_agents-1431747', 'neutron.services.placement_report.plugin.PlacementReportPlugin.handle_placement_config--9223372036854771549'] for agent, after_update _notify_loop /home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py:193
2020-02-05 22:40:18.607 6914 DEBUG neutron.services.placement_report.plugin [req-e4...

Read more...

Lajos Katona (lajos-katona) wrote :

Hmmmm....
I can't see how more agents can be visible sometimes for the server, but let's try it.

Fix proposed to branch: master
Review: https://review.opendev.org/706500

Changed in neutron:
assignee: Bence Romsics (bence-romsics) → Lajos Katona (lajos-katona)
status: Confirmed → In Progress

Reviewed: https://review.opendev.org/706500
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c9a3d49513f863be144e0df31530861938b95979
Submitter: Zuul
Branch: master

commit c9a3d49513f863be144e0df31530861938b95979
Author: elajkat <email address hidden>
Date: Fri Feb 7 13:41:28 2020 +0100

    Enhance agent's resources_synced check

    The fullstack test TestPlacementBandwidthReport.
    test_configurations_are_synced_towards_placement sporadically failing
    and one possible reason is that multiple agents are visible to the
    server after restart of the agent (perhaps due to parallel execution).

    Partial-Bug: #1862177
    Change-Id: I79aa2cb9bcd69fc0cb35e0fd351a373e00feb062

Bence Romsics (bence-romsics) wrote :

So far the logs of the 3 occurences we looked at in depth did not look exactly the same. Unfortunately there are variations so we may or may not be facing multiple issues.

But anyway here's some analysis of the newest occurence linked by Lajos in comment #6.

I went through the log files, collecting the exact time of events as cross-referenced (see line numbers below) with the test code here:

https://opendev.org/openstack/neutron/src/commit/6b9765c991da8731fe39f7e7eed1ed6e2bca231a/neutron/tests/fullstack/test_agent_bandwidth_report.py

         2020-02-26-09-26-26-086450 q-svc log file created
         2020-02-26 09:26:28.012 q-svc first log line
         2020-02-26-09-26-30-776059 placement #1 log file created
         2020-02-26-09-26-30-865884 sriov-agent #1 log file created
         2020-02-26 09:26:30.890 q-svc GET /v2.0/agents
         2020-02-26 09:26:32.587 placement #1 first log line
         2020-02-26 09:26:33.256 sriov-agent #1 first log line
         2020-02-26 09:26:33.436 placement #1 last log line
         2020-02-26 09:26:39.396 sriov-agent #1 last log line
line 229 2020-02-26 09:26:40.905 q-svc GET /v2.0/agents
line 234 2020-02-26 09:26:40.918 q-svc GET /v2.0/agents?agent_type=NIC+Switch+agent first
!!! 2020-02-26-09-26-40-986104 sriov-agent #2 log file created
line 234 2020-02-26 09:26:42.467 q-svc GET /v2.0/agents?agent_type=NIC+Switch+agent last
         2020-02-26 09:26:43.254 sriov-agent #2 first log line
line 244 2020-02-26 09:26:43.479 q-svc GET /v2.0/agents?agent_type=NIC+Switch+agent first
line 244 2020-02-26 09:26:47.536 q-svc GET /v2.0/agents?agent_type=NIC+Switch+agent last
line 251 2020-02-26 09:26:48.551 q-svc GET /v2.0/agents?agent_type=NIC+Switch+agent first
         2020-02-26-09-26-53-627265 placement #2 log file created
         2020-02-26 09:26:55.128 placement #2 first log line
         2020-02-26 09:26:55.130 placement #2 last log line
         2020-02-26 09:27:39.490 sriov-agent #2 last log line
line 251 2020-02-26 09:27:40.632 q-svc GET /v2.0/agents?agent_type=NIC+Switch+agent last
line 254 3rd wait_until_true: Timed out after 47 seconds

The most troubling piece is the line marked by '!!!'. As you can see in the code the 2nd start of the sriov-agent should only happen in 240, clearly after the first wait_until_true() in line 234 completed. But according to the timestamp found in the sriov-agent log file name the 2nd start of sriov-agent happened during the first wait_until_true().

At the moment I don't have an explanation for this, I just wanted to record the observation here.

tags: added: neutron-proactive-backport-potential
Slawek Kaplonski (slaweq) wrote :

I didn't saw such failures since long time. Lets close this bug now.

Changed in neutron:
status: In Progress → Incomplete
tags: removed: neutron-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers