Error overcloud network provision failed

Bug #1980333 reported by Amol Kahat
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

Description:
Error overcloud network provision failed

Error:
2022-06-28 19:07:46.849555 | fa163e5d-2450-90c1-aa70-00000000000e | FATAL | Create/Update composable networks | localhost | item={'name': 'External', 'name_lower': 'external', 'mtu': 1350, 'vip': True, 'subnets': {'external_subnet': {'ip_subnet': '10.0.0.0/24', 'gateway_ip': '10.0.0.1', 'allocation_pools': [{'start': '10.0.0.10', 'end': '10.0.0.50'}]}}} | error={"ansible_index_var": "idx", "ansible_loop_var": "item", "changed": true, "error": "HttpException: 504: Server Error for url: https://192.168.24.2:13696/v2.0/segments, The server didn't respond in time.: 504 Gateway Time-out", "idx": 0, "item": {"mtu": 1350, "name": "External", "name_lower": "external", "subnets": {"external_subnet": {"allocation_pools": [{"end": "10.0.0.50", "start": "10.0.0.10"}], "gateway_ip": "10.0.0.1", "ip_subnet": "10.0.0.0/24"}}, "vip": true}, "msg": "Error overcloud network provision failed!", "success": false}
2022-06-28 19:07:46.851704 | fa163e5d-2450-90c1-aa70-00000000000e | TIMING | Create/Update composable networks | localhost | 0:02:05.283186 | 124.77s
2022-06-28 19:09:52.752976 | fa163e5d-2450-90c1-aa70-00000000000e | FATAL | Create/Update composable networks | localhost | item={'name': 'InternalApi', 'name_lower': 'internal_api', 'mtu': 1350, 'vip': True, 'subnets': {'internal_api_subnet': {'ip_subnet': '172.17.0.0/24', 'allocation_pools': [{'start': '172.17.0.10', 'end': '172.17.0.250'}]}}} | error={"ansible_index_var": "idx", "ansible_loop_var": "item", "changed": true, "error": "HttpException: 504: Server Error for url: https://192.168.24.2:13696/v2.0/subnets, 504 Gateway Time-out: The server didn't respond in time.", "idx": 1, "item": {"mtu": 1350, "name": "InternalApi", "name_lower": "internal_api", "subnets": {"internal_api_subnet": {"allocation_pools": [{"end": "172.17.0.250", "start": "172.17.0.10"}], "ip_subnet": "172.17.0.0/24"}}, "vip": true}, "msg": "Error overcloud network provision failed!", "success": false}
2022-06-28 19:09:52.754270 | fa163e5d-2450-90c1-aa70-00000000000e | TIMING | Create/Update composable networks | localhost | 0:04:11.185760 | 250.68s
2022-06-28 19:11:58.275322 | fa163e5d-2450-90c1-aa70-00000000000e | FATAL | Create/Update composable networks | localhost | item={'name': 'Storage', 'name_lower': 'storage', 'mtu': 1350, 'vip': True, 'subnets': {'storage_subnet': {'ip_subnet': '172.18.0.0/24', 'allocation_pools': [{'start': '172.18.0.10', 'end': '172.18.0.250'}]}}} | error={"ansible_index_var": "idx", "ansible_loop_var": "item", "changed": true, "error": "HttpException: 504: Server Error for url: https://192.168.24.2:13696/v2.0/subnets, 504 Gateway Time-out: The server didn't respond in time.", "idx": 2, "item": {"mtu": 1350, "name": "Storage", "name_lower": "storage", "subnets": {"storage_subnet": {"allocation_pools": [{"end": "172.18.0.250", "start": "172.18.0.10"}], "ip_subnet": "172.18.0.0/24"}}, "vip": true}, "msg": "Error overcloud network provision failed!", "success": false}
2022-06-28 19:11:58.276451 | fa163e5d-2450-90c1-aa70-00000000000e | TIMING | Create/Update composable networks | localhost | 0:06:16.707946 | 376.20s
2022-06-28 19:14:03.674507 | fa163e5d-2450-90c1-aa70-00000000000e | FATAL | Create/Update composable networks | localhost | item={'name': 'StorageMgmt', 'name_lower': 'storage_mgmt', 'mtu': 1350, 'vip': True, 'subnets': {'storage_mgmt_subnet': {'ip_subnet': '172.19.0.0/24', 'allocation_pools': [{'start': '172.19.0.10', 'end': '172.19.0.250'}]}}} | error={"ansible_index_var": "idx", "ansible_loop_var": "item", "changed": true, "error": "HttpException: 504: Server Error for url: https://192.168.24.2:13696/v2.0/subnets, 504 Gateway Time-out: The server didn't respond in time.", "idx": 3, "item": {"mtu": 1350, "name": "StorageMgmt", "name_lower": "storage_mgmt", "subnets": {"storage_mgmt_subnet": {"allocation_pools": [{"end": "172.19.0.250", "start": "172.19.0.10"}], "ip_subnet": "172.19.0.0/24"}}, "vip": true}, "msg": "Error overcloud network provision failed!", "success": false}
2022-06-28 19:14:03.675681 | fa163e5d-2450-90c1-aa70-00000000000e | TIMING | Create/Update composable networks | localhost | 0:08:22.107174 | 501.60s
2022-06-28 19:14:09.807277 | fa163e5d-2450-90c1-aa70-00000000000e | CHANGED | Create/Update composable networks | localhost | item={'name': 'Tenant', 'name_lower': 'tenant', 'mtu': 1350, 'subnets': {'tenant_subnet': {'ip_subnet': '172.16.0.0/24', 'allocation_pools': [{'start': '172.16.0.10', 'end': '172.16.0.250'}]}}}

Additional Info:
2022-06-29 10:36:55.618 106980 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/37f83309-ea94-431a-8b0b-9ae1aabdf74d/pid; Error: [Errno 2] No such file or directory: '/var/lib/neutron/dhcp/37f83309-ea94-431a-8b0b-9ae1aabdf74d/pid' get_value_from_file /usr/lib/python3.9/site-packages/neutron/agent/linux/utils.py:252

2022-06-29 10:37:03.611 106980 DEBUG neutron.agent.linux.utils [None req-6c8ba203-c6c2-4bf6-87fe-ade60ad19bb3 - - - - - -] Unable to access /var/lib/neutron/external/pids/37f83309-ea94-431a-8b0b-9ae1aabdf74d.pid.haproxy; Error: [Errno 2] No such file or directory: '/var/lib/neutron/external/pids/37f83309-ea94-431a-8b0b-9ae1aabdf74d.pid.haproxy' get_value_from_file /usr/lib/python3.9/site-packages/neutron/agent/linux/utils.py:252

Logs:
- https://logserver.rdoproject.org/54/36254/128/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-network-master/27b7b63/logs/undercloud/home/zuul/overcloud_network_provision.log.txt.gz
- https://logserver.rdoproject.org/37/41437/14/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-network-master/2a0b814/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz

Revision history for this message
Jakob Meng (jm1337) wrote :

Job periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-network-master passed once [1] and then failed in the testproject [2] although both the commit_hash e9a479519f0b7eb9b6aa86811ad992f9d5626111 and distro_hash 21f87ec3c18ca01bd0681ad8c14578a6ff52f012 are matching?!?

[1] https://logserver.rdoproject.org/openstack-component-network/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-network-master/c2a9501/job-output.txt

[2] https://logserver.rdoproject.org/37/41437/15/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-network-master/053d715/job-output.txt

Revision history for this message
Rabi Mishra (rabi) wrote :

Yeah, probably some transient "504 Gateway Time-out error". I could not see anything wrong in the neutron logs.

Ronelle Landy (rlandy)
Changed in tripleo:
status: Confirmed → Triaged
Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :
Revision history for this message
Rabi Mishra (rabi) wrote :

I see those 504 errors in haproxy logs and then a bunch of connection reset errors in neutron logs which I'm not sure is related or just spurious. Someone from neutron team would have better idea though.

Jul 1 03:14:50 undercloud haproxy[7]: 192.168.24.2:43668 [01/Jul/2022:03:12:50.058] neutron~ neutron_be/undercloud.ctlplane.localdomain 0/0/0/-1/120002 504 198 - - sH-- 25/1/0/0/0 0/0 "POST /v2.0/segments HTTP/1.1"
Jul 1 03:18:59 undercloud haproxy[7]: 192.168.24.2:39230 [01/Jul/2022:03:16:59.652] neutron~ neutron_be/undercloud.ctlplane.localdomain 0/0/0/-1/120002 504 198 - - sH-- 28/1/0/0/0 0/0 "POST /v2.0/subnets HTTP/1.1"
Jul 1 03:21:04 undercloud haproxy[7]: 192.168.24.2:50680 [01/Jul/2022:03:19:04.889] neutron~ neutron_be/undercloud.ctlplane.localdomain 0/0/0/-1/120001 504 198 - - sH-- 27/1/0/0/0 0/0 "POST /v2.0/subnets HTTP/1.1"
Jul 1 03:23:10 undercloud haproxy[7]: 192.168.24.2:54174 [01/Jul/2022:03:21:10.279] neutron~ neutron_be/undercloud.ctlplane.localdomain 0/0/0/-1/120001 504 198 - - sH-- 27/1/0/0/0 0/0 "POST /v2.0/subnets HTTP/1.1"

2022-07-01 03:14:54.733 15 INFO neutron.wsgi [None req-47641095-9001-4688-a680-995651dc1320 - fake_project_id - - - -] Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/eventlet/wsgi.py", line 605, in handle_one_response
    write(b''.join(towrite))
  File "/usr/lib/python3.9/site-packages/eventlet/wsgi.py", line 533, in write
    wfile.flush()
  File "/usr/lib64/python3.9/socket.py", line 722, in write
    return self._sock.send(b)
  File "/usr/lib/python3.9/site-packages/eventlet/greenio/base.py", line 401, in send
    return self._send_loop(self.fd.send, data, flags)
  File "/usr/lib/python3.9/site-packages/eventlet/greenio/base.py", line 388, in _send_loop
    return send_method(data, *args)
ConnectionResetError: [Errno 104] Connection reset by peer

Revision history for this message
yatin (yatinkarel) wrote :

This issue triggered with neutron-lib-21.0.0 (updated in RDO on 14th june and since then issue is seen) specifically https://review.opendev.org/c/openstack/neutron-lib/+/828738, I think https://review.opendev.org/c/openstack/neutron/+/848396 will fix it.

Also pushed https://review.rdoproject.org/r/c/testproject/+/43803 to test the potential fix.

The actually errors are[1]:-
2022-06-29 15:14:02.464 15 DEBUG neutron_lib.callbacks.manager [None req-1a997528-ef5d-4ebe-87a1-8bdb29a2b4f8 - fake_project_id - - - -] Callback neutron.plugins.ml2.plugin.Ml2Plugin._handle_segment_change-956301 raised (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
[SQL: UPDATE standardattributes SET revision_number=%(revision_number)s, updated_at=%(updated_at)s WHERE standardattributes.id = %(standardattributes_id)s AND standardattributes.revision_number = %(standardattributes_revision_number)s]
[parameters: {'revision_number': 4, 'updated_at': datetime.datetime(2022, 6, 29, 15, 13, 11), 'standardattributes_id': 10, 'standardattributes_revision_number': 3}]
(Background on this error at: https://sqlalche.me/e/14/e3q8) _notify_loop /usr/lib/python3.9/site-packages/neutron_lib/callbacks/manager.py:193
2022-06-29 15:14:02.466 15 DEBUG neutron_lib.db.api [None req-1a997528-ef5d-4ebe-87a1-8bdb29a2b4f8 - fake_project_id - - - -] Retry wrapper got retriable exception: Callback neutron.plugins.ml2.plugin.Ml2Plugin._handle_segment_change-956301 failed with "(pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
[SQL: UPDATE standardattributes SET revision_number=%(revision_number)s, updated_at=%(updated_at)s WHERE standardattributes.id = %(standardattributes_id)s AND standardattributes.revision_number = %(standardattributes_revision_number)s]
[parameters: {'revision_number': 4, 'updated_at': datetime.datetime(2022, 6, 29, 15, 13, 11), 'standardattributes_id': 10, 'standardattributes_revision_number': 3}]
(Background on this error at: https://sqlalche.me/e/14/e3q8)" wrapped /usr/lib/python3.9/site-packages/neutron_lib/db/api.py:188
2022-06-29 15:14:02.466 15 DEBUG oslo_db.api [None req-1a997528-ef5d-4ebe-87a1-8bdb29a2b4f8 - fake_project_id - - - -] Performing DB retry for function neutron.api.v2.base.Controller._create wrapper /usr/lib/python3.9/site-packages/oslo_db/api.py:155

wrt https://bugs.launchpad.net/tripleo/+bug/1980333/comments/1 same hash passing and failing in different job runs is that, neutron-server container didn't got updated(and had neutron-lib-20.0.0) in passing job and hence the issue not faced, but that's an issue with container update tooling.

[1] https://logserver.rdoproject.org/37/41437/15/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-network-master/053d715/logs/undercloud/var/log/containers/neutron/server.log.txt.gz

Revision history for this message
Rabi Mishra (rabi) wrote :

>The actually errors are[1]:-
>2022-06-29 15:14:02.464 15 DEBUG neutron_lib.callbacks.manager [None

So neutron-lib traces errors with DEBUG;) I saw those and assumed that they are tolerated.

Revision history for this message
yatin (yatinkarel) wrote :
Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

Thank you Yatin and Rabi for your help on this bug.

Closing this bug as the issue is resolved.

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.