Create router failed with The server didn't respond in time.: 504 Gateway Time-out

Bug #1937035 reported by chandan kumar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Unassigned

Bug Description

in FS02, tempest pre resource creation tasks are failing at create router [1]
```
2021-07-21 00:26:30.827471 | primary | TASK [os_tempest : Create router] **********************************************
2021-07-21 00:26:30.827706 | primary | Wednesday 21 July 2021 00:26:30 +0000 (0:02:13.700) 1:59:31.652 ********
2021-07-21 00:28:33.108393 | primary | FAILED - RETRYING: Create router (5 retries left).
2021-07-21 00:30:45.819402 | primary | FAILED - RETRYING: Create router (4 retries left).
2021-07-21 00:32:58.284105 | primary | FAILED - RETRYING: Create router (3 retries left).
2021-07-21 00:35:10.658736 | primary | FAILED - RETRYING: Create router (2 retries left).
2021-07-21 00:37:23.397917 | primary | FAILED - RETRYING: Create router (1 retries left).
2021-07-21 00:39:35.447171 | primary | fatal: [undercloud -> undercloud]: FAILED! => {
2021-07-21 00:39:35.447471 | primary | "attempts": 5,
2021-07-21 00:39:35.447492 | primary | "changed": false,
2021-07-21 00:39:35.447502 | primary | "extra_data": {
2021-07-21 00:39:35.447512 | primary | "data": null,
2021-07-21 00:39:35.447521 | primary | "details": "The server didn't respond in time.: 504 Gateway Time-out",
2021-07-21 00:39:35.447531 | primary | "response": "<html><body><h1>504 Gateway Time-out</h1>\nThe server didn't respond in time.\n</body></html>\n"
2021-07-21 00:39:35.447541 | primary | }
2021-07-21 00:39:35.447551 | primary | }
2021-07-21 00:39:35.447598 | primary |
2021-07-21 00:39:35.447607 | primary | MSG:
2021-07-21 00:39:35.447616 | primary |
2021-07-21 00:39:35.452745 | primary | Error updating router 422d9b28-9209-4541-b6b5-03afdfd95b21: Server Error for url: https://10.0.0.5:13696/v2.0/routers/422d9b28-9209-4541-b6b5-03afdfd95b21, The server didn't respond in time.: 504 Gateway Time-out
```

while looking at neutron server logs, we are seeing two errors
```
ERROR oslo.messaging._drivers.impl_rabbit [req-7343d331-7f68-447c-9b18-debe793a254a 4a212f6aa4c344ccb059b042b4d27310 a5f52d5a21eb4b2e9fc2cbc993ff7a9d - default default] Connection failed: [Errno 111] ECONNREFUSED (retrying in 32.0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
```

and

```
2021-07-21 00:16:27.712 30 DEBUG futurist.periodics [req-144966c1-a3ff-4472-a8ad-10bbd14dd235 - - - - -] Submitting periodic callback 'neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance.HashRingHealthCheckPeriodics.touch_hash_ring_nodes' _process_scheduled /usr/lib/python3.6/site-packages/futurist/periodics.py:642
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines [req-144966c1-a3ff-4472-a8ad-10bbd14dd235 - - - - -] Database connection was found disconnected; reconnecting: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
[SQL: SELECT 1]
(Background on this error at: http://sqlalche.me/e/e3q8)
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1244, in _execute_context
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines cursor, statement, parameters, context
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines cursor.execute(statement, parameters)
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/cursors.py", line 163, in execute
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines result = self._query(query)
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/cursors.py", line 321, in _query
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines conn.query(q)
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 505, in query
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines self._affected_rows = self._read_query_result(unbuffered=unbuffered)
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 724, in _read_query_result
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines result.read()
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 1069, in read
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines first_packet = self.connection._read_packet()
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 646, in _read_packet
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines packet_header = self._read_bytes(4)
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3.6/site-packages/pymysql/connections.py", line 699, in _read_bytes
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
2021-07-21 00:16:27.718 30 ERROR oslo_db.sqlalchemy.engines pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
```

It might be related to rabbitmq. Filing this bug to track further.

Links:
[1]. https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-wallaby/0a1e869/job-output.txt

[2]. https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-wallaby/0a1e869/logs/overcloud-controller-0/var/log/containers/neutron/server.log.txt.gz

Revision history for this message
Michele Baldessari (michele) wrote :
Download full text (3.8 KiB)

Just the usual podman timing out on all operations:
 michele@holtby  /tmp/z  grep -ir -e 'exceeded timeout' -e 'timed out' journal.txt |wc -l
90

Jul 21 00:12:14 overcloud-controller-0 pacemaker-controld[26220]: notice: High CPU load detected: 8.280000
Jul 21 00:12:44 overcloud-controller-0 pacemaker-controld[26220]: notice: High CPU load detected: 8.270000
Jul 21 00:13:14 overcloud-controller-0 pacemaker-controld[26220]: notice: High CPU load detected: 7.850000
Jul 21 00:13:48 overcloud-controller-0 pacemaker-execd[26217]: warning: rabbitmq-bundle-podman-0_monitor_60000 process (PID 173756) timed out
Jul 21 00:13:48 overcloud-controller-0 pacemaker-execd[26217]: warning: rabbitmq-bundle-podman-0_monitor_60000[173756] timed out after 120000ms
Jul 21 00:13:48 overcloud-controller-0 pacemaker-controld[26220]: error: Result of monitor operation for rabbitmq-bundle-podman-0 on overcloud-controller-0: Timed Out
Jul 21 00:13:48 overcloud-controller-0 pacemaker-controld[26220]: notice: Transition 56 action 10 (rabbitmq-bundle-podman-0_monitor_60000 on overcloud-controller-0): expected 'ok' but got 'error'
Jul 21 00:13:48 overcloud-controller-0 pacemaker-controld[26220]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Jul 21 00:13:48 overcloud-controller-0 pacemaker-attrd[26218]: notice: Setting fail-count-rabbitmq-bundle-podman-0#monitor_60000[overcloud-controller-0]: (unset) -> 1
Jul 21 00:13:48 overcloud-controller-0 pacemaker-attrd[26218]: notice: Setting last-failure-rabbitmq-bundle-podman-0#monitor_60000[overcloud-controller-0]: (unset) -> 1626826428

Jul 21 00:13:49 overcloud-controller-0 pacemaker-attrd[26218]: notice: Removing all rabbitmq-bundle-0 attributes for peer overcloud-controller-0
Jul 21 00:13:49 overcloud-controller-0 pacemaker-controld[26220]: notice: Result of stop operation for rabbitmq-bundle-0 on overcloud-controller-0: ok
Jul 21 00:13:49 overcloud-controller-0 pacemaker-fenced[26216]: notice: Node rabbitmq-bundle-0 state is now lost
Jul 21 00:13:49 overcloud-controller-0 pacemaker-controld[26220]: notice: Initiating stop operation rabbitmq-bundle-podman-0_stop_0 locally on overcloud-controller-0
Jul 21 00:13:49 overcloud-controller-0 pacemaker-controld[26220]: notice: Requesting local execution of stop operation for rabbitmq-bundle-podman-0 on overcloud-controller-0
Jul 21 00:13:50 overcloud-controller-0 pacemaker-execd[26217]: warning: redis-bundle-podman-0_monitor_60000 process (PID 173952) timed out
Jul 21 00:13:50 overcloud-controller-0 pacemaker-execd[26217]: warning: redis-bundle-podman-0_monitor_60000[173952] timed out after 120000ms
Jul 21 00:13:50 overcloud-controller-0 pacemaker-controld[26220]: error: Result of monitor operation for redis-bundle-podman-0 on overcloud-controller-0: Timed Out
Jul 21 00:13:50 overcloud-controller-0 pacemaker-controld[26220]: notice: Transition 75 aborted by operation redis-bundle-podman-0_monitor_60000 'create' on overcloud-controller-0: Change in recurring result
Jul 21 00:13:50 overcloud-controller-0 pacemaker-controld[26220]: notice: Transition 56 action 12 (redis-bundle-podman-0_monitor_60000 on overcloud-controller-0): expected 'ok' but got 'error'
Jul ...

Read more...

Changed in tripleo:
milestone: xena-2 → xena-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I think that the issue is caused by the fact that dhcp_agent_notification in Neutron is set to True. Because of that dhcp methods are subscribed for notifications in https://opendev.org/openstack/neutron/src/branch/master/neutron/api/rpc/agentnotifiers/dhcp_rpc_agent_api.py#L104 and later when port is created it tries to send such notifications using RPC. But in case of OVN deployment RPC is not available so it takes long time with retries and there is finally that timeout.

In devstack based jobs this config option is disabled when ovn is used: https://github.com/openstack/devstack/blob/a5ed116814fa3a435f15231aa7b18d389f917844/lib/neutron_plugins/ovn_agent#L491

So I proposed patch for Tripleo-heat-templates to do something similar: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/801761
Let's see if that will solve the problem.

Revision history for this message
Marios Andreou (marios-b) wrote :

picking up as ruck this cycle and catching up

I can only see this happening *once* in [1] (2021-07-20 22:11:26) - which is the example referenced in the description above [2].

After that we have 2 successful runs. Unless there is more than one example of this I don't know if we need this bug or any fixes here.

[1] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-wallaby
[2] https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset002-wallaby/0a1e869/

Revision history for this message
Alfredo Moralejo (amoralej) wrote :

We've hit an issue in one of the rdo gate jobs which seems similar. In case it's useful as a new datapoint:

https://logserver.rdoproject.org/46/34646/2/check/rdoinfo-tripleo-master-testing-centos-8-scenario001-standalone/81841c9/logs/undercloud/var/log/extra/

It's standalone job with ovn.

Revision history for this message
Marios Andreou (marios-b) wrote :

@Alfredo thanks but can you please clarify.. which log or trace in particular did you notice i can't find something similar in [1]
I saw the mysql error in [2] but that is during the deploy. The bug reported here is happening during the tempest tests.

I am not sure it is the same issue.

[1] https://logserver.rdoproject.org/46/34646/2/check/rdoinfo-tripleo-master-testing-centos-8-scenario001-standalone/81841c9/job-output.txt
[2] https://logserver.rdoproject.org/46/34646/2/check/rdoinfo-tripleo-master-testing-centos-8-scenario001-standalone/81841c9/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

Revision history for this message
Alfredo Moralejo (amoralej) wrote :

Sure. overcloud deployment fails with:

oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '192.168.24.3' ([Errno 113] No route to host)")

Actually, there are other similar messages in other services which lost access to the DB:

2021-07-23 06:22:18.939 36 ERROR oslo_db.sqlalchemy.engines [-] Database connection was found disconnected; reconnecting: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (TimeoutError(110, 'Connection timed out'))")

https://logserver.rdoproject.org/46/34646/2/check/rdoinfo-tripleo-master-testing-centos-8-scenario001-standalone/81841c9/logs/undercloud/var/log/containers/aodh/aodh-evaluator.log.txt.gz

Note that ip is not reacheable. That ip is a pacemaker resourece.

Looking at pacemaker messages in:

https://logserver.rdoproject.org/46/34646/2/check/rdoinfo-tripleo-master-testing-centos-8-scenario001-standalone/81841c9/logs/undercloud/var/log/extra/journal.txt.gz

I see a bunch of pacemakers timeouts and errors (that's the part i found similar):

Jul 23 06:20:59 standalone.localdomain pacemaker-controld[492447]: error: Result of start operation for openstack-cinder-backup-podman-0 on standalone: Timed Out
Jul 23 06:20:59 standalone.localdomain pacemaker-controld[492447]: notice: Transition 45 action 149 (openstack-cinder-backup-podman-0_start_0 on standalone): expected 'ok' but got 'error'

Jul 23 06:21:22 standalone.localdomain pacemaker-controld[492447]: notice: Transition 9 action 5 (haproxy-bundle-podman-0_monitor_60000 on standalone): expected 'ok' but got 'error'

Jul 23 06:22:10 standalone.localdomain pacemaker-controld[492447]: error: Result of stop operation for rabbitmq-bundle-podman-0 on standalone: Timed Out

Jul 23 06:21:22 standalone.localdomain pacemaker-controld[492447]: error: Result of monitor operation for haproxy-bundle-podman-0 on standalone: Timed Out

And resources were finally stopped.

Note this job is gating a minor update on ovs which i guess it's unrelated. Said this, the issue may be related to pacemaker but a different one to the reported in this LP.

Jul 23 06:22:10 standalone.localdomain pacemaker-execd[492444]: warning: rabbitmq-bundle-podman-0_stop_0 process (PID 633224) timed out
Jul 23 06:22:10 standalone.localdomain pacemaker-execd[492444]: warning: rabbitmq-bundle-podman-0_stop_0[633224] timed out after 20000ms
Jul 23 06:22:10 standalone.localdomain pacemaker-controld[492447]: error: Result of stop operation for rabbitmq-bundle-podman-0 on standalone: Timed Out

Revision history for this message
Marios Andreou (marios-b) wrote :

The job for which this bug was filed i.e. FS2 wallaby is very green in latest runs - see attachment

Per my earlier comment at https://bugs.launchpad.net/tripleo/+bug/1937035/comments/4 above I think this was only seen once. I am going to move the bug to incomplete for now please move back if you disagree thank you

Changed in tripleo:
status: In Progress → Incomplete
status: Incomplete → Opinion
status: Opinion → Incomplete
Revision history for this message
yatin (yatinkarel) wrote :
Changed in tripleo:
status: Incomplete → New
Revision history for this message
Marios Andreou (marios-b) wrote :

@yatin what was seen twice how can you be sure this is the same issue. The bug here was filed for failing os-tempest router creation and it was only ever seen once. I don't know that we can say they are the same 'because pcs resources are in bad shape'.

I think the issue you are hitting may be the same as the one that alfredo described in comment #7.

I think this bug is already confusing enough and we should file a new one for your issue please?

I am moving back to invalid.

Changed in tripleo:
status: New → Incomplete
Revision history for this message
yatin (yatinkarel) wrote :

Reported seperate bug https://bugs.launchpad.net/tripleo/+bug/1938283, the symptoms are same, but in wallaby it failed even at 120s timeout, in master it's failing more frequently due to low timeout setting

Changed in tripleo:
status: Incomplete → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/801761
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/79e24494bfcabd261e206aaa06db18758992fc65
Submitter: "Zuul (22348)"
Branch: master

commit 79e24494bfcabd261e206aaa06db18758992fc65
Author: Slawek Kaplonski <email address hidden>
Date: Thu Jul 22 12:32:37 2021 +0200

    Set DhcpAgentNotification to be "false" by default

    This option is set to true by default in Neutron but it should be
    disabled in case when OVN backend is used.
    It should be enabled only when "traditional" dhcp agents are used in
    Neutron.
    As OVN is default Neutron backend in the Tripleo, this patch sets
    DhcpAgentNotification option to be False by default and enables it only
    when neutron-dhcp-agent is going to be deployed.

    Closes-Bug: #1937035
    Change-Id: I5f8ad32a9fa888dec6b26c0185f8b506521bd6e3

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 15.1.0

This issue was fixed in the openstack/tripleo-heat-templates 15.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/869328

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/train)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/869328
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/8cc6d0bf9d139ce12beda9498324c59a436d9326
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 8cc6d0bf9d139ce12beda9498324c59a436d9326
Author: Slawek Kaplonski <email address hidden>
Date: Thu Jul 22 12:32:37 2021 +0200

    Set DhcpAgentNotification to be "false" by default

    This option is set to true by default in Neutron but it should be
    disabled in case when OVN backend is used.
    It should be enabled only when "traditional" dhcp agents are used in
    Neutron.
    As OVN is default Neutron backend in the Tripleo, this patch sets
    DhcpAgentNotification option to be False by default and enables it only
    when neutron-dhcp-agent is going to be deployed.

    Closes-Bug: #1937035
    Change-Id: I5f8ad32a9fa888dec6b26c0185f8b506521bd6e3
    (cherry picked from commit 79e24494bfcabd261e206aaa06db18758992fc65)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/train)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/869730
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/4bac4ec9531c43b83b73360d994daae53ae60cf8
Submitter: "Zuul (22348)"
Branch: stable/train

commit 4bac4ec9531c43b83b73360d994daae53ae60cf8
Author: Slawek Kaplonski <email address hidden>
Date: Thu Jul 22 12:32:37 2021 +0200

    Set DhcpAgentNotification to be "false" by default

    This option is set to true by default in Neutron but it should be
    disabled in case when OVN backend is used.
    It should be enabled only when "traditional" dhcp agents are used in
    Neutron.
    As OVN is default Neutron backend in the Tripleo, this patch sets
    DhcpAgentNotification option to be False by default and enables it only
    when neutron-dhcp-agent is going to be deployed.

    Closes-Bug: #1937035
    Change-Id: I5f8ad32a9fa888dec6b26c0185f8b506521bd6e3
    (cherry picked from commit 79e24494bfcabd261e206aaa06db18758992fc65)
    (cherry picked from commit 8cc6d0bf9d139ce12beda9498324c59a436d9326)
    Conflicts:
            environments/undercloud.yaml

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates train-eol

This issue was fixed in the openstack/tripleo-heat-templates train-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.