CI: ovb-ha promotion job fails with 504 gateway timeout, neutron-server create-subnet timing out

Bug #1696094 reported by Sagi (Sergey) Shnaidman
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Invalid
Critical
Unassigned
tripleo
Fix Released
Critical
Unassigned

Bug Description

http://logs.openstack.org/15/359215/106/experimental-tripleo/gate-tripleo-ci-centos-7-ovb-ha/2ea94ab/console.html#_2017-06-05_23_52_38_539282

2017-06-05 23:50:34.148537 | +---------------------------+--------------------------------------+
2017-06-05 23:50:35.545475 | neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
2017-06-05 23:52:38.539282 | <html><body><h1>504 Gateway Time-out</h1>
2017-06-05 23:52:38.539408 | The server didn't respond in time.
2017-06-05 23:52:38.539437 | </body></html>

It happens on where subnet creation should be.
I see in logs ovs-vsctl failure, but not sure it's not red herring.

http://logs.openstack.org/15/359215/106/experimental-tripleo/gate-tripleo-ci-centos-7-ovb-ha/2ea94ab/logs/controller-1-tripleo-ci-b-bar/var/log/messages

Jun 5 23:48:22 localhost ovs-vsctl: ovs|00001|vsctl|INFO|Called as /bin/ovs-vsctl --timeout=5 --id=@manager -- create Manager "target=\"ptcp:6640:127.0.0.1\"" -- add Open_vSwitch . manager_options @manager
Jun 5 23:48:22 localhost ovs-vsctl: ovs|00002|db_ctl_base|ERR|transaction error: {"details":"Transaction causes multiple rows in \"Manager\" table to have identical values (\"ptcp:6640:127.0.0.1\") for index on column \"target\". First row, with UUID 7e2b866a-40d5-4f9c-9e08-0be3bb34b199, existed in the database before this transaction and was not modified by the transaction. Second row, with UUID 49488cff-271a-457a-b1e7-e6ca3da6f069, was inserted by this transaction.","error":"constraint violation"}

Tags: ci
Revision history for this message
Alfredo Moralejo (amoralej) wrote :
Revision history for this message
Emilien Macchi (emilienm) wrote :
Revision history for this message
Emilien Macchi (emilienm) wrote :

https://review.openstack.org/#/c/469650/ was backported to ocata and it seems to work fine in tripleo ci, so it's probably another one.

Revision history for this message
Emilien Macchi (emilienm) wrote :

The version of eventlet in TripleO is maybe to high: python2-eventlet-0.20.1-1.el7.noarch

Versus what is suggested by Neutron:
eventlet!=0.18.3,<0.21.0,>=0.18.2 # MIT
https://github.com/openstack/neutron/blob/master/requirements.txt#L10

So are we provided a too recent version of eventlet? Probably another theory.

Revision history for this message
Emilien Macchi (emilienm) wrote :

Nevermind my last comment. The version is good.

Revision history for this message
Emilien Macchi (emilienm) wrote :
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

From the look at the logs, it seems like the real problem is that request takes 2 minutes+, so haproxy gives up on neutron.

Revision history for this message
Jakub Libosvar (libosvar) wrote :

It seems to me that Neutron is either stuck or slow. The client times out and disconnects and after server has the reply for client's request, it can't send it back because the connection already terminated from client side.

http://logs.openstack.org/39/471139/2/experimental-tripleo/gate-tripleo-ci-centos-7-ovb-ha/4852702/console.html#_2017-06-06_12_00_32_937296

http://logs.openstack.org/39/471139/2/experimental-tripleo/gate-tripleo-ci-centos-7-ovb-ha/4852702/logs/controller-0-tripleo-ci-a-foo/var/log/neutron/server.txt.gz#_2017-06-06_12_00_34_330

Client times out at 12:00:32.937296 and server replies at 12:00:34.330

Revision history for this message
Jakub Libosvar (libosvar) wrote :

The "event-dispatch" lock is locked for over a minute. There is nothing in the logs in between, unclear to me what happened between 11:59:30 and 12:00:30 on the server side.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I checked several logs from different runs and all of them fail on subnet request, and all of them hold ~60 secs on event-dispatch. Why 60 exactly?

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

As per @amoralej, the diff between good and bad neutron versions is 3f7815a..ded39ea

Assaf Muller (amuller)
summary: - CI: ovb-ha promotion job fails with 504 gateway timeout
+ CI: ovb-ha promotion job fails with 504 gateway timeout, neutron-server
+ create-subnet timing out
Changed in neutron:
status: New → Confirmed
importance: Undecided → Critical
tags: added: gate-failure
tags: added: api
tags: removed: api
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/471446

Revision history for this message
Brian Haley (brian-haley) wrote :

Just wanted to throw a rough timeline out there, taken from http://logs.openstack.org/15/359215/106/experimental-tripleo/gate-tripleo-ci-centos-7-ovb-ha/2ea94ab/logs/controller-0-tripleo-ci-a-foo/var/log/neutron/server.txt.gz

2017-06-05 23:50:38.513 170146 DEBUG neutron.api.v2.base [req-a6e39b95-98b4-455e-a21a-f051a39ce1e2 "subnet create"

2017-06-05 23:50:38.571 170146 DEBUG neutron.pecan_wsgi.hooks.quota_enforcement [req-a6e39b95-98b4-455e-a21a-f051a39ce1e2 "quota enforcement - before"

60-second delay, possibly DB transaction?

2017-06-05 23:51:38.842 170146 DEBUG neutron.db.ipam_backend_mixin [req-a6e39b95-98b4-455e-a21a-f051a39ce1e2 "IP allocation"

2017-06-05 23:51:39.296 170146 DEBUG neutron.scheduler.dhcp_agent_scheduler [req-a6e39b95-98b4-455e-a21a-f051a39ce1e2 "DHCP scheduling"

60-second delay, event dispatcher? But RPC is using cast()

2017-06-05 23:52:39.253 170146 DEBUG oslo_concurrency.lockutils [req-a6e39b95-98b4-455e-a21a-f051a39ce1e2 "event-dispatch lock released, held 60.235s"

2017-06-05 23:52:39.389 170146 INFO neutron.wsgi [req-a6e39b95-98b4-455e-a21a-f051a39ce1e2 "connection reset"

I'll start looking at the code diffs, but this might be related to change in a package.

Revision history for this message
Alan Pevec (apevec) wrote :

Trying to test related fix from comment 12 in https://review.openstack.org/471485

Revision history for this message
Alan Pevec (apevec) wrote :

> might be related to change in a package.

Which package do you think?

Revision history for this message
Alan Pevec (apevec) wrote :

Looks like related fix worked, oooq ovb-ha job passed, doing few more rechecks.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Emilien Macchi (<email address hidden>) on branch: master
Review: https://review.openstack.org/471345

Revision history for this message
Alfredo Moralejo (amoralej) wrote :
Revision history for this message
Alan Pevec (apevec) wrote :

collecting more debugging info/ideas in https://trello.com/b/WXJTwsuU/tripleo-and-rdo-ci-status

Revision history for this message
Brian Haley (brian-haley) wrote :

Alan - sorry, didn't notice your comment. I was just looking at anything that might have changed in the package area, and there had been a requirements.txt change on May 26:

-oslo.context>=2.12.0 # Apache-2.0
-oslo.db>=4.19.0 # Apache-2.0
-oslo.i18n>=2.1.0 # Apache-2.0
+oslo.context>=2.14.0 # Apache-2.0
+oslo.db>=4.21.1 # Apache-2.0
+oslo.i18n!=3.15.2,>=2.1.0 # Apache-2.0
 oslo.log>=3.22.0 # Apache-2.0
 oslo.messaging>=5.19.0 # Apache-2.0
-oslo.middleware>=3.10.0 # Apache-2.0
+oslo.middleware>=3.27.0 # Apache-2.0

Let me look at logs mentioned in comment #18

Changed in tripleo:
milestone: pike-2 → pike-3
Le Hou (legeshuang)
Changed in neutron:
status: Confirmed → Incomplete
status: Incomplete → Confirmed
information type: Public → Public Security
Revision history for this message
Jeremy Stanley (fungi) wrote :

I've switched this report from Public Security back to Public on the assumption it was an accident. If you really consider this report to represent a suspected security vulnerability for Neutron, please include a rationale for that when setting it to Public Security again and the VMT will evaluate it for a possible advisory.

information type: Public Security → Public
Revision history for this message
Alfredo Moralejo (amoralej) wrote :

The root cause is that eventlet is trying to resolve host names first in DNS and then in /etc/hosts. This means that when controllers can't reach DNS servers for some reason, the first time a process sends a message it needs to wait for dns call timeout and then go to /etc/hosts. In a HA environment that's 60 seconds as it has to resolve the three rabbitmq controllers.

this is reported in https://github.com/eventlet/eventlet/issues/416 and seems to be fixed in https://github.com/eventlet/eventlet/commit/503aed1310b5e4a2fad9f5a0909663636a4e8752

I've reproduce this with a glance image creation operation. These are the time it takes for the different versions of eventlet:

- eventlet 0.18.4: ~ 2 seconds
- eventlet 0.19.0: ~ 2 seconds
- eventlet 0.20.0: ~ 32 seconds
- eventlet 0.20.1: ~ 62 seconds
- eventlet 0.21.0: ~ 32 seconds

Reverting to eventlet-0.20.0 would improve the situation although it doesn't totally fix the issue.

Revision history for this message
Alan Pevec (apevec) wrote :

python-eventlet-0.20.1-2.el7 with the backport is being pushed to RDO Pike repo

Revision history for this message
Alan Pevec (apevec) wrote :

eventlet backport is not enough, tripleo-ci change https://review.openstack.org/472975 is also required

tags: removed: alert gate-failure promotion-blocker
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
wes hayutin (weshayutin) wrote : Re: [Bug 1696094] Re: CI: ovb-ha promotion job fails with 504 gateway timeout, neutron-server create-subnet timing out

Sorry I missed the configuration file pacemaker.yml review us in place now.

Just need a merge

On Thu, Jun 8, 2017, 05:01 Alan Pevec <email address hidden> wrote:

> collecting more debugging info/ideas in https://trello.com/b/WXJTwsuU
> /tripleo-and-rdo-ci-status
> <https://trello.com/b/WXJTwsuU/tripleo-and-rdo-ci-status>
>
> --
> You received this bug notification because you are subscribed to
> tripleo.
> Matching subscriptions: critical tripleo bugs
> https://bugs.launchpad.net/bugs/1696094
>
> Title:
> CI: ovb-ha promotion job fails with 504 gateway timeout, neutron-
> server create-subnet timing out
>
> Status in neutron:
> Confirmed
> Status in tripleo:
> Triaged
>
> Bug description:
> http://logs.openstack.org/15/359215/106/experimental-tripleo/gate-
> tripleo-ci-centos-7-ovb-
> ha/2ea94ab/console.html#_2017-06-05_23_52_38_539282
>
> 2017-06-05 23:50:34.148537 |
> +---------------------------+--------------------------------------+
> 2017-06-05 23:50:35.545475 | neutron CLI is deprecated and will be
> removed in the future. Use openstack CLI instead.
> 2017-06-05 23:52:38.539282 | <html><body><h1>504 Gateway Time-out</h1>
> 2017-06-05 23:52:38.539408 | The server didn't respond in time.
> 2017-06-05 23:52:38.539437 | </body></html>
>
> It happens on where subnet creation should be.
> I see in logs ovs-vsctl failure, but not sure it's not red herring.
>
> http://logs.openstack.org/15/359215/106/experimental-tripleo/gate-
> tripleo-ci-centos-7-ovb-ha/2ea94ab/logs/controller-1-tripleo-
> ci-b-bar/var/log/messages
>
> Jun 5 23:48:22 localhost ovs-vsctl: ovs|00001|vsctl|INFO|Called as
> /bin/ovs-vsctl --timeout=5 --id=@manager -- create Manager
> "target=\"ptcp:6640:127.0.0.1\"" -- add Open_vSwitch . manager_options
> @manager
> Jun 5 23:48:22 localhost ovs-vsctl:
> ovs|00002|db_ctl_base|ERR|transaction error: {"details":"Transaction causes
> multiple rows in \"Manager\" table to have identical values
> (\"ptcp:6640:127.0.0.1\") for index on column \"target\". First row, with
> UUID 7e2b866a-40d5-4f9c-9e08-0be3bb34b199, existed in the database before
> this transaction and was not modified by the transaction. Second row, with
> UUID 49488cff-271a-457a-b1e7-e6ca3da6f069, was inserted by this
> transaction.","error":"constraint violation"}
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1696094/+subscriptions
>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: master
Review: https://review.openstack.org/471446
Reason: Turned out it's not relevant, and there is no clear reason to complicate the patching like that if it doesn't help the original issue. Abandoning.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

It was not a neutron bug but eventlet/dns issue, so marking the bug as Invalid for neutron.

Changed in neutron:
status: Confirmed → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/requirements 1.1.0

This issue was fixed in the openstack/requirements 1.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.