CI: No connected gearman servers

Bug #1594732 reported by Sagi (Sergey) Shnaidman
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Sagi (Sergey) Shnaidman

Bug Description

CI jobs fail because no Gearman servers are found:

2016-06-21 08:23:19.452592 + ./testenv-client -b 192.168.1.1:4730 -t 10200 -- ./toci_instack.sh
2016-06-21 08:23:19.453958 + sleep 1200
2016-06-21 08:23:49.526973 2016-06-21 08:23:49,526 - gear.Client.unknown - ERROR - Connection <gear.Connection 0x2852990 host: 192.168.1.1 port: 4730> timed out waiting for a response to a submit job request: <gear.Job 0x2852950 handle: None name: lockenv unique: None>
2016-06-21 08:23:49.527528 Traceback (most recent call last):
2016-06-21 08:23:49.527583 File "./testenv-client", line 183, in <module>
2016-06-21 08:23:49.527610 exit(main())
2016-06-21 08:23:49.527833 File "./testenv-client", line 160, in main
2016-06-21 08:23:49.527873 client.submitJob(job)
2016-06-21 08:23:49.527922 File "/usr/lib/python2.7/site-packages/gear/__init__.py", line 1427, in submitJob
2016-06-21 08:23:49.528186 conn = self.getConnection()
2016-06-21 08:23:49.528295 File "/usr/lib/python2.7/site-packages/gear/__init__.py", line 1226, in getConnection
2016-06-21 08:23:49.528843 raise NoConnectedServersError("No connected Gearman servers")
2016-06-21 08:23:49.528960 gear.NoConnectedServersError: No connected Gearman servers
2016-06-21 08:43:19.458345 + '[' '!' -e /tmp/toci.started ']'
2016-06-21 08:43:19.458427 + sudo kill -9 17124
2016-06-21 08:43:19.484112 bash: line 1: 17124 Killed bash -xe /opt/stack/new/tripleo-ci/toci_gate_test.sh

http://logs.openstack.org/97/331997/1/check-tripleo/gate-tripleo-ci-centos-7-nonha-mitaka/9d5c14d/console.html#_2016-06-21_08_26_34_564945

Changed in tripleo:
assignee: nobody → Sagi (Sergey) Shnaidman (sshnaidm)
status: New → Confirmed
Changed in tripleo:
importance: Undecided → High
Changed in tripleo:
importance: High → Critical
Changed in tripleo:
milestone: none → newton-2
tags: removed: alert
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

Patch to install gear from pip: https://review.openstack.org/#/c/332123/
On hold, to check if it helped.

Revision history for this message
Derek Higgins (derekh) wrote :

The problem is that geard on 192.168.1.1 keeps running out of file handles, something is causing it to keep old tcp connections open

this only started in the last week or so, pretty much every day this week, I've had to restart geard on that server

Its possible the problem is in some way correlated with the full switch over to ZUUL (we're no longer using jenkins) but if this has been the trigger I'm not sure exactly why.

Revision history for this message
Emilien Macchi (emilienm) wrote :
tags: added: alert
tags: removed: alert
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

I patched geard to support TCP keepalives in sockets, it will remove connections from dead peers. Let's see if it helps.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

seems like it helps, submitted a patch to gear to support it: https://review.openstack.org/#/c/334452/

Changed in tripleo:
status: Confirmed → Fix Released
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

Emilien, actually it's not released yet. Patch in gear is on review and we need also to reconfigure systemd to use these parameters when (and if) it will be merged.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

gear was released with keepalives in 0.7 version

Changed in tripleo:
status: Fix Released → In Progress
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

The final solution for the issue:

https://review.openstack.org/#/c/352566/

Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.