periodic HA master job pingtest times out

Bug #1637961 reported by Gabriele Cerami
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Gabriele Cerami

Bug Description

periodic HA job fails during ping test with the error

2016-10-31 07:57:58.084956 | Timing out after 300 seconds:

as seen in

http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-ha/0ae3179/console.html#_2016-10-31_07_57_58_084956

Logs at

http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-ha/0ae3179/logs/overcloud-controller-0/var/log/heat/heat-engine.txt.gz

show that tenant stack is created successfully 1 minute after the timeout.

Investigating delay, eventually increasing timeout.

Tags: ci
Changed in tripleo:
assignee: nobody → Gabriele Cerami (gcerami)
Revision history for this message
Gabriele Cerami (gcerami) wrote :

Looks like gnocchi-metricd is eating all the CPU again.

logs are full of this error

2016-10-31 07:45:29.205 28623 ERROR cotyledon ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)

As seen in
http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-ha/0ae3179/logs/overcloud-controller-0/var/log/gnocchi/metricd.txt.gz

Revision history for this message
Gabriele Cerami (gcerami) wrote :

metricd is continuosly trying to contact redis server, but redis is down. Fails at start with this error

22419:M 31 Oct 07:16:30.624 # Opening Unix socket: bind: Permission denied

as seen in

http://logs.openstack.org/periodic/periodic-tripleo-ci-centos-7-ovb-ha/0ae3179/logs/overcloud-controller-0/var/log/redis/redis.txt.gz

Revision history for this message
Gabriele Cerami (gcerami) wrote :

Redis starts correctly with the configuration when launched manually. With systemd, it fails.

Revision history for this message
Gabriele Cerami (gcerami) wrote :

Selinux policy problem.

I see this in audit/audit.log

type=AVC msg=audit(1477956868.065:23265): avc: denied { write } for pid=11970 comm="redis-server" name="redis" dev="tmpfs" ino=130656 scontext=system_u:system_r:redis_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=dir

When setting enforce to permissive let redis start

Revision history for this message
Ryan Hallisey (rthall14) wrote :

`restorecon -Rv /var/run/redis`

This will fix the problem. I think this command will need to be run as part of deployment maybe by puppet?

Changed in tripleo:
milestone: none → ocata-1
Revision history for this message
Ryan Hallisey (rthall14) wrote :

The restorecon needs to be done on the fly because after openstack-selinux is installed /var/run/redis may not exist. The restorecon needs to be run after that directory is created. Puppet should be able to handle this.

Revision history for this message
Alfredo Moralejo (amoralej) wrote :

/var/run is tmpfs, content is lost after every reboot so fixing it with puppet will not help. When running redis from pacemaker, the directory is created by redis resource-agent:

https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/redis#L325

so i think restorecon should be done in the resource agent.

Revision history for this message
Gabriele Cerami (gcerami) wrote :

created pull request for resource agents in https://github.com/ClusterLabs/resource-agents/pull/872

Revision history for this message
Gabriele Cerami (gcerami) wrote :

Pull request has been merged. We only have to wait for it to be packaged now.

Revision history for this message
Gabriele Cerami (gcerami) wrote :

Package is in base RHEL, this is the downstream bugzilla that requires packaging https://bugzilla.redhat.com/show_bug.cgi?id=1390974. This will take some days.
We are working around the issue with this https://review.openstack.org/392521

Revision history for this message
Gabriele Cerami (gcerami) wrote :

Added a workaround to make selinux permissive until the resource-agents package is out https://review.openstack.org/392703

Changed in tripleo:
importance: High → Critical
status: Confirmed → In Progress
tags: added: alert ci
Revision history for this message
Gabriele Cerami (gcerami) wrote :

new root cause:

https://bugzilla.redhat.com/show_bug.cgi?id=1374728

/var/run/redis automatic creation was removed from redis package 3.2.4.

New package with revert is now in testing at

http://cbs.centos.org/koji/buildinfo?buildID=13831

tags: removed: alert
description: updated
Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.