periodic HA master job pingtest times out

Bug #1637961 reported by Gabriele Cerami on 2016-10-31
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Gabriele Cerami

Bug Description

periodic HA job fails during ping test with the error

2016-10-31 07:57:58.084956 | Timing out after 300 seconds:

as seen in

Logs at

show that tenant stack is created successfully 1 minute after the timeout.

Investigating delay, eventually increasing timeout.

Tags: ci Edit Tag help
Changed in tripleo:
assignee: nobody → Gabriele Cerami (gcerami)
Gabriele Cerami (gcerami) wrote :

Looks like gnocchi-metricd is eating all the CPU again.

logs are full of this error

2016-10-31 07:45:29.205 28623 ERROR cotyledon ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)

As seen in

Gabriele Cerami (gcerami) wrote :

metricd is continuosly trying to contact redis server, but redis is down. Fails at start with this error

22419:M 31 Oct 07:16:30.624 # Opening Unix socket: bind: Permission denied

as seen in

Gabriele Cerami (gcerami) wrote :

Redis starts correctly with the configuration when launched manually. With systemd, it fails.

Gabriele Cerami (gcerami) wrote :

Selinux policy problem.

I see this in audit/audit.log

type=AVC msg=audit(1477956868.065:23265): avc: denied { write } for pid=11970 comm="redis-server" name="redis" dev="tmpfs" ino=130656 scontext=system_u:system_r:redis_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=dir

When setting enforce to permissive let redis start

Ryan Hallisey (rthall14) wrote :

`restorecon -Rv /var/run/redis`

This will fix the problem. I think this command will need to be run as part of deployment maybe by puppet?

Changed in tripleo:
milestone: none → ocata-1
Ryan Hallisey (rthall14) wrote :

The restorecon needs to be done on the fly because after openstack-selinux is installed /var/run/redis may not exist. The restorecon needs to be run after that directory is created. Puppet should be able to handle this.

Alfredo Moralejo (amoralej) wrote :

/var/run is tmpfs, content is lost after every reboot so fixing it with puppet will not help. When running redis from pacemaker, the directory is created by redis resource-agent:

so i think restorecon should be done in the resource agent.

Gabriele Cerami (gcerami) wrote :

created pull request for resource agents in

Gabriele Cerami (gcerami) wrote :

Pull request has been merged. We only have to wait for it to be packaged now.

Gabriele Cerami (gcerami) wrote :

Package is in base RHEL, this is the downstream bugzilla that requires packaging This will take some days.
We are working around the issue with this

Gabriele Cerami (gcerami) wrote :

Added a workaround to make selinux permissive until the resource-agents package is out

Changed in tripleo:
importance: High → Critical
status: Confirmed → In Progress
tags: added: alert ci
Gabriele Cerami (gcerami) wrote :

new root cause:

/var/run/redis automatic creation was removed from redis package 3.2.4.

New package with revert is now in testing at

tags: removed: alert
description: updated
Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.