Comment 5 for bug 1585275

Revision history for this message
Tim Rozet (trozet) wrote :

It turns out the failures from comment #3 are multiple and all in the overcloud. The issue is exacerbated by using ceph on the control and compute nodes. It turns out there are multiple failures:

1) SQL calls to create the databases (db schema upgrades per service, in step2) fail to connect to SQL. This seeems to be a timing issue for when the mariadb cluster is really ready to handle requests. A hack fix is to add a sleep between when clustercheck passes and db syncs start.

I have noticed it looks like clustercheck passes, even when only 1 node is in the cluster. My theory is not all memebers of the cluster have joined, and they may join around the same time as the db schema upgrades happen, causing some type of deadlock.

The interesting part of this failure, is it occurs in the puppet mysql provider, but for some reason the resource does not fail which called the provider. Not sure how that happens...

2) Commands to acces ceph mon and osd timeout. The problem looks to be some type of resource contention with other services coming up around the same time. Moving Ceph configuration to "step1" and making it happen first (before tripleO loadbalancer, mongodb, etc) fixes the problem.

We have fixed this in our OPNFV fork of THT with this patch:
https://github.com/trozet/opnfv-tht/pull/18/files

This resolved #2 completely. We still see some issues with #1, but now it seems to be an issue of the db schema upgrades themselves happening too quickly per each service. We are workikng on a patch to serialize the db schema upgrades and add a 10 second sleep timer between each one to see if it fixes that issue.