intermittent HA failures in CI gates due to deployment race conditions

Bug #1393334 reported by Bogdan Dobrelya
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
Critical
Fuel Library (Deprecated)

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/6.0/job/6.0.staging.ubuntu.bvt_2/73/

Here is what could be seen from logs:
1) Failure of deployment due to
2014-11-16T09:40:02.587686 node-1 ./node-1.test.domain.local/puppet-apply.log:2014-11-16T09:40:02.587686+00:00 err: (/Stage[main]/Osnailyfacter::Cluster_ha/Nova_floating_range[10.108.1.128-10.108.1.254]) Could not evaluate: Oops - not sure what happened: 751: unexpected token at '<html><body><h1>504 Gateway Time-out</h1>

And galera cluster reported its ready for connections 3 minutes *later*
2014-11-16T09:43:59.887053 node-1 ./node-1.test.domain.local/mysqld.log:2014-11-16T09:43:59.887053+00:00 err: 2014-11-16 09:43:59 452 [Note] WSREP: Synchronized with group, ready for connections

Also there are signal=13 errors in xinetd.log for galeracheck
from 2014-11-16T09:32:38 to 2014-11-16T09:58:00:
 START: galeracheck ... from=10.108.2.2, EXIT: galeracheck signal=13
(10.108.2.2 is management VIP)

2) at the moment of logs snapshot had been taken:
rabbitmqctl report (Nov 16, 09:59) shows 'rabbit@node-1': nodedown

pcs status (Nov 16 09:58:04 2014) shows all resources are stopped.
But there are no errors for this in puppet logs, and debug shows resources as started a minute before:
Sun Nov 16 09:57:55 +0000 2014 Puppet (debug):
-> Simple primitive 'vip__public' global status: start
   node-1: start
-> Cloned primitive 'clone_ping_vip__public' global status: start
   node-1: start
-> Cloned primitive 'clone_p_heat-engine' global status: start
   node-1: start
-> Multistate primitive 'master_p_rabbitmq-server' global status: master
   node-1: master
-> Simple primitive 'vip__management' global status: start
   node-1: start
-> Cloned primitive 'clone_p_haproxy' global status: start
   node-1: start
-> Cloned primitive 'clone_p_mysql' global status: start
   node-1: start

Changed in fuel:
importance: Undecided → Critical
milestone: none → 6.0
status: New → Confirmed
assignee: nobody → Fuel Library Team (fuel-library)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

related, but looks like not a dup https://bugs.launchpad.net/fuel/+bug/1391180

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I was wrong, the https://bugs.launchpad.net/fuel/+bug/1391180 is a dup of this one as well. No more intermittent failures due to galera should occur after https://review.openstack.org/134920 was merged

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.