Fuel for OpenStack

intermittent HA failures in CI gates due to deployment race conditions

Bug #1393334 reported by Bogdan Dobrelya on 2014-11-17

This bug report is a duplicate of: Bug #1391180: Deployment of Ha nova-flat cluster failed with (/Stage[main]/Osnailyfacter::Cluster_ha/Nova_floating_range[10.108.78.128-10.108.78.254]) Could not evaluate: Oops - not sure what happened: 757: unexpected token at '<html><body><h1>504 Gateway Time-out</h1>. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Confirmed	Critical	Fuel Library (Deprecated)	Fuel for OpenStack 6.0

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/6.0/job/6.0.staging.ubuntu.bvt_2/73/

Here is what could be seen from logs:
1) Failure of deployment due to
2014-11-16T09:40:02.587686 node-1 ./node-1.test.domain.local/puppet-apply.log:2014-11-16T09:40:02.587686+00:00 err: (/Stage[main]/Osnailyfacter::Cluster_ha/Nova_floating_range[10.108.1.128-10.108.1.254]) Could not evaluate: Oops - not sure what happened: 751: unexpected token at '<html><body><h1>504 Gateway Time-out</h1>

And galera cluster reported its ready for connections 3 minutes *later*
2014-11-16T09:43:59.887053 node-1 ./node-1.test.domain.local/mysqld.log:2014-11-16T09:43:59.887053+00:00 err: 2014-11-16 09:43:59 452 [Note] WSREP: Synchronized with group, ready for connections

Also there are signal=13 errors in xinetd.log for galeracheck
from 2014-11-16T09:32:38 to 2014-11-16T09:58:00:
START: galeracheck ... from=10.108.2.2, EXIT: galeracheck signal=13
(10.108.2.2 is management VIP)

2) at the moment of logs snapshot had been taken:
rabbitmqctl report (Nov 16, 09:59) shows 'rabbit@node-1': nodedown

pcs status (Nov 16 09:58:04 2014) shows all resources are stopped.
But there are no errors for this in puppet logs, and debug shows resources as started a minute before:
Sun Nov 16 09:57:55 +0000 2014 Puppet (debug):
-> Simple primitive 'vip__public' global status: start
   node-1: start
-> Cloned primitive 'clone_ping_vip__public' global status: start
   node-1: start
-> Cloned primitive 'clone_p_heat-engine' global status: start
   node-1: start
-> Multistate primitive 'master_p_rabbitmq-server' global status: master
   node-1: master
-> Simple primitive 'vip__management' global status: start
   node-1: start
-> Cloned primitive 'clone_p_haproxy' global status: start
   node-1: start
-> Cloned primitive 'clone_p_mysql' global status: start
   node-1: start

Bogdan Dobrelya (bogdando) on 2014-11-17

Changed in fuel:
importance:	Undecided → Critical
milestone:	none → 6.0
status:	New → Confirmed
assignee:	nobody → Fuel Library Team (fuel-library)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-11-17:

related, but looks like not a dup https://bugs.launchpad.net/fuel/+bug/1391180

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-11-18:

I was wrong, the https://bugs.launchpad.net/fuel/+bug/1391180 is a dup of this one as well. No more intermittent failures due to galera should occur after https://review.openstack.org/134920 was merged

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1391180 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.