Fuel for OpenStack

[system tests] deploy gre ha - deployment fail on compute nodes

Bug #1352982 reported by Tatyanka on 2014-08-05

6

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	High	Fuel Library (Deprecated)	Fuel for OpenStack 5.1

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/0_master_swarm/job/master_fuelmain.system_test.ubuntu.thread_4/125/testReport/junit/%28root%29/deploy_neutron_gre_ha/deploy_neutron_gre_ha/

Failed with Call cib_replace failed (-62): Timer expired on compute (see node-4)
http://paste.openstack.org/show/90546/

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "5.1"
  api: "1.0"
  build_number: "389"
  build_id: "2014-08-03_02-01-14"
  astute_sha: "ce86172e77661026c91fdf1ff8066d7df1f7d89d"
  fuellib_sha: "4e3fdd75f8dabde8e5d07067545d8043a70a176b"
  ostf_sha: "a3fa823ea0e4e03beb637ae07a91adea82c33182"
  nailgun_sha: "bd0127be0061029f9f910547db5e633c82244942"
  fuelmain_sha: "e99879292cf6e96b8991300d947df76b69134bb1"

on Ubuntu

Tags:

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2014-08-05:

#1

fail_deploy_neutron_gre_ha-2014_08_03__20_08_32.tar.gz Edit (6.7 MiB, application/x-tar)

Bogdan Dobrelya (bogdando) on 2014-08-06

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-08-06:

#2

2014-08-03T20:46:23.455168 node-1 ./node-1.test.domain.local/crmd.log:2014-08-03T20:46:23.455168+01:00 warning: warning: crmd_ha_msg_filter: Another DC detected: node-2 (op=noop)
2014-08-03T20:46:23.491805 node-2 ./node-2.test.domain.local/crmd.log:2014-08-03T20:46:23.491805+01:00 warning: warning: crmd_ha_msg_filter: Another DC detected: node-1 (op=noop)

RCA: Split brain at corosync cluster

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-08-06:

#3

Here is the events flow http://pastebin.com/eTgY4Hg7, looks like quorum was lost right after it has been acquired - because of 2 nodes gone offline (at least from corosync pov) ending up wuth a split brain once joined back to cluster?

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-08-06:

#4

2014-08-03T20:36:06.927246 cluster loses node-2. Only node-1 and node-4 remain in cluster with quorum
2014-08-03T20:46:18.609025 node-2 thinks he lost node-1 and node-4 from the cluster and it losts quorum.
We have two clusters can't see each other now (two partitions).
2014-08-03T20:46:23.450193 cluster1 with node-1,node-4 rejoins node-2 back
2014-08-03T20:46:23.454188 cluster2 acquires quorum (with node-2 inside only!)
2014-08-03T20:46:23.454399 cluster2 resjoins "lost" node-1 and node-4 back
2014-08-03T20:46:23.455168 cluster1 detects another DC elected in cluster2 while being partitioned and does nothing: op=noop (cuz we have no_quorum_policy=ignore?)
2014-08-03T20:46:23.491805 cluster2 detects another DC kept in cluster1 while being partitioned and does nothing: op=noop (cuz we have no_quorum_policy=ignore?)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-08-06:

#5

I believe we couldn't fix it w/o having fencing configured

Changed in fuel:
status:	New → Triaged

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-08-11:

#6

I am pretty sure that this bug is invalid as pacemaker and corosync are not running on the compute nodes.

Changed in fuel:
status:	Triaged → Incomplete

Bogdan Dobrelya (bogdando) on 2014-08-13

Changed in fuel:
assignee:	Bogdan Dobrelya (bogdando) → Fuel Library Team (fuel-library)
status:	Incomplete → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

fail_deploy_neutron_gre_ha-2014_08_03__20_08_32.tar.gz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.