[system tests] deploy gre ha - deployment fail on compute nodes

Bug #1352982 reported by Tatyanka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Fuel Library (Deprecated)

Bug Description

http://jenkins-product.srt.mirantis.net:8080/view/0_master_swarm/job/master_fuelmain.system_test.ubuntu.thread_4/125/testReport/junit/%28root%29/deploy_neutron_gre_ha/deploy_neutron_gre_ha/

Failed with Call cib_replace failed (-62): Timer expired on compute (see node-4)
http://paste.openstack.org/show/90546/

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "5.1"
  api: "1.0"
  build_number: "389"
  build_id: "2014-08-03_02-01-14"
  astute_sha: "ce86172e77661026c91fdf1ff8066d7df1f7d89d"
  fuellib_sha: "4e3fdd75f8dabde8e5d07067545d8043a70a176b"
  ostf_sha: "a3fa823ea0e4e03beb637ae07a91adea82c33182"
  nailgun_sha: "bd0127be0061029f9f910547db5e633c82244942"
  fuelmain_sha: "e99879292cf6e96b8991300d947df76b69134bb1"

on Ubuntu

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

2014-08-03T20:46:23.455168 node-1 ./node-1.test.domain.local/crmd.log:2014-08-03T20:46:23.455168+01:00 warning: warning: crmd_ha_msg_filter: Another DC detected: node-2 (op=noop)
2014-08-03T20:46:23.491805 node-2 ./node-2.test.domain.local/crmd.log:2014-08-03T20:46:23.491805+01:00 warning: warning: crmd_ha_msg_filter: Another DC detected: node-1 (op=noop)

RCA: Split brain at corosync cluster

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Here is the events flow http://pastebin.com/eTgY4Hg7, looks like quorum was lost right after it has been acquired - because of 2 nodes gone offline (at least from corosync pov) ending up wuth a split brain once joined back to cluster?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

2014-08-03T20:36:06.927246 cluster loses node-2. Only node-1 and node-4 remain in cluster with quorum
2014-08-03T20:46:18.609025 node-2 thinks he lost node-1 and node-4 from the cluster and it losts quorum.
We have two clusters can't see each other now (two partitions).
2014-08-03T20:46:23.450193 cluster1 with node-1,node-4 rejoins node-2 back
2014-08-03T20:46:23.454188 cluster2 acquires quorum (with node-2 inside only!)
2014-08-03T20:46:23.454399 cluster2 resjoins "lost" node-1 and node-4 back
2014-08-03T20:46:23.455168 cluster1 detects another DC elected in cluster2 while being partitioned and does nothing: op=noop (cuz we have no_quorum_policy=ignore?)
2014-08-03T20:46:23.491805 cluster2 detects another DC kept in cluster1 while being partitioned and does nothing: op=noop (cuz we have no_quorum_policy=ignore?)

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I believe we couldn't fix it w/o having fencing configured

Changed in fuel:
status: New → Triaged
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

I am pretty sure that this bug is invalid as pacemaker and corosync are not running on the compute nodes.

Changed in fuel:
status: Triaged → Incomplete
Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Fuel Library Team (fuel-library)
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.