Network verification and HA tests failed after network outage

Bug #1526339 reported by Vladimir
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Kyrylo Galanov

Bug Description

Steps to reproduce:

1. Create and deploy new cluster - Neutron Vxlan, ceph for all, ceph replication factor - 3, 3 controller, 2 compute, 3 ceph nodes
2. Create 2 volumes and 2 instances with attached volumes
3. Fill cinder storage up to 30%
4. Simulate network outage (virsh net-destroy of all networks except 'admin')
5. Wait 5 minutes
6. Fix network connection (virsh net-start)
7. Wait until OSTF 'HA' suite passes

3 hours after network connection was restored network verification fails:
Verification failed.
These nodes: "6", "7", "8", "3", "1", "2" failed to connect to some of these repositories: "http://archive.ubuntu.com/ubuntu/", "http://mirror.fuel-infra.org/mos-repos/ubuntu/8.0/"

HA tests failed:
Can not set proxy for Health Check.Make sure that network configuration for controllers is correct

Diagnostic snapshot link:
https://drive.google.com/file/d/0BzGc8pMVuherUGFaRkpyNVFGb0k/view?usp=sharing

Vladimir (vushakov)
description: updated
Ilya Kutukov (ikutukov)
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
milestone: none → 8.0
importance: Undecided → High
status: New → Confirmed
tags: added: area-library ha
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Please give us logs

Changed in fuel:
status: Confirmed → Incomplete
Vladimir (vushakov)
description: updated
Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Hm, there is no logs from nodes in the snapshot for some strange reason

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Although I can see at least status commands from controllers show the corosync & pacemaker cluster ended up split into 3 pieces. That is a case not compatible with a life of the cloud's control plane ;( I cannot know what happened w/o logs.

 [10.109.0.4] out: Online: [ node-1.test.domain.local ]
 [10.109.0.4] out: OFFLINE: [ node-2.test.domain.local node-3.test.domain.local ]

 [10.109.0.7] out: Online: [ node-2.test.domain.local ]
 [10.109.0.7] out: OFFLINE: [ node-1.test.domain.local node-3.test.domain.local ]

 [10.109.0.3] out: Online: [ node-3.test.domain.local ]
 [10.109.0.3] out: OFFLINE: [ node-1.test.domain.local node-2.test.domain.local ]

tags: added: corosync pacemaker
Revision history for this message
Vladimir (vushakov) wrote :

Logs from Fuel master node.

tags: added: team-bugfix
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Logs doesn't contain any required info but puppet logs

Changed in fuel:
status: Incomplete → Confirmed
status: Confirmed → Incomplete
Revision history for this message
ElenaRossokhina (esolomina) wrote :

I faced this bug too, iso#361
My steps:
1. Create and deploy next cluster - Neutron Vlan, cinder/swift, 3 controller, 2 compute, 1 cinder nodes
2. Create 2 volumes and 2 instances with attached volumes
3. Fill cinder storage up to 30%
4. Simulate network outage (virsh net-destroy of all networks except 'admin')
5. Wait 5 minutes.
6. Fix network connection (virsh net-start)
7. Wait until OSTF 'HA' suite passes (FAIL)

see all gathered logs on https://drive.google.com/a/mirantis.com/file/d/0B2ag_Bf-ShtTQTlIRWFPVXlOMVE/view?usp=sharing

[root@nailgun ~]# cat /etc/fuel/version.yaml
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "361"
  build_id: "361"
  fuel-nailgun_sha: "53c72a9600158bea873eec2af1322a716e079ea0"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "7463551bc74841d1049869aaee777634fb0e5149"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "ba8063d34ff6419bddf2a82b1de1f37108d96082"
  fuel-ostf_sha: "889ddb0f1a4fa5f839fd4ea0c0017a3c181aa0c1"
  fuel-mirror_sha: "8adb10618bb72bb36bb018386d329b494b036573"
  fuelmenu_sha: "824f6d3ebdc10daf2f7195c82a8ca66da5abee99"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "07d5f1c3e1b352cb713852a3a96022ddb8fe2676"

Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
ElenaRossokhina (esolomina) wrote :

And also, I've reproduced the initial scenario of this bug report.
Logs is unreacheable due to https://bugs.launchpad.net/fuel/+bug/1530324

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Kyrylo Galanov (kgalanov)
status: Confirmed → In Progress
Revision history for this message
Kyrylo Galanov (kgalanov) wrote :

The most simple way to reproduce:
1. Setup master
2. Add nodes
3. Verify network connectivity
4. $net-detroy xxx_public
5. $net-start xxx_public
6. Verify network connectivity
...
7. Environment is broken

--
Verification failed.
Repo availability verification using public network failed on following nodes Untitled (73:2a), Untitled (3f:77), Untitled (0b:60).
Following repos are not available - http://archive.ubuntu.com/ubuntu/, http://mirror.fuel-infra.org/mos-repos/ubuntu/8.0/
. Check your public network settings and availability of the repositories from public network. Please examine nailgun and astute logs for additional details.

Revision history for this message
Kyrylo Galanov (kgalanov) wrote :

libvirt disassembles network bridge on destroy: http://paste.openstack.org/show/483047/
Networking would not until bridge is assembles again.
It's not a bug.

Changed in fuel:
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.