Re-deployment fail in case if user remove non-primary controller

Bug #1470947 reported by Tatyanka
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Oleksiy Molchanov
6.1.x
Invalid
Critical
Bogdan Dobrelya
7.0.x
Invalid
High
Oleksiy Molchanov

Bug Description

Steps to reproduce:
1. Deploy HA cluster on CentOS witn neutron-vlan
2. Add following nodes:
 - 3 controllers
 - 1 cinder
 - 1 compute

3.Configure interfaces on the nodes:
eth0 - admin(PXE)
eth1 - public
eth2 - management
eth3 - private
eth4 - storage

4. Networks tab:
disable 'use vlan tag' for Public, Storage and Mangment

5. When cluster is ready run ostf
6. Delete 2 non-primary controllers
7. Run re-deployment

Expected Result:
Cluster ready after re-deployment, ostf tests are passed

Actual:
Redeployment fail with error on controller:
 (/Stage[main]/Main/Package[cloud-init]/ensure) http://mirror.fuel-infra.org/mos/centos-6/mos6.1/security/repodata/repomd.xml: [Errno 14] PYCURL ERROR 6 - "Couldn't resolve host 'mirror.fuel-infra.org'"

At the same time I have inet connectivity on controller:
[root@node-1 ~]# ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=58 time=3.87 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=58 time=3.87 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=58 time=3.77 ms
^C
--- 8.8.8.8 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2565ms
rtt min/avg/max/mdev = 3.774/3.842/3.878/0.069 ms

So the problem is that resolving still works via vr-mgmt ip that actually is not running yet.

resolv.conf
[root@node-1 ~]# cat /etc/resolv.conf
search test.domain.local
nameserver 10.109.4.3
[root@node-1 ~]#

ip netns befor deletion of controller:
http://paste.openstack.org/show/336059/

After deletion of controllers and re-deploy :
[root@node-1 ~]# ip netns exec vrouter ip a
18: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
19: vr-ns: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether f2:42:45:48:90:55 brd ff:ff:ff:ff:ff:ff
    inet 240.0.0.6/30 scope global vr-ns
    inet6 fe80::f042:45ff:fe48:9055/64 scope link
       valid_lft forever preferred_lft forever
[root@node-1 ~]#

[root@node-1 ~]# ping 10.109.4.3
PING 10.109.4.3 (10.109.4.3) 56(84) bytes of data.
From 10.109.4.4 icmp_seq=1 Destination Host Unreachable
From 10.109.4.4 icmp_seq=2 Destination Host Unreachable
From 10.109.4.4 icmp_seq=3 Destination Host Unreachable
^C
--- 10.109.4.3 ping statistics ---

[root@nailgun docker-logs]# cat /etc/fuel/version.yaml
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "6.1"
  openstack_version: "2014.2.2-6.1"
  api: "1.0"
  build_number: "525"
  build_id: "2015-06-19_13-02-31"
  nailgun_sha: "dbd54158812033dd8cfd7e60c3f6650f18013a37"
  python-fuelclient_sha: "4fc55db0265bbf39c369df398b9dc7d6469ba13b"
  astute_sha: "1ea8017fe8889413706d543a5b9f557f5414beae"
  fuel-library_sha: "2e7a08ad9792c700ebf08ce87f4867df36aa9fab"
  fuel-ostf_sha: "8fefcf7c4649370f00847cc309c24f0b62de718d"
  fuelmain_sha: "a3998372183468f56019c8ce21aa8bb81fee0c2f"
[root@nailgun docker-logs]#

Changed in fuel:
status: New → Confirmed
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Snapshot
https://drive.google.com/a/mirantis.com/file/d/0B_tSitrwrgvoeGxwQm1xV3Zsakk/view?usp=sharing

I leave the env, so if you need access- please ping me in slack

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

to get it worked, replace vr-mgmt ip in resolf.conf to 8.8.8.8, and run re-deoly

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Also re-deployment pass with success if in setting for mirrors only ips is using (mos and security)

Revision history for this message
Mike Scherbakov (mihgen) wrote :

Why it's incomplete for 7.0?

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

We need to reproduce it on 7.0 to set proper status(Confirmed or Invalid), that's why for now it stay as invalid for 7.0

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Issue is not reproduced on ha ubuntu gre and ha ubuntu vlan . Centos is not relevant for 7.0. So move to invalid for 7.0
{"build_id": "2015-07-12_15-52-44", "build_number": "31", "release_versions": {"2014.2.2-7.0": {"VERSION": {"build_id": "2015-07-12_15-52-44", "build_number": "31", "api": "1.0", "fuel-library_sha": "49c7ddeb5e4257bb52862bc5aa22600df71bb52a", "nailgun_sha": "60f9bf536e30efd896b7b4da1830e71adda19e30", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-7.0", "production": "docker", "python-fuelclient_sha": "accd6493bf034ba7c70c987ace8f1dcd960cbdf5", "astute_sha": "9cbb8ae5adbe6e758b24b3c1021aac1b662344e8", "fuel-ostf_sha": "62785c16f8399f30526d24c52bb9ca23e1585bfb", "release": "7.0", "fuelmain_sha": "28551be12a050acb9a633933ed6a8b25e2dc411c"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "49c7ddeb5e4257bb52862bc5aa22600df71bb52a", "nailgun_sha": "60f9bf536e30efd896b7b4da1830e71adda19e30", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-7.0", "production": "docker", "python-fuelclient_sha": "accd6493bf034ba7c70c987ace8f1dcd960cbdf5", "astute_sha": "9cbb8ae5adbe6e758b24b3c1021aac1b662344e8", "fuel-ostf_sha": "62785c16f8399f30526d24c52bb9ca23e1585bfb", "release": "7.0", "fuelmain_sha": "28551be12a050acb9a633933ed6a8b25e2dc411c"}

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

This bug is invalid as removing 2 controllers stops all the pacemaker-controlled resources and is not supported.

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

sorry Vova It it reproduced for 1 controller removal but only on 6.1, I 've updated the changes(also you can get confirmetion from snapshot about 1 node removal)

summary: - Re-deployment fail in case if user remove non-primary controllers
+ Re-deployment fail in case if user remove non-primary controller
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Vladimir, the removal of controllers should be supported up to the cluster size=1 case.
A removal is not a "power off". A removed node is dynamically deleted from the corosync with pacemaker cluster, lowering all of the clustering related numbers as appropriate.

Mike Scherbakov (mihgen)
tags: added: customer-found
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

I have reproduced it and found root cause. When we have 1 controller that is starting to redeploy, redeploy fails on tools.pp, because of DNS is down and DNS is down because of corosync is stopped.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

We should investigate this, might be a dynamic corosync node removal bug

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Bogdan, could you please clarify the current status of the bug/fix?

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Setting this as Invalid as per comment #7 we don't support 2 controllers failure.

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Hi Dennis, sorry but I have a question why we do not support remove of 2 controllers from 3 and Re-deploy environment? It is looks like common operation for downscale, also if we do not support such common case where I can find mention about this? And last do we need to remove auto tests covered this case( that it is green for 8.0?)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.