astute could skip removing of a node from cobbler if requested to remove more than 7 nodes at once.

Bug #1494446 reported by Vladimir Khlyunev on 2015-09-10
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Critical
Vladimir Kozhukalov
7.0.x
Critical
Vladimir Kozhukalov
8.0.x
Critical
Vladimir Kozhukalov

Bug Description

ISO 288 (RC2)

It's hard to reproduce this bug but I caught it twice (of about 5, the last was caught on fresh env after first deployment).
Steps for me:
1) Deploy 2 clusters:
First: HA with detached-database node - https://mirantis.testrail.com/index.php?/tests/view/1795440
Second: Simple with detached-keystone - https://mirantis.testrail.com/index.php?/tests/view/1795439
2) Delete both clusters
3) Wait until all nodes becomes available

Actual result:
1 node was not bootstrapped, http://puu.sh/k6FNl/3d9f86c29b.png vnc console of disappeared node (UPD: after restart node still throw this error)

This but is probably related to https://bugs.launchpad.net/fuel/+bug/1494246

I will keep my env - feel free to request access

Vladimir Khlyunev (vkhlyunev) wrote :
description: updated
Ivan Kliuk (ivankliuk) on 2015-09-11
Changed in fuel:
assignee: nobody → Fuel Python Team (fuel-python)
status: New → Confirmed
importance: Undecided → High
description: updated

Also reproduced after upgrade from 6.1 to 7.0 with 288 tarball.
Delete existing cluster created in 6.1 and 1 node doesn't boostrapped.
Only "virsh destroy" + "virsh start" help's

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Aleksandr Gordeev (a-gordeev)
Alexander Gordeev (a-gordeev) wrote :

The reason is quite simple. A node was not removed from cobbler. However this doesn't prevent from cleaning of disks. Thus, its pxelinux config was still existed and configured to boot from local disk. Therefore it failed to load from local disk as all data from boot sector was wiped.

2015-09-10T12:57:29 info: [666] Processing RPC call 'remove_nodes'
2015-09-10T12:57:29 info: [666] Total list of nodes to remove: ["node-1",
 "node-2",
 "node-3",
 "node-4",
 "node-5",
 "node-6",
 "node-7",
 "node-8",
 "node-9"]
2015-09-10T12:57:30 info: [666] Trying to remove system from cobbler: node-1
2015-09-10T12:57:30 info: [666] Trying to remove system from cobbler: node-3
2015-09-10T12:57:30 info: [666] Trying to remove system from cobbler: node-5
2015-09-10T12:57:30 info: [666] Trying to remove system from cobbler: node-7
2015-09-10T12:57:31 info: [666] Trying to remove system from cobbler: node-9
2015-09-10T12:57:33 info: [666] Trying to remove system from cobbler: node-2
2015-09-10T12:57:33 info: [666] Trying to remove system from cobbler: node-6
2015-09-10T12:57:35 info: [666] Trying to remove system from cobbler: node-4
2015-09-10T12:57:37 err: [666] Cannot remove nodes from cobbler: ["node-8"]

[root@nailgun ~]# grep -r 'Cannot remove nodes from cobbler' /var/log/docker-logs/astute/astute.log
2015-09-10T10:12:28 err: [670] Cannot remove nodes from cobbler: ["node-8"]
2015-09-10T12:57:37 err: [666] Cannot remove nodes from cobbler: ["node-8"]
2015-09-11T07:38:15 err: [674] Cannot remove nodes from cobbler: ["node-31"]

The most interesting thing is that, there're no any messages in logs such as:
https://github.com/stackforge/fuel-astute/blob/master/lib/astute/cobbler_manager.rb#L56
"Trying to remove system from cobbler: node-8"
or
"System is not in cobbler: node-8"
https://github.com/stackforge/fuel-astute/blob/master/lib/astute/cobbler_manager.rb#L60

It can't be.

tags: added: module-astute

the problem in line https://github.com/stackforge/fuel-astute/blob/master/lib/astute/cobbler_manager.rb#L52
the error_nodes in this case is the reference to nodes_to_remove and any modification of error_list affects the error_nodes as well.
that means original list are modified during iteration and the code does not work as expected.

to fix need to change line 52: error_nodes = nodes_to_remove.dup

summary: - After cluster deletion node can't bootstrap - "boot sector signature not
- found, (unbootable disk/partition?)"
+ astute could skip removing of a node from cobbler if requested to remove
+ more than 7 nodes at once.

Fix proposed to branch: master
Review: https://review.openstack.org/222590

Changed in fuel:
assignee: Aleksandr Gordeev (a-gordeev) → Vladimir Kozhukalov (kozhukalov)
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/222590
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=f95549e1ba6fa738ed88797d42ead113ef0518ef
Submitter: Jenkins
Branch: master

commit f95549e1ba6fa738ed88797d42ead113ef0518ef
Author: Vladimir Kozhukalov <email address hidden>
Date: Fri Sep 11 15:42:13 2015 +0300

    Fixed reference error in cobbler_manager.rb

    Due to reference error some nodes were skipped when removing.

    Change-Id: I3c03b161f0643eeca65a15a5fa5cd468a0b17e43
    Closes-Bug: #1494446
    Co-Authored-By: Alexander Gordeev <email address hidden>
    Co-Authored-By: Bulat Gaifullin <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/222860
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=6c5b73f93e24cc781c809db9159927655ced5012
Submitter: Jenkins
Branch: stable/7.0

commit 6c5b73f93e24cc781c809db9159927655ced5012
Author: Vladimir Kozhukalov <email address hidden>
Date: Fri Sep 11 15:42:13 2015 +0300

    Fixed reference error in cobbler_manager.rb

    Due to reference error some nodes were skipped when removing.

    Change-Id: I3c03b161f0643eeca65a15a5fa5cd468a0b17e43
    Closes-Bug: #1494446
    Co-Authored-By: Alexander Gordeev <email address hidden>
    Co-Authored-By: Bulat Gaifullin <email address hidden>

Moving bug for 7.0 status to 'Fix committed' due to merge of https://review.openstack.org/#/c/222860/

Tatyanka (tatyana-leontovich) wrote :

back to fix commit status for 7.0 according to fix is not verified yet

Ihor Kalnytskyi (ikalnytskyi) wrote :

Raise bug to Critical for 7.0, since it has nothing to do with detach components and basic workflow (remove a number of nodes) is broken. That's why the backport to stable/7.0 was merged - it's strange to see that when you remove 5 nodes from cluster, two of them became unresponsive.

tags: added: on verification

I did about 10 delete and reset clusters in a row and issue didn't show up

build_id": "298",
"build_number": "298",
"release_versions":
{

    "2015.1.0-7.0":

{

    "VERSION":

{

    "build_id": "298",
    "build_number": "298",
    "api": "1.0",
    "fuel-library_sha": "0623b4daad438ceeb5dc41b10cdd3011795fff7e",
    "nailgun_sha": "d590b26dbb09785b8a8b3651b0ef69746fcf9991",
    "feature_groups":

            [
                "mirantis"
            ],
            "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd",
            "openstack_version": "2015.1.0-7.0",
            "fuel-agent_sha": "082a47bf014002e515001be05f99040437281a2d",
            "production": "docker",
            "python-fuelclient_sha": "486bde57cda1badb68f915f66c61b544108606f3",
            "astute_sha": "6c5b73f93e24cc781c809db9159927655ced5012",
            "fuel-ostf_sha": "1f08e6e71021179b9881a824d9c999957fcc7045",
            "release": "7.0",
            "fuelmain_sha": "6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85"
        }
    }

}

tags: removed: on verification
tags: added: on-verification
Dmitriy Kruglov (dkruglov) wrote :

Verified on MOS 8.0 Kilo. Not reproduced.

ISO info:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  openstack_version: "2015.1.0-8.0"
  api: "1.0"
  build_number: "128"
  build_id: "128"
  fuel-nailgun_sha: "70d8b7e80573728e04ac5478c112850afcfa9802"
  python-fuelclient_sha: "56fbd6bad7f60f0944b3845c2db14d0b8cabd4d3"
  fuel-agent_sha: "e881f0dabd09af4be4f3e22768b02fe76278e20e"
  fuel-nailgun-agent_sha: "d66f188a1832a9c23b04884a14ef00fc5605ec6d"
  astute_sha: "0f753467a3f16e4d46e7e9f1979905fb178e4d5b"
  fuel-library_sha: "e3d2905b9dd2cc7b4d46201ca9816dd320868917"
  fuel-ostf_sha: "41aa5059243cbb25d7a80b97f8e1060a502b99dd"
  fuelmain_sha: "51614465980e5f62a5796779d3f6c3305c1d5739"

Dmitry Pyzhov (dpyzhov) on 2015-10-21
tags: added: area-python
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers