[Image Based] Cloud init reset all node files after delete cluster and deploy another one on the same nodes

Bug #1394599 reported by Andrey Sledzinskiy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
Medium
Alexander Gordeev

Bug Description

{

    "build_id": "2014-11-17_17-53-34",
    "ostf_sha": "82465a94eed4eff1fc8d8e1f2fb7e9993c22f068",
    "build_number": "504",
    "auth_required": true,
    "api": "1.0",
    "nailgun_sha": "8d23d1b1bcd9213a70a40c38c3c1486d215d40b5",
    "production": "docker",
    "fuelmain_sha": "8d4943d5ead7a894d4af5e10172510fa60eeed84",
    "astute_sha": "65eb911c38afc0e23d187772f9a05f703c685896",
    "feature_groups": [
        "mirantis"
    ],
    "release": "6.0",
    "release_versions": {
        "2014.2-6.0": {
            "VERSION": {
                "build_id": "2014-11-17_17-53-34",
                "ostf_sha": "82465a94eed4eff1fc8d8e1f2fb7e9993c22f068",
                "build_number": "504",
                "api": "1.0",
                "nailgun_sha": "8d23d1b1bcd9213a70a40c38c3c1486d215d40b5",
                "production": "docker",
                "fuelmain_sha": "8d4943d5ead7a894d4af5e10172510fa60eeed84",
                "astute_sha": "65eb911c38afc0e23d187772f9a05f703c685896",
                "feature_groups": [
                    "mirantis"
                ],
                "release": "6.0",
                "fuellib_sha": "8a0ceff90777af75a3f9363a57185e608f3ee10d"
            }
        }
    },
    "fuellib_sha": "8a0ceff90777af75a3f9363a57185e608f3ee10d"

}

Steps:
1. Create and deploy next cluster - Ubuntu, HA, Neutron GRE, Image-based provisioning, 3 controller, 2 compute, 1 cinder node
2. After deployment delete cluster
3. Create new cluster - CentOS, HA, Neutron Vlan, Image-based provisioning, 3 controller, 2 compute nodes
4. Provision cluster

Expected - all nodes were successfully provisioned
Actual - 1 time out of 4 one of the nodes is provisioned but after node's restart and start up of cloud init it destroyed all node's files

Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
tags: added: experimental
Changed in fuel:
status: New → Confirmed
tags: added: cloud-init
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 6.0 → 6.1
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

The root cause is still unknown. Stable and repeatable way of how to reproduce it even.

At first i thought it was a failure inside of the boothook script. Nope, the boothook script worked fine every time i'd tried. https://review.openstack.org/#/c/138384/ <- patch for boothook scripts.

It might be cloud-init's semaphores issues. For unknown reason cloud-init log was full of messages showing that all config_* modules have been already run. The executor simply checks the semaphore and skips if it exists.

They stored in /var/lib/cloud/instance/sem/config_*

I have only one strategy to follow. We need to disable automatic cloud-init start on boot (just removing links from /etc/rc.d/* should help) and then start cloud-init by hand under `strace` or other hardcore debug stuff and watch what will happen.

Sounds as very time consuming task.

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Also last time this issue has been reproduced mainly on CentOS clusters sporadically.
On our CI it fails in 1 timeout of 4

Revision history for this message
Anastasia Palkina (apalkina) wrote :

Reproduced on ISO #49 for 6.0

"build_id": "2014-12-09_22-41-06", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "49", "auth_required": true, "api": "1.0", "nailgun_sha": "22bd43b89a17843f9199f92d61fc86cb0f8772f1", "production": "docker", "fuelmain_sha": "3aab16667f47dd8384904e27f70f7a87ba15f4ee", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "release_versions": {"2014.2-6.0": {"VERSION": {"build_id": "2014-12-09_22-41-06", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "49", "api": "1.0", "nailgun_sha": "22bd43b89a17843f9199f92d61fc86cb0f8772f1", "production": "docker", "fuelmain_sha": "3aab16667f47dd8384904e27f70f7a87ba15f4ee", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "fuellib_sha": "2c99931072d951301d395ebd5bf45c8d401301bb"}}}, "fuellib_sha": "2c99931072d951301d395ebd5bf45c8d401301bb"}

1. Create new environment (Ubuntu, HA mode)
2. Choose nova-network, flat
3. Choose both Ceph
4. Add 3 controllers, 2 computes, 2 ceph
5. Choose Image Based provisioning
6. Start deployment
7. One of nodes hangs during provisioning, other nodes provisioned successfully

Revision history for this message
Anastasia Palkina (apalkina) wrote :
tags: added: release-notes
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-astute (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/146776

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/146776
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=2e9d2733a2ddd1ea3ff583d0a9f81792e2569dba
Submitter: Jenkins
Branch: master

commit 2e9d2733a2ddd1ea3ff583d0a9f81792e2569dba
Author: Alexei Sheplyakov <email address hidden>
Date: Tue Jan 13 08:55:28 2015 +0300

    Fix rebooting of the bootstrap nodes

    Skip the hard reboot for the image based provisioning since the reboot
    command might hit a node which has booted into the provisioned OS (which
    causes the filesystem corruption and interrupts the deployment).
    Fix the condition which selects the bootstrap nodes, that is, use
    SshHardReboot instead of SshRebootNotProvisioning (the latter reboots
    the locally booted nodes instead the bootstrap ones due to the inverted
    condition).

    Related-bug: #1394599
    Related-bug: #1407634
    Change-Id: Ie4af6904a8297d9acbc4e96425903e9e57450286

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-astute (stable/6.0)

Related fix proposed to branch: stable/6.0
Review: https://review.openstack.org/147223

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-astute (stable/6.0)

Reviewed: https://review.openstack.org/147223
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=f7cda2171b0b677dfaeb59693d980a2d3ee4c3e0
Submitter: Jenkins
Branch: stable/6.0

commit f7cda2171b0b677dfaeb59693d980a2d3ee4c3e0
Author: Alexei Sheplyakov <email address hidden>
Date: Tue Jan 13 08:55:28 2015 +0300

    Fix rebooting of the bootstrap nodes

    Skip the hard reboot for the image based provisioning since the reboot
    command might hit a node which has booted into the provisioned OS (which
    causes the filesystem corruption and interrupts the deployment).
    Fix the condition which selects the bootstrap nodes, that is, use
    SshHardReboot instead of SshRebootNotProvisioning (the latter reboots
    the locally booted nodes instead the bootstrap ones due to the inverted
    condition).

    Related-bug: #1394599
    Related-bug: #1407634
    Change-Id: Ie4af6904a8297d9acbc4e96425903e9e57450286

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.