File system corruption after environment deletion

Bug #1268641 reported by Igor Shishkin
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Sergii Golovatiuk
4.1.x
Won't Fix
High
Fuel Library (Deprecated)
5.0.x
Fix Released
High
Fuel Library (Deprecated)

Bug Description

ISO: 4.1-30
Centos + HA + Ceph + GRE

Steps to reproduce:

- Bootstrap 3 nodes
- Add 3 controllers with ceph and cinder and 3 computes with ceph and cinder
- Deploy them
- After cluster successfully deployed run network verification and OSTF
- Then destroy environment

On of nodes in my case was found with messages on attached screenshot.

Revision history for this message
Igor Shishkin (teran) wrote :
Changed in fuel:
importance: Undecided → Low
tags: added: nailgun
description: updated
Revision history for this message
Ryan Moe (rmoe) wrote :

I ran into this problem on a larger scale (15+ deleted nodes ended up in this state). It led to some difficult to debug issues. When nodes are in this state Fuel will re-use the IP address but the stuck node is still running. This causes a duplicate IP on the network once Fuel re-assigns the IP and ARP issues that cause deployments to randomly fail.

Changed in fuel:
importance: Low → High
status: New → Confirmed
Revision history for this message
Ryan Moe (rmoe) wrote :

To clarify further, we experienced this issue when deleting individual nodes, not when deleting the entire environment.

Evgeniy L (rustyrobot)
Changed in fuel:
milestone: none → 4.1
Revision history for this message
Igor Shishkin (teran) wrote :

More often I got kernel panic, looks like it's the same problem.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Do you have node rebooted into bootstrap or not?

Revision history for this message
Igor Shishkin (teran) wrote :

I've just removed the environment from Fuel WebUI. The node got this instead of being rebooted.

Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Changed in fuel:
status: Confirmed → Triaged
Ryan Moe (rmoe)
tags: added: astute
tags: added: customer-found
Andrew Woodward (xarses)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Andrew Woodward (xarses)
assignee: Andrew Woodward (xarses) → nobody
assignee: nobody → Ryan Moe (rmoe)
Ryan Moe (rmoe)
Changed in fuel:
status: Triaged → In Progress
Changed in fuel:
importance: High → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/71892

Changed in fuel:
assignee: Ryan Moe (rmoe) → Vladimir Kuklin (vkuklin)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Guys

I've added kernel panic timeout that should fix the problem. Could you please test it and submit corresponding reply in gerrit ?

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

This bug is not affecting functionality of the cluster being deployed. The only part that is affected is cluster deletion, in which case nodes may hang. User can reset them and continue working. Thus, no need to mark this bug as critical.

Changed in fuel:
importance: Critical → High
Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Igor Shishkin (teran)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/71892
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=7eed50fc30cec675fff7787c37fcf6da6dd518ee
Submitter: Jenkins
Branch: master

commit 7eed50fc30cec675fff7787c37fcf6da6dd518ee
Author: Vladimir Kuklin <email address hidden>
Date: Fri Feb 7 14:23:30 2014 +0100

    Fix env deletion sequence

    add kernel panic timeout for env deletion
    do not do emergency remount

    Change-Id: Ie17cd74218cb4bd8e5ad64c1fe6e60e1efe5edee
    Closes-bug: #1268641

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

{
build_id: "2014-02-20_12-38-56",
mirantis: "no",
build_number: "169",
nailgun_sha: "1cafb7c9a81946a056dcaa6554d48bf396c90e9e",
ostf_sha: "380d376b8f16d1cf040b7cabbe9133fd0dcbeadd",
fuelmain_sha: "15637d29a59f299ee8ffe6560245a6884e954cbe",
astute_sha: "3d43abeefb60677ce6cae83d31ebbba1ff3cdbe2",
release: "4.1",
fuellib_sha: "35299a0aa5c9f75ee20c5b05003403a3d51af11c"
}

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Ryan Moe (rmoe) wrote :

I'm seeing this same issue again in 5.0. See attached screenshot. 2/5 nodes failed to reboot after I deleted the environment.

{"build_id": "2014-05-05_00-15-43", "mirantis": "yes", "build_number": "180", "ostf_sha": "134765fcb5a07dce0cd1bb399b2290c988c3c63b", "nailgun_sha": "2de1dcf9fa3fc1521999bff6377eaa6f01d825aa", "production": "docker", "api": "1.0", "fuelmain_sha": "95c35c199c2efc03fb105d090c5a42525430b7b3", "astute_sha": "3cffebde1e5452f5dbf8f744c6525fc36c7afbf3", "release": "5.0", "fuellib_sha": "2348fae80b21c3ec9e5f520395eea2943a510f3d"}

Changed in fuel:
milestone: 4.1 → 5.0
status: Fix Released → Confirmed
Revision history for this message
Ryan Moe (rmoe) wrote :
Revision history for this message
Igor Shishkin (teran) wrote :

Is it in bootstrap or on deployed node? If on deployed node, what OS on it?
Did you just delete environment, right?

Igor Shishkin (teran)
Changed in fuel:
milestone: 5.0 → 5.1
importance: High → Medium
Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :
Revision history for this message
Mike Scherbakov (mihgen) wrote :

This bug affects many users, and once node is stuck, you would need to use remote console to reset it, so increasing priority to High.

Changed in fuel:
importance: Medium → High
assignee: Igor Shishkin (teran) → Fuel Library Team (fuel-library)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
Changed in fuel:
status: Confirmed → Triaged
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Sergii Golovatiuk (sgolovatiuk)
status: Triaged → In Progress
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

If this bug affects 4.1 and 5.1, it has to be present in 5.0 as well.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/98460
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=694b5a55695e01e1c42185bfac9cc7a641a9bd48
Submitter: Jenkins
Branch: master

commit 694b5a55695e01e1c42185bfac9cc7a641a9bd48
Author: Aleksandr Didenko <email address hidden>
Date: Fri Jun 6 19:28:54 2014 +0300

    Correct node erase sequence

    This sequence is more accurate as it sends SIGTERM to all processes.

    - Send SIGTERM to all processes
    - Trap SIGTERM
    - Sync and set RO flag on all partitions
    - Erase bootable partitions with dd
    - Send "b" to sysrq-trigger

    Closes-Bug: 1268641
    Closes-Bug: 1279720

    Change-Id: I4eaa6ed9b2872f55efaff0a874ac280bdba02226

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/5.0)

Fix proposed to branch: stable/5.0
Review: https://review.openstack.org/100971

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/4.1)

Fix proposed to branch: stable/4.1
Review: https://review.openstack.org/100972

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (stable/5.0)

Reviewed: https://review.openstack.org/100971
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=a6f99f830ad356def866ded2e44f0f899e80ca24
Submitter: Jenkins
Branch: stable/5.0

commit a6f99f830ad356def866ded2e44f0f899e80ca24
Author: Aleksandr Didenko <email address hidden>
Date: Fri Jun 6 19:28:54 2014 +0300

    Correct node erase sequence

    This sequence is more accurate as it sends SIGTERM to all processes.

    - Send SIGTERM to all processes
    - Trap SIGTERM
    - Sync and set RO flag on all partitions
    - Erase bootable partitions with dd
    - Send "b" to sysrq-trigger

    Closes-Bug: 1268641
    Closes-Bug: 1279720

    Change-Id: I4eaa6ed9b2872f55efaff0a874ac280bdba02226
    (cherry picked from commit 694b5a55695e01e1c42185bfac9cc7a641a9bd48)

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

{
build_id: "2014-07-17_11-18-10",
mirantis: "yes",
build_number: "135",
ostf_sha: "09b6bccf7d476771ac859bb3c76c9ebec9da9e1f",
nailgun_sha: "1d08d6f80b6514085dd8c0af4d437ef5d37e2802",
production: "docker",
api: "1.0",
fuelmain_sha: "c8e13df4c7de3ce3504c2bcb6d51a165b9aae0b6",
astute_sha: "9a74b788be9a7c5682f1c52a892df36e4766ce3f",
release: "5.0.1",
fuellib_sha: "e8c2bb726be6b78c3a34f75c84337a3a5662bb35"
}

Revision history for this message
Anastasia Palkina (apalkina) wrote :

Verified on ISO #381
"build_id": "2014-08-01_02-01-14",
"ostf_sha": "15f3be5fbafb7a8c7075b5077a5074a50e679c25",
"build_number": "381",
"auth_required": true,
"api": "1.0",
"nailgun_sha": "51f32395eebe2514e78eb7e0a85e694826be40d6",
"production": "docker",
"fuelmain_sha": "7990f5bfa7fea5b74ebf0402b1918109b9bc505b",
"astute_sha": "f655ee86ebf0359b014f00cff63d0aaf15c65308",
"feature_groups": ["mirantis"],
"release": "5.1",
"fuellib_sha": "5571b86a667e28d4c9770fcce4d32163dee5a710"

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-astute (stable/4.1)

Change abandoned by Dmitry Pyzhov (<email address hidden>) on branch: stable/4.1
Review: https://review.openstack.org/100972
Reason: No activity for this review for a month

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

The backport to stable/4.1 got stuck and CI and was abandoned, bug status is reset back to Triaged.

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

4.x is already out of support. Closing the bug as Won't Fix.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.