[upgrade] Ceph node deletion from simple cluster after upgrade failed with Ceph data still exists on: node-1. You must manually remove the OSDs from the cluster and allow Ceph to rebalance before deleting these nodes

Bug #1446089 reported by Andrey Sledzinskiy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Ryan Moe
6.1.x
Fix Released
High
Ryan Moe

Bug Description

{"build_id": "2015-04-17_15-24-00",
"ostf_sha": "4bab9b975ace8d9a305d6e0f112b734de587f847",
"build_number": "321",
"release_versions": {
"2014.2-6.0": {
"VERSION": {
"build_id": "2014-12-26_14-25-46",
"ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4",
"build_number": "58",
"api": "1.0",
"nailgun_sha": "5f91157daa6798ff522ca9f6d34e7e135f150a90",
"production": "docker",
"fuelmain_sha": "81d38d6f2903b5a8b4bee79ca45a54b76c1361b8",
"astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91",
"feature_groups":
["mirantis"],
"release": "6.0",
"fuellib_sha": "fde8ba5e11a1acaf819d402c645c731af450aff0"}},
"2014.2-6.1": {
"VERSION": {
"build_id": "2015-04-17_15-24-00",
"ostf_sha": "4bab9b975ace8d9a305d6e0f112b734de587f847",
"build_number": "321",
"api": "1.0",
"nailgun_sha": "939e5780cd0f7b1af3afd2926eda30f81bfc3e3f",
"openstack_version": "2014.2-6.1",
"production": "docker",
"python-fuelclient_sha": "0698062e9b044becf07bf9918fa16613aa3d93ad",
"astute_sha": "bf1751a4fe0d912325e3b4af629126a59c1b2b51",
"feature_groups":
["mirantis"],
"release": "6.1",
"fuelmain_sha": "5981d230e9484c196022a027c5c1600e36b17a72",
"fuellib_sha": "65617981bef34ea96b85d2d389cc037c304516e5"}}},
"auth_required": true,
"api": "1.0",
"nailgun_sha": "939e5780cd0f7b1af3afd2926eda30f81bfc3e3f",
"openstack_version": "2014.2-6.1",
"production": "docker",
"python-fuelclient_sha": "0698062e9b044becf07bf9918fa16613aa3d93ad",
"astute_sha": "bf1751a4fe0d912325e3b4af629126a59c1b2b51",
"feature_groups":
["mirantis"],
"release": "6.1",
"fuelmain_sha": "5981d230e9484c196022a027c5c1600e36b17a72",
"fuellib_sha": "65617981bef34ea96b85d2d389cc037c304516e5"}

Steps:
1. Create and deploy next 6.0-58 cluster - CentOS, simple, Neutron Vlan, Ceph for volumes and images, 1 controller+ceph, 2 compute+ceph
2. Upgrade fuel to 6.1
3. Delete 1 compute+ceph node from 6.0 cluster and start re-deployment

Actual result - node deletion failed with
Ceph data still exists on: node-1. You must manually remove the OSDs from the cluster and allow Ceph to rebalance before deleting these nodes.

Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
Changed in fuel:
assignee: nobody → Fuel Python Team (fuel-python)
Dmitry Ilyin (idv1985)
Changed in fuel:
status: New → Confirmed
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Ryan Moe (rmoe)
Revision history for this message
Ryan Moe (rmoe) wrote :

I don't think this is a bug. By preventing the deletion of OSDs with active data we prevent the user from inadvertently losing data. This is a good thing.

Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

So does it mean that I can't automatically delete ceph node from cluster ?
I don't think it's right that I need to do manual job in order to delete it, especially when this case worked fine before.
Please, ping me when you have time to look at this env and decide what to do with it

Changed in fuel:
status: Invalid → Confirmed
Dmitry Pyzhov (dpyzhov)
tags: added: feature-upgrade
removed: upgrade
Revision history for this message
Ryan Moe (rmoe) wrote :

I'll add documentation to the ops guide about how to safely remove an OSD node. The problem is that it didn't work fine before. Customers have encountered data loss because Fuel allowed them to delete OSDs with no regards to data integrity.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-docs (master)

Fix proposed to branch: master
Review: https://review.openstack.org/177462

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Ryan, assign it please on qa-team after it's done in order we change our test

tags: added: docs non-release release-notes
no longer affects: fuel/7.0.x
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-docs (master)

Reviewed: https://review.openstack.org/177462
Committed: https://git.openstack.org/cgit/stackforge/fuel-docs/commit/?id=522dd74a1d59e765724e38ec5de16b42d500bd5c
Submitter: Jenkins
Branch: master

commit 522dd74a1d59e765724e38ec5de16b42d500bd5c
Author: Ryan Moe <email address hidden>
Date: Fri Apr 24 14:14:04 2015 -0700

    Add instructions for removing OSDs to ops guide

    OSDs must be manually removed from the Ceph cluster
    before Fuel can delete them. The operations guide now
    explains how to do that.

    Change-Id: I7dc1e7d1f144a85bdf840ffbd655af929d1b72d4
    Closes-bug: #1446089
    Related-bug: #1428355

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Related bug about changing the system test is added: https://bugs.launchpad.net/fuel/+bug/1453416

Maksym Strukov (unbelll)
tags: added: on-verification
Revision history for this message
Maksym Strukov (unbelll) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-docs (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/182425

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-docs (master)

Reviewed: https://review.openstack.org/182425
Committed: https://git.openstack.org/cgit/stackforge/fuel-docs/commit/?id=8b9240cfd06c2856dd77f0bc80515ed4d1a050ec
Submitter: Jenkins
Branch: master

commit 8b9240cfd06c2856dd77f0bc80515ed4d1a050ec
Author: Ryan Moe <email address hidden>
Date: Tue May 12 12:27:43 2015 -0700

    Fix typo in ops guide for deleting OSDs

    Change-Id: I55ddcbd0db39d7d70d98d6581bdcf1c4dadfcd10
    Related-bug: #1446089

Revision history for this message
Maksym Strukov (unbelll) wrote :
Download full text (3.7 KiB)

Env:

{
  "build_id": "2015-05-12_08-34-41",
  "build_number": "406",
  "release_versions": {
    "2014.2-6.0": {
      "VERSION": {
        "build_id": "2014-12-26_14-25-46",
        "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4",
        "build_number": "58",
        "api": "1.0",
        "nailgun_sha": "5f91157daa6798ff522ca9f6d34e7e135f150a90",
        "production": "docker",
        "fuelmain_sha": "81d38d6f2903b5a8b4bee79ca45a54b76c1361b8",
        "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91",
        "feature_groups": [
          "mirantis"
        ],
        "release": "6.0",
        "fuellib_sha": "fde8ba5e11a1acaf819d402c645c731af450aff0"
      }
    },
    "2014.2.2-6.1": {
      "VERSION": {
        "build_id": "2015-05-12_08-34-41",
        "build_number": "406",
        "api": "1.0",
        "fuel-library_sha": "156fb11bbf3e12e7c73a9a3ac785c9d33d4ac343",
        "nailgun_sha": "0d077692e813720410c52bc720a8462725446e0d",
        "feature_groups": [
          "mirantis"
        ],
        "openstack_version": "2014.2.2-6.1",
        "production": "docker",
        "python-fuelclient_sha": "af6c9c3799b9ec107bcdc6dbf035cafc034526ce",
        "astute_sha": "e319b19158fc416d911edf0c06667e810c457b02",
        "fuel-ostf_sha": "51b41cba7572aefa4a98e40fdecdbc05efb2e1ea",
        "release": "6.1",
        "fuelmain_sha": "51b86bb24b27742a22b23e2ae3dfc850c47e5fbf"
      }
    }
  },
  "auth_required": true,
  "api": "1.0",
  "fuel-library_sha": "156fb11bbf3e12e7c73a9a3ac785c9d33d4ac343",
  "nailgun_sha": "0d077692e813720410c52bc720a8462725446e0d",
  "feature_groups": [
    "mirantis"
  ],
  "openstack_version": "2014.2.2-6.1",
  "production": "docker",
  "python-fuelclient_sha": "af6c9c3799b9ec107bcdc6dbf035cafc034526ce",
  "astute_sha": "e319b19158fc416d911edf0c06667e810c457b02",
  "fuel-ostf_sha": "51b41cba7572aefa4a98e40fdecdbc05efb2e1ea",
  "release": "6.1",
  "fuelmain_sha": "51b86bb24b27742a22b23e2ae3dfc850c47e5fbf"
}

Steps:
1. Create and deploy next 6.0-58 cluster - CentOS, simple, Neutron Vlan, Ceph for volumes and images, 1 controller+ceph, 2 compute+ceph
2. Upgrade fuel to 6.1
3. Do: http://paste.mirantis.net/show/381/
4. Delete 1 compute+ceph node from 6.0 cluster and start re-deployment

Actual:

2015-05-12T18:54:58 err: [651] Error running RPC method remove_nodes: undefined method `[]' for nil:NilClass, trace:
["/usr/lib64/ruby/gems/2.1.0/gems/astute-6.1.0/lib/astute/pre_delete.rb:74:in `remove_ceph_mons'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.1.0/lib/astute/orchestrator.rb:202:in `remove_ceph_mons'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.1.0/lib/astute/orchestrator.rb:210:in `perform_pre_deletion_tasks'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.1.0/lib/astute/orchestrator.rb:103:in `remove_nodes'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.1.0/lib/astute/server/dispatcher.rb:168:in `remove_nodes'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.1.0/lib/astute/server/server.rb:142:in `dispatch_message'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.1.0/lib/astute/server/server.rb:103:in `block in dispatch'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-6.1.0/lib/astute/server/task_queue.rb:64:in `cal...

Read more...

Revision history for this message
Ryan Moe (rmoe) wrote :

Maxim, looks like you might have run into this bug: https://bugs.launchpad.net/fuel/+bug/1453408

Revision history for this message
Maksym Strukov (unbelll) wrote :

{"build_id": "2015-05-12_17-03-01", "build_number": "408", "release_versions": {"2014.2-6.0": {"VERSION": {"build_id": "2014-12-26_14-25-46", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "58", "api": "1.0", "nailgun_sha": "5f91157daa6798ff522ca9f6d34e7e135f150a90", "production": "docker", "fuelmain_sha": "81d38d6f2903b5a8b4bee79ca45a54b76c1361b8", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "fuellib_sha": "fde8ba5e11a1acaf819d402c645c731af450aff0"}}, "2014.2.2-6.1": {"VERSION": {"build_id": "2015-05-12_17-03-01", "build_number": "408", "api": "1.0", "fuel-library_sha": "156fb11bbf3e12e7c73a9a3ac785c9d33d4ac343", "nailgun_sha": "042ed77ffdff22a5242ac7f8a6980836d0a5ef1a", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-6.1", "production": "docker", "python-fuelclient_sha": "af6c9c3799b9ec107bcdc6dbf035cafc034526ce", "astute_sha": "d9d488c4c675e6dd33eb68b33a79abe591b4c26f", "fuel-ostf_sha": "21afa436e725be1debadf1c207018753b537c7b3", "release": "6.1", "fuelmain_sha": "51b86bb24b27742a22b23e2ae3dfc850c47e5fbf"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "156fb11bbf3e12e7c73a9a3ac785c9d33d4ac343", "nailgun_sha": "042ed77ffdff22a5242ac7f8a6980836d0a5ef1a", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-6.1", "production": "docker", "python-fuelclient_sha": "af6c9c3799b9ec107bcdc6dbf035cafc034526ce", "astute_sha": "d9d488c4c675e6dd33eb68b33a79abe591b4c26f", "fuel-ostf_sha": "21afa436e725be1debadf1c207018753b537c7b3", "release": "6.1", "fuelmain_sha": "51b86bb24b27742a22b23e2ae3dfc850c47e5fbf"}

#1453408 fixed now.

So I removed node-3 from env. But on another nodes:

[root@node-1 ~]# ceph osd tree
# id weight type name up/down reweight
-1 0.2 root default
-2 0.09998 host node-1
0 0.04999 osd.0 up 1
1 0.04999 osd.1 up 1
-3 0 host node-3
-4 0.09998 host node-2
4 0.04999 osd.4 up 1
5 0.04999 osd.5 up 1
2 0 osd.2 down 0
3 0 osd.3 down 0

So I remove osd 2 and 3

[root@node-1 ~]# ceph osd rm 2
removed osd.2
[root@node-1 ~]# ceph osd rm 3
removed osd.3

[root@node-1 ~]# ceph osd tree
# id weight type name up/down reweight
-1 0.2 root default
-2 0.09998 host node-1
0 0.04999 osd.0 up 1
1 0.04999 osd.1 up 1
-3 0 host node-3
-4 0.09998 host node-2
4 0.04999 osd.4 up 1
5 0.04999 osd.5 up 1

1. Maybe we should add this steps to manual?
2. How to remove "-3 0 host node-3"? Should we do this?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-docs (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/184347

Revision history for this message
Maksym Strukov (unbelll) wrote :

// Sorry for my previous comment, wrong window (comment already hidden) //

Current manual (with latest proposed patch) seems fine for me.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-docs (master)

Reviewed: https://review.openstack.org/184347
Committed: https://git.openstack.org/cgit/stackforge/fuel-docs/commit/?id=f902ae99dd7af93c91e10cb74da8684dcaa18320
Submitter: Jenkins
Branch: master

commit f902ae99dd7af93c91e10cb74da8684dcaa18320
Author: Ryan Moe <email address hidden>
Date: Tue May 19 14:08:28 2015 -0700

    Update commands for removing OSDs

    Added OS-specific commands for stopping OSD processes.
    Also added commands to remove the OSDs and host from
    the CRUSH map.

    Change-Id: Ic4a58135c546fc8867a838aab9e366be41886095
    Related-bug: #1446089

Revision history for this message
Maksym Strukov (unbelll) wrote :

I assume that work completed.

tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.