Redeployment fails on removal of CephOSD from cluster

Bug #1549839 reported by Kyrylo Romanenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
Medium
Fuel Sustaining

Bug Description

1. Prepare cluster of 4 nodes:

Controller+CephOSD+Ironic
Controller+CephOSD+Ironic
Controller+CephOSD+Ironic
Compute

Compute QEMU
Network Neutron with VLAN segmentation
Storage Backends Ceph RBD for volumes (Cinder)
Ceph RadosGW for objects (Swift API)
Ceph RBD for ephemeral volumes (Nova)
Ceph RBD for images (Glance)
Set Ceph Replication factor = 2.

2. Deploy it and wait until deployment completes.
3. Check OSD tree:
# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.14996 root default
-2 0.04999 host node-1
 0 0.04999 osd.0 up 1.00000 1.00000
-3 0.04999 host node-3
 1 0.04999 osd.1 up 1.00000 1.00000
-4 0.04999 host node-2
 2 0.04999 osd.2 up 1.00000 1.00000

4. Then prepare node-2 (and osd.2) to deletion:
ceph osd out 2
ceph pg stat (check until all PGs will be "active+clean")
stop ceph-all
ceph osd crush remove osd.2
ceph auth del osd.2
ceph osd rm 2

Check that osd.2 successfully set out of Ceph tree:
# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.09998 root default
-2 0.04999 host node-1
 0 0.04999 osd.0 up 1.00000 1.00000
-3 0.04999 host node-3
 1 0.04999 osd.1 up 1.00000 1.00000
-4 0 host node-2

5. Delete node-2 from nodes tab in Fuel.
6. Redeploy changes.

Expected: node-2 to be deleted from cluster.
Actual result: quickly get an error
 [500] Error running RPC method granular_deploy: 83e35741-5ea2-4381-8e76-8cb51055aedb: MCollective call failed in agent 'puppetd', method 'runonce', failed nodes:
ID: 1 - Reason: Lock file and PID file exist; puppet is running.

http://paste.openstack.org/show/488197/

Astute log error:

Puppet error on primary controller: http://paste.openstack.org/show/488195/

Version:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "570"
  build_id: "570"
  fuel-nailgun_sha: "558ca91a854cf29e395940c232911ffb851899c1"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "c2a335b5b725f1b994f78d4c78723d29fa44685a"
  fuel-ostf_sha: "3bc76a63a9e7d195ff34eadc29552f4235fa6c52"
  fuel-mirror_sha: "fb45b80d7bee5899d931f926e5c9512e2b442749"
  fuelmenu_sha: "78ffc73065a9674b707c081d128cb7eea611474f"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "d605bcbabf315382d56d0ce8143458be67c53434"

Changed in mos:
importance: Undecided → Medium
Revision history for this message
Kyrylo Romanenko (kromanenko) wrote :
Revision history for this message
Kyrylo Romanenko (kromanenko) wrote :
tags: added: ceph
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> [500] Error running RPC method granular_deploy: 83e35741-5ea2-4381-8e76-8cb51055aedb: MCollective call failed in agent 'puppetd', method 'runonce', failed nodes:
> ID: 1 - Reason: Lock file and PID file exist; puppet is running.

The error has nothing to do with ceph, it looks like astute (or nailgun) bug.

> Controller+CephOSD+Ironic
> Controller+CephOSD+Ironic
> Controller+CephOSD+Ironic

> ceph osd out 2

In such a small cluster it's better to reweight the node being removed before marking it `out'

ceph osd crush reweight osd.2 0

See the official documentation [1] for more details

http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#take-the-osd-out-of-the-cluster

tags: added: nailgun
removed: ceph
Changed in mos:
assignee: MOS Ceph (mos-ceph) → nobody
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

No longer fixing Medium bugs in 8.0, closing as Won't fix.

MOS Ceph team, could you please take another look at this?

> [500] Error running RPC method granular_deploy: 83e35741-5ea2-4381-8e76-8cb51055aedb: MCollective call failed in agent 'puppetd', method 'runonce', failed nodes:
> ID: 1 - Reason: Lock file and PID file exist; puppet is running.

We need to also check the puppet logs to see why this failed ^

tags: added: area-ceph
Changed in mos:
assignee: nobody → MOS Ceph (mos-ceph)
status: New → Won't Fix
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Roman,

There's nothing to look at. The OSD has been successfully removed from the cluster. Thus node-2 is an ordinary controller node.

> MCollective call failed in agent 'puppetd', method 'runonce', failed nodes: ID: 1 - Reason: Lock file and PID file exist; puppet is running.

The error has nothing to do with ceph. Please ask nailgun and astute experts to debug it.

Revision history for this message
Kyrylo Romanenko (kromanenko) wrote :

I sill keep live environment with this failure if you need it for investigation.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Fuel Python team, could you please take a look at this?

Dmitry Pyzhov (dpyzhov)
tags: added: area-python
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

We passed SCF for 9.0. Moving the medium priority bug to 10.0 release.

no longer affects: mos/9.0.x
Changed in mos:
status: Won't Fix → Confirmed
assignee: MOS Ceph (mos-ceph) → Fuel Python Team (fuel-python)
milestone: 8.0 → 10.0
no longer affects: mos/9.0.x
no longer affects: mos/9.0.x
no longer affects: mos/10.0.x
Changed in fuel:
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Fuel Python Team (fuel-python)
milestone: none → 10.0
no longer affects: mos/8.0.x
no longer affects: mos
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Fuel Python (Deprecated) (fuel-python) → Fuel Sustaining (fuel-sustaining-team)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.