[doc] Ceph OSD disks are lost at node reboot

Bug #1416855 reported by Dmitriy Novakovskiy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Fuel Documentation Team
5.0.x
Won't Fix
High
Fuel Documentation Team
5.1.x
Won't Fix
High
Fuel Documentation Team
6.0.x
Won't Fix
High
Fuel Documentation Team
6.1.x
Fix Released
High
Fuel Documentation Team
7.0.x
Fix Released
Undecided
Fuel Documentation Team
8.0.x
Fix Released
Undecided
Fuel Documentation Team
Future
Invalid
Undecided
Fuel Documentation Team
Mitaka
Fix Released
High
Fuel Documentation Team

Bug Description

In one of customer deployments we faced the situation when host OS kernel would initialize multiple backplanes with disks in random order at boot time. This causes OSD disks that Fuel deployed w/ ceph-deploy utility to be lost from OSD after node reboot, since they were mounted via /dev/sdXXX pointers and these numbers become different at every boot.

The solution is to mount OSD disks using UUID instead of sdXXX names. It's needs to be checked whether ceph-deploy utility can do it (so Fuel could just pass additional parameter) or a more sophisticated approach is needed to solve this problem.

The following document describes the manual OSD management steps for the deployment where we first found this: http://goo.gl/SPZGFC. Also, Miroslav Anashkin (<email address hidden>) has detailed context

Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
milestone: none → 6.1
Changed in fuel:
importance: Undecided → High
status: New → Confirmed
Mike Scherbakov (mihgen)
tags: added: customer-found
Changed in fuel:
status: Confirmed → Triaged
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Oleksiy Molchanov (omolchanov)
Changed in fuel:
status: Triaged → In Progress
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

This issue doesn't affect 6.1

Changed in fuel:
milestone: 6.1 → 5.0.3
Revision history for this message
Dmitriy Novakovskiy (dnovakovskiy) wrote :

>>This issue doesn't affect 6.1

Why? What was changed in how Ceph is deployed in 6.1?

Revision history for this message
Ryan Moe (rmoe) wrote :

When Fuel deploys Ceph we set the GPT partition typecode (using sgdisk [0]) for OSD and journal partitions. Ceph installs udev rules [1] that will find all partitions containing these GUIDs and activate them as needed. If you're going to deploy new OSDs manually you'll probably want to set the partition GUIDs.

[0] https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/cobbler/templates/scripts/pmanager.py#L885
[1] https://github.com/ceph/ceph/blob/master/udev/95-ceph-osd.rules

Revision history for this message
Dmitriy Novakovskiy (dnovakovskiy) wrote :

Ryan, do I understand correctly that you're describing how Ceph deployment is done in 6.1?

Miroslav, please verify that the approach that Ryan describes will help

Revision history for this message
Ryan Moe (rmoe) wrote :

This is how Fuel has deployed Ceph since the feature was added.

Revision history for this message
Dmitriy Novakovskiy (dnovakovskiy) wrote : Re: [Bug 1416855] Re: Ceph OSD disks are lost at node reboot

Then this needs to be reviewed by Miroslav. He first observed this problem at
customer installation

On Monday, February 9, 2015, Ryan Moe <email address hidden> wrote:

> This is how Fuel has deployed Ceph since the feature was added.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1416855
>
> Title:
> Ceph OSD disks are lost at node reboot
>
> Status in Fuel: OpenStack installer that works:
> In Progress
>
> Bug description:
> In one of customer deployments we faced the situation when host OS
> kernel would initialize multiple backplanes with disks in random order
> at boot time. This causes OSD disks that Fuel deployed w/ ceph-deploy
> utility to be lost from OSD after node reboot, since they were mounted
> via /dev/sdXXX pointers and these numbers become different at every
> boot.
>
> The solution is to mount OSD disks using UUID instead of sdXXX names.
> It's needs to be checked whether ceph-deploy utility can do it (so
> Fuel could just pass additional parameter) or a more sophisticated
> approach is needed to solve this problem.
>
> The following document describes the manual OSD management steps for
> the deployment where we first found this: http://goo.gl/SPZGFC. Also,
> Miroslav Anashkin (<email address hidden> <javascript:;>) has detailed
> context
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/fuel/+bug/1416855/+subscriptions
>

--
---
Regards,

*Dmitriy Novakovskiy*
Sr. Sales Engineer, Mirantis EMEA

*Skype:* dmitriy.novakovskiy
*Operating from:* Ukraine

Changed in fuel:
assignee: Oleksiy Molchanov (omolchanov) → Miroslav Anashkin (manashkin)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote : Re: Ceph OSD disks are lost at node reboot

That was a version of Fuel for the reported issue?

Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :

Issue with OSD happens only when disks are added manually after deployment phase, during operation.

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Ryan and Sergii are right, such problem may affect only disks that were added manually after deployment. I've checked on 5.0.1 env I have and udev rules are in place and osd partitions have GUID set, so marking this invalid for 5.0.

Revision history for this message
Ryan Moe (rmoe) wrote :

This is invalid for 5.1 and 6.0 for the same reasons it's invalid for 6.1 and 5.0. We set the partition GUID during provisioning and the udev rules are present in both of those releases.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Please document the correct procedure to add disks to deployed OSD nodes in the Operations Guide. See the document linked from bug description for reference.

tags: added: docs
Revision history for this message
Denis Klepikov (dklepikov) wrote :

Draft "How to add OSD with mount by UDEV on reboot, with notes."

https://docs.google.com/a/mirantis.com/document/d/18gPSkw4Cg3cV5mHF-O3OxATMqPSatiIBwUw-l-uD1-k/edit?usp=sharing

Comments are welcome.

Revision history for this message
Miroslav Anashkin (manashkin) wrote :
Revision history for this message
Dmitriy Novakovskiy (dnovakovskiy) wrote : Re: [Bug 1416855] Re: Ceph OSD disks are lost at node reboot

it seems to be dealing with another issue - removing OSD nodes from Fuel
UI, not OSD getting lost at reboot

On Wed, Apr 1, 2015 at 8:31 PM, Miroslav Anashkin <email address hidden>
wrote:

> Published bulletin:
>
> https://online.mirantis.com/hubfs/Mirantis-Technical-Bulletin-5-Removing-Ceph-OSD-node.pdf?t=1427907150102
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1416855
>
> Title:
> Ceph OSD disks are lost at node reboot
>
> Status in Fuel: OpenStack installer that works:
> Confirmed
> Status in Fuel for OpenStack 5.0.x series:
> Won't Fix
> Status in Fuel for OpenStack 5.1.x series:
> Won't Fix
> Status in Fuel for OpenStack 6.0.x series:
> Confirmed
> Status in Fuel for OpenStack 6.1.x series:
> Confirmed
>
> Bug description:
> In one of customer deployments we faced the situation when host OS
> kernel would initialize multiple backplanes with disks in random order
> at boot time. This causes OSD disks that Fuel deployed w/ ceph-deploy
> utility to be lost from OSD after node reboot, since they were mounted
> via /dev/sdXXX pointers and these numbers become different at every
> boot.
>
> The solution is to mount OSD disks using UUID instead of sdXXX names.
> It's needs to be checked whether ceph-deploy utility can do it (so
> Fuel could just pass additional parameter) or a more sophisticated
> approach is needed to solve this problem.
>
> The following document describes the manual OSD management steps for
> the deployment where we first found this: http://goo.gl/SPZGFC. Also,
> Miroslav Anashkin (<email address hidden>) has detailed context
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/fuel/+bug/1416855/+subscriptions
>

Revision history for this message
Andrey Grebennikov (agrebennikov) wrote : Re: Ceph OSD disks are lost at node reboot

This still doesn't help. In 6.0 right now we are experiencing this issue. We have 2 disks for journal, one for OS and 10 for OSDs. When node is bootstrapped, sda and sdb were assigned to be journals, sdc turned into OS, remaining disks became OSDs. In puppet log though I see /dev/sdl and /dev/sdk are the journal disks. If I reboot the node now, these last two disks will become /dev/sda and /dev/sdb and no OSD will be able to start anymore since they will not be able to find their journals.

summary: - Ceph OSD disks are lost at node reboot
+ [doc] Ceph OSD disks are lost at node reboot
Revision history for this message
Igor Shishkin (teran) wrote :

Moving to 6.1-updates since 6.1 is already released.

Changed in fuel:
milestone: 6.1 → 6.1-updates
Changed in fuel:
milestone: 6.1-updates → 9.0
status: Triaged → New
Dmitry Pyzhov (dpyzhov)
tags: added: area-docs
Revision history for this message
Miroslav Anashkin (manashkin) wrote :

Ceph-deploy 1.5.20 uses disk IDs to link journals to.
So, issue is fixed in 6.1 and higher versions.
Marked these versions as fix released, since we ship Ceph version with fix for these versions.

Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

MOS 5.0, MOS 5.1 and MOS6.0 are no longer supported. Moving to Won't Fix.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.