[ibp] Provisioning stucks on Ubuntu (bare metal): 'mdadm: Cannot get exclusive access to /dev/md127:Perhaps a running process, mounted filesystem or active volume group?\n'

Bug #1456276 reported by Artem Panchenko
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Alexander Adamov
6.1.x
Won't Fix
High
Fuel Python (Deprecated)
7.0.x
Fix Released
High
Alexander Adamov
8.0.x
New
Undecided
Fuel Documentation Team

Bug Description

Fuel version info (6.1 build #432): http://paste.openstack.org/show/227193/

Environment deployment hanged because one slave node was unavailable after provisioning, here is a part of fuel-agent logs:

http://paste.openstack.org/show/227132/

Bug was reproduced on bare metal lab. I was able to connect to failed slave via IPMI, but couldn't log in using default (root/r00tme) credentials (see screenshot).

Steps to reproduce:

1. Create new environment on Ubuntu.
2. Add some nodes.
3. Deploy changes

Expected result:

- cluster is deployed and works fine

Actual:

- deployment hangs on provisioning step

Seems the issue is floating and is reproduced on bare metal servers only (I've never seen similar failures on CI or KVM/VirtualBox deployments). Diagnostic snapshot is attached.

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Changed in fuel:
assignee: Fuel provisioning team (fuel-provisioning) → Aleksandr Gordeev (a-gordeev)
Changed in fuel:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Vladimir Kozhukalov (kozhukalov) wrote :

The question is: was this md meant to be removed? It's obviously not md device, which fuel created some time in the past. It is Raid 0 and metadata version is 1.2. So, it's kinda user defined md device. Currently, Fuel Agent does have just rudimentary decommission support which is supposed to erase everything (all lvm, md, most plain partitions), but sometimes fails to do this.

I'd say this issue is medium. Probably, we could add some additional logic (fuser) to try to figure out which processes use this md and try to kill them. But, frankly, we need to develop our decommission feature to make it mature and data driven.

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
importance: Medium → High
status: Confirmed → Incomplete
status: Incomplete → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/184283

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :
Revision history for this message
Vladimir Kozhukalov (kozhukalov) wrote :

The thing is there could be a lot of possible reasons why this has happened including this bug https://bugzilla.redhat.com/show_bug.cgi?id=956053 (from logs it is only clear that md device was occupied by something, likely by kernel) and currently Fuel Agent has just rudimentary decommission support. If there had been a kind of a complicated case when, for example, md device is a part of lvm or something like this, we 100% couldn't have been able to deal with that. So, what we actually need for addressing this issue and other ones like this, is to implement comprehensive data driven decommission feature. I believe it is deal of a separate story. Let's move this bug to 7.0.

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 6.1 → 7.0
status: In Progress → Confirmed
assignee: Aleksandr Gordeev (a-gordeev) → Fuel Python Team (fuel-python)
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

We need to add a guide for user what to do if she has this issue.

tags: added: release-notes
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Aleksandr Gordeev (a-gordeev)
status: Confirmed → In Progress
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

Forwarding to fuel-doc team.

Currently, fuel-agent has very rudimentary decommissioning capabilities.

If provisioning failed with the following error message 'mdadm: Cannot get exclusive access to /dev/mdXYZ:Perhaps a running process, mounted filesystem or active volume group?\n', then user may want to perform a removal of that md device by hand:

1. Reboot failed node into bootstrap again.
2. Check that /dev/mdXYZ is still presented.
3. Check that /dev/mdXYZ has not been mounted. Unmount it.
4. Check that /dev/mdXYZ has not been added to any active volume group. Remove it from volume group. (the exact steps could be found at https://www.centos.org/docs/5/html/Cluster_Logical_Volume_Manager/PV_remove.html )
5. Proceed with the removal of /dev/mdXYZ (the exact steps could be found at https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Deployment_Guide/s2-raid-manage-removing.html )
6. Re-deploy node.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (master)

Change abandoned by Igor Shishkin (<email address hidden>) on branch: master
Review: https://review.openstack.org/184283
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

tags: added: release-notes-done rn7.0
removed: release-notes
Changed in fuel:
assignee: Fuel Documentation Team (fuel-docs) → Alexander Adamov (aadamov)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-docs (master)

Reviewed: https://review.openstack.org/223218
Committed: https://git.openstack.org/cgit/stackforge/fuel-docs/commit/?id=2538aaab1f25c955fe06f0b9b36e5e168ff7fb6f
Submitter: Jenkins
Branch: master

commit 2538aaab1f25c955fe06f0b9b36e5e168ff7fb6f
Author: Alexander Adamov <email address hidden>
Date: Mon Sep 14 19:42:16 2015 +0300

    [RN 7.0]Fuel install&deploy issues-part2

    Adds resolved and known issues:
    LP1491583, LP1456276, LP1427378

    Related-Bug: #1491583
    Closes-Bug: #1456276
    Closes-Bug: #1427378

    Change-Id: Ic0a86fd8b0f081e92296c0a65fddd14969d133ba

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Vasyl Saienko (vsaienko) wrote :

Reproduced on supermicro server:
lvm is configured on top of software raid:

root@debian:~# pvs
File descriptor 3 (/usr/share/bash-completion/completions) leaked on pvs invocation. Parent PID 9206: -bash
  PV VG Fmt Attr PSize PFree
  /dev/md127 vg00 lvm2 a-- 929.61g 0

Revision history for this message
Vasyl Saienko (vsaienko) wrote :
Revision history for this message
Vasyl Saienko (vsaienko) wrote :
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

As the issue has been "fixed" by a documentation change, forwarding this to the Fuel Docs team to repeat the "fix" for 8.0 docs.

tags: added: area-docs
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.