AIO-SX: Adding OSDs with node in 'available' hangs when stx-openstack is running

Bug #1829855 reported by Ovidiu Poncea
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Ovidiu Poncea

Bug Description

Brief Description
-----------------
Adding OSDs with node in 'available' hangs when stx-openstack is running.

Puppet hangs at:
2019-05-16T08:52:19.816 Debug: 2019-05-16 08:52:19 +0000 Executing: '/usr/sbin/ceph-disk list | grep -v 'unknown cluster' | grep " *$(readlink -f /dev/disk/by-path/pci-0000:00:0d.0-ata-2.0).*ceph data" | grep -v unprepared | grep 'osd uuid dc4cbb83-3dfa-4e1c-a5e4-98cdee42c336'

Problem is caused by ceph-disk trying to access /dev/rbdX mapped devices. These devices did not exist in the past and should be skipped by ceph-disk.

Severity
--------
Minor: System/Feature is usable with minor issue

Steps to Reproduce
------------------
Fully deploy an AIO-SX
Add an OSD while the openstack application is running

Expected Behavior
------------------
Applications that access Ceph should pause for a couple of minutes while the OSD is added then they should resume working.

Actual Behavior
----------------
Ceph is down

Reproducibility
---------------
100%

System Configuration
--------------------
One node system

Branch/Pull Time/Commit
-----------------------
controller-0:/home/wrsroot# cat /etc/build.info
SW_VERSION="19.01"
BUILD_TARGET="Unknown"
BUILD_TYPE="Informal"
BUILD_ID="n/a"

JOB="n/a"
BUILD_BY="oponcea"
BUILD_NUMBER="n/a"
BUILD_HOST="yow-cgts1-lx"
BUILD_DATE="2019-05-14 09:01:49 -0400"

BUILD_DIR="/"
WRS_SRC_DIR="/localdisk/designer/oponcea/starlingx-0/cgcs-root"
WRS_GIT_BRANCH="HEAD"
CGCS_SRC_DIR="/localdisk/designer/oponcea/starlingx-0/cgcs-root/stx"
CGCS_GIT_BRANCH="HEAD"

Last Pass
---------
never

Timestamp/Logs
--------------
Attach the logs for debugging (use attachments in Launchpad or paste.openstack.org)
Provide a snippet of logs here and the timestamp when issue was seen.
Please indicate the unique identifier in the logs to highlight the problem

Test Activity
-------------
Developer Testing

Changed in starlingx:
assignee: nobody → Ovidiu Poncea (ovidiu.poncea)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; issue prevents the addition of a disk w/o lock/unlock.

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2.0 stx.storage
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/667903

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/667903
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=4f1dde2c9d14692100727b3e6114da2dd53cad95
Submitter: Zuul
Branch: master

commit 4f1dde2c9d14692100727b3e6114da2dd53cad95
Author: Ovidiu Poncea <email address hidden>
Date: Thu Jun 27 14:12:00 2019 +0300

    AIO-SX: Fix adding OSD at runtime hanging when /dev/rbd* are mounted

    This mostly happens when stx-openstack application is running
    as there are a couple of pods that uses RBD PVCs.

    The reason of the hang is that, on AIO-SX, all OSDs are stopped
    while new ones are configured. This leads to data on the /dev/rbd*
    to be inaccessible when runtime puppet manifests executes
    'ceph-disk list' to query the state of all storage devices in
    the system.

    This commit removes the code that stops existing OSDs on all
    deployments. A side effect is a decrease in time needed to
    add the OSDs on a node with many OSDs already configured.
    For e.g. time for adding an OSD on a system with 7 OSDs can
    take more that 2 minutes, while with this change it drops to
    around 30 seconds.

    Change-Id: I41a65d42a45a90b00f65509f4e536eb1c345a91b
    Closes-Bug: 1829855
    Signed-off-by: Ovidiu Poncea <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.