Ceph failing to start osd

Bug #1999826 reported by Felipe Sanches Zanoni
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Felipe Sanches Zanoni

Bug Description

Brief Description
-----------------
During the upgrade orchestration, after host-unlock, subcloud was in a degraded state with an alarm saying "controller-0 experienced a service-affecting failure. Auto-recovery in progress".
Ceph failed to start up osd.0.

Severity
--------
Major

Steps to Reproduce
------------------
Upgrade simplex subcloud from stx7.0 to stx8.0

Expected Behavior
------------------
Subcloud unlocked with no errors.

Actual Behavior
----------------
Subcloud unlocking with errors.

Reproducibility
---------------
Intermittent

System Configuration
--------------------
DC Subcloud simplex

Branch/Pull Time/Commit
-----------------------
N/A

Last Pass
---------
N/A

Timestamp/Logs
--------------
root@controller-0:/var/home/sysadmin# cat /var/log/ceph/ceph-init.log
Tue Dec 13 20:13:11 UTC 2022
Starting ceph services...
=== mds.controller-0 ===
Starting Ceph mds.controller-0 on controller-0...already running
=== mon.controller-0 ===
Starting Ceph mon.controller-0 on controller-0...already running
=== osd.0 ===
Mounting xfs on controller-0:/var/lib/ceph/osd/ceph-0
umount: /var/lib/ceph/osd/ceph-0: target is busy.
mount: /var/lib/ceph/osd/ceph-0: /dev/nvme2n1p1 already mounted on /var/lib/ceph/osd/ceph-0.
failed: 'modprobe xfs ; egrep -q '^[^ ]+ /var/lib/ceph/osd/ceph-0 ' /proc/mounts && umount /var/lib/ceph/osd/ceph-0 ; mount -t xfs -o rw,noatime,inode64,logbufs=8,logbsize=256k /dev/disk/by-path/pci-0000:11:00.0-nvme-1-part1 /var/lib/ceph/osd/ceph-0'
Tue Dec 13 20:13:11 UTC 2022
RC was: 1

Test Activity
-------------
Regression Testing

Workaround
----------
N/A

Changed in starlingx:
assignee: nobody → Felipe Sanches Zanoni (fsanches)
Changed in starlingx:
status: New → In Progress
Revision history for this message
Felipe Sanches Zanoni (fsanches) wrote :
Changed in starlingx:
status: In Progress → Fix Committed
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
status: Fix Committed → Fix Released
tags: added: stx.8.0 stx.storage
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/868124

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/868124
Committed: https://opendev.org/starlingx/integ/commit/08a571dc8663027565234ef948ad92d7acacc4fc
Submitter: "Zuul (22348)"
Branch: master

commit 08a571dc8663027565234ef948ad92d7acacc4fc
Author: Felipe Sanches Zanoni <email address hidden>
Date: Mon Dec 19 14:35:23 2022 -0500

    Enable ceph init script to use already mounted osd filesystem

    Ceph initialization script /etc/init.d/ceph was failing to start osd
    when osd disk is already mounted and the umount fails because disk is
    in use.

    The script line has an umount command that fails if the partition is
    in use. Then, the next mount command will fail returning 32.
    If the error is that the partition is already mounted, look for
    'already mounted on ${fs_path}' text in the output and then ignore
    the mount error returning success and continuing the start script.

    An example of error text output:
     === osd.0 ===
     Mounting xfs on controller-0:/var/lib/ceph/osd/ceph-0
     umount: /var/lib/ceph/osd/ceph-0: target is busy.
     mount: /var/lib/ceph/osd/ceph-0: /dev/nvme2n1p1 already mounted
       on /var/lib/ceph/osd/ceph-0.
     failed: 'modprobe xfs ; egrep -q '^[^ ]+ /var/lib/ceph/osd/ceph-0 '
       /proc/mounts && umount /var/lib/ceph/osd/ceph-0 ;
       mount -t xfs -o rw,noatime,inode64,logbufs=8,logbsize=256k
       /dev/disk/by-path/pci-0000:11:00.0-nvme-1-part1
       /var/lib/ceph/osd/ceph-0'

    Test-Plan:
      PASS: Validate the new script with partition already mounted
       on right location in AIO-SX and AIO-DX.
      PASS: Validate the new script with partition already mounted
       but on a different location in AIO-SX and AIO-DX.
      PASS: Validate the new script with partition not mounted in
       AIO-SX and AIO-DX.

    Closes-bug: 1999826

    Signed-off-by: Felipe Sanches Zanoni <email address hidden>
    Change-Id: I6f0c1a3c2742de62040a690dd3d65785bdc1de73

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.