After upgrade ceph cluster stays unhealthy

Bug #1991414 reported by Luiz Felipe Kina
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Luiz Felipe Kina

Bug Description

Brief Description
-----------------
During 22.06 CentOS to 22.12 Debian upgrade on AIO-SX deployment, ceph cluster stays unhealthy, more specifically the mds server stays in creating state and osd stays in down state.

Severity
--------
Critical - ceph cluster unhealthy

Steps to Reproduce
------------------
1. Start upgrade on CentOS 22.06
2. Reinstall Debian 22.12 iso
3. Run upgrade playbook with CentOS backup tgz using "-e wipe_ceph_osds=false" (mandatory, using "true" ceph will not fail)
4. Unlock controller-0 (before unlocking, ceph is healthy)
5. Run upgrade-activate (at this point ceph is unhealthy)
6. Run upgrade-complete

Expected Behavior
------------------
- controller-0 in unlocked / enabled / available state
- Healthy ceph cluster
- No ceph alarms

Actual Behavior
----------------
- controller-0 in unlocked / enabled / available state
- Ceph cluster unhealthy
- Multiple ceph alarms active

Reproducibility
---------------
Reproducible 100%

System Configuration
--------------------
AIO-SX

Branch/Pull Time/Commit
-----------------------
WRCP_Dev / master

Last Pass
---------
N/A

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ ceph -s
  cluster:
    id: f932764f-af86-4078-a00c-dd05540a56dc
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            Reduced data availability: 192 pgs inactive
            227 slow ops, oldest one blocked for 69797 sec, mon.controller-0 has slow ops

  services:
    mon: 1 daemons, quorum controller-0 (age 19h)
    mgr: controller-0(active, since 19h)
    mds: kube-cephfs:1 {0=controller-0=up:creating}
    osd: 1 osds: 0 up, 0 in

  data:
    pools: 3 pools, 192 pgs
    objects: 0 objects, 0 B
    usage: 0 B used, 0 B / 0 B avail
    pgs: 100.000% pgs unknown
             192 unknown

ceph osd tree:

root@controller-0:/var/log/ceph# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 7.27449 root storage-tier
-2 7.27449 chassis group-0
-3 7.27449 host controller-0
 0 hdd 7.27449 osd.0 down 0 1.00000

ceph health detail:

root@controller-0:/var/log/ceph# ceph health detail
HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 192 pgs inactive; 232 slow ops, oldest one blocked for 71132 sec, mon.controller-0 has slow ops
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
    mds.controller-0(mds.0): 31 slow metadata IOs are blocked > 30 secs, oldest blocked for 71135 secs
PG_AVAILABILITY Reduced data availability: 192 pgs inactive
    pg 1.29 is stuck inactive for 71179.441580, current state unknown, last acting []
    pg 1.2a is stuck inactive for 71179.441580, current state unknown, last acting []

ceph-process-states:

2022-09-27 14:52:08.459 /etc/init.d/ceph osd.0 ERROR: Process failed to go up in 300s after start, reporting it as HANGED!
2022-09-27 14:52:08.498 /etc/init.d/ceph-init-wrapper osd.0 INFO: Dealing with hung process (pid:4182793)
2022-09-27 14:52:08.499 /etc/init.d/ceph-init-wrapper osd.0 INFO: Increasing log level
2022-09-27 14:52:08.501 /etc/init.d/ceph-init-wrapper osd.0 WARN: Error executing: timeout ceph daemon osd.0 config set debug_osd 20/20 errorcode: 125 output: timeout: invalid time interval ‘ceph’
Try 'timeout --help' for more information.
2022-09-27 14:52:08.502 /etc/init.d/ceph-init-wrapper osd.0 INFO: Dumping stack trace to: /var/log/ceph/hang_trace_osd.0_4182793_2022-09-27_14-52-08.log
2022-09-27 14:52:13.511 /etc/init.d/ceph-init-wrapper osd.0 INFO: Dumping stack trace to: /var/log/ceph/hang_trace_osd.0_4182793_2022-09-27_14-52-13.log
2022-09-27 14:52:18.520 /etc/init.d/ceph-init-wrapper osd.0 INFO: Dumping stack trace to: /var/log/ceph/hang_trace_osd.0_4182793_2022-09-27_14-52-18.log
2022-09-27 14:52:23.527 /etc/init.d/ceph-init-wrapper osd.0 INFO: Trigger core dump
2022-09-27 14:52:26.384 /etc/init.d/ceph-init-wrapper - INFO: Ceph START command received
2022-09-27 14:52:26.385 /etc/init.d/ceph-init-wrapper - INFO: Grab service locks
2022-09-27 14:52:26.387 /etc/init.d/ceph-init-wrapper - INFO: Lock service status
2022-09-27 14:52:26.389 /etc/init.d/ceph-init-wrapper - INFO: Run service action: /etc/init.d/ceph
2022-09-27 14:52:26.408 /etc/init.d/ceph osd.0 WARN: /var/lib/ceph/osd/ceph-0/sysvinit file is missing
2022-09-27 14:52:26.412 /etc/init.d/ceph mgr.controller-0 WARN: /var/lib/ceph/mgr/ceph-controller-0/sysvinit file is missing
2022-09-27 14:52:27.075 /etc/init.d/ceph osd.0 INFO: Process STARTED successfully, waiting for it to become OPERATIONAL
2022-09-27 14:52:27.702 /etc/init.d/ceph-init-wrapper - INFO: Ceph START command finished.
2022-09-27 14:52:59.092 /etc/init.d/ceph mds.controller-0 WARN: Unknown process type: mds
2022-09-27 14:53:00.333 /etc/init.d/ceph osd.0 INFO: osd.0 has the following state: booting
2022-09-27 14:53:32.604 /etc/init.d/ceph mds.controller-0 WARN: Unknown process type: mds

Collect Logs: /folk/cgts_logs/CGTS-39150

Test Activity
-------------
Feature Testing (AIO-SX upgrades)

Workaround
----------
# don't allow puppet wipe ceph-mon-lv
sudo touch /etc/platform/.ceph-mon-lv

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/860008
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/81c8aeca3e27cf35a996a30aa35804f20a3b2286
Submitter: "Zuul (22348)"
Branch: master

commit 81c8aeca3e27cf35a996a30aa35804f20a3b2286
Author: Luiz Felipe Kina <email address hidden>
Date: Fri Sep 30 13:31:10 2022 -0400

    Adding ceph-mon-lv flag during upgrade

    The flag .ceph_mon_lv was being removed during the upgrade from
    version 22.06 CentOS to version 22.12 Debian on AIO-SX, causing the
    ceph to be unhealthy. Without the flag, ceph mon from the backup was
    being removed and a new one was being created while the osd remained
    the same resulting in a communication failure.

    TEST PLAN:
    PASS Run upgrade playbook on Debian using "wipe_ceph_osds=false",
    finish the upgrade successfully and verify that ceph is healthy
    PASS Run B&R

    Closes-bug: 1991414

    Signed-off-by: Luiz Felipe Kina <email address hidden>
    Change-Id: Ibd6dd325b6115ed881c099df0979fce17c7bf69a

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
assignee: nobody → Luiz Felipe Kina (leiskeki)
tags: added: stx.8.0 stx.storage stx.update
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.