After upgrade ceph cluster stays unhealthy
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Luiz Felipe Kina |
Bug Description
Brief Description
-----------------
During 22.06 CentOS to 22.12 Debian upgrade on AIO-SX deployment, ceph cluster stays unhealthy, more specifically the mds server stays in creating state and osd stays in down state.
Severity
--------
Critical - ceph cluster unhealthy
Steps to Reproduce
------------------
1. Start upgrade on CentOS 22.06
2. Reinstall Debian 22.12 iso
3. Run upgrade playbook with CentOS backup tgz using "-e wipe_ceph_
4. Unlock controller-0 (before unlocking, ceph is healthy)
5. Run upgrade-activate (at this point ceph is unhealthy)
6. Run upgrade-complete
Expected Behavior
------------------
- controller-0 in unlocked / enabled / available state
- Healthy ceph cluster
- No ceph alarms
Actual Behavior
----------------
- controller-0 in unlocked / enabled / available state
- Ceph cluster unhealthy
- Multiple ceph alarms active
Reproducibility
---------------
Reproducible 100%
System Configuration
-------
AIO-SX
Branch/Pull Time/Commit
-------
WRCP_Dev / master
Last Pass
---------
N/A
Timestamp/Logs
--------------
[sysadmin@
cluster:
id: f932764f-
health: HEALTH_WARN
1 MDSs report slow metadata IOs
Reduced data availability: 192 pgs inactive
227 slow ops, oldest one blocked for 69797 sec, mon.controller-0 has slow ops
services:
mon: 1 daemons, quorum controller-0 (age 19h)
mgr: controller-
mds: kube-cephfs:1 {0=controller-
osd: 1 osds: 0 up, 0 in
data:
pools: 3 pools, 192 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
192 unknown
ceph osd tree:
root@controller
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 7.27449 root storage-tier
-2 7.27449 chassis group-0
-3 7.27449 host controller-0
0 hdd 7.27449 osd.0 down 0 1.00000
ceph health detail:
root@controller
HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 192 pgs inactive; 232 slow ops, oldest one blocked for 71132 sec, mon.controller-0 has slow ops
MDS_SLOW_
mds.
PG_AVAILABILITY Reduced data availability: 192 pgs inactive
pg 1.29 is stuck inactive for 71179.441580, current state unknown, last acting []
pg 1.2a is stuck inactive for 71179.441580, current state unknown, last acting []
ceph-process-
2022-09-27 14:52:08.459 /etc/init.d/ceph osd.0 ERROR: Process failed to go up in 300s after start, reporting it as HANGED!
2022-09-27 14:52:08.498 /etc/init.
2022-09-27 14:52:08.499 /etc/init.
2022-09-27 14:52:08.501 /etc/init.
Try 'timeout --help' for more information.
2022-09-27 14:52:08.502 /etc/init.
2022-09-27 14:52:13.511 /etc/init.
2022-09-27 14:52:18.520 /etc/init.
2022-09-27 14:52:23.527 /etc/init.
2022-09-27 14:52:26.384 /etc/init.
2022-09-27 14:52:26.385 /etc/init.
2022-09-27 14:52:26.387 /etc/init.
2022-09-27 14:52:26.389 /etc/init.
2022-09-27 14:52:26.408 /etc/init.d/ceph osd.0 WARN: /var/lib/
2022-09-27 14:52:26.412 /etc/init.d/ceph mgr.controller-0 WARN: /var/lib/
2022-09-27 14:52:27.075 /etc/init.d/ceph osd.0 INFO: Process STARTED successfully, waiting for it to become OPERATIONAL
2022-09-27 14:52:27.702 /etc/init.
2022-09-27 14:52:59.092 /etc/init.d/ceph mds.controller-0 WARN: Unknown process type: mds
2022-09-27 14:53:00.333 /etc/init.d/ceph osd.0 INFO: osd.0 has the following state: booting
2022-09-27 14:53:32.604 /etc/init.d/ceph mds.controller-0 WARN: Unknown process type: mds
Collect Logs: /folk/cgts_
Test Activity
-------------
Feature Testing (AIO-SX upgrades)
Workaround
----------
# don't allow puppet wipe ceph-mon-lv
sudo touch /etc/platform/
Changed in starlingx: | |
importance: | Undecided → Medium |
assignee: | nobody → Luiz Felipe Kina (leiskeki) |
tags: | added: stx.8.0 stx.storage stx.update |
Fix proposed to branch: master /review. opendev. org/c/starlingx /ansible- playbooks/ +/860008
Review: https:/