StarlingX

Alarm 800.010 Potential data loss. No available OSDs in storage replication group group-0

Bug #1942480 reported by Nicolae Jascanu on 2021-09-02

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Critical	Jia Hu

Bug Description

Brief Description
-----------------
After provisioning Virtual STANDARD and Virtual STANDARD-EXTERNAL systems, the alarm 800.010 is listed and it is CRITICAL

Severity
--------

Steps to Reproduce
------------------
Provision Virtual STANDARD and Virtual STANDARD-EXTERNAL

Expected Behavior
------------------
No alarms

Actual Behavior
----------------
============= VIRTUAL STANDARD
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+-------------------------------------------------------------------------------------------------+---------------------------------------+----------+-------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-------------------------------------------------------------------------------------------------+---------------------------------------+----------+-------------------+
| 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph | cluster= | warning | 2021-09-02T05:49: |
| | -s' for more details. | 2b7074f0-8eda-4072-ab09-7479a77dd42a | | 33.848985 |
| | | | | |
| 800.010 | Potential data loss. No available OSDs in storage replication group group-0: no OSDs | cluster= | critical | 2021-09-02T05:44: |
| | | 2b7074f0-8eda-4072-ab09-7479a77dd42a. | | 31.033302 |
| | | peergroup=group-0 | | |
| | | | | |
+----------+-------------------------------------------------------------------------------------------------+---------------------------------------+----------+-------------------+
[sysadmin@controller-0 ~(keystone_admin)]$ ceph -s
  cluster:
    id: 2b7074f0-8eda-4072-ab09-7479a77dd42a
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            Reduced data availability: 192 pgs inactive

  services:
    mon: 3 daemons, quorum controller-0,controller-1,compute-0
    mgr: controller-0(active), standbys: controller-1
    mds: kube-cephfs-1/1/1 up {0=controller-1=up:creating}, 2 up:standby
    osd: 0 osds: 0 up, 0 in

  data:
    pools: 3 pools, 192 pgs
    objects: 0 objects, 0 B
    usage: 0 B used, 0 B / 0 B avail
    pgs: 100.000% pgs unknown
             192 unknown

================= VIRTUAL STANDARD-EXTERNAL
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+-------------------------------------------------------------------------------------------------+---------------------------------------+----------+-------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-------------------------------------------------------------------------------------------------+---------------------------------------+----------+-------------------+
| 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph | cluster= | warning | 2021-09-02T12:26: |
| | -s' for more details. | 720df95d-9587-4693-8ed3-09bae55efc59 | | 32.188827 |
| | | | | |
| 800.010 | Potential data loss. No available OSDs in storage replication group group-0: no OSDs | cluster= | critical | 2021-09-02T12:20: |
| | | 720df95d-9587-4693-8ed3-09bae55efc59. | | 28.349292 |
| | | peergroup=group-0 | | |
| | | | | |
| 200.001 | storage-1 was administratively locked to take it out-of-service. | host=storage-1 | warning | 2021-09-02T11:57: |
| | | | | 12.267591 |
| | | | | |
| 200.001 | storage-0 was administratively locked to take it out-of-service. | host=storage-0 | warning | 2021-09-02T11:57: |
| | | | | 06.516257 |
| | | | | |
| 250.001 | compute-1 Configuration is out-of-date. | host=compute-1 | major | 2021-09-02T11:56: |
| | | | | 47.769518 |
| | | | | |
| 250.001 | compute-0 Configuration is out-of-date. | host=compute-0 | major | 2021-09-02T11:56: |
| | | | | 47.065148 |
| | | | | |
| 200.001 | compute-1 was administratively locked to take it out-of-service. | host=compute-1 | warning | 2021-09-02T11:46: |
| | | | | 23.538292 |
| | | | | |
| 200.001 | compute-0 was administratively locked to take it out-of-service. | host=compute-0 | warning | 2021-09-02T11:46: |
| | | | | 17.311249 |
| | | | | |
+----------+-------------------------------------------------------------------------------------------------+---------------------------------------+----------+-------------------+
[sysadmin@controller-0 ~(keystone_admin)]$ ceph -s
  cluster:
    id: 720df95d-9587-4693-8ed3-09bae55efc59
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            Reduced data availability: 192 pgs inactive

  services:
    mon: 2 daemons, quorum controller-0,controller-1
    mgr: controller-0(active), standbys: controller-1
    mds: kube-cephfs-1/1/1 up {0=controller-1=up:creating}, 1 up:standby
    osd: 0 osds: 0 up, 0 in

  data:
    pools: 3 pools, 192 pgs
    objects: 0 objects, 0 B
    usage: 0 B used, 0 B / 0 B avail
    pgs: 100.000% pgs unknown
             192 unknown

Reproducibility
---------------
Reproducible

System Configuration
--------------------
OS="centos"
SW_VERSION="21.12"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20210902T013654Z"

JOB="STX_build_layer_flock_master_master"
<email address hidden>"
BUILD_NUMBER="600"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2021-09-02 01:36:54 +0000"

FLOCK_OS="centos"
FLOCK_JOB="STX_build_layer_flock_master_master"
<email address hidden>"
FLOCK_BUILD_NUMBER="600"
FLOCK_BUILD_HOST="starlingx_mirror"
FLOCK_BUILD_DATE="2021-09-02 01:36:54 +0000"

DISTRO_OS="centos"
DISTRO_JOB="STX_build_layer_distro_master_master"
<email address hidden>"
DISTRO_BUILD_NUMBER="608"
DISTRO_BUILD_HOST="starlingx_mirror"
DISTRO_BUILD_DATE="2021-08-27 01:33:10 +0000"

COMPILER_OS="centos"
COMPILER_JOB="STX_build_layer_compiler_master_master"
<email address hidden>"
COMPILER_BUILD_NUMBER="661"
COMPILER_BUILD_HOST="starlingx_mirror"
COMPILER_BUILD_DATE="2021-08-17 19:28:42 +0000"

Tags:

Ghada Khalil (gkhalil) on 2021-10-22

tags:	added: stx.storage
Changed in starlingx:
importance:	Undecided → Medium

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-11-24:

Screening: stx.6.0 / high - results in a sanity issue

Changed in starlingx:
assignee:	nobody → Mauricio Biasi do Monte Carmelo (mbiasido)
status:	New → Triaged
importance:	Medium → High

Revision history for this message

Mauricio Biasi do Monte Carmelo (mbiasido) wrote on 2021-11-25:

Hello Nicolae,

Are you still facing this problem? By checking the logs mentioned in the description of the issue this seems to be the standard behavior. I noticed that both storages and computes are locked and, if this is the case, the 800.010 is expected when running command `fm alarm-list`.

Can you please let me know the steps that were performed when you collected these logs? Also, please try to unlock these hosts and check if the issue continues.

Best regards,
Mauricio

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-11-27:

Seen again in the Nov 26 sanity:
http://lists.starlingx.io/pipermail/starlingx-discuss/2021-November/012448.html

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-11-29:

screening: raising the priority since this is causing red sanities for stx.6.0

Changed in starlingx:
importance:	High → Critical

Revision history for this message

Flavio Luis Peres (fperes) wrote on 2021-11-30 (last edit on 2021-11-30):

Hi Nicolae,

Did you have a chance to take a look at the comments from Mauricio?

Can you please let me know the steps that were performed when you collected these logs? Also, please try to unlock these hosts and check if the issue continues.

Thanks

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2021-12-01:

I collected the logs from virtual standard configuration Edit (34.0 MiB, application/x-tar)

Download full text (5.8 KiB)

services:
    mon: 2 daemons, quorum controller-0,controller-1 (age 22m)
    mgr: controller-0(active, since 44m), standbys: controller-1
    mds:  2 up:standby
    osd: 0 osds: 0 up, 0 in

data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:

Revision history for this message

Mauricio Biasi do Monte Carmelo (mbiasido) wrote on 2021-12-01:

Hello Alex,

By checking the logs from /var/log/bash.log we noticed that the commands used to add the OSDs were not executed after the ceph backend initialized.

Command to initialize the backend (this was executed):

2021-12-01T07:17:01.000 localhost -sh: info HISTORY: PID=159004 UID=42425 system storage-backend-add ceph --confirmed

Commands to add an OSD (this was not executed).

Please refer to section "Configure Controller-0", in the following documentation:

https://docs.starlingx.io/deploy_install_guides/r6_release/virtual/aio_simplex_install_kubernetes.html

Also, if you prefer, please let us know your availability so I can schedule a meeting to discuss this.

Best regards

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2021-12-02:

I attached the robot framework debug.log file Edit (1.5 MiB, text/plain)

This is happening because it is failing before adding the OSDs. In this case is failing at "Setup partitions":
20211201 09:35:42.816 - INFO - +------- START KW: SSHLibrary.Write [ ${cmd} ]
20211201 09:35:42.824 - INFO - system host-pv-add compute-0 nova-loc ^Mal 94644555-fec2-44d8-aa97-79f48a630687
20211201 09:35:42.824 - INFO - +------- END KW: SSHLibrary.Write (8)
20211201 09:35:42.825 - INFO - +------- START KW: SSHLibrary.Read Until Prompt [ ]
20211201 09:35:43.870 - INFO - type object 'ilvg' has no attribute 'isdigit'
[sysadmin@controller-0 ~(keystone_admin)]$

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2021-12-02:

I think the problem is coming from this commit:
./cgcs-root/stx/config 64a1944a66a04e09e4515acba0f072accb51af06 2021-11-25 19:19:03 +0000 Gerrit Code Review <email address hidden> Merge "Cleanup pylint error: redefined-outer-name"

Revision history for this message

Thiago Paiva Brito (outbrito) wrote on 2021-12-02:

#10

It looks like this is the same bug of https://launchpad.net/bugs/1952400

Revision history for this message

Flavio Luis Peres (fperes) wrote on 2021-12-02:

#11

Thanks Alex,

This is the part of the code that seems to be causing this issue:

def _find_ilvg(cc, ihost, ilvg_id):
    if ilvg.isdigit():
        try:
            lvg = cc.ilvg.get(ilvg)
            lvg = cc.ilvg.get(ilvg_id)
        except exc.HTTPNotFound:
            raise exc.CommandError('Local volume group not found by id: %s'
                                   % ilvg)
                                   % ilvg_id)

./cgcs-root/stx/config 64a1944a66a04e09e4515acba0f072accb51af06 2021-11-25 19:19:03 +0000 Gerrit Code Review <email address hidden> Merge "Cleanup pylint error: redefined-outer-name"

I am assigning this LP to Jia Hu to fix it.

Changed in starlingx:
assignee:	Mauricio Biasi do Monte Carmelo (mbiasido) → nobody
assignee:	nobody → Flavio Luis Peres (fperes)
assignee:	Flavio Luis Peres (fperes) → nobody

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-12-02:

#12

As per the notes in https://bugs.launchpad.net/starlingx/+bug/1952400, one fix was already merged: https://review.opendev.org/c/starlingx/config/+/820180
But as per Al Bailey, the ipv.py needs a similar fix: https://opendev.org/starlingx/config/src/branch/master/sysinv/cgts-client/cgts-client/cgtsclient/v1/ipv.py#L66

Changed in starlingx:
assignee:	nobody → Jia Hu (jhu5)

Ghada Khalil (gkhalil) on 2021-12-02

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-12-02: Fix proposed to config (master)

#13

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/820214

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-12-02: Change abandoned on config (master)

#14

Change abandoned by "Jia Hu <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/config/+/820214

Revision history for this message

Al Bailey (albailey1974) wrote on 2021-12-02:

#15

We ended up merging this one instead
https://review.opendev.org/c/starlingx/config/+/820216

We are asking Jia to abandon her change.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-12-03:

#16

I expect the original issue reported here is unrelated to the 'ilvg' has no attribute 'isdigit' errors as the LP predates the commit which introduced this issue. That being said, this issue is now addressed as per the ntoes above.

@Alex, please let us know if this gets us over the sanity issue with the 2021-12-03 build.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-12-03:

#17

Not reported in the latest sanity. Marking as Fix Released.
http://lists.starlingx.io/pipermail/starlingx-discuss/2021-December/012482.html