Alarm 800.010 Potential data loss. No available OSDs in storage replication group group-0

Bug #1942480 reported by Nicolae Jascanu
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
Jia Hu

Bug Description

Brief Description
-----------------
After provisioning Virtual STANDARD and Virtual STANDARD-EXTERNAL systems, the alarm 800.010 is listed and it is CRITICAL

Severity
--------

Steps to Reproduce
------------------
Provision Virtual STANDARD and Virtual STANDARD-EXTERNAL

Expected Behavior
------------------
No alarms

Actual Behavior
----------------
============= VIRTUAL STANDARD
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+-------------------------------------------------------------------------------------------------+---------------------------------------+----------+-------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-------------------------------------------------------------------------------------------------+---------------------------------------+----------+-------------------+
| 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph | cluster= | warning | 2021-09-02T05:49: |
| | -s' for more details. | 2b7074f0-8eda-4072-ab09-7479a77dd42a | | 33.848985 |
| | | | | |
| 800.010 | Potential data loss. No available OSDs in storage replication group group-0: no OSDs | cluster= | critical | 2021-09-02T05:44: |
| | | 2b7074f0-8eda-4072-ab09-7479a77dd42a. | | 31.033302 |
| | | peergroup=group-0 | | |
| | | | | |
+----------+-------------------------------------------------------------------------------------------------+---------------------------------------+----------+-------------------+
[sysadmin@controller-0 ~(keystone_admin)]$ ceph -s
  cluster:
    id: 2b7074f0-8eda-4072-ab09-7479a77dd42a
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            Reduced data availability: 192 pgs inactive

  services:
    mon: 3 daemons, quorum controller-0,controller-1,compute-0
    mgr: controller-0(active), standbys: controller-1
    mds: kube-cephfs-1/1/1 up {0=controller-1=up:creating}, 2 up:standby
    osd: 0 osds: 0 up, 0 in

  data:
    pools: 3 pools, 192 pgs
    objects: 0 objects, 0 B
    usage: 0 B used, 0 B / 0 B avail
    pgs: 100.000% pgs unknown
             192 unknown

================= VIRTUAL STANDARD-EXTERNAL
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+-------------------------------------------------------------------------------------------------+---------------------------------------+----------+-------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-------------------------------------------------------------------------------------------------+---------------------------------------+----------+-------------------+
| 800.001 | Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph | cluster= | warning | 2021-09-02T12:26: |
| | -s' for more details. | 720df95d-9587-4693-8ed3-09bae55efc59 | | 32.188827 |
| | | | | |
| 800.010 | Potential data loss. No available OSDs in storage replication group group-0: no OSDs | cluster= | critical | 2021-09-02T12:20: |
| | | 720df95d-9587-4693-8ed3-09bae55efc59. | | 28.349292 |
| | | peergroup=group-0 | | |
| | | | | |
| 200.001 | storage-1 was administratively locked to take it out-of-service. | host=storage-1 | warning | 2021-09-02T11:57: |
| | | | | 12.267591 |
| | | | | |
| 200.001 | storage-0 was administratively locked to take it out-of-service. | host=storage-0 | warning | 2021-09-02T11:57: |
| | | | | 06.516257 |
| | | | | |
| 250.001 | compute-1 Configuration is out-of-date. | host=compute-1 | major | 2021-09-02T11:56: |
| | | | | 47.769518 |
| | | | | |
| 250.001 | compute-0 Configuration is out-of-date. | host=compute-0 | major | 2021-09-02T11:56: |
| | | | | 47.065148 |
| | | | | |
| 200.001 | compute-1 was administratively locked to take it out-of-service. | host=compute-1 | warning | 2021-09-02T11:46: |
| | | | | 23.538292 |
| | | | | |
| 200.001 | compute-0 was administratively locked to take it out-of-service. | host=compute-0 | warning | 2021-09-02T11:46: |
| | | | | 17.311249 |
| | | | | |
+----------+-------------------------------------------------------------------------------------------------+---------------------------------------+----------+-------------------+
[sysadmin@controller-0 ~(keystone_admin)]$ ceph -s
  cluster:
    id: 720df95d-9587-4693-8ed3-09bae55efc59
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            Reduced data availability: 192 pgs inactive

  services:
    mon: 2 daemons, quorum controller-0,controller-1
    mgr: controller-0(active), standbys: controller-1
    mds: kube-cephfs-1/1/1 up {0=controller-1=up:creating}, 1 up:standby
    osd: 0 osds: 0 up, 0 in

  data:
    pools: 3 pools, 192 pgs
    objects: 0 objects, 0 B
    usage: 0 B used, 0 B / 0 B avail
    pgs: 100.000% pgs unknown
             192 unknown

Reproducibility
---------------
Reproducible

System Configuration
--------------------
OS="centos"
SW_VERSION="21.12"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20210902T013654Z"

JOB="STX_build_layer_flock_master_master"
<email address hidden>"
BUILD_NUMBER="600"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2021-09-02 01:36:54 +0000"

FLOCK_OS="centos"
FLOCK_JOB="STX_build_layer_flock_master_master"
<email address hidden>"
FLOCK_BUILD_NUMBER="600"
FLOCK_BUILD_HOST="starlingx_mirror"
FLOCK_BUILD_DATE="2021-09-02 01:36:54 +0000"

DISTRO_OS="centos"
DISTRO_JOB="STX_build_layer_distro_master_master"
<email address hidden>"
DISTRO_BUILD_NUMBER="608"
DISTRO_BUILD_HOST="starlingx_mirror"
DISTRO_BUILD_DATE="2021-08-27 01:33:10 +0000"

COMPILER_OS="centos"
COMPILER_JOB="STX_build_layer_compiler_master_master"
<email address hidden>"
COMPILER_BUILD_NUMBER="661"
COMPILER_BUILD_HOST="starlingx_mirror"
COMPILER_BUILD_DATE="2021-08-17 19:28:42 +0000"

Ghada Khalil (gkhalil)
tags: added: stx.storage
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Screening: stx.6.0 / high - results in a sanity issue

Changed in starlingx:
assignee: nobody → Mauricio Biasi do Monte Carmelo (mbiasido)
status: New → Triaged
importance: Medium → High
Revision history for this message
Mauricio Biasi do Monte Carmelo (mbiasido) wrote :

Hello Nicolae,

Are you still facing this problem? By checking the logs mentioned in the description of the issue this seems to be the standard behavior. I noticed that both storages and computes are locked and, if this is the case, the 800.010 is expected when running command `fm alarm-list`.

Can you please let me know the steps that were performed when you collected these logs? Also, please try to unlock these hosts and check if the issue continues.

Best regards,
Mauricio

Revision history for this message
Ghada Khalil (gkhalil) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: raising the priority since this is causing red sanities for stx.6.0

Changed in starlingx:
importance: High → Critical
Revision history for this message
Flavio Luis Peres (fperes) wrote (last edit ):

Hi Nicolae,

Did you have a chance to take a look at the comments from Mauricio?

Can you please let me know the steps that were performed when you collected these logs? Also, please try to unlock these hosts and check if the issue continues.

Thanks

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Download full text (5.8 KiB)

The image used is: 20211201T041648Z
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+---------------------------------------------------------------------------------------+--------------------------------------+----------+-------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+---------------------------------------------------------------------------------------+--------------------------------------+----------+-------------------+
| 800.010 | Potential data loss. No available OSDs in storage replication group group-0: no OSDs | cluster=07c0ce0c-9b00-4de0-a8ee- | critical | 2021-12-01T08:36: |
| | | 13da4debf43c.peergroup=group-0 | | 39.009657 |
| | | | | |
| 800.001 | Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details. | cluster=07c0ce0c-9b00-4de0-a8ee- | warning | 2021-12-01T08:14: |
| | | 13da4debf43c | | 14.949030 |
| | | | | |
| 250.001 | compute-1 Configuration is out-of-date. | host=compute-1 | major | 2021-12-01T08:07: |
| | | | | 16.063014 |
| | | | | |
| 250.001 | compute-0 Configuration is out-of-date. | host=compute-0 | major | 2021-12-01T08:07: |
| | | | | 13.642871 |
| | | | | |
| 200.001 | compute-1 was administratively locked to take it out-of-service. | host=compute-1 | warning | 2021-12-01T07:56: |
| | | | | 59.093516 |
| | | | | |
| 200.001 | compute-0 was ad...

Read more...

Revision history for this message
Mauricio Biasi do Monte Carmelo (mbiasido) wrote :

Hello Alex,

By checking the logs from /var/log/bash.log we noticed that the commands used to add the OSDs were not executed after the ceph backend initialized.

Command to initialize the backend (this was executed):

2021-12-01T07:17:01.000 localhost -sh: info HISTORY: PID=159004 UID=42425 system storage-backend-add ceph --confirmed

Commands to add an OSD (this was not executed).

Please refer to section "Configure Controller-0", in the following documentation:

https://docs.starlingx.io/deploy_install_guides/r6_release/virtual/aio_simplex_install_kubernetes.html

Also, if you prefer, please let us know your availability so I can schedule a meeting to discuss this.

Best regards

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

This is happening because it is failing before adding the OSDs. In this case is failing at "Setup partitions":
20211201 09:35:42.816 - INFO - +------- START KW: SSHLibrary.Write [ ${cmd} ]
20211201 09:35:42.824 - INFO - system host-pv-add compute-0 nova-loc ^Mal 94644555-fec2-44d8-aa97-79f48a630687
20211201 09:35:42.824 - INFO - +------- END KW: SSHLibrary.Write (8)
20211201 09:35:42.825 - INFO - +------- START KW: SSHLibrary.Read Until Prompt [ ]
20211201 09:35:43.870 - INFO - type object 'ilvg' has no attribute 'isdigit'
[sysadmin@controller-0 ~(keystone_admin)]$

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

I think the problem is coming from this commit:
./cgcs-root/stx/config 64a1944a66a04e09e4515acba0f072accb51af06 2021-11-25 19:19:03 +0000 Gerrit Code Review <email address hidden> Merge "Cleanup pylint error: redefined-outer-name"

Revision history for this message
Thiago Paiva Brito (outbrito) wrote :

It looks like this is the same bug of https://launchpad.net/bugs/1952400

Revision history for this message
Flavio Luis Peres (fperes) wrote :

Thanks Alex,

This is the part of the code that seems to be causing this issue:

def _find_ilvg(cc, ihost, ilvg_id):
    if ilvg.isdigit():
        try:
            lvg = cc.ilvg.get(ilvg)
            lvg = cc.ilvg.get(ilvg_id)
        except exc.HTTPNotFound:
            raise exc.CommandError('Local volume group not found by id: %s'
                                   % ilvg)
                                   % ilvg_id)

./cgcs-root/stx/config 64a1944a66a04e09e4515acba0f072accb51af06 2021-11-25 19:19:03 +0000 Gerrit Code Review <email address hidden> Merge "Cleanup pylint error: redefined-outer-name"

I am assigning this LP to Jia Hu to fix it.

Changed in starlingx:
assignee: Mauricio Biasi do Monte Carmelo (mbiasido) → nobody
assignee: nobody → Flavio Luis Peres (fperes)
assignee: Flavio Luis Peres (fperes) → nobody
Revision history for this message
Ghada Khalil (gkhalil) wrote :
Changed in starlingx:
assignee: nobody → Jia Hu (jhu5)
Ghada Khalil (gkhalil)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/820214

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (master)

Change abandoned by "Jia Hu <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/config/+/820214

Revision history for this message
Al Bailey (albailey1974) wrote :

We ended up merging this one instead
https://review.opendev.org/c/starlingx/config/+/820216

We are asking Jia to abandon her change.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

I expect the original issue reported here is unrelated to the 'ilvg' has no attribute 'isdigit' errors as the LP predates the commit which introduced this issue. That being said, this issue is now addressed as per the ntoes above.

@Alex, please let us know if this gets us over the sanity issue with the 2021-12-03 build.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Not reported in the latest sanity. Marking as Fix Released.
http://lists.starlingx.io/pipermail/starlingx-discuss/2021-December/012482.html

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.