Alarm 800.011 Loss of replication in replication group group-0: no OSDs. Can not associate to a rootfs disk

Bug #1951242 reported by Alexandru Dimofte
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
High
Delfino Gomes Curado Filho

Bug Description

Brief Description
-----------------
Alarm 800.011 Loss of replication in replication group group-0: no OSDs. Can not associate to a rootfs disk. I found this issue on Virtual Standard External configuration.

Severity
--------
<Major: System/Feature is usable but degraded>

Steps to Reproduce
------------------
Try to install stx on Virtual Standard

Expected Behavior
------------------
Installation should work fine

Actual Behavior
----------------
Installation fails returning: "Can not associate to a rootfs disk"

Reproducibility
---------------
It happened on Standard External, I will rerun to see if this is visible again.

System Configuration
--------------------
Virtual Standard external

Branch/Pull Time/Commit
-----------------------
20211117T032111Z

Last Pass
---------
20211116T021917Z

Timestamp/Logs
--------------
Logs will be attached

Test Activity
-------------
Sanity

Workaround
----------
-

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Ghada Khalil (gkhalil)
tags: added: stx.storage
Changed in starlingx:
importance: Undecided → High
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.6.0 / high priority - sanity issue; appears related to ceph/storage

tags: added: stx.6.0
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Delfino Gomes Curado Filho (dcuradof)
status: New → Triaged
Revision history for this message
Delfino Gomes Curado Filho (dcuradof) wrote :

Hi Alex,

My conclusion from the logs is that storage-1 configured /dev/sdb as rootfs and the OS is installed on this disk, because of this you can't add it as an OSD.

The command executed:
system host-stor-add storage-1 6af0b39f-8226-4ed5-90e9-dd6f0caa9260

On i_idisk table we can check that this UUID refers to this disk:
/dev/sdb 2064 HDD 256000 QM00005 {"model_num": "QEMU HARDDISK", "stor_function": "rootfs"}

While there is another free disk and its UUID is: 5fbeff41-3205-48d0-a195-b106264c0a71
/dev/sda 2048 HDD 256000 QM00007 {"model_num": "QEMU HARDDISK"}

Based on this you need to check how the script selects the disk that will be added as an OSD.

Another approach is to attach here the files on /var/log/anaconda of storage-1. This way I can get some info on how the installer chose the rootfs disk. Usually it chooses sda.

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

Hi Delfino,

Checking the documentation from: https://docs.starlingx.io/deploy_install_guides/r6_release/virtual/controller_storage_install_kubernetes.html#add-ceph-osds-to-controllers
I see that OSDs="/dev/sdb"
In our scripts the device selected is "/dev/sdb".

Revision history for this message
Flavio Luis Peres (fperes) wrote :

According do Alex (email from Nov25th), he observed the issue sporadically on VMs.
He asked us to keep this open until he runs the sanity on baremetal servers and check whether this is reproducible or not

Alex, please let us know if you were able to conclude this test.

Revision history for this message
Flavio Luis Peres (fperes) wrote :

Hi Alex,

Re: https://bugs.launchpad.net/starlingx/+bug/1951242, did you have chance to run the tests on a baremetal server?

Thanks
Flavio

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

During provision, always when something is failing before adding the OSDs we observed that kind of alarms too. We started to have green sanities on virtual environment(baremetal not yet executed) and this issue is not present there. I think that we can decrease the importance for this bug and after checking on the baremetal servers probably we can even close it. I will comment here when I will have more results.

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

I tested on baremetal Standard and baremetal Standard Ext and this issue is not visible. So, the conclusion is that always when something is failing before adding the OSDs we will see this kind of alarms(the command for adding the OSDs will not be executed) but we need to focus on the main issue and not on this alarm. When the initial error will be fixed, the OSDs are added normally and the alarm is not observed. I think we can now close this issue as "Invalid". Thanks!

Revision history for this message
Flavio Luis Peres (fperes) wrote :

Thanks for your confirmation Alex.
I am closing this LP as invalid based on your comments.
Thanks

Changed in starlingx:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.