StarlingX

Alarm 800.011 Loss of replication in replication group group-0: no OSDs. Can not associate to a rootfs disk

Bug #1951242 reported by Alexandru Dimofte on 2021-11-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Invalid	High	Delfino Gomes Curado Filho

Bug Description

Brief Description
-----------------
Alarm 800.011 Loss of replication in replication group group-0: no OSDs. Can not associate to a rootfs disk. I found this issue on Virtual Standard External configuration.

Severity
--------
<Major: System/Feature is usable but degraded>

Steps to Reproduce
------------------
Try to install stx on Virtual Standard

Expected Behavior
------------------
Installation should work fine

Actual Behavior
----------------
Installation fails returning: "Can not associate to a rootfs disk"

Reproducibility
---------------
It happened on Standard External, I will rerun to see if this is visible again.

System Configuration
--------------------
Virtual Standard external

Branch/Pull Time/Commit
-----------------------
20211117T032111Z

Last Pass
---------
20211116T021917Z

Timestamp/Logs
--------------
Logs will be attached

Test Activity
-------------
Sanity

Workaround
----------
-

Tags:

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2021-11-17:

I attached the collected logs Edit (48.3 MiB, application/x-tar)

Ghada Khalil (gkhalil) on 2021-11-17

tags:	added: stx.storage
Changed in starlingx:
importance:	Undecided → High

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-11-17:

stx.6.0 / high priority - sanity issue; appears related to ceph/storage

tags:

added: stx.6.0

Ghada Khalil (gkhalil) on 2021-11-17

Changed in starlingx:
assignee:	nobody → Delfino Gomes Curado Filho (dcuradof)
status:	New → Triaged

Revision history for this message

Delfino Gomes Curado Filho (dcuradof) wrote on 2021-11-17:

Hi Alex,

My conclusion from the logs is that storage-1 configured /dev/sdb as rootfs and the OS is installed on this disk, because of this you can't add it as an OSD.

The command executed:
system host-stor-add storage-1 6af0b39f-8226-4ed5-90e9-dd6f0caa9260

On i_idisk table we can check that this UUID refers to this disk:
/dev/sdb 2064 HDD 256000 QM00005 {"model_num": "QEMU HARDDISK", "stor_function": "rootfs"}

While there is another free disk and its UUID is: 5fbeff41-3205-48d0-a195-b106264c0a71
/dev/sda 2048 HDD 256000 QM00007 {"model_num": "QEMU HARDDISK"}

Based on this you need to check how the script selects the disk that will be added as an OSD.

Another approach is to attach here the files on /var/log/anaconda of storage-1. This way I can get some info on how the installer chose the rootfs disk. Usually it chooses sda.

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2021-11-19:

Hi Delfino,

Checking the documentation from: https://docs.starlingx.io/deploy_install_guides/r6_release/virtual/controller_storage_install_kubernetes.html#add-ceph-osds-to-controllers
I see that OSDs="/dev/sdb"
In our scripts the device selected is "/dev/sdb".

Revision history for this message

Flavio Luis Peres (fperes) wrote on 2021-11-29:

According do Alex (email from Nov25th), he observed the issue sporadically on VMs.
He asked us to keep this open until he runs the sanity on baremetal servers and check whether this is reproducible or not

Alex, please let us know if you were able to conclude this test.

Revision history for this message

Flavio Luis Peres (fperes) wrote on 2021-12-03:

Hi Alex,

Re: https://bugs.launchpad.net/starlingx/+bug/1951242, did you have chance to run the tests on a baremetal server?

Thanks
Flavio

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2021-12-03:

During provision, always when something is failing before adding the OSDs we observed that kind of alarms too. We started to have green sanities on virtual environment(baremetal not yet executed) and this issue is not present there. I think that we can decrease the importance for this bug and after checking on the baremetal servers probably we can even close it. I will comment here when I will have more results.

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2021-12-04:

I tested on baremetal Standard and baremetal Standard Ext and this issue is not visible. So, the conclusion is that always when something is failing before adding the OSDs we will see this kind of alarms(the command for adding the OSDs will not be executed) but we need to focus on the main issue and not on this alarm. When the initial error will be fixed, the OSDs are added normally and the alarm is not observed. I think we can now close this issue as "Invalid". Thanks!

Revision history for this message

Flavio Luis Peres (fperes) wrote on 2021-12-06:

Thanks for your confirmation Alex.
I am closing this LP as invalid based on your comments.
Thanks