platform-integ-apps, oidc-auth-apps and rook-ceph-apps applications remained in "uploaded" state

Bug #1949771 reported by Alexandru Dimofte
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Critical
Delfino Gomes Curado Filho

Bug Description

Brief Description
-----------------
Installing StarlingX, platform-integ-apps, oidc-auth-apps and rook-ceph-apps applications remained in "uploaded" state. compute-0 and compute-1 are locked and online(not available)

Severity
--------
<Critical: System/Feature is not usable due to the defect>

Steps to Reproduce
------------------
install stx image: 20211104T061458Z

Expected Behavior
------------------
installation should work fine

Actual Behavior
----------------
stx installation fails with timeout because of platform-integ-apps, oidc-auth-apps and rook-ceph-apps applications which remained in "uploaded" state.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
all configurations

Branch/Pull Time/Commit
-----------------------
20211104T061458Z

Last Pass
---------
20211027T024800Z

Timestamp/Logs
--------------
will be attached

Test Activity
-------------
Sanity

Workaround
----------
-

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

The storage team will be looking at this as it appears similar to https://bugs.launchpad.net/starlingx/+bug/1949360 and maybe related to the ceph up-version

Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Felipe Sanches Zanoni (fsanches)
tags: added: stx.6.0 stx.apps stx.storage
Changed in starlingx:
importance: High → Critical
Revision history for this message
Delfino Gomes Curado Filho (dcuradof) wrote :

Hi Alexandru,

The apps oidc-auth-apps and rook-ceph-apps will only be applied if "system application-apply" is used.

Regarding platform-integ-apps, what I can see from the logs is this:

Ceph was added as storage backend:

(/var/log/bash.log) 2021-11-04T09:16:54.000 localhost -sh: info HISTORY: PID=155566 UID=42425 system storage-backend-add ceph --confirmed

This can be confirmed on the database
(/var/extra/database/sysinv.db.sql.txt)
2021-11-04 09:16:55.573516 \N \N 1 b1927830-fc47-4498-9d6d-11ca0b72a623 ceph configured provision-storage 1 \N {"min_replication": "1", "replication": "2"} ceph-store

After that I can't find any disk being added with the command "system host-stor-add" (https://docs.starlingx.io/deploy_install_guides/r6_release/virtual/controller_storage_install_kubernetes.html#add-ceph-osds-to-controllers)

This can be confirmed on the database by the empty i_istor table

(/var/extra/database/sysinv.db.sql.txt)
-- Data for Name: i_istor; Type: TABLE DATA; Schema: public; Owner: admin-sysinv
--

COPY i_istor (created_at, updated_at, deleted_at, id, uuid, osdid, idisk_uuid, state, function, capabilities, forihostid, fortierid) FROM stdin;
\.

--
-- Name: i_istor_id_seq; Type: SEQUENCE SET; Schema: public; Owner: admin-sysinv
--

SELECT pg_catalog.setval('i_istor_id_seq', 1, false);

This can also be seen on /var/log/ceph/ceph.log
2021-11-04 09:53:24.267525 mon.controller-0 (mon.0) 50 : cluster [WRN] Health check failed: OSD count 0 < osd_pool_default_size 2 (TOO_FEW_OSDS)

I understand that this is something that was working before, so can you double check if the command to add /dev/sdb as an OSD is being executed?

Changed in starlingx:
assignee: Felipe Sanches Zanoni (fsanches) → Delfino Gomes Curado Filho (dcuradof)
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

Hi Delfino,

The "Ceph OSDs add to controllers" logic is implemented on our scripts. It was not executed and from experience I remember this happened in the past because something else failed before that step.. For the moment I can add that I also see the alarm: "Alarm 800.010 Potential data loss. No available OSDs in storage replication group group-0" check bug: https://bugs.launchpad.net/starlingx/+bug/1942480 .
Trying to add the OSDs manually I see:
[sysadmin@controller-0 ~(keystone_admin)]$ echo "$DISKS" | grep "$OSD"
+--------------------------------------+-----------+---------+---------+-------+------------+-----+----------+--------------------------------------------+
| uuid | device_no | device_ | device_ | size_ | available_ | rpm | serial_i | device_path |
| | de | num | type | gib | gib | | d | |
+--------------------------------------+-----------+---------+---------+-------+------------+-----+----------+--------------------------------------------+
| e00b997e-ce7a-43af-a580-e59bd5361959 | /dev/sda | 2048 | SSD | 476. | 476.937 | N/A | BTLA8094 | /dev/disk/by-path/pci-0000:00:11.5-ata-2.0 |
| | | | | 939 | | | 07ME512K | |
| | | | | | | | | |
| b7f1d649-ddfd-40b0-8969-1c32128e21b1 | /dev/sdb | 2064 | SSD | 894. | 0.0 | N/A | PHYF0045 | /dev/disk/by-path/pci-0000:00:17.0-ata-1.0 |
| | | | | 252 | | | 01SK960C | |
| | | | | | | | GN | |
| | | | | | | | | |
+--------------------------------------+-----------+---------+---------+-------+------------+-----+----------+--------------------------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$ echo "$DISKS" | grep "/dev/sdb" | awk '{print $2}'
b7f1d649-ddfd-40b0-8969-1c32128e21b1
[sysadmin@controller-0 ~(keystone_admin)]$ TIERS=$(system storage-tier-list ceph_cluster)
[sysadmin@controller-0 ~(keystone_admin)]$ system host-stor-add controller-0 $(echo "$DISKS" | grep "/dev/sdb" | awk '{print $2}') --tier-uuid $(echo "$TIERS" | grep storage | awk '{print $2}')
Can not associate to a rootfs disk

Revision history for this message
Delfino Gomes Curado Filho (dcuradof) wrote :

Hi Alexandru,

I'm assuming these commands were executed on a different installation as I can't find these disks and their respective UUIDs in the collected logs.

With this in mind, the only thing I can add is that this error message appears when host-stor-add try to add the disk that the system is installed as an OSD.

For a more detailed analysis, please collect the logs for this new installation and attach them to this launchpad.

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Download full text (7.1 KiB)

[sysadmin@controller-0 ~(keystone_admin)]$ system application-list
+--------------------------+---------+-----------------------------------+----------------------------------------+----------+-----------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+---------+-----------------------------------+----------------------------------------+----------+-----------+
| cert-manager | 1.0-25 | cert-manager-manifest | certmanager-manifest.yaml | applied | completed |
| nginx-ingress-controller | 1.1-17 | nginx-ingress-controller-manifest | nginx_ingress_controller_manifest.yaml | applied | completed |
| oidc-auth-apps | 1.0-59 | oidc-auth-manifest | manifest.yaml | uploaded | completed |
| platform-integ-apps | 1.0-42 | platform-integration-manifest | manifest.yaml | uploaded | completed |
| rook-ceph-apps | 1.0-11 | rook-ceph-manifest | manifest.yaml | uploaded | completed |
+--------------------------+---------+-----------------------------------+----------------------------------------+----------+-----------+
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list

+----------+---------------------------------------------------------------------------------------+--------------------------------------+----------+-------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+---------------------------------------------------------------------------------------+--------------------------------------+----------+-------------------+
| 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1 | major | 2021-11-14T14:12: |
| | | | | 24.181896 |
| | | | | |
| 100.114 | NTP address 192.168.100.1 is not a valid or a reachable NTP server. | host=controller-1=192.168.100.1 | minor | 2021-11-14T13:02: |
| | | | | 24.178856 |
| | | | | |
| 800.010 | Potential data loss. No available OSDs in storage replication group group-0: no OSDs | cluster=96d61c01-a7a4-4340-9dad- | critical | 2021-11-14T12:56: |
| | ...

Read more...

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

I attached 2 logs from 2 different builds: debug_20211114T023708Z.log the failing build and debug_20211027T024800Z.log the working build. You can observe that before running the "system host-stor-add ..." the compute nodes are not in available state and then, the command is not executed.

Revision history for this message
Delfino Gomes Curado Filho (dcuradof) wrote :

Analyzing the debug logs that you sent, I see that after the configuration of compute-0 the script waits for platform-integ-apps to be applied.

What happens is that ceph 14 now has a health check called TOO_FEW_OSDS. This warning is raised while the cluster does not have at least the default number of OSDs configured, leaving the cluster in a HEALTH_WARN state.

Because of this warning, Sysinv does not execute the auto-apply on platform-integ-apps keeping it in an "uploaded" state.

I don't see a reason to wait for this apply at this step. From my understanding this can be removed as there is already another wait for platform-integ-apps before stx-openstack upload/apply and after the inclusion of the OSDs to the cluster.

Because of that, I suggest that you change this script that configures StarlingX.

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

I created and tested the commit for reverting the b5b362c759cd40997eab5dfed45b8b34a38b3b5e: https://review.opendev.org/c/starlingx/test/+/818131
This issue was NOT observed again during the test yesterday(I don't remember if the older issue was a sporadically one or not).

Revision history for this message
Ghada Khalil (gkhalil) wrote (last edit ):

@Alex, Based on your comment, can we mark this LP as Invalid given this is not a software issue?

Changed in starlingx:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.