Bug #1949771 “platform-integ-apps, oidc-auth-apps and rook-ceph-...” : Bugs : StarlingX

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2021-11-04:

#1

I attached the collected logs Edit (39.7 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-11-05:

#2

The storage team will be looking at this as it appears similar to https://bugs.launchpad.net/starlingx/+bug/1949360 and maybe related to the ceph up-version

Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged
assignee:	nobody → Felipe Sanches Zanoni (fsanches)
tags:	added: stx.6.0 stx.apps stx.storage
Changed in starlingx:
importance:	High → Critical

Revision history for this message

Delfino Gomes Curado Filho (dcuradof) wrote on 2021-11-10:

#3

Hi Alexandru,

The apps oidc-auth-apps and rook-ceph-apps will only be applied if "system application-apply" is used.

Regarding platform-integ-apps, what I can see from the logs is this:

Ceph was added as storage backend:

(/var/log/bash.log) 2021-11-04T09:16:54.000 localhost -sh: info HISTORY: PID=155566 UID=42425 system storage-backend-add ceph --confirmed

This can be confirmed on the database
(/var/extra/database/sysinv.db.sql.txt)
2021-11-04 09:16:55.573516 \N \N 1 b1927830-fc47-4498-9d6d-11ca0b72a623 ceph configured provision-storage 1 \N {"min_replication": "1", "replication": "2"} ceph-store

After that I can't find any disk being added with the command "system host-stor-add" (https://docs.starlingx.io/deploy_install_guides/r6_release/virtual/controller_storage_install_kubernetes.html#add-ceph-osds-to-controllers)

This can be confirmed on the database by the empty i_istor table

(/var/extra/database/sysinv.db.sql.txt)
-- Data for Name: i_istor; Type: TABLE DATA; Schema: public; Owner: admin-sysinv
--

COPY i_istor (created_at, updated_at, deleted_at, id, uuid, osdid, idisk_uuid, state, function, capabilities, forihostid, fortierid) FROM stdin;
\.

--
-- Name: i_istor_id_seq; Type: SEQUENCE SET; Schema: public; Owner: admin-sysinv
--

SELECT pg_catalog.setval('i_istor_id_seq', 1, false);

This can also be seen on /var/log/ceph/ceph.log
2021-11-04 09:53:24.267525 mon.controller-0 (mon.0) 50 : cluster [WRN] Health check failed: OSD count 0 < osd_pool_default_size 2 (TOO_FEW_OSDS)

I understand that this is something that was working before, so can you double check if the command to add /dev/sdb as an OSD is being executed?

Changed in starlingx:
assignee:	Felipe Sanches Zanoni (fsanches) → Delfino Gomes Curado Filho (dcuradof)

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2021-11-13:

#4

Hi Delfino,

The "Ceph OSDs add to controllers" logic is implemented on our scripts. It was not executed and from experience I remember this happened in the past because something else failed before that step.. For the moment I can add that I also see the alarm: "Alarm 800.010 Potential data loss. No available OSDs in storage replication group group-0" check bug: https://bugs.launchpad.net/starlingx/+bug/1942480 .
Trying to add the OSDs manually I see:
[sysadmin@controller-0 ~(keystone_admin)]$ echo "$DISKS" | grep "$OSD"
+--------------------------------------+-----------+---------+---------+-------+------------+-----+----------+--------------------------------------------+
| uuid | device_no | device_ | device_ | size_ | available_ | rpm | serial_i | device_path |
| | de | num | type | gib | gib | | d | |
+--------------------------------------+-----------+---------+---------+-------+------------+-----+----------+--------------------------------------------+
| e00b997e-ce7a-43af-a580-e59bd5361959 | /dev/sda | 2048 | SSD | 476. | 476.937 | N/A | BTLA8094 | /dev/disk/by-path/pci-0000:00:11.5-ata-2.0 |
| | | | | 939 | | | 07ME512K | |
| | | | | | | | | |
| b7f1d649-ddfd-40b0-8969-1c32128e21b1 | /dev/sdb | 2064 | SSD | 894. | 0.0 | N/A | PHYF0045 | /dev/disk/by-path/pci-0000:00:17.0-ata-1.0 |
| | | | | 252 | | | 01SK960C | |
| | | | | | | | GN | |
| | | | | | | | | |
+--------------------------------------+-----------+---------+---------+-------+------------+-----+----------+--------------------------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$ echo "$DISKS" | grep "/dev/sdb" | awk '{print $2}'
b7f1d649-ddfd-40b0-8969-1c32128e21b1
[sysadmin@controller-0 ~(keystone_admin)]$ TIERS=$(system storage-tier-list ceph_cluster)
[sysadmin@controller-0 ~(keystone_admin)]$ system host-stor-add controller-0 $(echo "$DISKS" | grep "/dev/sdb" | awk '{print $2}') --tier-uuid $(echo "$TIERS" | grep storage | awk '{print $2}')
Can not associate to a rootfs disk

Hi Delfino,

The "Ceph OSDs add to controllers" logic is implemented on our scripts. It was not executed and from experience I remember this happened in the past because something else failed before that step.. For the moment I can add that I also see the  alarm: "Alarm 800.010 Potential data loss. No available OSDs in storage replication group group-0" check bug: https://bugs.launchpad.net/starlingx/+bug/1942480 .
Trying to add the OSDs manually I see:
[sysadmin@controller-0 ~(keystone_admin)]$ echo "$DISKS" | grep "$OSD"
+--------------------------------------+-----------+---------+---------+-------+------------+-----+----------+--------------------------------------------+
| uuid                                 | device_no | device_ | device_ | size_ | available_ | rpm | serial_i | device_path                                |
|                                      | de        | num     | type    | gib   | gib        |     | d        |                                            |
+--------------------------------------+-----------+---------+---------+-------+------------+-----+----------+--------------------------------------------+
| e00b997e-ce7a-43af-a580-e59bd5361959 | /dev/sda  | 2048    | SSD     | 476.  | 476.937    | N/A | BTLA8094 | /dev/disk/by-path/pci-0000:00:11.5-ata-2.0 |
|                                      |           |         |         | 939   |            |     | 07ME512K |                                            |
|                                      |           |         |         |       |            |     |          |                                            |
| b7f1d649-ddfd-40b0-8969-1c32128e21b1 | /dev/sdb  | 2064    | SSD     | 894.  | 0.0        | N/A | PHYF0045 | /dev/disk/by-path/pci-0000:00:17.0-ata-1.0 |
|                                      |           |         |         | 252   |            |     | 01SK960C |                                            |
|                                      |           |         |         |       |            |     | GN       |                                            |
|                                      |           |         |         |       |            |     |          |                                            |
+--------------------------------------+-----------+---------+---------+-------+------------+-----+----------+--------------------------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$ echo "$DISKS" | grep "/dev/sdb" | awk '{print $2}'
b7f1d649-ddfd-40b0-8969-1c32128e21b1
[sysadmin@controller-0 ~(keystone_admin)]$ TIERS=$(system storage-tier-list ceph_cluster)
[sysadmin@controller-0 ~(keystone_admin)]$ system host-stor-add controller-0 $(echo "$DISKS" | grep "/dev/sdb" | awk '{print $2}') --tier-uuid $(echo "$TIERS" | grep storage | awk '{print $2}')
Can not associate to a rootfs disk

Revision history for this message

Delfino Gomes Curado Filho (dcuradof) wrote on 2021-11-14:

#5

Hi Alexandru,

I'm assuming these commands were executed on a different installation as I can't find these disks and their respective UUIDs in the collected logs.

With this in mind, the only thing I can add is that this error message appears when host-stor-add try to add the disk that the system is installed as an OSD.

For a more detailed analysis, please collect the logs for this new installation and attach them to this launchpad.

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2021-11-14:

#6

I attached the collected logs from a standard baremetal configuration Edit (39.3 MiB, application/x-tar)

Download full text (7.1 KiB)

+----------+---------------------------------------------------------------------------------------+--------------------------------------+----------+-------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+---------------------------------------------------------------------------------------+--------------------------------------+----------+-------------------+
| 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1 | major | 2021-11-14T14:12: |
| | | | | 24.181896 |
| | | | | |
| 100.114 | NTP address 192.168.100.1 is not a valid or a reachable NTP server. | host=controller-1=192.168.100.1 | minor | 2021-11-14T13:02: |
| | | | | 24.178856 |
| | | | | |
| 800.010 | Potential data loss. No available OSDs in storage replication group group-0: no OSDs | cluster=96d61c01-a7a4-4340-9dad- | critical | 2021-11-14T12:56: |
| | ...

+----------+---------------------------------------------------------------------------------------+--------------------------------------+----------+-------------------+
| Alarm ID | Reason Text                                                                           | Entity ID                            | Severity | Time Stamp        |
+----------+---------------------------------------------------------------------------------------+--------------------------------------+----------+-------------------+
| 100.114  | NTP configuration does not contain any valid or reachable NTP servers.                | host=controller-1                    | major    | 2021-11-14T14:12: |
|          |                                                                                       |                                      |          | 24.181896         |
|          |                                                                                       |                                      |          |                   |
| 100.114  | NTP address 192.168.100.1 is not a valid or a reachable NTP server.                   | host=controller-1=192.168.100.1      | minor    | 2021-11-14T13:02: |
|          |                                                                                       |                                      |          | 24.178856         |
|          |                                                                                       |                                      |          |                   |
| 800.010  | Potential data loss. No available OSDs in storage replication group  group-0: no OSDs | cluster=96d61c01-a7a4-4340-9dad-     | critical | 2021-11-14T12:56: |
|          |                                                                                       | a2d60c6db811.peergroup=group-0       |          | 32.674975         |
|          |                                                                                       |                                      |          |                   |
| 800.001  | Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details.        | cluster=96d61c01-a7a4-4340-9dad-     | warning  | 2021-11-14T12:43: |
|          |                                                                                       | a2d60c6db811                         |          | 10.813330         |
|          |                                                                                       |                                      |          |                   |
| 250.001  | compute-1 Configuration is out-of-date.                                               | host=compute-1                       | major    | 2021-11-14T12:37: |
|          |                                                                                       |                                      |          | 00.372640         |
|          |                                                                                       |                                      |          |                   |
| 100.114  | NTP configuration does not contain any valid or reachable NTP servers.                | host=controller-0                    | major    | 2021-11-14T12:33: |
|          |                                                                                       |                                      |          | 46.276285         |
|          |                                                                                       |                                      |          |                   |
| 100.114  | NTP address 192.168.100.1 is not a valid or a reachable NTP server.                   | host=controller-0=192.168.100.1      | minor    | 2021-11-14T12:33: |
|          |                                                                                       |                                      |          | 46.274352         |
|          |                                                                                       |                                      |          |                   |
| 200.001  | compute-1 was administratively locked to take it out-of-service.                      | host=compute-1                       | warning  | 2021-11-14T12:27: |
|          |                                                                                       |                                      |          | 36.708072         |
|          |                                                                                       |                                      |          |                   |
| 200.001  | compute-0 was administratively locked to take it out-of-service.                      | host=compute-0                       | warning  | 2021-11-14T12:27: |
|          |                                                                                       |                                      |          | 31.399275         |
|          |                                                                                       |                                      |          |                   |
+----------+---------------------------------------------------------------------------------------+--------------------------------------+----------+-------------------+
[sysadmin@controller-0 ~(keystone_admin)]$
controller-0:~$ collect all
[sudo] password for sysadmin:
collecting data from 4 host(s)
Error: cannot collect from compute-0 (reason:33:unreachable)
Error: cannot collect from compute-1 (reason:33:unreachable)
monitoring host collect ; please standby ...
collected controller-1_20211114.141117 ... done  (00:00:45   18M   1%)
collected controller-0_20211114.141117 ... done  (00:01:39   22M   1%)
collected from 2 hosts
creating all-nodes tarball /scratch/ALL_NODES_20211114.141117.tar ... done (00:01:39   40M   1%)

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2021-11-15:

#7

I attached the debug logs Edit (368.8 KiB, application/rar)

I attached 2 logs from 2 different builds: debug_20211114T023708Z.log the failing build and debug_20211027T024800Z.log the working build. You can observe that before running the "system host-stor-add ..." the compute nodes are not in available state and then, the command is not executed.

Revision history for this message

Delfino Gomes Curado Filho (dcuradof) wrote on 2021-11-16:

#8

Analyzing the debug logs that you sent, I see that after the configuration of compute-0 the script waits for platform-integ-apps to be applied.

What happens is that ceph 14 now has a health check called TOO_FEW_OSDS. This warning is raised while the cluster does not have at least the default number of OSDs configured, leaving the cluster in a HEALTH_WARN state.

Because of this warning, Sysinv does not execute the auto-apply on platform-integ-apps keeping it in an "uploaded" state.

I don't see a reason to wait for this apply at this step. From my understanding this can be removed as there is already another wait for platform-integ-apps before stx-openstack upload/apply and after the inclusion of the OSDs to the cluster.

Because of that, I suggest that you change this script that configures StarlingX.

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2021-11-17:

#9

I created and tested the commit for reverting the b5b362c759cd40997eab5dfed45b8b34a38b3b5e: https://review.opendev.org/c/starlingx/test/+/818131
This issue was NOT observed again during the test yesterday(I don't remember if the older issue was a sporadically one or not).

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-11-17 (last edit on 2021-11-17):

#10

@Alex, Based on your comment, can we mark this LP as Invalid given this is not a software issue?

Alexandru Dimofte (adimofte) on 2021-11-17

Changed in starlingx:
status:	Triaged → Invalid

StarlingX

platform-integ-apps, oidc-auth-apps and rook-ceph-apps applications remained in "uploaded" state

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches