Provision fails in stx3.0.1 for Standard configuration

Bug #1897896 reported by Alexandru Dimofte
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Critical
Paul-Ionut Vaduva

Bug Description

Brief Description
-----------------
Trying to install the stx3.0.1 (http://mirror.starlingx.cengn.ca/mirror/starlingx/rc/3.0/centos/20200903T190312Z/outputs/iso/) with the old(from 2019) helm-chart:
http://mirror.starlingx.cengn.ca/mirror/starlingx/rc/3.0/centos/20191213T023000Z/outputs/helm-charts/helm-charts-stx-openstack-centos-stable-versioned.tgz (md5:687820b8dba02d11dd4cd8fb18a33eba)
always fails for Standard configuration.

Severity
--------
<Major: System/Feature is usable but degraded>

Steps to Reproduce
------------------
Try to install it.

Expected Behavior
------------------
It should install successfully.

Actual Behavior
----------------
Fails at Provision.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Multi-node system: Standard configuration.

Branch/Pull Time/Commit
-----------------------
Stx 3.0.1

Limestamp/Logs
--------------
Will be attached.

Test Activity
-------------
Sanity

Workaround
----------
-

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Bruce Jones (brucej) wrote :

This is likely a blocking issue for the 3.0.1 release

Changed in starlingx:
importance: Undecided → Critical
tags: added: stx.3.0
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

Added more logs(debug, output and report) archived: https://files.starlingx.kube.cengn.ca/download_file/360

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per Nick, Zhipeng will help investigate this issue on the test setup.

Changed in starlingx:
assignee: nobody → zhipeng liu (zhipengs)
Revision history for this message
zhipeng liu (zhipengs) wrote :

Hi Ghada,

From the log, it seems that platform-integ-apps apply fail.
storage-init-rbd-provisioner-dgw6k 0/1 Error 3 69s

[sysadmin@controller-0 armada(keystone_admin)]$ kubectl -n kube-system logs storage-init-rbd-provisioner-dgw6k
ceph-admin kubernetes.io/rbd 1 51s
====================================
  cluster:
    id: a01b9d34-2f53-4761-b8cf-39c54fd9fef7
    health: HEALTH_OK

  services:
    mon: 2 daemons, quorum controller-0,controller-1
    mgr: controller-0(active), standbys: controller-1
    osd: 0 osds: 0 up, 0 in

  data:
    pools: 1 pools, 64 pgs
    objects: 0 objects, 0 B
    usage: 0 B used, 0 B / 0 B avail
    pgs: 100.000% pgs unknown
             64 unknown

+ ceph osd pool stats kube-rbd
pool kube-rbd id 1
  nothing is going on

+ ceph osd pool application enable kube-rbd rbd
enabled application 'rbd' on pool 'kube-rbd'
+ ceph osd pool set kube-rbd size 2
set pool 1 size to 2
+ ceph osd pool set kube-rbd crush_rule storage_tier_ruleset
Error ENOENT: crush rule storage_tier_ruleset does not exist

Please forward to the right guy to further check this error.

Thanks!
Zhipeng

zhipeng liu (zhipengs)
Changed in starlingx:
status: New → Confirmed
Ghada Khalil (gkhalil)
tags: added: stx.storage
Revision history for this message
Paul-Ionut Vaduva (pvaduva) wrote :

sysinv log has a series of error enties some of which are related to ceph.
sysinv 2020-09-29 10:36:50.347 69839 ERROR sysinv.openstack.common.rpc.common [-] Failed to consume message from queue: Socket closed: IOError: Socket closed

/var/log/sysinv.log
sysinv 2020-09-29 10:50:04.542 93466 INFO sysinv.common.ceph [-] Active ceph monitors in inventory = [u'controller-0']
sysinv 2020-09-29 10:50:04.543 93466 INFO sysinv.common.ceph [-] Active ceph monitors in ceph cluster = []
sysinv 2020-09-29 10:50:04.545 93466 INFO sysinv.common.ceph [-] Active ceph monitors = []
sysinv 2020-09-29 10:50:04.545 93466 INFO sysinv.common.ceph [-] Not enough monitors yet available to fix crushmap.
AT this time ceph-mon seems to be down judging by the lack of entries in ceph mon log.
Between 10:45 and 11:16 there are no log entries in:
/var/log/ceph/ceph-mon.controller-0.log
2020-09-29 10:45:35.696 7f1b25c99140 0 mon.controller-0@-1(probing) e0 my rank is now 0 (was -1)
2020-09-29 11:16:14.600 7f1b0c560700 0 -- 192.168.204.11:6789/0 >> 192.168.204.11:0/1237726851 conn(0x56385d509000 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing

/var/log/sysinv.log
sysinv 2020-09-29 11:31:55.621 93466 INFO sysinv.common.ceph [-] Active ceph monitors in inventory = [u'controller-0']
sysinv 2020-09-29 11:31:55.622 93466 INFO sysinv.common.ceph [-] Active ceph monitors in ceph cluster = ]
sysinv 2020-09-29 11:31:55.625 93466 INFO sysinv.common.ceph [-] Active ceph monitors = []
sysinv 2020-09-29 11:31:55.626 93466 INFO sysinv.common.ceph [-] Not enough monitors yet available to fix crushmap.
Around this time there are log entries in ceph mon with a 30 seconds gap however.
/var/log/ceph/ceph-mon.controller-0.log
2020-09-29 11:31:27.483 7f1b0bd5f700 1 mon.controller-0@0(leader).log v29 check_sub sending message to mgr.4140 192.168.204.11:0/279806 with 0 entries (version 29) │
│2020-09-29 11:31:59.392 7f1b0fd67700 0 mon.controller-0@0(leader) e1 handle_command mon_command({"prefix": "health", "detail": "detail"} v 0) v1

It seems so far that the crushmap is to blame and probably doe not contain the rule storage_tier_ruleset

Revision history for this message
Paul-Ionut Vaduva (pvaduva) wrote :

From var/extra/database/sysinv.db.sql.txt
*******
COPY i_istor (created_at, updated_at, deleted_at, id, uuid, osdid, idisk_uuid, state, function, capabilities, forihostid, fortierid) FROM stdin;
\.

--
-- Name: i_istor_id_seq; Type: SEQUENCE SET; Schema: public; Owner: admin-sysinv
--
******
All the logs seem to indicate no osds have been created.
If that's the case we can reject this issue.

Revision history for this message
Frank Miller (sensfan22) wrote :

THe comment from Zhipeng indicates the OSDs did not exist. Please check the script used to provision the OSD and see if the script did indeed try to provision the OSD. If it did then see if an error was reported.

Changed in starlingx:
status: Confirmed → Invalid
assignee: zhipeng liu (zhipengs) → Alexandru Dimofte (adimofte)
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

I attached a diff file showing some different cmd outputs. In the left side we have a working Standard configuration and on the right ride our standard configuration.

Revision history for this message
Nicolae Jascanu (njascanu-intel) wrote :

We retested on all configurations, baremetal and virtual and only on STANDARD the PROVISION fails.
The install is blocked because the platform-integ-apps status is UPLOADED and NOT APPLIED.
Please find attached the sysinv.log when status is UPLOADED.

Changed in starlingx:
status: Invalid → Incomplete
status: Incomplete → Invalid
Revision history for this message
Frank Miller (sensfan22) wrote :

Re-assigning to Paul to investigate again.

Changed in starlingx:
status: Invalid → Confirmed
assignee: Alexandru Dimofte (adimofte) → Paul-Ionut Vaduva (pvaduva)
Revision history for this message
Nicolae Jascanu (njascanu-intel) wrote :

I've uploaded the /var/log/bash.log file. At 2020-11-08T04:16:10.000, the install script starts to wait for platform-integ-apps to be in applied status, but this never happen.

Revision history for this message
Nicolae Jascanu (njascanu-intel) wrote :

The collect archive when platform-integ-apps is in UPLOADED status is uploaded at: https://files.starlingx.kube.cengn.ca/download_file/371

Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

Looking over bash.log we see that OSDs are not present, please add them BEFORE checking that platform-integ-apps is applied. Once OSDs are added platform-integ-apps should apply automatically.

Revision history for this message
Paul-Ionut Vaduva (pvaduva) wrote :

The platform-integ-apps is not applied due to the fact that no OSDs have being defined
│sysinv 2020-11-08 06:55:14.471 96976 INFO sysinv.conductor.manager [-] Platform managed application platform-integ-apps: Prerequisites not met.
among those prerequisites in the file
https://opendev.org/starlingx/config/src/branch/master/sysinv/sysinv/sysinv/sysinv/conductor/manager.py
the method _met_app_apply_prerequisites is checking for what is
called a crushmap. A crushmap is generated after at least one OSD is
declared. if no OSD is defined no crushmap is generated and the
prerequisites will not be met and prevent the application of platform-integ-apps

Changed in starlingx:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers