Compute-0 install failure: Configuration failure, threshold reached, Lock/Unlock to retry

Bug #1926029 reported by Alexandru Dimofte
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Incomplete
Critical
Unassigned

Bug Description

Brief Description
-----------------
StarlingX installation failure during Provisioning. Compute-0 installation failed: Configuration failure, threshold reached, Lock/Unlock to retry
Affected configurations: Standard and Standard External on master branch

Severity
--------
<Critical: System/Feature is not usable due to the defect>

Steps to Reproduce
------------------
Try to install master iso 20210424T043141Z on Standard or Standard External system.
It will fail during provisioning.

Expected Behavior
------------------
Stx installation should work.

Actual Behavior
----------------
Stx installation fails because of Compute-0 installation failure: Configuration failure, threshold reached, Lock/Unlock to retry
[sysadmin@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | disabled | failed |
| 3 | compute-1 | worker | locked | disabled | online |
| 4 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
[sysadmin@controller-0 ~(keystone_admin)]$ system host-show compute-0
+-----------------------+----------------------------------------------------------------+
| Property | Value |
+-----------------------+----------------------------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | failed |
| bm_ip | None |
| bm_type | none |
| bm_username | None |
| boot_device | /dev/sda |
| capabilities | {} |
| clock_synchronization | ntp |
| config_applied | 7b3f7305-4ea8-4941-b04b-da72ef57b2bb |
| config_status | None |
| config_target | 7b3f7305-4ea8-4941-b04b-da72ef57b2bb |
| console | ttyS0,115200 |
| created_at | 2021-04-24T17:20:27.181750+00:00 |
| device_image_update | None |
| hostname | compute-0 |
| id | 2 |
| install_output | text |
| install_state | completed |
| install_state_info | None |
| inv_state | inventoried |
| invprovision | provisioning |
| location | {} |
| mgmt_ip | 10.10.56.27 |
| mgmt_mac | a4:bf:01:70:56:39 |
| operational | disabled |
| personality | worker |
| reboot_needed | False |
| reserved | False |
| rootfs_device | /dev/sda |
| serialid | None |
| software_load | 21.05 |
| task | Configuration failure, threshold reached, Lock/Unlock to retry |
| tboot | false |
| ttys_dcd | None |
| updated_at | 2021-04-24T19:25:33.184340+00:00 |
| uptime | 5534 |
| uuid | 25bd4618-c49d-49c4-8af6-7e85233c9c22 |
| vim_progress_status | None |
+-----------------------+----------------------------------------------------------------+

| 200. | compute-0 experienced a service-affecting failure. Auto-recovery in | host=compute-0 | critical | 2021-04-24T17: |
| 004 | progress. Manual Lock and Unlock may be required if auto-recovery is | | | 51:29.459654 |
| | unsuccessful. | | | |
| | | | | |
| 200. | compute-0 experienced a configuration failure. | host=compute-0 | critical | 2021-04-24T17: |
| 011 | | | | 51:29.417192 |

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------
master

Last Pass
---------
20210422T042749Z

Timestamp/Logs
--------------
will be attached

Test Activity
-------------
Sanity

Workaround
----------
-

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Critical
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: marking as critical given this is causing a red sanity of stx master

tags: added: stx.config
Revision history for this message
Bin Qian (bqian20) wrote :

The log attached doesn't include log from compute-0 probably because the compute-0 was offline when the log was collected.
mtcAgent report configure failure, but no detail info can be found because missing log from compute-0
2021-04-24T17:51:29.387 [105819.00290] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1101) enable_handler :Error : compute-0 configuration failed or incomplete (oob:2)
2021-04-24T17:51:29.387 [105819.00291] controller-0 mtcAgent hbs nodeClass.cpp (1644) alarm_config_failure :Error : compute-0 critical config failure
2021-04-24T17:51:29.387 [105819.00292] controller-0 mtcAgent alm mtcAlarm.cpp ( 579) mtcAlarm_critical :Error : compute-0 setting critical 'Configuration' failure alarm (200.011 )
2021-04-24T17:51:29.387 fmAPI.cpp(490): Enqueue raise alarm request: UUID (3c076b01-5b36-4542-9ee4-628ea94ae6ea) alarm id (200.011) instant id (host=compute-0)
2021-04-24T17:51:29.387 [105819.00293] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : compute-0 Task: Configuration Failed, re-enabling (seq:34)
2021-04-24T17:51:29.387 [105819.00294] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 926) send_hbs_command : Info : compute-0 stop host service sent to controller-0 hbsAgent
2021-04-24T17:51:29.387 [105819.00295] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 926) send_hbs_command : Info : compute-0 stop host service sent to controller-1 hbsAgent
2021-04-24T17:51:29.387 [105819.00296] controller-0 mtcAgent inv mtcInvApi.cpp ( 877) mtcInvApi_update_states : Info : compute-0 unlocked-disabled-failed (seq:35)
2021-04-24T17:51:29.387 [105819.00297] controller-0 mtcAgent hbs nodeClass.cpp (7659) ar_manage : Warn : compute-0 auto recovery (try 1 of 2) (0)
2021-04-24T17:51:29.388 [105819.00298] controller-0 mtcAgent msg mtcCtrlMsg.cpp (1019) service_events : Info : compute-0 heartbeat minor clear (Mgmnt)
2021-04-24T17:51:29.388 [105819.00299] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp ( 573) enable_handler :Error : compute-0 Main Enable FSM (from failed)
2021-04-24T17:51:29.388 [105819.00300] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 926) send_hbs_command : Info : compute-0 stop host service sent to controller-0 hbsAgent
2021-04-24T17:51:29.388 [105819.00301] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 926) send_hbs_command : Info : compute-0 stop host service sent to controller-1 hbsAgent
2021-04-24T17:51:29.388 [105819.00302] controller-0 mtcAgent hbs nodeClass.cpp (1682) alarm_enabled_failure :Error : compute-0 critical enable failure
2021-04-24T17:51:29.388 [105819.00303] controller-0 mtcAgent alm mtcAlarm.cpp ( 579) mtcAlarm_critical :Error : compute-0 setting critical 'In-Service' failure alarm (200.004 )

Changed in starlingx:
status: New → Incomplete
Ghada Khalil (gkhalil)
tags: added: stx.6.0
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

Trying to collect the logs from compute-0 I was unable to connect via ssh into it. No known passwords are working.

Revision history for this message
Don Penney (dpenney) wrote :

If the compute puppet manifest did not apply at all, password would be initial installation password, same as controller-0 initial install

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

Oh please ignore the logs attached in previous comment. I made a mistake. This were prepared for different bug.

Revision history for this message
Austin Sun (sunausti) wrote :

2021-05-07T02:59:34.886 ^[[1;33mWarning: 2021-05-07 02:59:32 +0000 Could not find resource 'Service[kubelet]' in parameter 'before'
2021-05-07T02:59:34.891 (file & line not available)^[[0m
2021-05-07T02:59:34.897 ^[[mNotice: 2021-05-07 02:59:32 +0000 Compiled catalog for compute-0 in environment production in 24.14 seconds^[[0m
2021-05-07T02:59:34.902 ^[[0;36mDebug: 2021-05-07 02:59:32 +0000 Creating default schedules^[[0m
2021-05-07T02:59:34.907 ^[[1;31mError: 2021-05-07 02:59:32 +0000 Could not find dependent Service[kubelet] for Service[kvm_timer_advance_setup] at /usr/share/puppet/modules/platform/manifests/compute.pp:364

and kubelet services are not ok, because
2021-05-07T03:16:14.332 compute-0 kubelet[28444]: info F0507 03:16:14.332234 28444 server.go:199] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory

so , kubeadm join command is not issued yet, need check if there are some timing issue.

Revision history for this message
Austin Sun (sunausti) wrote :
Revision history for this message
Austin Sun (sunausti) wrote :

@Alex, please double confirm this issue is gone from May 6th.

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

@Austin, yes yesterday I didn't observed this issue. This is why the sanity for master branch is yellow now.

Revision history for this message
Austin Sun (sunausti) wrote :

OK. Thanks.
I duplicate this issue to LP1918139.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.