Cannot add a new host during provisioning

Bug #1822657 reported by Juan Carlos Alonso
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
chen haochuan

Bug Description

Title
-----
Cannot add a new host during provisioning.

Brief Description
-----------------
During provisioning, when add a new host:
$ system host-add -n ${host_name} -p ${personality} -m ${mac_address}

Got the following error:
'Maintenance has returned with a status of fail, reason: no response, recommended action: retry'

This issue is intermitent.
After it failed, try to add the host again but got:
'error: Host already exists'

When check the hosts available can see host installed correctly:
$ system host-list

Then, got an error when added a new host, got an error when retry to add the host because it was correctly installed.
This issue sometimes breaks our test execution.

Severity
--------
<Critical: System/Feature is not usable after the defect>

Steps to Reproduce
------------------
$ system host-add -n ${host_name} -p ${personality} -m ${mac_address}

Expected Behavior
------------------
Host added correctly

Actual Behavior
----------------
'Maintenance has returned with a status of fail, reason: no response, recommended action: retry'

$ system host-add -n ${host_name} -p ${personality} -m ${mac_address}
'error: Host already exists'

Reproducibility
---------------
<Intermittent>

System Configuration
--------------------
Any configuration in virtual and bare metal

Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; appears to be an intermittent issue during provisioning. Requires further investigation.

tags: added: stx.2019.05 stx.config
removed: stx.containers
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Bruce Jones (brucej)
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Bruce Jones (brucej)
Changed in starlingx:
assignee: Bruce Jones (brucej) → Cindy Xie (xxie1)
Revision history for this message
Erich Cordoba (ericho) wrote :

According to the code, there are two reasons on why this message can be returned. In [0] the API client returns a None value on mtce_api.host_add, while in [1] the API returns None on mtce_api.host_modify trying to power on the node.

As this is an intermittent issue and the error message is ambiguous, I would like to propose to change the error messages to reflect the actual status, so the next time we hit this issue it would be more easy to identify.

I have some additional questions, when this error appeared and the node was actually added, is the node functional?

Also, the next time this issue appeared, could you please attach the /var/log/mtc* log files?

[0] https://opendev.org/starlingx/config/src/branch/master/sysinv/sysinv/sysinv/sysinv/api/controllers/v1/host.py#L1481
[1] https://opendev.org/starlingx/config/src/branch/master/sysinv/sysinv/sysinv/sysinv/api/controllers/v1/host.py#L1504

Changed in starlingx:
assignee: Cindy Xie (xxie1) → chen haochuan (martin1982)
Revision history for this message
chen haochuan (martin1982) wrote :

request log file saving in /var/log.

Changed in starlingx:
status: Triaged → Incomplete
Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :

Logic were added to the test suite to handle this issue and try to add the host again, so in order to reproduce this issue again I think we need to provisioning manually.

Revision history for this message
Erich Cordoba (ericho) wrote :

Reviewing old logs it turns out we had another instance of this issue in may 15 for standard and external storage configuration. The error appears but the nodes are added successfully

20190515 11:30:58.022 - INFO - [wrsroot@controller-0 ~(keystone_admin)]$
20190515 11:30:58.022 - INFO - +------- END KW: SSHLibrary.Read (0)
20190515 11:30:58.022 - INFO - +------- START KW: SSHLibrary.Write [ ${cmd} ]
20190515 11:30:58.031 - INFO - system host-add -n compute-0 -p worker ^M -m a4:bf:01:54:98:61
20190515 11:30:58.031 - INFO - +------- END KW: SSHLibrary.Write (9)
20190515 11:30:58.031 - INFO - +------- START KW: SSHLibrary.Read Until Prompt [ ]
20190515 11:31:01.335 - INFO - Maintenance has returned with a status of fail, reason: no response, recommended action: retry
[wrsroot@controller-0 ~(keystone_admin)]$
20190515 11:31:01.335 - INFO - ${output} = Maintenance has returned with a status of fail, reason: no response, recommended action: retry
[wrsroot@controller-0 ~(keystone_admin)]$

Revision history for this message
Cindy Xie (xxie1) wrote :

As it is confirmed that the node was added, only the error message appears. And the test scripts updated with re-try logic. this bug is not blocking stx.2.0 and lower down to "low" priority.

Changed in starlingx:
importance: Medium → Low
tags: removed: stx.2.0
Cindy Xie (xxie1)
tags: added: stx.2.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per discussion with Cindy, restoring the priority to medium and putting back the stx.2.0 release tag. The config PL/TL will need to review and determine if this should be re-gated to Low / Not Gating.

Changed in starlingx:
importance: Low → Medium
Revision history for this message
Dariush Eslimi (deslimi) wrote :

I reviewed this bug and would like to keep it as medium, rational is that the error massage is very misleading in case of successful add. i do agree that this should not gate stx-2.0 as the functionality worked.

Dariush Eslimi (deslimi)
tags: added: stx.3.0
removed: stx.2.0
Changed in starlingx:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.