StarlingX

Cannot add a new host during provisioning

Bug #1822657 reported by Juan Carlos Alonso on 2019-04-01

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Invalid	Medium	chen haochuan

Bug Description

Title
-----
Cannot add a new host during provisioning.

Brief Description
-----------------
During provisioning, when add a new host:
$ system host-add -n ${host_name} -p ${personality} -m ${mac_address}

Got the following error:
'Maintenance has returned with a status of fail, reason: no response, recommended action: retry'

This issue is intermitent.
After it failed, try to add the host again but got:
'error: Host already exists'

When check the hosts available can see host installed correctly:
$ system host-list

Then, got an error when added a new host, got an error when retry to add the host because it was correctly installed.
This issue sometimes breaks our test execution.

Severity
--------
<Critical: System/Feature is not usable after the defect>

Steps to Reproduce
------------------
$ system host-add -n ${host_name} -p ${personality} -m ${mac_address}

Expected Behavior
------------------
Host added correctly

Actual Behavior
----------------
'Maintenance has returned with a status of fail, reason: no response, recommended action: retry'

$ system host-add -n ${host_name} -p ${personality} -m ${mac_address}
'error: Host already exists'

Reproducibility
---------------
<Intermittent>

System Configuration
--------------------
Any configuration in virtual and bare metal

Tags:

Revision history for this message

Juan Carlos Alonso (juancarlosa) wrote on 2019-04-01:

log.txt Edit (3.7 KiB, text/plain)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-04-01:

Marking as release gating; appears to be an intermittent issue during provisioning. Requires further investigation.

tags:	added: stx.2019.05 stx.config removed: stx.containers
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
assignee:	nobody → Bruce Jones (brucej)

Ken Young (kenyis) on 2019-04-05

tags:

added: stx.2.0
removed: stx.2019.05

Bruce Jones (brucej) on 2019-04-05

Changed in starlingx:
assignee:	Bruce Jones (brucej) → Cindy Xie (xxie1)

Revision history for this message

Erich Cordoba (ericho) wrote on 2019-04-22:

According to the code, there are two reasons on why this message can be returned. In [0] the API client returns a None value on mtce_api.host_add, while in [1] the API returns None on mtce_api.host_modify trying to power on the node.

As this is an intermittent issue and the error message is ambiguous, I would like to propose to change the error messages to reflect the actual status, so the next time we hit this issue it would be more easy to identify.

I have some additional questions, when this error appeared and the node was actually added, is the node functional?

Also, the next time this issue appeared, could you please attach the /var/log/mtc* log files?

[0] https://opendev.org/starlingx/config/src/branch/master/sysinv/sysinv/sysinv/sysinv/api/controllers/v1/host.py#L1481
[1] https://opendev.org/starlingx/config/src/branch/master/sysinv/sysinv/sysinv/sysinv/api/controllers/v1/host.py#L1504

chen haochuan (martin1982) on 2019-04-26

Changed in starlingx:
assignee:	Cindy Xie (xxie1) → chen haochuan (martin1982)

Revision history for this message

chen haochuan (martin1982) wrote on 2019-05-10:

request log file saving in /var/log.

Changed in starlingx:
status:	Triaged → Incomplete

Revision history for this message

Juan Carlos Alonso (juancarlosa) wrote on 2019-05-13:

Logic were added to the test suite to handle this issue and try to add the host again, so in order to reproduce this issue again I think we need to provisioning manually.

Revision history for this message

Erich Cordoba (ericho) wrote on 2019-05-27:

Reviewing old logs it turns out we had another instance of this issue in may 15 for standard and external storage configuration. The error appears but the nodes are added successfully

20190515 11:30:58.022 - INFO - [wrsroot@controller-0 ~(keystone_admin)]$
20190515 11:30:58.022 - INFO - +------- END KW: SSHLibrary.Read (0)
20190515 11:30:58.022 - INFO - +------- START KW: SSHLibrary.Write [ ${cmd} ]
20190515 11:30:58.031 - INFO - system host-add -n compute-0 -p worker ^M -m a4:bf:01:54:98:61
20190515 11:30:58.031 - INFO - +------- END KW: SSHLibrary.Write (9)
20190515 11:30:58.031 - INFO - +------- START KW: SSHLibrary.Read Until Prompt [ ]
20190515 11:31:01.335 - INFO - Maintenance has returned with a status of fail, reason: no response, recommended action: retry
[wrsroot@controller-0 ~(keystone_admin)]$
20190515 11:31:01.335 - INFO - ${output} = Maintenance has returned with a status of fail, reason: no response, recommended action: retry
[wrsroot@controller-0 ~(keystone_admin)]$

Revision history for this message

Cindy Xie (xxie1) wrote on 2019-07-03:

As it is confirmed that the node was added, only the error message appears. And the test scripts updated with re-try logic. this bug is not blocking stx.2.0 and lower down to "low" priority.

Changed in starlingx:
importance:	Medium → Low
tags:	removed: stx.2.0

Cindy Xie (xxie1) on 2019-07-03

tags:

added: stx.2.0

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-25:

As per discussion with Cindy, restoring the priority to medium and putting back the stx.2.0 release tag. The config PL/TL will need to review and determine if this should be re-gated to Low / Not Gating.

Changed in starlingx:
importance:	Low → Medium

Revision history for this message

Dariush Eslimi (deslimi) wrote on 2019-07-26:

I reviewed this bug and would like to keep it as medium, rational is that the error massage is very misleading in case of successful add. i do agree that this should not gate stx-2.0 as the functionality worked.

Dariush Eslimi (deslimi) on 2019-07-26

tags:

added: stx.3.0
removed: stx.2.0

chen haochuan (martin1982) on 2019-08-13