IPv6 Distributed Cloud: DC orchestration serial apply patch subcloud failed

Bug #1847828 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
John Kung

Bug Description

Brief Description
-----------------
upload and apply non reboot required on system controller. create serial strategy orchestration and apply patch to system. System controller apply complete, but subcloud serial apply patch subcloud failed

Severity
--------
Major

Steps to Reproduce
------------------
As description

TC-name: DC regression

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Seen once

System Configuration
--------------------
DC system
IPv6

Lab-name: DC

Branch/Pull Time/Commit
-----------------------
"2019-10-06_20-00-00"

Last Pass
---------

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager patch-strategy create --subcloud-apply-type serial
+------------------------+----------------------------+
| Field | Value |
+------------------------+----------------------------+
| subcloud apply type | serial |
| max parallel subclouds | 20 |
| stop on failure | False |
| state | initial |
| created_at | 2019-10-11T23:34:44.515203 |
| updated_at | None |
+------------------------+----------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager patch-strategy apply
+------------------------+----------------------------+
| Field | Value |
+------------------------+----------------------------+
| subcloud apply type | serial |
| max parallel subclouds | 20 |
| stop on failure | False |
| state | applying |
| created_at | 2019-10-11T23:34:44.515203 |
| updated_at | 2019-10-11T23:35:15.921717 |
+------------------------+----------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager strategy-step list
+------------------+-------+----------+----------------------------------------------------------------+----------------------------+----------------------------+
| cloud | stage | state | details | started_at | finished_at |
+------------------+-------+----------+----------------------------------------------------------------+----------------------------+----------------------------+
| SystemController | 1 | complete | | 2019-10-11 23:35:21.447412 | 2019-10-11 23:39:01.645823 |
| subcloud4 | 2 | failed | Strategy apply failed for subcloud4 - unexpected state aborted | 2019-10-11 23:39:11.653890 | 2019-10-11 23:41:04.059838 |
| subcloud1 | 3 | failed | Strategy apply failed for subcloud1 - unexpected state aborted | 2019-10-11 23:41:11.764378 | 2019-10-11 23:45:04.548805 |
+------------------+-------+----------+----------------------------------------------------------------+----------------------------+----------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager strategy-step show subcloud4
+-------------+----------------------------------------------------------------+
| Field | Value |
+-------------+----------------------------------------------------------------+
| cloud | subcloud4 |
| stage | 2 |
| state | failed |
| details | Strategy apply failed for subcloud4 - unexpected state aborted |
| started_at | 2019-10-11 23:39:11.653890 |
| finished_at | 2019-10-11 23:41:04.059838 |
| created_at | 2019-10-11 23:34:44.520215 |
| updated_at | 2019-10-11 23:41:04.064529 |
+-------------+----------------------------------------------------------------+

Test Activity
-------------
Regression Testing

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :
Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Bart to triage before deciding on gate

Changed in starlingx:
assignee: nobody → Bart Wensley (bartwensley)
tags: added: stx.distcloud
Revision history for this message
Bart Wensley (bartwensley) wrote :

The patch application on subcloud 4 failed due to an alarm:

2019-10-11T23:41:03.904 controller-1 VIM_Thread[133212] INFO _strategy.py.382 Apply Complete Callback, result=failed, reason=alarms from platform are present.

This is a "config out of date" alarm:

2019-10-11T23:40:28.000 controller-1 fmManager: info { "event_log_id" : "250.001", "reason_text" : "controller-1 Configuration is out-of-date.", "entity_instance_id" : "region=subcloud4.system=dc-subcloud4.host=controller-1", "severity" : "major", "state" : "set", "timestamp" : "2019-10-11 23:40:28.570324" }

The alarm was raised by the sysinv-conductor at the point when the patch was being applied to controller-1:

2019-10-11 23:40:28.408 130812 INFO sysinv.conductor.manager [req-a9899633-ae56-4bfb-8619-0c9de70b6e1b None None] Updating platform data for host: f0996358-69f1-413d-b92e-f16c798ae52a with: {u'config_applied': u'577513b7-2f2c-47f7-8c3b-d5fecde9d484', u'first_report': False, u'availability': u'available', u'iscsi_initiator_name': u'iqn.1994-05.com.redhat:a497a94283a0'}
2019-10-11 23:40:28.569 130812 WARNING sysinv.conductor.manager [req-a9899633-ae56-4bfb-8619-0c9de70b6e1b None None] controller-1: iconfig out of date: target 6b7d1b14-cdfa-4736-8218-1506c3c9e761, applied 577513b7-2f2c-47f7-8c3b-d5fecde9d484
2019-10-11 23:40:28.570 130812 WARNING sysinv.conductor.manager [req-a9899633-ae56-4bfb-8619-0c9de70b6e1b None None] SYS_I Raise system config alarm: host controller-1 config applied: 577513b7-2f2c-47f7-8c3b-d5fecde9d484 vs. target: 6b7d1b14-cdfa-4736-8218-1506c3c9e761.
2019-10-11 23:40:28.587 411409 INFO sysinv.agent.manager [-] Sysinv Agent platform update by host: {'config_applied': '577513b7-2f2c-47f7-8c3b-d5fecde9d484', 'first_report': False, 'availability': 'available', 'iscsi_initiator_name': 'iqn.1994-05.com.redhat:a497a94283a0'}

Investigation is required by someone in sysinv to understand why the config went out of date.

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: Bart Wensley (bartwensley) → John Kung (john-kung)
Revision history for this message
Yosief Gebremariam (ygebrema) wrote :

DC Patch orchestration with subcloud-apply-type=serial passed with no issue.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Appears to be an intermittent issue given a re-test by Yosief was successful.
Marking as stx.3.0 / medium priority for further investigation.

tags: added: stx.3.0
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.config
Revision history for this message
John Kung (john-kung) wrote :

The Config-out-of-date alarm failed due to the following puppet manifest apply failure on subcloud following a host-swact:

sysinv.log
2019-10-11 14:25:28.939 13566 INFO sysinv.agent.manager [req-0a94efbc-9ed7-40a0-ac91-ee9121214a7b admin None] config_apply_runtime_manifest: 6b7d1b14-cdfa-4736-8218-1506c3c9e761 {u'classes': [u'openstack::keystone::endpoint::runtime', u'platform::firewall::runtime', u'platform::sysinv::runtime'], u'force': False, u'personalities': [u'controller'], u'host_uuids': [u'f0996358-69f1-413d-b92e-f16c798ae52a']} controller

puppet.log
2019-10-11T14:26:44.387 Debug: 2019-10-11 14:26:44 +0000 Executing: '/bin/sh -c source /etc/platform/openrc && openstack endpoint list --region RegionOne --service keystone --interface admin -f value -c ID | xargs openstack endpoint delete'
2019-10-11T14:26:48.044 Notice: 2019-10-11 14:26:48 +0000 /Stage[main]/Openstack::Keystone::Endpoint::Runtime/Delete_endpoints[Delete keystone endpoints]/Exec[Delete RegionOne keystone admin endpoint]/returns: usage: openstack endpoint delete [-h] <endpoint-id> [<endpoint-id> ...]
2019-10-11T14:26:48.047 Notice: 2019-10-11 14:26:48 +0000 /Stage[main]/Openstack::Keystone::Endpoint::Runtime/Delete_endpoints[Delete keystone endpoints]/Exec[Delete RegionOne keystone admin endpoint]/returns: openstack endpoint delete: error: too few arguments
2019-10-11T14:26:48.062 Error: 2019-10-11 14:26:48 +0000 source /etc/platform/openrc && openstack endpoint list --region RegionOne --service keystone --interface admin -f value -c ID | xargs openstack endpoint delete returned 123 instead of one of [0]

The robustness of the keystone:runtime was improved by the following commit to prevent the 'error: too few arguments' :
https://opendev.org/starlingx/stx-puppet/commit/37ca0899b6b684e7058cb3f53d835d1679bed694

Revision history for this message
John Kung (john-kung) wrote :

Updated status to 'Fix Released' (reason as per note above):

https://opendev.org/starlingx/stx-puppet/commit/37ca0899b6b684e7058cb3f53d835d1679bed694

Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

verified on 2019-12-01_20-00-00

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.