Distributed Cloud: failure during add subcloud prevents subcloud from being added again

Bug #1862774 reported by Bart Wensley
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Jessica Castelino

Bug Description

Brief Description
-----------------
If the "dcmanager subcloud add" command fails at certain points, the data for the subcloud is not cleaned up properly, leaving the system in a state where any future attempt to add the same subcloud will fail.

Severity
--------
Major: failure is severe and will affect users

Steps to Reproduce
------------------
Attempt to add a subcloud and cause a failure during one of the steps. For example, during the add_subcloud RPC sent from dcmanager to dcorch or in the creation of the endpoints.

Expected Behavior
------------------
All data for the subcloud should be cleaned up so a future attempt to add the subcloud can succeed.

Actual Behavior
----------------
Some data is not cleaned up (e.g. the dcorch data for the subcloud), which results in failures when the subcloud is added again. For example:

2020-02-11 13:38:13.491 860107 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2020-02-11 13:38:13.491 860107 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 160, in _process_incoming
2020-02-11 13:38:13.491 860107 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2020-02-11 13:38:13.491 860107 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 213, in dispatch
2020-02-11 13:38:13.491 860107 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2020-02-11 13:38:13.491 860107 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _do_dispatch
2020-02-11 13:38:13.491 860107 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2020-02-11 13:38:13.491 860107 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/dcmanager/manager/service.py", line 53, in wrapped
2020-02-11 13:38:13.491 860107 ERROR oslo_messaging.rpc.server return func(self, ctx, *args, **kwargs)
2020-02-11 13:38:13.491 860107 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/dcmanager/manager/service.py", line 142, in add_subcloud
2020-02-11 13:38:13.491 860107 ERROR oslo_messaging.rpc.server return self.subcloud_manager.add_subcloud(context, payload)
2020-02-11 13:38:13.491 860107 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/dcmanager/manager/subcloud_manager.py", line 304, in add_subcloud
2020-02-11 13:38:13.491 860107 ERROR oslo_messaging.rpc.server raise e
2020-02-11 13:38:13.491 860107 ERROR oslo_messaging.rpc.server RemoteError: Remote error: DBDuplicateEntry (psycopg2.IntegrityError) duplicate key value violates unique constraint "subcloud_alarms_region_name_key"
2020-02-11 13:38:13.491 860107 ERROR oslo_messaging.rpc.server DETAIL: Key (region_name)=(subcloud2) already exists.
2020-02-11 13:38:13.491 860107 ERROR oslo_messaging.rpc.server [SQL: 'INSERT INTO subcloud_alarms (created_at, updated_at, deleted_at, deleted, uuid, region_name, critical_alarms, major_alarms, minor_alarms, warnings, cloud_status, capabilities) VALUES (%(created_at)s, %(updated_at)s, %(deleted_at)s, %(deleted)s, %(uuid)s, %(region_name)s, %(critical_alarms)s, %(major_alarms)s, %(minor_alarms)s, %(warnings)s, %(cloud_status)s, %(capabilities)s) RETURNING subcloud_alarms.id'] [parameters: {'cloud_status': 'disabled', 'critical_alarms': -1, 'uuid': '5c925ea7-8906-49bb-8353-3a0afa69f154', 'warnings': -1, 'deleted': 0, 'created_at': datetime.datetime(2020, 2, 11, 13, 38, 5, 405809), 'updated_at': None, 'capabilities': None, 'minor_alarms': -1, 'deleted_at': None, 'major_alarms': -1, 'region_name': u'subcloud2'}]

Reproducibility
---------------
Reproducible (but only in failure cases)

System Configuration
--------------------
Distributed cloud

Branch/Pull Time/Commit
-----------------------
Designer load built from a pull on February 4, 2020.

Last Pass
---------
Unknown

Timestamp/Logs
--------------
See above

Test Activity
-------------
Developer Testing

Workaround
----------
Use a different name for the subcloud when attempting to re-add it. However, this will leave data for the previous name in various locations.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - issue with recovering from previous failure condition

tags: added: stx.4.0 stx.distcloud
Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Dariush Eslimi (deslimi)
Frank Miller (sensfan22)
Changed in starlingx:
assignee: Dariush Eslimi (deslimi) → Jessica Castelino (jcasteli)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/712574

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/712574
Committed: https://git.openstack.org/cgit/starlingx/distcloud/commit/?id=3a1bf60caddfa2e807d4f5996ff94fea7dde5477
Submitter: Zuul
Branch: master

commit 3a1bf60caddfa2e807d4f5996ff94fea7dde5477
Author: Jessica Castelino <email address hidden>
Date: Wed Mar 11 16:23:21 2020 -0400

    Cleanup subcloud details when subcloud add fails

    Failure during add subcloud prevents subcloud from being added again
    with the same name as the subcloud details are not cleaned up
    properly. Fixes have been added for proper cleanup of dcorch database
    tables, ansible subcloud inventory files, keystone endpoints, keystone
    region, and addn_hosts_dc file when failure is encountered.

    Test cases:
    1. Add subcloud
    2. Add subcloud with "--deploy-playbook"
    3. Delete subcloud
    4. Raise explicit exception in dcorch/objects/subcloud.py
    5. Raise explicit exception in dcmanager/manager/subcloud_manager.py

    Change-Id: Iedf172c3e9c3c4bdb9b9482dc5d46f072b3ccf61
    Closes-Bug: 1862774
    Signed-off-by: Jessica Castelino <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/716140

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (f/centos8)
Download full text (15.6 KiB)

Reviewed: https://review.opendev.org/716140
Committed: https://git.openstack.org/cgit/starlingx/distcloud/commit/?id=04b49dd093ab850f4520cdb85638221120dd7568
Submitter: Zuul
Branch: f/centos8

commit 25c9d6ed3861f2d783404fcf84b186441ab9cd4d
Author: albailey <email address hidden>
Date: Wed Mar 25 15:43:32 2020 -0500

    Removing ddt from unit tests

    This cleanup should assist in transitioning to
    stestr and fixtures, as well as py3 support.

    The ddt data is primarily unused, only subcloud, route
    and endpoints were being loaded.

    The information in the data files was out of date,
    and not necessarily matching the current product model.

    Story: 2004515
    Task: 39160
    Change-Id: Iddd7ed4664b0d59dbc58aae5c3fedd74c9a138c0
    Signed-off-by: albailey <email address hidden>

commit 7f3827f24d2fb3cb546d3caf71d505d23187b0dc
Author: Tao Liu <email address hidden>
Date: Thu Mar 12 09:46:29 2020 -0400

    Keystone token and resource caching

    Add the following misc. changes to dcorch and dcmanager components:
    - Cache the master resource in dcorch audit
    - Consolidate the openstack drivers to common module, combine the
      dcmanager and dcorch sysinv client. (Note: the sdk driver that
      used by nova, neutron and cinder will be cleaned as part of
      story 2006588).
    - Update the common sdk driver:
      . in order to avoid creating new keystone client multiple times
      . to add a option for caching region clients, in addition to the
        keystone client
      . finally, to randomize the token early renewal duration
    - Change subcloud audit manager, patch audit manager,
      and sw update manager to:
      utilize the sdk driver which caches the keystone client and token

    Test cases:
    1. Manage/unmanage subclouds
    2. Platform resources sync and audit
    3. Verify the keystone token is cached until the token is
       expired
    4. Add/delete subclouds
    5. Managed subcloud goes offline/online (power off/on)
    6. Managed subcloud goes offline/online (delete/add a static route)
    7. Apply a patch to all subclouds via patch Orchestration

    Story: 2007267
    Task: 38865

    Change-Id: I75e0cf66a797a65faf75e7c64dafb07f54c2df06
    Signed-off-by: Tao Liu <email address hidden>

commit 3a1bf60caddfa2e807d4f5996ff94fea7dde5477
Author: Jessica Castelino <email address hidden>
Date: Wed Mar 11 16:23:21 2020 -0400

    Cleanup subcloud details when subcloud add fails

    Failure during add subcloud prevents subcloud from being added again
    with the same name as the subcloud details are not cleaned up
    properly. Fixes have been added for proper cleanup of dcorch database
    tables, ansible subcloud inventory files, keystone endpoints, keystone
    region, and addn_hosts_dc file when failure is encountered.

    Test cases:
    1. Add subcloud
    2. Add subcloud with "--deploy-playbook"
    3. Delete subcloud
    4. Raise explicit exception in dcorch/objects/subcloud.py
    5. Raise explicit exception in dcmanager/manager/subcloud_manager.py

    Change-Id: Iedf172c3e9c3c4bdb9b9482dc5d46f072b3ccf61
    ...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.