Distributed Cloud: 5 subclouds fail deployment

Bug #1895605 reported by Gerry Kopec
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Tee Ngo

Bug Description

Brief Description
-----------------
Tested parallel deployment of subclouds. Of the group 3 bootstrap-failed and 2 deploy-prep-failed.

Severity
--------
Major

Steps to Reproduce
------------------
Set up distributed cloud system with many subclouds

Expected Behavior
------------------
subclouds should deploy successfully.

Actual Behavior
----------------
5 failed to deploy.

Reproducibility
---------------
occurred on 1 out of 2 attempts

System Configuration
--------------------
Distributed Cloud - system controller

Branch/Pull Time/Commit
-----------------------
2020-06-27_18-35-20

Last Pass
---------
none

Timestamp/Logs
--------------
controller-0 was active at the time of test:

    2 deploy prep failed (subcloud 81 & 99)

2020-08-11 22:43:14.859 13677 ERROR dcmanager.manager.subcloud_manager [req-dbd26abd-7f39-442a-ab7d-42cf9958bb2b 0784a55568994c69bd3a991c101b5007 - - default default] Failed to create subcloud subcloud81: ConnectTimeout: Request to https://[fd01:6::2]:5001/v3/endpoints timed out
2020-08-11 22:43:16.331 13677 ERROR dcmanager.manager.subcloud_manager [req-6fe5a6b9-ca1e-44c4-a3c3-16370eb94b66 0784a55568994c69bd3a991c101b5007 - - default default] Failed to create subcloud subcloud99: ConnectTimeout: Request to https://[fd01:6::2]:5001/v3/services? timed out

    3 bootstrap failed

subcloud98:
2020-08-11 22:43:15.042 13677 ERROR dcmanager.manager.subcloud_manager [-] Failed to run the subcloud bootstrap playbook for subcloud subcloud98, check individual log at /var/log/dcmanager/subcloud98_bootstrap_2020-08-11-22-42-12.log for detailed output.: CalledProcessError: Command 'ansible-playbook' returned non-zero exit status 2

[sysadmin@controller-0 dcmanager(keystone_admin)]$ tail subcloud98_bootstrap_2020-08-11-22-42-12.log
changed: [subcloud98] => (item=DOCKER_REGISTRY_ADDITIONAL_OVERRIDES=undef)
changed: [subcloud98] => (item=ELASTIC_REGISTRY_ADDITIONAL_OVERRIDES=undef)
changed: [subcloud98] => (item=USE_DEFAULT_REGISTRIES=False)
changed: [subcloud98] => (item=RECONFIGURE_ENDPOINTS=False)
changed: [subcloud98] => (item=INITIAL_DB_POPULATED=False)
fatal: [subcloud98]: FAILED! => {"msg": "Timeout (12s) waiting for privilege escalation prompt: "}

PLAY RECAP *********************************************************************
subcloud98 : ok=115 changed=23 unreachable=0 failed=1

subcloud80:
2020-08-11 22:43:15.092 13677 ERROR dcmanager.manager.subcloud_manager [-] Failed to run the subcloud bootstrap playbook for subcloud subcloud80, check individual log at /var/log/dcmanager/subcloud80_bootstrap_2020-08-11-22-42-37.log for detailed output.: CalledProcessError: Command 'ansible-playbook' returned non-zero exit status 2

[sysadmin@controller-0 dcmanager(keystone_admin)]$ tail subcloud80_bootstrap_2020-08-11-22-42-37.log
TASK [bootstrap/validate-config : Check OpenID Connect parameters] *************

TASK [bootstrap/validate-config : Check for Ceph data wipe flag] ***************

TASK [bootstrap/validate-config : Wipe ceph osds] ******************************
fatal: [subcloud80]: FAILED! => {"msg": "Timeout (12s) waiting for privilege escalation prompt: "}

PLAY RECAP *********************************************************************
subcloud80 : ok=112 changed=21 unreachable=0 failed=1

subcloud97:
2020-08-11 22:48:00.198 13677 ERROR dcmanager.manager.subcloud_manager [-] Failed to run the subcloud bootstrap playbook for subcloud subcloud97, check individual log at /var/log/dcmanager/subcloud97_bootstrap_2020-08-11-22-43-56.log for detailed output.: CalledProcessError: Command 'ansible-playbook' returned non-zero exit status 2

[sysadmin@controller-0 dcmanager(keystone_admin)]$ tail subcloud97_bootstrap_2020-08-11-22-43-56.log
changed: [subcloud97] => (item=grubby --update-kernel=/boot/vmlinuz-3.10.0-1127.el7.2.tis.x86_64 --args='nopti nospectre_v2 nospectre_v1')
changed: [subcloud97] => (item=grubby --efi --update-kernel=/boot/vmlinuz-3.10.0-1127.el7.2.tis.x86_64 --args='nopti nospectre_v2 nospectre_v1')

TASK [bootstrap/persist-config : Resize filesystems (default)] *****************
changed: [subcloud97] => (item=lvextend -L1G /dev/cgts-vg/pgsql-lv)
fatal: [subcloud97]: FAILED! => {"msg": "Timeout (12s) waiting for privilege escalation prompt: "}

PLAY RECAP *********************************************************************
subcloud97 : ok=152 changed=47 unreachable=0 failed=1

Was collecting performance data at the time and these errors correspond to periods of high cpu usage on system controller due to kswapd, postgres, ansible-playbook processes.

Test Activity
-------------
Distributed Cloud system testing

Workaround
----------
Subclouds deployed successfully individually on subsequent attempt.

Changed in starlingx:
assignee: nobody → Gerry Kopec (gerry-kopec)
summary: - Distributed Cloud: some subclouds fail deployment
+ Distributed Cloud: 5 subclouds fail deployment
tags: added: stx.distcloud
Tee Ngo (teewrs)
Changed in starlingx:
assignee: Gerry Kopec (gerry-kopec) → Tee Ngo (teewrs)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 - scalability issue with distributed cloud

tags: added: stx.5.0
Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/754433

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/754433
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=da913270fb04dec483a8bbc7a1a527fa8b7b5cef
Submitter: Zuul
Branch: master

commit da913270fb04dec483a8bbc7a1a527fa8b7b5cef
Author: Tee Ngo <email address hidden>
Date: Fri Sep 25 11:13:25 2020 -0400

    Turn off Ansible pipelining

    Restore the default Ansible pipelining setting (False). Pipelining,
    when enabled, boosts performance by reducing network traffic but
    it can also lead to random bootstrap failures in batch subcloud
    deployment due to ssh timeout (currently default to 10s).

    This setting will be revisited when a test environment, that
    enables a reliable determination of ansible ssh timeout to
    support batch subcloud deployment in a large Distributed Cloud, is
    available.

    Partial-Bug: 1895605
    Change-Id: I423933f34f8dc76aa67db75f689f64dba3ecb164
    Signed-off-by: Tee Ngo <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/754812

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/754812
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=9c25d435a3c2c01aaf055df816ed795ad4b44816
Submitter: Zuul
Branch: master

commit 9c25d435a3c2c01aaf055df816ed795ad4b44816
Author: Tee Ngo <email address hidden>
Date: Mon Sep 28 12:07:14 2020 -0400

    Increase max_pool_size for dc audits

    Increase max_pool_size for dcorch and dcmanager audits to avoid
    database thrashing with connect/disconnect requests resulting in
    sharp CPU spike caused by postgres on every dcorch/dcmanager audit
    cycle. The CPU spike is magnified when both dcorch and dcmanager
    audits happen to run at the same time which can impact resources
    intensive operations such as batch subcloud deployment. Low
    max_pool_size setting makes sense for on-demand services such as
    fm, not for services that perform regular audits.

    These settings will be re-assessed and adjusted when all DC
    scalability related features are complete.

    Closes-Bug: 1895605
    Change-Id: I138faa640933bd255d7ae90d3388733f35431e4d
    Signed-off-by: Tee Ngo <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/762919

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.