Distributed Cloud: Many clients received http 500 errors in batch deployment

Bug #1865573 reported by Tee Ngo
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Jessica Castelino

Bug Description

Brief Description
-----------------
The add subcloud REST api returns http 500 errors

Severity
--------
Major - the add subcloud requests are still being processed by the backend but these errors will give the users the impression that their requests have failed. This also breaks automated batch subcloud deployment.

Steps to Reproduce
------------------
Send 50 add subcloud requests via CLI or REST API simultaneously

Expected Behavior
------------------
REST API request - http 200 resonse
CLI request - a confirmed response with subcloud UUID

Actual Behavior
----------------
REST API - http 500 response
CLI - ERROR Unable to add subcloud

A sample timed out log in dcmanager.log

Traceback (most recent call last):
   File ""/usr/lib/python2.7/site-packages/dcmanager/api/controllers/v1/subclouds.py"", line 447, in post
     return self.rpc_client.add_subcloud(context, payload)
   File ""/usr/lib/python2.7/site-packages/dcmanager/rpc/client.py"", line 68, in add_subcloud
     payload=payload))
   File ""/usr/lib/python2.7/site-packages/dcmanager/rpc/client.py"", line 56, in call
     return client.call(ctxt, method, **kwargs)
   File ""/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py"", line 465, in call
     return self.prepare().call(ctxt, method, **kwargs)
   File ""/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py"", line 169, in call
     retry=self.retry)
   File ""/usr/lib/python2.7/site-packages/oslo_messaging/transport.py"", line 123, in _send
     timeout=timeout, retry=retry)
   File ""/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py"", line 566, in send
     retry=retry)
   File ""/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py"", line 555, in _send
     result = self._waiter.wait(msg_id, timeout)
   File ""/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py"", line 447, in wait
     message = self.waiters.get(msg_id, timeout=timeout)
   File ""/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py"", line 335, in get
     'to message ID %s' % msg_id)
 MessagingTimeout: Timed out waiting for a reply to message ID c654c39f48624cffbb56e3a866b92bea

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
IPv6 distributed cloud

Branch/Pull Time/Commit
-----------------------
Feb. 22 master

Last Pass
---------
Not certain when this test case was verified

Timestamp/Logs
--------------
dcmanager logs attached

Test Activity
-------------
Evaluation

Revision history for this message
Tee Ngo (teewrs) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - issue appears to be tied to the # of subclouds being added simultaneously. Should be investigated/addressed.

tags: added: stx.4.0 stx.distcloud
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Tao Liu (tliu88)
Revision history for this message
Bart Wensley (bartwensley) wrote :

Adding some more info based on investigation and discussions...

The handling of the add_subcloud by the dcmanager needs to be asynchronous in order to support parallel add requests. This will require a bit of restructuring (the addition of the subcloud to the dcmanager DB would have to happen in the API) and we may need to look at adding a new state for this “adding” phase.

Changed in starlingx:
assignee: Tao Liu (tliu88) → Jessica Castelino (jcasteli)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/724926

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/724926
Committed: https://git.openstack.org/cgit/starlingx/distcloud/commit/?id=ae73f0dd585c53c0242d6da7f7ef717e24ec9ddd
Submitter: Zuul
Branch: master

commit ae73f0dd585c53c0242d6da7f7ef717e24ec9ddd
Author: Jessica Castelino <email address hidden>
Date: Thu Apr 30 10:15:35 2020 -0400

    Make the add_subcloud RPC call asynchronous

    The add subcloud REST API returns HTTP 500 errors although the
    requests are being processed in the backend. This is because the
    work done while handling the add_subcloud RPC can take too long
    and cause the RPC to time out, which causes the POST to fail,
    even though dcmanager-manager continues to add the subcloud. Thus,
    a fix is added to make the add_subcloud RPC call asynchronous.

    Change-Id: I89d9ce8367ef124c77869dff309f6bb3a621df34
    Closes-Bug: 1865573
    Signed-off-by: Jessica Castelino <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729815

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (f/centos8)
Download full text (19.6 KiB)

Reviewed: https://review.opendev.org/729815
Committed: https://git.openstack.org/cgit/starlingx/distcloud/commit/?id=a39415ff84d71a54580eb8cbe885647e69279306
Submitter: Zuul
Branch: f/centos8

commit 5c8377047ba679ce88a0360315df494a74208bbf
Author: Tao Liu <email address hidden>
Date: Tue May 5 08:59:59 2020 -0500

    Move subcloud audit to separate process

    Remove subcloud audit from dcmanager-manager process.
    Create dcmanager-audit process & associated files.
    Add new RPC calls for dcmanager-audit to notify dcmanager
    subcloud availability and sync endpoint type changes.
    Update dcmanager to handle availability and sync endpoint
    type updates from dcmanager-audit.
    Subcloud audit interval will be reduced to 20 seconds.
    Create/update unit tests, to verify the implementation
    changes.

    Story: 2007267
    Task: 39637

    Change-Id: Iff408166753f22ce3616d34e267ca1155ac43042
    Signed-off-by: Tao Liu <email address hidden>

commit d46516c46d24f8fea6ea71fdbe2f3fa2d296eb4d
Author: albailey <email address hidden>
Date: Wed May 13 14:00:12 2020 -0500

    Enable python3 unit tests as part of zuul

    The existing py27 unit tests were not all passing in py36,
    however now they are and so the zuul check and gate for py36
    have been added.

    Change-Id: Ie293ec69a04e6fd657f960aa9a135c428138b4b4
    Story: 2004515
    Task: 39768
    Signed-off-by: albailey <email address hidden>

commit dbf603b4c45e865e56a21d3b14d94bcc8d5f455c
Author: albailey <email address hidden>
Date: Mon May 11 10:29:45 2020 -0500

    Reduce the number of suppressed pylint warnings

    All pylint warnings were being suppressed by a wildcard.
    This commit only suppresses the warnings that are failing and
    prevents checks that would pass from being broken in later commits.

    The warnings being suppressed can be resolved individually
    by later submissions based on priority where appropriate.

    This commit also specifies python3 for pylint which has
    more recent checks.

    Change-Id: Ie29aeb0ea3e9dcb671af67f38e9a3f919ea7111e
    Story: 2004515
    Task: 39734
    Signed-off-by: albailey <email address hidden>

commit 15fb58f45c0f552eccd9c27ba023dbea560f27f2
Author: albailey <email address hidden>
Date: Fri May 8 13:20:23 2020 -0500

    Enhance Upgrade strategy to use endpoint audit status

    The distributed cloud audit was updated to include 'load'
    endpoint status, so the upgrade strategy is now able to
    make use of that information when constructing a strategy.

    Change-Id: I69eb4d98b9abf38b329e13fb116fc098db2bd736
    Story: 2007403
    Task: 39736
    Signed-off-by: albailey <email address hidden>

commit acc710093bb1c9581670d20b62c68ad669b3a3ad
Author: MCamp859 <email address hidden>
Date: Mon May 11 14:28:28 2020 -0400

    Minor edits to test docs promote issue

    Change-Id: I08ffc5e57b5b04c59102a6491f2fcdc256f16e0f
    Signed-off-by: MCamp859 <email address hidden>

commit 6d4fa855462cc6faf2e962f9d825b832f2885aa3
Author: Tee Ngo <email address hidden>
Date: Mon May 4 23:53:20 2020 -0400

    Extend subcloud audit...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.