StarlingX dashboard failed to retrieve system data on Cloud Overview page

Bug #1856740 reported by Tee Ngo
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Tyler Smith

Bug Description

Brief Description
-----------------
In a large distributed cloud, the system command to SystemController region errors out when the response header grows too large from too many endpoints. It appears that this issue is limited to sysinv as the requests to dcmanager endpoint to get the subcloud data and to fm endpoint to get the alarm stati of the subclouds work.

Severity
--------
Major

Steps to Reproduce
------------------
Set up a SystemController with a large number of subclouds in the database.
Either use CLI command: system --os-region-name SystemController show or navigate to the Cloud Overview page in SystemController region on Horizon

Expected Behavior
------------------
System data is returned and displayed

Actual Behavior
----------------
HTTP 400 Header Line Too Long

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Distributed Cloud

Branch/Pull Time/Commit
-----------------------
Dec. 12th master load

Last Pass
---------
I don't think this scenario has ever been verified.

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ system --debug --os-region-name SystemController show
DEBUG (base:187) Making authentication request to http://192.168.204.2:5000/v3/auth/tokens
DEBUG (connectionpool:207) Starting new HTTP connection (1): 192.168.204.2
DEBUG (connectionpool:395) http://192.168.204.2:5000 "POST /v3/auth/tokens HTTP/1.1" 201 825545
DEBUG (base:192) {"token": {"is_domain": false, "methods": ["password"], "roles": [{"id": "4e5f757616154010b739e3ddc4537bc9", "name": "admin"}, {"id": "0fba96199d834d94a659088f4e252101", "name": "reader"}, {"id": "1ac90aaaefe841539c0bdc336502ee93", "name": "member"}], "expires_at": "2019-12-16T20:16:26.000000Z", "project": {"domain": {"id": "default", "name": "Default"}, "id": "62b6390379cd4cbd8d68ff88bac5acbf", "name": "admin"}, "catalog": [{"endpoints": [{"url": "http://192.168.204.2:18002", "interface": "admin", "region": "RegionOne", "region_id": "RegionOne", "id": "117168607e904df291af71731ceabd55"}, {"url": "http://192.168.204.2:18002",
.................................
.................................. "type": "dcorch", "id": "4054335725014ae9b446f8912d25c979", "name": "dcorch"}], "user": {"domain": {"id": "default", "name": "Default"}, "password_expires_at": null, "name": "admin", "id": "3f6145ad8a04426693fbed6f184addff"}, "audit_ids": ["3Q93PAVDQEW2EnzZ2d3PTA"], "issued_at": "2019-12-16T19:16:26.000000Z"}}
DEBUG (utils:195) REQ: curl -i http://192.168.204.2:26385/v1/isystems -X GET -H "Content-Type: application/json" -H "Accept: application/json" -H "X-Auth-Token: {SHA256}01beac0d4c3899268f8d65a401e1aa12dc39121925ebb94991c0f3c3f9d432ef"
connect: (192.168.204.2, 26385) ************
send: 'GET /v1/isystems HTTP/1.1\r\nHost: 192.168.204.2:26385\r\nuser-agent: Python-httplib2/0.9.2 (gzip)\r\ncontent-type: application/json\r\naccept-encoding: gzip, deflate\r\naccept: application/json\r\nx-auth-token: gAAAAABd99gKQ7PNYudk0ysFmlQe2Ug-kq7Bq9nVl-830aDOOnvpNetxuA2pma4SI2iM-p5AkhzupmrWGg8oHzM_fe1fMyiW9HvIIYvvNa7icW5D-Cdz_PIKnX0xDRpCAjDOwkXtISX2R3zcvp-Z6_qXiRNC__9AnnEAtFbjJE8V6ZOVoXnpAVM\r\n\r\n'
reply: 'HTTP/1.1 400 Header Line Too Long\r\n'
header: Content-Length: 0
header: Date: Mon, 16 Dec 2019 19:16:28 GMT
DEBUG (http:157) RESP:

WARNING (http:215) Request returned failure status.
expected string or buffer

Test Activity
-------------
Developer Testing

Revision history for this message
Tyler Smith (tyler.smith) wrote :

I took a brief look at this and the solution seems to be to override the max_header_line setting in the sysinv wsgi setup to a higher value

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - seems to be related to DC scalability. Therefore, not a must for stx.3.0 for now

tags: added: stx.4.0 stx.distcloud
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Dariush Eslimi (deslimi)
Dariush Eslimi (deslimi)
Changed in starlingx:
assignee: Dariush Eslimi (deslimi) → Tyler Smith (tyler.smith)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/701828

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Matt Peters (mpeters-wrs) wrote :

It isn't clear why the header line length is dependent on the number of subclouds? What header field is being exhausted when the request is made to query the subclouds?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/701828
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=130919c096d3699022f9994957253e11b861b834
Submitter: Zuul
Branch: master

commit 130919c096d3699022f9994957253e11b861b834
Author: Tyler Smith <email address hidden>
Date: Thu Jan 9 15:12:22 2020 -0500

    Removing service catalog insertion from dcorch proxy

    requests going through the dcorch proxy were having the entire service
    catalog tacked on during the authtoken filter stage, this was resulting
    in the header size growing too large for sysinv to handle the forwarded
    requests.

    This commit sets keystone_authtoken/include_service_catalog to False in
    the dcorch settings to prevent this.

    Tested by installing a subcloud, bringing online and managing, as well
    as doing sysinv queries to SystemController. I've tested with 200
    subclouds in dcmanager without issue.

    Change-Id: Ic47c062bd8b5376084d27a9378c131650d9ec2da
    Closes-Bug: 1856740
    Signed-off-by: Tyler Smith <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/705852

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (f/centos8)
Download full text (9.5 KiB)

Reviewed: https://review.opendev.org/705852
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=e1f095eb112f76a133734a17f01afeb9828ebaf2
Submitter: Zuul
Branch: f/centos8

commit fc7b9b3d8d811fd50427b584dae5b7488947bb03
Author: Angie Wang <email address hidden>
Date: Tue Jan 28 13:57:52 2020 -0500

    Fix the image download failure on IPv6 system

    "crictl pull" failed to pull images on IPv6 system with
    proxy setting since Containerd doesn't work with the
    NO_PROXY environment variable that has IPv6 addresses
    with square brackets. This commit updates to strip out
    the square brackets from NO_PROXY environment variable.

    Change-Id: I6bb5ad0379f576f66d77a90dfdca94f5e0f28f0c
    Closes-Bug: 1859835
    Signed-off-by: Angie Wang <email address hidden>

commit 950670ac1f0bfaa43e29eeb3ffda71a94de66520
Author: Jim Somerville <email address hidden>
Date: Mon Jan 27 17:09:52 2020 -0500

    Security: Add nospectre_v1 to the security params

    Most of the v1 mitigation is baked into the kernel and not
    optional. The swapgs barriers are, however, optional.
    They have a negative performance impact so we disable them
    by using the nospectre_v1 kernel bootarg.

    Partial-Bug: 1860193
    Depends-On: https://review.opendev.org/#/c/704406
    Change-Id: Iaa11ba3f430fc064ebda679cf290474d3be413da
    Signed-off-by: Jim Somerville <email address hidden>

commit 83775d38804fb665af518127051b37a1daf31e36
Author: David Sullivan <email address hidden>
Date: Wed Jan 15 23:50:23 2020 -0500

    Install secondary controller nodes with kubeadm join

    Kubeadm init is no longer supported for installing secondary nodes in an
    HA kubernetes cluster. kubeadm join with the --controller-plane option
    should be used.

    Change-Id: I21a30b9e871d05c59a19e33a9d278f0217682da6
    Closes-Bug: 1846829
    Depends-On: https://review.opendev.org/702797
    Signed-off-by: David Sullivan <email address hidden>

commit c94fa4a0174b96e0716d39bbea7e6fbbbee415a9
Author: Shuicheng Lin <email address hidden>
Date: Thu Jan 23 02:45:31 2020 +0800

    Fix duplex system controller-1 fail to boot after unlock

    It is due to controller-1 doesn't have /opt/platform/config folder.
    And cause puppet failure due to using non-exist file as source.
    Restrict the code for worker node only, since controller node
    already has ca cert in the ssl folder.

    Test:
    Pass simplex/duplex/multi node deployment with vm created.

    Closes-Bug: 1860529
    Change-Id: I808ee15e5c78ebead114219d0ec428fb45cc9128
    Signed-off-by: Shuicheng Lin <email address hidden>

commit 27f167eb14a04bc67ecca59af3b617c115522101
Author: Angie Wang <email address hidden>
Date: Wed Jan 15 16:15:26 2020 -0500

    Remove puppet-manifests code made obsolete by ansible

    As a result of switch to Ansible, remove the obsolete erb
    templates and remove the dependency of is_initial_config_primary
    facter.

    Change-Id: I4ca6525f01a37da971dc66a11ee99ea4e115e3ad
    Partial-Bug: 1834218
    Depends-On: https://review.opendev.org/#/c/703517/
 ...

Read more...

tags: added: in-f-centos8
Revision history for this message
Kristine Bujold (kbujold) wrote :

The service catalog is still being sent. Its visible with the --debug option.

system --debug --os-region-name SystemController show
DEBUG (base:187) Making authentication request to http://192.168.204.2:5000/v3/auth/tokens
DEBUG (connectionpool:207) Starting new HTTP connection (1): 192.168.204.2
DEBUG (connectionpool:395) http://192.168.204.2:5000 "POST /v3/auth/tokens HTTP/1.1" 201 498714
DEBUG (base:192) {"token": {"is_domain": false, "methods": ["password"], "roles": [{"id": "ad95536cec0043d59af7d20c71ac4555", "name": "admin"}, {"id": "88c034f321914a9f9026365e9270c0f3", "name": "reader"}, {"id": "67d423940ab745a2b6b0fc91d419b874", "name": "member"}], "expires_at": "2020-02-26T17:49:49.000000Z", "project": {"domain": {"id": "default", "name": "Default"}, "id": "089df81258334085b5dc33fc5ec76caf", "name": "admin"}, "catalog": [{"endpoints": [{"url": "http://192.168.204.2:18002", "interface": "admin", "region": "RegionOne", "region_id": "RegionOne", "id": "d82a567bf6034c74a347ad2d21628057"}, {"url": "http://192.168.204.2:18002", "interface": "internal", "region": "RegionOne", "region_id": "RegionOne", "id": "96aec0bcb4f04cf8831ef8331ec1f15c"},
<snip>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.