Distributed Cloud logs rotating out in less than a day with 200 subclouds

Bug #1928335 reported by Bart Wensley
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Al Bailey

Bug Description

Brief Description
-----------------
The distributed cloud logs (/var/log/dcmanager/dcmanager.log and /var/log/dcmanager/audit.log) are rotating out in less than one day when 200 subclouds are configured. This will make debugging customer issues extremely difficult as the logs may no longer exist when they are collected.

Severity
--------
Major: System/Feature is usable but degraded

Steps to Reproduce
------------------
Install a distributed cloud system with 200 subclouds.

Expected Behavior
-----------------
In normal conditions, the system should retain two weeks of logs.

Actual Behavior
---------------
Less than a day of distributed cloud logs are being retained.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Distributed Cloud

Branch/Pull Time/Commit
-----------------------
SW_VERSION="21.05"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2021-05-05_20-00-10"
SRC_BUILD_ID="20"
BUILD_DATE="2021-05-05 20:02:01 -0400"

Last Pass
---------
Unknown

Timestamp/Logs
--------------
[root@controller-0 dcmanager(keystone_admin)]# ll
total 34252
drwxr-xr-x 2 root root 24576 May 8 12:31 ansible
rw-r---- 1 root root 1244263 May 10 18:07 audit.log
rw-r---- 1 root root 707391 May 10 11:52 audit.log.10.gz
rw-r---- 1 root root 704436 May 10 11:11 audit.log.11.gz
rw-r---- 1 root root 720434 May 10 10:30 audit.log.12.gz
rw-r---- 1 root root 704373 May 10 09:48 audit.log.13.gz
rw-r---- 1 root root 721350 May 10 09:07 audit.log.14.gz
rw-r---- 1 root root 699986 May 10 08:25 audit.log.15.gz
rw-r---- 1 root root 702828 May 10 07:44 audit.log.16.gz
rw-r---- 1 root root 699267 May 10 07:02 audit.log.17.gz
rw-r---- 1 root root 702918 May 10 06:21 audit.log.18.gz
rw-r---- 1 root root 701166 May 10 05:40 audit.log.19.gz
rw-r---- 1 root root 705920 May 10 18:02 audit.log.1.gz
rw-r---- 1 root root 704129 May 10 04:59 audit.log.20.gz
rw-r---- 1 root root 702416 May 10 17:21 audit.log.2.gz
rw-r---- 1 root root 708640 May 10 16:40 audit.log.3.gz
rw-r---- 1 root root 720314 May 10 15:59 audit.log.4.gz
rw-r---- 1 root root 700632 May 10 15:17 audit.log.5.gz
rw-r---- 1 root root 704501 May 10 14:36 audit.log.6.gz
rw-r---- 1 root root 697963 May 10 13:55 audit.log.7.gz
rw-r---- 1 root root 705210 May 10 13:14 audit.log.8.gz
rw-r---- 1 root root 700696 May 10 12:33 audit.log.9.gz
rw-r---- 1 root root 10000278 May 10 18:07 dcmanager.log
rw-r---- 1 root root 469352 May 10 07:54 dcmanager.log.10.gz
rw-r---- 1 root root 453641 May 10 06:52 dcmanager.log.11.gz
rw-r---- 1 root root 455146 May 10 05:50 dcmanager.log.12.gz
rw-r---- 1 root root 478895 May 10 04:48 dcmanager.log.13.gz
rw-r---- 1 root root 450988 May 10 03:46 dcmanager.log.14.gz
rw-r---- 1 root root 448862 May 10 02:44 dcmanager.log.15.gz
rw-r---- 1 root root 462944 May 10 01:43 dcmanager.log.16.gz
rw-r---- 1 root root 451840 May 10 00:42 dcmanager.log.17.gz
rw-r---- 1 root root 464016 May 9 23:41 dcmanager.log.18.gz
rw-r---- 1 root root 445211 May 9 22:39 dcmanager.log.19.gz
rw-r---- 1 root root 445706 May 10 17:10 dcmanager.log.1.gz
rw-r---- 1 root root 520472 May 9 21:37 dcmanager.log.20.gz
rw-r---- 1 root root 453725 May 10 16:09 dcmanager.log.2.gz
rw-r---- 1 root root 458233 May 10 15:07 dcmanager.log.3.gz
rw-r---- 1 root root 474471 May 10 14:05 dcmanager.log.4.gz
rw-r---- 1 root root 451344 May 10 13:04 dcmanager.log.5.gz
rw-r---- 1 root root 452953 May 10 12:02 dcmanager.log.6.gz
rw-r---- 1 root root 464730 May 10 11:00 dcmanager.log.7.gz
rw-r---- 1 root root 465153 May 10 09:58 dcmanager.log.8.gz
rw-r---- 1 root root 466608 May 10 08:56 dcmanager.log.9.gz
rw-r---- 1 root root 357444 May 8 03:04 orchestrator.log

Test Activity
-------------
Developer Testing

Workaround
----------
None

tags: added: stx.distcloud
Changed in starlingx:
assignee: nobody → Al Bailey (albailey1974)
Revision history for this message
Al Bailey (albailey1974) wrote :

The submission for https://bugs.launchpad.net/starlingx/+bug/1928333
which is this review https://review.opendev.org/c/starlingx/distcloud/+/791244
prevents the audit from auditing too much every 30 seconds should help

Going to look at what other logs can be revised

Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: stx.6.0 / medium - fix will help w/ debugging of issues as logs can be retained for a longer period of time

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.6.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/791547

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/791547
Committed: https://opendev.org/starlingx/distcloud/commit/077ab926e388d477a788e3cf306e2b7443c3927e
Submitter: "Zuul (22348)"
Branch: master

commit 077ab926e388d477a788e3cf306e2b7443c3927e
Author: albailey <email address hidden>
Date: Fri May 14 13:48:23 2021 -0500

    Updating some of the dcmanager and audit logs

    On startup of any process, the entire configuration was
    being logged, which was we were logging the entire
    configuration to the file (almost 200 lines per process).
    - Lowering that log to DEBUG.
    - Added an extra 'Starting...' log entry to help indicate
    when a process start/restart occurs.

    The update_subcloud_endpoint_status log was not including
    enough information, as it is called several times in a row
    for each endpoint.

    Two audit logs that are created every 30 seconds have been reduced
    from info to debug.

    Closes-Bug: 1928335
    Change-Id: I7ada5abf87c2f28f5826c02345f8dd3197eae665
    Signed-off-by: albailey <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298
Reason: Updated merge soon

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/796528

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (f/centos8)
Download full text (105.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/796528
Committed: https://opendev.org/starlingx/distcloud/commit/4c5344f8765b372cb84d2b1181589c16db2ae6e4
Submitter: "Zuul (22348)"
Branch: f/centos8

commit cb979811017bd193fc1f06e53bb7830fd3184859
Author: Yuxing Jiang <email address hidden>
Date: Wed Jun 9 11:11:27 2021 -0400

    Format the IP addresses in payload before adding a subcloud

    The IPv6 addresses can be represented in multiple formats. As IP
    addresses are stored as text in database, ansible inventory and
    overrides, this commit converts the IP addresses in payload to
    standard text format of IPv6 address during adding a new subcloud.

    Tested with installing and bootstrapping a new subcloud(RVMC
    configured) with the correct IPv6 address values, but with
    unrecommended upper case letters and '0'. The addresses are
    converted to standard format in database, ansible inventory and
    overrides files.

    Partial-Bug: 1931459
    Signed-off-by: Yuxing Jiang <email address hidden>
    Change-Id: I6c26e749941f1ea2597f91886ad8f7da64521f0d

commit 2cf5d6d5cef0808c354f7575336aec34253993b3
Author: albailey <email address hidden>
Date: Thu May 20 14:19:24 2021 -0500

    Delete existing vim strategy from subcloud during patch orch

    When dcmanager creates a patch strategy, if a subcloud has an
    existing vim patch strategy, it will attempt to re-use
    that strategy during its patching phase, which may result in an
    error.

    This commit deletes the existing vim patch strategy in
    a subcloud, if it exists, so it can be re-created.
    If the strategy cannot be deleted, orchestration fails.

    Change-Id: Id35ef26ed3ddae6d71874fc6bac11df147f72323
    Closes-Bug: 1929221
    Signed-off-by: albailey <email address hidden>

commit 9e14c83f0162549a2a94cb8bc1e73dbc4f4d4887
Author: albailey <email address hidden>
Date: Tue Jun 1 14:37:14 2021 -0500

    Adding activation retry to upgrade orchestration

    When performing an activation, the keystone endpoints may not
    be accessible in the subcloud due to the asyncronous way that
    cert-mon can trigger a restart of keystone.

    This would have occasionally resulted in the upgrade activation
    failing to be initiated, and orchestration needing to be invoked
    again to resume.

    This 'hack' adds retries and sleeps to the initial
    activation action.

    Change-Id: Ic757521dec7bdc248a51a70b5463caafe7927360
    Partial-Bug: 1927550
    Signed-off-by: albailey <email address hidden>

commit bb604c0a9b872efd65fa45f1e2269995818c6262
Author: Tee Ngo <email address hidden>
Date: Thu May 27 22:17:16 2021 -0400

    Fix subcloud show --detail command related issues

    If the subcloud is offline, the command stalls and eventually returns
    the "ERROR (app)" output. If the subcloud is online, the oam_floating_ip
    info is excluded from the output when the subcloud id instead of subcloud
    name is specified.

    This commit fixes both of the above issues.

    Closes-Bug: 1929893
    Change-Id: I995591368564539b0e6af185b1adba2db73e0e46
    Sign...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.