ovb-containers job logs excessive amounts of data

Bug #1698172 reported by Ben Nemec on 2017-06-15
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Critical
Unassigned

Bug Description

Per http://paste.openstack.org/show/612724/ this is by far the largest consumer of log storage space. On a per-job basis it is saving ~15x the data of the next largest consumer (a job which is run far more often). This is causing significant pressure to the infra log storage system and needs to stop.

Tags: ci Edit Tag help
Changed in tripleo:
assignee: nobody → Steve Baker (steve-stevebaker)
Steve Baker (steve-stevebaker) wrote :

I'm doing some analysis now

Steve Baker (steve-stevebaker) wrote :

I've not seen a smoking gun regarding gate-tripleo-ci-centos-7-ovb-containers-oooq-nv yet, but these changes may help with the general size of a few jobs:

https://review.openstack.org/#/q/status:open+project:openstack-infra/tripleo-ci+branch:master+topic:bug/1698172

How was the paste http://paste.openstack.org/show/612724/ derived? I'm trying to understand whether gzipped or raw sizes are involved here, for example here[1] is a typical postci.txt.gz which is 171KB gzipped, but 22MB in its natural form (!!)

[1] http://logs.openstack.org/37/471537/8/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/f721694/logs/postci.txt.gz

wes hayutin (weshayutin) wrote :

Hey Ben,
How and where did you get the information listed in http://paste.openstack.org/show/612724/ ?
It would be really helpful to know how to do that so we can keep track.

If I had to guess re: the containers ovb job. When that job originally made it upstream there were only 5 or so containers on the compute node, and none on the controller. I suspect while the number of containers has grown so too has the logs.

Reviewing what can be removed the logs is fairly critical. The configuration can be found here
https://github.com/openstack-infra/tripleo-ci/blob/master/toci-quickstart/config/collect-logs.yml

Ben Nemec (bnemec) wrote :

The numbers were collected by infra. This is what I know about them:

<clarkb> EmilienM: so fungi did a random but representative sampling of disk usage on the logs filesystem. And collected usage by job name. http://paste.openstack.org/show/612724/ is the result of that

Reviewed: https://review.openstack.org/474891
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=74595a73210bb9cf5e0d688c57e9fa5423422603
Submitter: Jenkins
Branch: master

commit 74595a73210bb9cf5e0d688c57e9fa5423422603
Author: Martin André <email address hidden>
Date: Fri Jun 16 10:10:54 2017 +0200

    Make a copy of files touched by puppet in container

    This should help determine what exactly needs to be bind mounted in the
    container and should also help limit the size of collected logs in CI,
    as collecting the entire /etc directory from each container can grow
    pretty quickly in size and is not that useful.

    Related-Bug: #1698172
    Change-Id: Ie2bded39cdb82a72f0c28f1c552403cd11b5af45

Reviewed: https://review.openstack.org/474896
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=2cb712c95f9b9b75246c5b0051729989aecb44a0
Submitter: Jenkins
Branch: master

commit 2cb712c95f9b9b75246c5b0051729989aecb44a0
Author: Martin André <email address hidden>
Date: Fri Jun 16 10:24:07 2017 +0200

    Limit collection of config-data to puppet-generated files

    This should give us all the information we need to debug CI failures
    while keeping the size of collected data relatively small.

    Change-Id: If811682c3312c86f0c407e880be24ad71d6ea72b
    Related-Bug: #1698172
    Depends-On: Ie2bded39cdb82a72f0c28f1c552403cd11b5af45

Emilien Macchi (emilienm) wrote :

Removing the alert because we had excellent progress on the problem.

Clark from OpenStack Infra said the improvement was great but not finished yet. Indeed we reduced by 5 the size of logs but there are still a room for improvement:

* consider using xz on journald
* stop grabbing everything out of /etc
* use xz on non human readable files

tags: removed: alert
Changed in tripleo:
status: Triaged → In Progress
importance: Critical → High

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/477881

Changed in tripleo:
importance: High → Critical

Reviewed: https://review.openstack.org/474269
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=c011a34f5e84ffadb55e0a0583897ab59c5bd4da
Submitter: Jenkins
Branch: master

commit c011a34f5e84ffadb55e0a0583897ab59c5bd4da
Author: Bogdan Dobrelya <email address hidden>
Date: Fri Jun 9 18:03:50 2017 +0200

    Improve logs from ansible, puppet, docker-puppet.py

    * Debug ansible 'puppet apply' stderr joined stdout, split
      by lines.
    * Do 'puppet apply' w/o colors, logdest syslog, and given a wanted
      modulepath instead of the module puppet, that can't support those
      options.
    * Bind-mount syslog socket for docker-puppet.py to pass puppet logs
      to host OS syslog.
    * Fix logging handlers for multiprocess workers in docker-puppet.py.

    Related-bug: #1698172
    Closes-bug: #1700086

    Change-Id: I84112a836e968aa5c3596a6544e0392980529963
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in tripleo:
milestone: pike-3 → pike-rc1

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/477569

Changed in tripleo:
assignee: Steve Baker (steve-stevebaker) → nobody
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers