snapshot dump timeout with /var/log/ - 16G

Bug #1546023 reported by Sergey Galkin
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Georgy Kibardin
Mitaka
Won't Fix
High
Georgy Kibardin

Bug Description

Steps to reproduce
1. Create big env (~ 190 nodes)
2. Wait 16G in /var/log/
3. Try to create snapshot

Snapshot creating failed with error 'Dump is timed out'

As I see dump creation die during tar cJvf /var/dump/fuel-snapshot-2016-02-16_09-32-22.tar.xz -C /var/dump fuel-snapshot-2016-02-16_09-32-22

In parallel I try to create gz archive and after finish I see this (xz is steel working)
[root@fuel dump]# du -sh *
15G fuel-snapshot-2016-02-16_09-32-22
7.6G fuel-snapshot-2016-02-16_09-32-22.tar.gz
1.1G fuel-snapshot-2016-02-16_09-32-22.tar.xz

Fuel is kvm host with 16G RAM and 6 cores of Intel Xeon E312xx (Sandy Bridge)
Snapshot - http://mos-scale-share.mirantis.com/fuel-snapshot-2016-02-16_09-32-22.tar.gz

VERSION:
  feature_groups:
    - experimental
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "1438"
  build_id: "1438"
  fuel-nailgun_sha: "558ca91a854cf29e395940c232911ffb851899c1"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "33634ec27be77ecfb0b56b7e07497ad86d1fdcd3"
  fuel-ostf_sha: "3bc76a63a9e7d195ff34eadc29552f4235fa6c52"
  fuel-mirror_sha: "fb45b80d7bee5899d931f926e5c9512e2b442749"
  fuelmenu_sha: "78ffc73065a9674b707c081d128cb7eea611474f"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "d605bcbabf315382d56d0ce8143458be67c53434"

Changed in fuel:
status: New → Confirmed
tags: added: area-build
Changed in fuel:
milestone: none → 9.0
assignee: nobody → Fuel Python Team (fuel-python)
importance: Undecided → Medium
Roman Vyalov (r0mikiam)
tags: added: area-python
removed: area-build
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Python teams. such bad UX issues should have High priority.

Changed in fuel:
importance: Medium → High
Revision history for this message
Sergey Galkin (sgalkin) wrote :

Workaround

while true; do if [ -z "$(ps aux | grep -v grep | grep xz)" ] ; then echo "Wating..."; sleep 10; else tar czvf /root/$(ls /var/log/dump/ | grep -v xz).tar.gz /var/log/dump/$(ls /var/log/dump/ | grep -v xz) ; exit; fi ; done

Dmitry Pyzhov (dpyzhov)
tags: added: module-shotgun
Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

expected result

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Revision history for this message
Leontiy Istomin (listomin) wrote :

Reproduced with 9.0-188
3 controllers, 20 computes+ceph, 179 computes
[root@fuel ~]# du -hs /var/dump/*
44G /var/dump/fuel-snapshot-2016-04-12_08-07-14

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Georgy Kibardin (gkibardin)
Changed in fuel:
milestone: 9.0 → 10.0
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Georgy Kibardin (gkibardin) wrote :

The first, most simple, solution would be just to increase timeout in astute.
The second solution is more complex. We may offload compressing task to nodes. This will save some time on transferring and a lot of space on master node. Part of this job can be offloaded, for instance, to logrotate.

The second solution can be implemented in bounds of https://blueprints.launchpad.net/fuel/+spec/manage-logs-with-free-space-consideration

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

A little test
Uncompressed data size is 15G
1. Creating tar.xz. Resulting archive: 5.4Gb, time consumed 1h 20min.
1. Creating tar.gz. Resulting archive: 7.6Gb, time consumed 8min!!!!
3. Creating tar.bz2. Resulting archive: 7Gb, time consumed 42min

This benchmark http://catchchallenger.first-world.info//wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO seems to confirm my results. It also shows that xz is a memory hog comparing to gz which is veery humble in this sense.

I think that switching back to gz would be a reasonable choice for the simple solution.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to shotgun (master)

Fix proposed to branch: master
Review: https://review.openstack.org/310446

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (master)

Fix proposed to branch: master
Review: https://review.openstack.org/320345

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

We are going to deprecate Shotgun and we are not going to fix the problem completely. The fix is just a mitigation. It just improves compression speed so that we encounter timeouts a bit more rarely.

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Andrew Kalach (akndex) wrote :

Issue is still reproduced on large scale

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

There always be some size of a data which would exceed any timeout we set.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (master)

Reviewed: https://review.openstack.org/320345
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=18308a743c71e5b24efaf162bac5bbd2f7290e78
Submitter: Jenkins
Branch: master

commit 18308a743c71e5b24efaf162bac5bbd2f7290e78
Author: Georgy Kibardin <email address hidden>
Date: Tue May 24 13:30:42 2016 +0300

    Use gzip for snapshots

    Shotgun has been changed to use gzip for snapshot.
    On the 15G snapshot data gz is 10 times faster than xz while compression
    rate is just 36% against 50% with gzip.
    For the pattern "create once - download once" this looks like a
    reasonable solution.

    Change-Id: I133ae854c619655169f6b42003087dd9cc21b8e0
    Closes-Bug: #1546023

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

https://review.openstack.org/#/c/320345/ should be backported to 9.0

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/328199

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (stable/mitaka)

Reviewed: https://review.openstack.org/328199
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=9f02eb988bf732ece0f4bad19142a4a6834f7c5b
Submitter: Jenkins
Branch: stable/mitaka

commit 9f02eb988bf732ece0f4bad19142a4a6834f7c5b
Author: Georgy Kibardin <email address hidden>
Date: Tue May 24 13:30:42 2016 +0300

    Use gzip for snapshots

    Shotgun has been changed to use gzip for snapshot.
    On the 15G snapshot data gz is 10 times faster than xz while compression
    rate is just 36% against 50% with gzip.
    For the pattern "create once - download once" this looks like a
    reasonable solution.

    Change-Id: I133ae854c619655169f6b42003087dd9cc21b8e0
    Closes-Bug: #1546023
    (cherry picked from commit 18308a743c71e5b24efaf162bac5bbd2f7290e78)

Revision history for this message
Andrew Kalach (akndex) wrote :

I put in Confirmed state because the issue is still reproduced on 200 nodes.

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

The bug is not fixed since it is impossible, there always be some amount of nodes which would exceed any timeout we set. The issue has just been mitigated not completely fixed.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers