snapshot dump timeout with /var/log/ - 16G

Bug #1546023 reported by Sergey Galkin on 2016-02-16
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Georgy Kibardin
Mitaka
High
Georgy Kibardin

Bug Description

Steps to reproduce
1. Create big env (~ 190 nodes)
2. Wait 16G in /var/log/
3. Try to create snapshot

Snapshot creating failed with error 'Dump is timed out'

As I see dump creation die during tar cJvf /var/dump/fuel-snapshot-2016-02-16_09-32-22.tar.xz -C /var/dump fuel-snapshot-2016-02-16_09-32-22

In parallel I try to create gz archive and after finish I see this (xz is steel working)
[root@fuel dump]# du -sh *
15G fuel-snapshot-2016-02-16_09-32-22
7.6G fuel-snapshot-2016-02-16_09-32-22.tar.gz
1.1G fuel-snapshot-2016-02-16_09-32-22.tar.xz

Fuel is kvm host with 16G RAM and 6 cores of Intel Xeon E312xx (Sandy Bridge)
Snapshot - http://mos-scale-share.mirantis.com/fuel-snapshot-2016-02-16_09-32-22.tar.gz

VERSION:
  feature_groups:
    - experimental
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "1438"
  build_id: "1438"
  fuel-nailgun_sha: "558ca91a854cf29e395940c232911ffb851899c1"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "33634ec27be77ecfb0b56b7e07497ad86d1fdcd3"
  fuel-ostf_sha: "3bc76a63a9e7d195ff34eadc29552f4235fa6c52"
  fuel-mirror_sha: "fb45b80d7bee5899d931f926e5c9512e2b442749"
  fuelmenu_sha: "78ffc73065a9674b707c081d128cb7eea611474f"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "d605bcbabf315382d56d0ce8143458be67c53434"

Changed in fuel:
status: New → Confirmed
tags: added: area-build
Changed in fuel:
milestone: none → 9.0
assignee: nobody → Fuel Python Team (fuel-python)
importance: Undecided → Medium
Roman Vyalov (r0mikiam) on 2016-02-16
tags: added: area-python
removed: area-build
Nastya Urlapova (aurlapova) wrote :

Python teams. such bad UX issues should have High priority.

Changed in fuel:
importance: Medium → High
Sergey Galkin (sgalkin) wrote :

Workaround

while true; do if [ -z "$(ps aux | grep -v grep | grep xz)" ] ; then echo "Wating..."; sleep 10; else tar czvf /root/$(ls /var/log/dump/ | grep -v xz).tar.gz /var/log/dump/$(ls /var/log/dump/ | grep -v xz) ; exit; fi ; done

Dmitry Pyzhov (dpyzhov) on 2016-03-02
tags: added: module-shotgun

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

expected result

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Leontiy Istomin (listomin) wrote :

Reproduced with 9.0-188
3 controllers, 20 computes+ceph, 179 computes
[root@fuel ~]# du -hs /var/dump/*
44G /var/dump/fuel-snapshot-2016-04-12_08-07-14

Dmitry Pyzhov (dpyzhov) on 2016-04-14
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Georgy Kibardin (gkibardin)
Changed in fuel:
milestone: 9.0 → 10.0
Changed in fuel:
status: Confirmed → In Progress
Georgy Kibardin (gkibardin) wrote :

The first, most simple, solution would be just to increase timeout in astute.
The second solution is more complex. We may offload compressing task to nodes. This will save some time on transferring and a lot of space on master node. Part of this job can be offloaded, for instance, to logrotate.

The second solution can be implemented in bounds of https://blueprints.launchpad.net/fuel/+spec/manage-logs-with-free-space-consideration

Georgy Kibardin (gkibardin) wrote :

A little test
Uncompressed data size is 15G
1. Creating tar.xz. Resulting archive: 5.4Gb, time consumed 1h 20min.
1. Creating tar.gz. Resulting archive: 7.6Gb, time consumed 8min!!!!
3. Creating tar.bz2. Resulting archive: 7Gb, time consumed 42min

This benchmark http://catchchallenger.first-world.info//wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO seems to confirm my results. It also shows that xz is a memory hog comparing to gz which is veery humble in this sense.

I think that switching back to gz would be a reasonable choice for the simple solution.

Georgy Kibardin (gkibardin) wrote :

We are going to deprecate Shotgun and we are not going to fix the problem completely. The fix is just a mitigation. It just improves compression speed so that we encounter timeouts a bit more rarely.

Changed in fuel:
status: In Progress → Fix Committed
Andrew Kalach (akndex) wrote :

Issue is still reproduced on large scale

Georgy Kibardin (gkibardin) wrote :

There always be some size of a data which would exceed any timeout we set.

Reviewed: https://review.openstack.org/320345
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=18308a743c71e5b24efaf162bac5bbd2f7290e78
Submitter: Jenkins
Branch: master

commit 18308a743c71e5b24efaf162bac5bbd2f7290e78
Author: Georgy Kibardin <email address hidden>
Date: Tue May 24 13:30:42 2016 +0300

    Use gzip for snapshots

    Shotgun has been changed to use gzip for snapshot.
    On the 15G snapshot data gz is 10 times faster than xz while compression
    rate is just 36% against 50% with gzip.
    For the pattern "create once - download once" this looks like a
    reasonable solution.

    Change-Id: I133ae854c619655169f6b42003087dd9cc21b8e0
    Closes-Bug: #1546023

Matthew Mosesohn (raytrac3r) wrote :

https://review.openstack.org/#/c/320345/ should be backported to 9.0

Reviewed: https://review.openstack.org/328199
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=9f02eb988bf732ece0f4bad19142a4a6834f7c5b
Submitter: Jenkins
Branch: stable/mitaka

commit 9f02eb988bf732ece0f4bad19142a4a6834f7c5b
Author: Georgy Kibardin <email address hidden>
Date: Tue May 24 13:30:42 2016 +0300

    Use gzip for snapshots

    Shotgun has been changed to use gzip for snapshot.
    On the 15G snapshot data gz is 10 times faster than xz while compression
    rate is just 36% against 50% with gzip.
    For the pattern "create once - download once" this looks like a
    reasonable solution.

    Change-Id: I133ae854c619655169f6b42003087dd9cc21b8e0
    Closes-Bug: #1546023
    (cherry picked from commit 18308a743c71e5b24efaf162bac5bbd2f7290e78)

Andrew Kalach (akndex) wrote :

I put in Confirmed state because the issue is still reproduced on 200 nodes.

Georgy Kibardin (gkibardin) wrote :

The bug is not fixed since it is impossible, there always be some amount of nodes which would exceed any timeout we set. The issue has just been mitigated not completely fixed.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers