tripleo

get_state_from_hosts often fails to unpack logs

Bug #1414098 reported by James Polley on 2015-01-23

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Unassigned

Bug Description

http://logs.openstack.org/87/147287/3/check-tripleo/check-tripleo-ironic-overcloud-precise-nonha/b6f979e/logs/get_state_from_host.txt.gz

+ mkdir /home/jenkins/workspace/check-tripleo-ironic-overcloud-precise-nonha/logs/seed_logs
+ tar xJvf /home/jenkins/workspace/check-tripleo-ironic-overcloud-precise-nonha/logs/seed_logs.tar.xz -C /home/jenkins/workspace/check-tripleo-ironic-overcloud-precise-nonha/logs/seed_logs var/log/host_info.txt --strip-components=2
xz: (stdin): File format not recognized
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Revision history for this message

Derek Higgins (derekh) wrote on 2015-01-27:

Looks to me like get_state_from_hosts if failing to connect to the seed at all

From http://logs.openstack.org/87/147287/3/check-tripleo/check-tripleo-ironic-overcloud-precise-nonha/b6f979e/console.html.gz#_2015-01-23_11_58_51_067

2015-01-23 11:58:57.030 | ERROR (ConnectionError): ('Connection aborted.', error(113, 'No route to host'))

so we could be just ending up with a empty tarball

Revision history for this message

James Polley (tchaypo) wrote on 2015-01-28:

It's probably true that in some cases the seed just isn't contactable - but because of the way the command is all bundled together and stdout/stderr are thrown away, we can't distinguish that case from other cases.

https://review.openstack.org/#/c/149819/ is a partial fix for this. Just looking at the test failures on that particular review, we've already discovered one other source of error: we provide a list of --exclude filenames to tar; tar tries to stat those files, and errors if any of them don't exist. So now we know that we need to be more selective about what we --exclude, or possibly that we need to use wildcards instead of literal filenames to make tar less fussy about the files.

Gregory Haynes (greghaynes) on 2015-01-28

Changed in tripleo:
status:	New → Triaged
importance:	Undecided → Critical

Revision history for this message

Giulio Fidente (gfidente) wrote on 2015-02-02:

Derek, I am running a recheck against this bug for https://review.openstack.org/#/c/147287/4 but like you said this seems again a case where host can't be contacted. I am not sure wht that would be though.

Revision history for this message

Derek Higgins (derekh) wrote on 2015-02-02:

I believe I found the underlying issue here, this is a df of the root filesystem on each of our test env hosts

/dev/sda1 845G 808G 586M 100% /
/dev/sda1 845G 808G 556M 100% /
/dev/sda1 845G 805G 4.3G 100% /
/dev/sda1 845G 809G 55M 100% /
/dev/sda1 845G 809G 21M 100% /
/dev/sda1 845G 809G 264M 100% /
/dev/sda1 845G 805G 4.0G 100% /
/dev/sda1 845G 627G 183G 78% /
/dev/sda1 845G 549G 261G 68% /
/dev/sda1 845G 653G 157G 81% /
/dev/sda1 845G 784G 26G 97% /
/dev/sda1 845G 600G 209G 75% /
/dev/sda1 845G 696G 113G 87% /
/dev/sda1 845G 559G 251G 70% /

I'm going to do a little prep and then re provision them all, with disk size a less over subscribed

For reference the nodes that are full are running 6 testenvs each for the past 4 months and most of the space is being used up in /var/lib/libvirt/images

[heat-admin@testenv-testenv2-24jhfjm5ohp3 images]$ sudo du -sh /var/lib/libvirt/images/
806G /var/lib/libvirt/images/

[heat-admin@testenv-testenv2-24jhfjm5ohp3 images]$ sudo ls -lSr /var/lib/libvirt/images/
total 840080760
-rw-r--r--. 1 root root 1800011776 Feb 2 11:54 seed_4.qcow2
-rw-------. 1 root root 5409996800 Dec 16 00:58 baremetalbrbm1_6.qcow2
-rw-------. 1 root root 5525667840 Dec 10 10:10 baremetalbrbm6_10.qcow2
-rw-------. 1 root root 5742329856 Dec 16 00:52 baremetalbrbm6_4.qcow2
...
-rw-------. 1 root root 12849381376 Dec 16 17:23 baremetalbrbm1_3.qcow2
-rw-------. 1 root root 13022134272 Jan 15 00:19 baremetalbrbm5_13.qcow2
-rw-r--r--. 1 qemu qemu 13036879872 Feb 2 11:54 seed_6.qcow2
-rw-------. 1 qemu qemu 13138984960 Feb 2 11:54 baremetalbrbm6_3.qcow2
-rw-------. 1 root root 13518569472 Jan 7 22:04 baremetalbrbm6_0.qcow2
-rw-------. 1 root root 13648461824 Feb 2 08:36 baremetalbrbm2_3.qcow2
-rw-------. 1 root root 13943177216 Dec 17 21:25 baremetalbrbm3_5.qcow2
-rw-------. 1 root root 14780203008 Dec 16 03:52 baremetalbrbm6_7.qcow2
-rw-------. 1 root root 15622406144 Jan 14 22:59 baremetalbrbm3_2.qcow2
-rw-------. 1 root root 15805775872 Dec 17 21:25 baremetalbrbm3_14.qcow2
-rw-------. 1 root root 16497836032 Dec 16 12:39 baremetalbrbm1_0.qcow2
-rw-r--r--. 1 root root 17180000256 Feb 2 10:05 seed_1.qcow2
-rw-------. 1 root root 17854627840 Feb 2 09:55 baremetalbrbm4_0.qcow2
-rw-------. 1 qemu qemu 18130403328 Feb 2 11:54 baremetalbrbm6_12.qcow2
-rw-------. 1 root root 18411356160 Jan 15 00:19 baremetalbrbm5_0.qcow2
-rw-------. 1 root root 26098335744 Feb 1 19:31 baremetalbrbm2_0.qcow2

I believe I found the underlying issue here, this is a df of the root filesystem  on each of our test env hosts

/dev/sda1       845G  808G  586M 100% /
/dev/sda1       845G  808G  556M 100% /
/dev/sda1       845G  805G  4.3G 100% /
/dev/sda1       845G  809G   55M 100% /
/dev/sda1       845G  809G   21M 100% /
/dev/sda1       845G  809G  264M 100% /
/dev/sda1       845G  805G  4.0G 100% /
/dev/sda1       845G  627G  183G  78% /
/dev/sda1       845G  549G  261G  68% /
/dev/sda1       845G  653G  157G  81% /
/dev/sda1       845G  784G   26G  97% /
/dev/sda1       845G  600G  209G  75% /
/dev/sda1       845G  696G  113G  87% /
/dev/sda1       845G  559G  251G  70% /

I'm going to do a little prep and then re provision them all, with disk size a less over subscribed

For reference the nodes that are full are running 6 testenvs each for the past 4 months and most of the space is being used up in /var/lib/libvirt/images

[heat-admin@testenv-testenv2-24jhfjm5ohp3 images]$ sudo du -sh /var/lib/libvirt/images/
806G    /var/lib/libvirt/images/

[heat-admin@testenv-testenv2-24jhfjm5ohp3 images]$ sudo ls -lSr /var/lib/libvirt/images/
total 840080760
-rw-r--r--. 1 root root  1800011776 Feb  2 11:54 seed_4.qcow2
-rw-------. 1 root root  5409996800 Dec 16 00:58 baremetalbrbm1_6.qcow2
-rw-------. 1 root root  5525667840 Dec 10 10:10 baremetalbrbm6_10.qcow2
-rw-------. 1 root root  5742329856 Dec 16 00:52 baremetalbrbm6_4.qcow2
...
-rw-------. 1 root root 12849381376 Dec 16 17:23 baremetalbrbm1_3.qcow2
-rw-------. 1 root root 13022134272 Jan 15 00:19 baremetalbrbm5_13.qcow2
-rw-r--r--. 1 qemu qemu 13036879872 Feb  2 11:54 seed_6.qcow2
-rw-------. 1 qemu qemu 13138984960 Feb  2 11:54 baremetalbrbm6_3.qcow2
-rw-------. 1 root root 13518569472 Jan  7 22:04 baremetalbrbm6_0.qcow2
-rw-------. 1 root root 13648461824 Feb  2 08:36 baremetalbrbm2_3.qcow2
-rw-------. 1 root root 13943177216 Dec 17 21:25 baremetalbrbm3_5.qcow2
-rw-------. 1 root root 14780203008 Dec 16 03:52 baremetalbrbm6_7.qcow2
-rw-------. 1 root root 15622406144 Jan 14 22:59 baremetalbrbm3_2.qcow2
-rw-------. 1 root root 15805775872 Dec 17 21:25 baremetalbrbm3_14.qcow2
-rw-------. 1 root root 16497836032 Dec 16 12:39 baremetalbrbm1_0.qcow2
-rw-r--r--. 1 root root 17180000256 Feb  2 10:05 seed_1.qcow2
-rw-------. 1 root root 17854627840 Feb  2 09:55 baremetalbrbm4_0.qcow2
-rw-------. 1 qemu qemu 18130403328 Feb  2 11:54 baremetalbrbm6_12.qcow2
-rw-------. 1 root root 18411356160 Jan 15 00:19 baremetalbrbm5_0.qcow2
-rw-------. 1 root root 26098335744 Feb  1 19:31 baremetalbrbm2_0.qcow2

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-02: Related fix proposed to tripleo-image-elements (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/152192

Revision history for this message

Derek Higgins (derekh) wrote on 2015-02-03:

Redeploy of test env hosts looks to have solved the problem, closing this

Also I should mention the reason Seeds were becoming unaccessible is that libvirt was pausing them when disk was filled. I'm guessing it was only happening to F20 because it is the bigger of the 2 images we build.

Changed in tripleo:
status:	Triaged → Fix Released

Revision history for this message

James Polley (tchaypo) wrote on 2015-02-04:

To be clear, the redeploy seems to have solved the problem that was often causing us to be unable to connect to the seed. The extra debugging made possible by https://review.openstack.org/#/c/149819/ also turned up at least one other issue that was causing failures creating the tar file. Prior to https://review.openstack.org/#/c/149819/ we couldn't distinguish between "couldn't read source files", "couldn't contact seed", "couldn't write tar file onto local machine" and possibly other errors.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-04: Related fix merged to tripleo-image-elements (master)

Reviewed: https://review.openstack.org/152192
Committed: https://git.openstack.org/cgit/openstack/tripleo-image-elements/commit/?id=515d62a870a3c931ed020f2bc6706830724777f0
Submitter: Jenkins
Branch: master

commit 515d62a870a3c931ed020f2bc6706830724777f0
Author: Derek Higgins <email address hidden>
Date: Mon Feb 2 17:11:44 2015 +0000

Adjust Test env requirements

    Oversubscribe on memory and disk space less aggressivly. We have been
    seeing problems on CI as a result of hitting limits in both of these
    cases.

Change-Id: I83a8e18fa81365596fc96e58b9a07111623054d8
Related-Bug: #1414098

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-31: Change abandoned on tripleo-incubator (master)

Change abandoned by Derek Higgins (<email address hidden>) on branch: master
Review: https://review.openstack.org/152150
Reason: Changing in toci instead

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1413106

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.