get_state_from_hosts often fails to unpack logs

Bug #1414098 reported by James Polley on 2015-01-23
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
tripleo
Critical
Unassigned

Bug Description

http://logs.openstack.org/87/147287/3/check-tripleo/check-tripleo-ironic-overcloud-precise-nonha/b6f979e/logs/get_state_from_host.txt.gz

+ mkdir /home/jenkins/workspace/check-tripleo-ironic-overcloud-precise-nonha/logs/seed_logs
+ tar xJvf /home/jenkins/workspace/check-tripleo-ironic-overcloud-precise-nonha/logs/seed_logs.tar.xz -C /home/jenkins/workspace/check-tripleo-ironic-overcloud-precise-nonha/logs/seed_logs var/log/host_info.txt --strip-components=2
xz: (stdin): File format not recognized
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Derek Higgins (derekh) wrote :

Looks to me like get_state_from_hosts if failing to connect to the seed at all

From http://logs.openstack.org/87/147287/3/check-tripleo/check-tripleo-ironic-overcloud-precise-nonha/b6f979e/console.html.gz#_2015-01-23_11_58_51_067

2015-01-23 11:58:57.030 | ERROR (ConnectionError): ('Connection aborted.', error(113, 'No route to host'))

so we could be just ending up with a empty tarball

James Polley (tchaypo) wrote :

It's probably true that in some cases the seed just isn't contactable - but because of the way the command is all bundled together and stdout/stderr are thrown away, we can't distinguish that case from other cases.

https://review.openstack.org/#/c/149819/ is a partial fix for this. Just looking at the test failures on that particular review, we've already discovered one other source of error: we provide a list of --exclude filenames to tar; tar tries to stat those files, and errors if any of them don't exist. So now we know that we need to be more selective about what we --exclude, or possibly that we need to use wildcards instead of literal filenames to make tar less fussy about the files.

Changed in tripleo:
status: New → Triaged
importance: Undecided → Critical
Giulio Fidente (gfidente) wrote :

Derek, I am running a recheck against this bug for https://review.openstack.org/#/c/147287/4 but like you said this seems again a case where host can't be contacted. I am not sure wht that would be though.

Derek Higgins (derekh) wrote :

I believe I found the underlying issue here, this is a df of the root filesystem on each of our test env hosts

/dev/sda1 845G 808G 586M 100% /
/dev/sda1 845G 808G 556M 100% /
/dev/sda1 845G 805G 4.3G 100% /
/dev/sda1 845G 809G 55M 100% /
/dev/sda1 845G 809G 21M 100% /
/dev/sda1 845G 809G 264M 100% /
/dev/sda1 845G 805G 4.0G 100% /
/dev/sda1 845G 627G 183G 78% /
/dev/sda1 845G 549G 261G 68% /
/dev/sda1 845G 653G 157G 81% /
/dev/sda1 845G 784G 26G 97% /
/dev/sda1 845G 600G 209G 75% /
/dev/sda1 845G 696G 113G 87% /
/dev/sda1 845G 559G 251G 70% /

I'm going to do a little prep and then re provision them all, with disk size a less over subscribed

For reference the nodes that are full are running 6 testenvs each for the past 4 months and most of the space is being used up in /var/lib/libvirt/images

[heat-admin@testenv-testenv2-24jhfjm5ohp3 images]$ sudo du -sh /var/lib/libvirt/images/
806G /var/lib/libvirt/images/

[heat-admin@testenv-testenv2-24jhfjm5ohp3 images]$ sudo ls -lSr /var/lib/libvirt/images/
total 840080760
-rw-r--r--. 1 root root 1800011776 Feb 2 11:54 seed_4.qcow2
-rw-------. 1 root root 5409996800 Dec 16 00:58 baremetalbrbm1_6.qcow2
-rw-------. 1 root root 5525667840 Dec 10 10:10 baremetalbrbm6_10.qcow2
-rw-------. 1 root root 5742329856 Dec 16 00:52 baremetalbrbm6_4.qcow2
...
-rw-------. 1 root root 12849381376 Dec 16 17:23 baremetalbrbm1_3.qcow2
-rw-------. 1 root root 13022134272 Jan 15 00:19 baremetalbrbm5_13.qcow2
-rw-r--r--. 1 qemu qemu 13036879872 Feb 2 11:54 seed_6.qcow2
-rw-------. 1 qemu qemu 13138984960 Feb 2 11:54 baremetalbrbm6_3.qcow2
-rw-------. 1 root root 13518569472 Jan 7 22:04 baremetalbrbm6_0.qcow2
-rw-------. 1 root root 13648461824 Feb 2 08:36 baremetalbrbm2_3.qcow2
-rw-------. 1 root root 13943177216 Dec 17 21:25 baremetalbrbm3_5.qcow2
-rw-------. 1 root root 14780203008 Dec 16 03:52 baremetalbrbm6_7.qcow2
-rw-------. 1 root root 15622406144 Jan 14 22:59 baremetalbrbm3_2.qcow2
-rw-------. 1 root root 15805775872 Dec 17 21:25 baremetalbrbm3_14.qcow2
-rw-------. 1 root root 16497836032 Dec 16 12:39 baremetalbrbm1_0.qcow2
-rw-r--r--. 1 root root 17180000256 Feb 2 10:05 seed_1.qcow2
-rw-------. 1 root root 17854627840 Feb 2 09:55 baremetalbrbm4_0.qcow2
-rw-------. 1 qemu qemu 18130403328 Feb 2 11:54 baremetalbrbm6_12.qcow2
-rw-------. 1 root root 18411356160 Jan 15 00:19 baremetalbrbm5_0.qcow2
-rw-------. 1 root root 26098335744 Feb 1 19:31 baremetalbrbm2_0.qcow2

Derek Higgins (derekh) wrote :

Redeploy of test env hosts looks to have solved the problem, closing this

Also I should mention the reason Seeds were becoming unaccessible is that libvirt was pausing them when disk was filled. I'm guessing it was only happening to F20 because it is the bigger of the 2 images we build.

Changed in tripleo:
status: Triaged → Fix Released
James Polley (tchaypo) wrote :

To be clear, the redeploy seems to have solved the problem that was often causing us to be unable to connect to the seed. The extra debugging made possible by https://review.openstack.org/#/c/149819/ also turned up at least one other issue that was causing failures creating the tar file. Prior to https://review.openstack.org/#/c/149819/ we couldn't distinguish between "couldn't read source files", "couldn't contact seed", "couldn't write tar file onto local machine" and possibly other errors.

Reviewed: https://review.openstack.org/152192
Committed: https://git.openstack.org/cgit/openstack/tripleo-image-elements/commit/?id=515d62a870a3c931ed020f2bc6706830724777f0
Submitter: Jenkins
Branch: master

commit 515d62a870a3c931ed020f2bc6706830724777f0
Author: Derek Higgins <email address hidden>
Date: Mon Feb 2 17:11:44 2015 +0000

    Adjust Test env requirements

    Oversubscribe on memory and disk space less aggressivly. We have been
    seeing problems on CI as a result of hitting limits in both of these
    cases.

    Change-Id: I83a8e18fa81365596fc96e58b9a07111623054d8
    Related-Bug: #1414098

Change abandoned by Derek Higgins (<email address hidden>) on branch: master
Review: https://review.openstack.org/152150
Reason: Changing in toci instead

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers