Overcloud deployment fails in minimal configuration with ('Connection aborted.', BadStatusLine("''",))

Bug #1638908 reported by Alfredo Moralejo on 2016-11-03
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Critical
Unassigned
tripleo-quickstart
Fix Released
Undecided
Alfredo Moralejo

Bug Description

We are getting frequent issues in RDO-CI when using tripleo-quickstart with minimal profile, API calls to services in undercloud fail with messages ('Connection aborted.', BadStatusLine("''",)) when deploying the overcloud.

Examples:

https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-newton-delorean-minimal-72/undercloud/home/stack/overcloud_deploy.log.gz

Unable to establish connection to https://192.0.2.2:13004/v1/71e3a94ea39841d8902095ea052c1982/stacks/overcloud/64b3278e-52b1-46b7-b7c3-bc32f361ab77: ('Connection aborted.', BadStatusLine("''",))

https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-newton-delorean-minimal-91/undercloud/home/stack/overcloud_deploy.log.gz

Unable to establish connection to https://192.168.24.2:13004/v1/f0aba169cfbd4985a1d42d55e3d92c88/stacks/overcloud/202f02eb-bcf3-4149-a5d6-6fb167877d28: ('Connection aborted.', BadStatusLine("''",))

https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-newton-delorean-minimal-92/undercloud/home/stack/overcloud_deploy.log.gz

Unable to establish connection to https://192.168.24.2:13004/v1/cbd83ef4dfe34d759a065de6d5580c50/stacks/overcloud/f5299b12-639c-414e-9e7f-ee89095e0261: ('Connection aborted.', BadStatusLine("''",))

At that point "openstack overcloud deploy" command exits while heat is still working to create the stack (as seen in https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-newton-delorean-minimal-72/undercloud/var/log/heat/heat-engine.log.gz and job fails in overcloud-deploy-post.sh with error:

/home/stack/overcloud-deploy-post.sh: line 38: /home/stack/overcloudrc: No such file or directory

Other example:

https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-master-delorean-minimal-651/undercloud/home/stack/overcloud_deploy.log.gz

In this case the error was received by heat so, heat stack went to CREATE_FAILED, but the root cause seems similar.

We don't find similar errors in minimal_pacemaker jobs which doesn't includes haproxy running in undercloud so i wonder if this could be related to it.

description: updated
wes hayutin (weshayutin) wrote :

some more details are available on https://etherpad.openstack.org/p/delorean_newton_current_issues issue #5

Steven Hardy (shardy) wrote :

I don't think this is specific to quickstart - I hit this locally yesterday on my instack-virt-setup/tripleo.sh environment.

AFAICS the error comes when trying to store the plan data in swift, because I saw it when attempting to upload a deploy artifacts tarball, not on overcloud deploy.

Unfortunately we don't seem to configure logging for swift on the undercloud and thus I couldn't find any clues as to why it was failing, further investigation needed but eventually I rebooted the undercloud which "fixed" the issue.

Changed in tripleo:
status: New → Confirmed
importance: Undecided → Critical
milestone: none → ocata-1
Steven Hardy (shardy) wrote :

Note however that the BadStatusLine("''",) error is a pretty generic python error, so it's possible we've got different issues manifesting with the same error...

Alfredo Moralejo (amoralej) wrote :

There was an attempt to fix it for the swift upload case by increasing haproxy timeouts in:

https://review.openstack.org/#/c/389737/

That, apparently, fixed the issue, but maybe not completely.

In the examples posted in the description, error occurred when:

- tripleoclient tried to connect to heat service using https on haproxy.
- heat tried to connect to nova using http on haproxy

Digging a bit on error timestamps in logs in https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-master-delorean-minimal-651/undercloud/var/log/heat/heat-engine.log.gz:

2016-11-03 04:48:46.279 20747 DEBUG heat.engine.scheduler [req-05836d4e-16ec-4fa8-904c-ffde74d58b61 a0374c3772a64c76a96872a6190f03e6 2316dad07f6f48049d1cb84ab7a453ee - - -] Task stack_task from Stack "overcloud-Controller-3sktk7mhr5kq-0-7
mi5p4zfscxn" [75257dba-11c9-4571-90ea-bb51dbd2ac6e] running step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:215

...

2016-11-03 04:49:57.695 20746 DEBUG heat.engine.scheduler [req-05836d4e-16ec-4fa8-904c-ffde74d58b61 a0374c3772a64c76a96872a6190f03e6 2316dad07f6f48049d1cb84ab7a453ee - - -] Task stack_task from Stack "overcloud-Compute-iqruwqc7bwfk" [8e3b
edf5-503a-4718-bab7-c21a173bc0af] sleeping _sleep /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:156
2016-11-03 04:49:46.283 20747 INFO heat.engine.resource [req-05836d4e-16ec-4fa8-904c-ffde74d58b61 a0374c3772a64c76a96872a6190f03e6 2316dad07f6f48049d1cb84ab7a453ee - - -] CREATE: ServerUpdateAllowed "Controller" [da8fcc62-7c62-4e31-96ce-5
896d038dec7] Stack "overcloud-Controller-3sktk7mhr5kq-0-7mi5p4zfscxn" [75257dba-11c9-4571-90ea-bb51dbd2ac6e]
2016-11-03 04:49:46.283 20747 ERROR heat.engine.resource ConnectFailure: Unable to establish connection to http://192.168.24.3:8774/v2.1/servers/da8fcc62-7c62-4e31-96ce-5896d038dec7: ('Connection aborted.', BadStatusLine("''",))
2016-11-03 04:49:58.161 20747 DEBUG oslo_messaging._drivers.amqpdriver [-] received message msg_id: 2e9d442ef54649fca87418dd0ba0af52 reply to reply_5fe11ae54cb2455f87b25fa4e66fcb3f __call__ /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:194

We could infere that it's hitting a 10s timeout in haproxy. Taking a look into https://github.com/openstack/puppet-tripleo/blob/master/manifests/haproxy.pp#L38-L40 , both connect and http-request are 10s by default. I'd suggest to increase them to something like 20 seconds and test, wdyt?

Alfredo Moralejo (amoralej) wrote :

Re-reading logs, i'd say we are hitting a 60s timeout, as it's exactly 1 minute between the last log mentioning overcloud-Controller-3sktk7mhr5kq-0-7mi5p4zfscx before the error and the error message itself:

2016-11-03 04:48:46.279 20747 DEBUG heat.engine.scheduler [req-05836d4e-16ec-4fa8-904c-ffde74d58b61 a0374c3772a64c76a96872a6190f03e6 2316dad07f6f48049d1cb84ab7a453ee - - -] Task stack_task from Stack "overcloud-Controller-3sktk7mhr5kq-0-7mi5p4zfscxn" [75257dba-11c9-4571-90ea-bb51dbd2ac6e] running step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:215

2016-11-03 04:49:46.283 20747 INFO heat.engine.resource [req-05836d4e-16ec-4fa8-904c-ffde74d58b61 a0374c3772a64c76a96872a6190f03e6 2316dad07f6f48049d1cb84ab7a453ee - - -] CREATE: ServerUpdateAllowed "Controller" [da8fcc62-7c62-4e31-96ce-5896d038dec7] Stack "overcloud-Controller-3sktk7mhr5kq-0-7mi5p4zfscxn" [75257dba-11c9-4571-90ea-bb51dbd2ac6e]
2016-11-03 04:49:46.283 20747 ERROR heat.engine.resource ConnectFailure: Unable to establish connection to http://192.168.24.3:8774/v2.1/servers/da8fcc62-7c62-4e31-96ce-5896d038dec7: ('Connection aborted.', BadStatusLine("''",))

Fix proposed to branch: master
Review: https://review.openstack.org/393876

Changed in tripleo-quickstart:
assignee: nobody → Alfredo Moralejo (amoralej)
status: New → In Progress
Matt Young (halcyondude) wrote :

AFAIK we might have just hit this on the internal virt-HA RDO newton job. So far I've seen this in 1 of 4 jobs so far since we've picked up newton: 8ebb715a52afef8c5eea6fa343a915d97910907c_13bba89f

https://thirdparty-logs.rdoproject.org/jenkins-tripleo-quickstart-periodic-newton-delorean-ha_192gb-11/undercloud/home/stack/overcloud_deploy.log.gz#_2016-11-04_18_30_55

https://review.rdoproject.org/etherpad/p/rdo-internal-issues #59

shardy: Could you please point me to how to configure logging for swift on the UC? I would like to make whatever changes are necessary to make sure we collect these for each job.

Change abandoned by Alfredo Moralejo (<email address hidden>) on branch: master
Review: https://review.openstack.org/393876
Reason: Abandoned in favor of https://review.openstack.org/394378

Reviewed: https://review.openstack.org/394378
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=1bee7bc8fa0ca5ace330e54bc3e64d7f6692d5a7
Submitter: Jenkins
Branch: master

commit 1bee7bc8fa0ca5ace330e54bc3e64d7f6692d5a7
Author: Steven Hardy <email address hidden>
Date: Mon Nov 7 11:35:05 2016 +0000

    Increase haproxy timeouts

    It's been proposed this may help with the
    ('Connection aborted.', BadStatusLine("''",)) errors.

    This patch increase queue, server and client timeouts to 2m (default is 1m)
    Related-Bug: #1638908

    Change-Id: Ie4f059f3fad2271bb472697e85ede296eee91f5d

Steven Hardy (shardy) on 2016-11-16
Changed in tripleo:
milestone: ocata-1 → ocata-2

Reviewed: https://review.openstack.org/397822
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=4460bd189228b8eb87403567df4c1365c2158e61
Submitter: Jenkins
Branch: stable/newton

commit 4460bd189228b8eb87403567df4c1365c2158e61
Author: Steven Hardy <email address hidden>
Date: Mon Nov 7 11:35:05 2016 +0000

    Increase haproxy timeouts

    It's been proposed this may help with the
    ('Connection aborted.', BadStatusLine("''",)) errors.

    This patch increase queue, server and client timeouts to 2m (default is 1m)
    Related-Bug: #1638908

    Change-Id: Ie4f059f3fad2271bb472697e85ede296eee91f5d
    (cherry picked from commit 1bee7bc8fa0ca5ace330e54bc3e64d7f6692d5a7)

tags: added: in-stable-newton
Steven Hardy (shardy) wrote :

https://review.openstack.org/#/c/402081/ was posted which aims to reduce the event polling load on heat, which may help when heat is overloaded on resource-constrained underclouds.

Changed in tripleo:
status: Confirmed → Fix Released
status: Fix Released → In Progress

Reviewed: https://review.openstack.org/399619
Committed: https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=34ec2644284d59c0e07f2c4729cd354a37928aae
Submitter: Jenkins
Branch: master

commit 34ec2644284d59c0e07f2c4729cd354a37928aae
Author: John Trowbridge <email address hidden>
Date: Fri Nov 18 09:22:57 2016 -0500

    Increase the default number of workers for heat engine

    We switched to a saner default for service workers on the undercloud
    with this patch[1]. However, reducing the number of workers for heat
    engine is not so sane. The massive nested stack we are deploying grinds
    to a halt with only two engine workers, which is now our saner default
    for any system with 8 or fewer CPU cores.

    This patch changes to using the heat default for the number of engine
    workers which is max(#CPUs,4).

    [1] https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=ee5f25a726910961caf72de2c6c55de06c922b74

    Change-Id: I95df0f39f37316cc56eafb351c823197f252d7b7
    Related-Bug: 1638908

Changed in tripleo:
status: In Progress → Fix Released
John Trowbridge (trown) on 2017-01-27
Changed in tripleo-quickstart:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers