Overcloud deployment fails in minimal configuration with ('Connection aborted.', BadStatusLine("''",))

Bug #1638908 reported by Alfredo Moralejo
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned
tripleo-quickstart
Fix Released
Undecided
Alfredo Moralejo

Bug Description

We are getting frequent issues in RDO-CI when using tripleo-quickstart with minimal profile, API calls to services in undercloud fail with messages ('Connection aborted.', BadStatusLine("''",)) when deploying the overcloud.

Examples:

https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-newton-delorean-minimal-72/undercloud/home/stack/overcloud_deploy.log.gz

Unable to establish connection to https://192.0.2.2:13004/v1/71e3a94ea39841d8902095ea052c1982/stacks/overcloud/64b3278e-52b1-46b7-b7c3-bc32f361ab77: ('Connection aborted.', BadStatusLine("''",))

https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-newton-delorean-minimal-91/undercloud/home/stack/overcloud_deploy.log.gz

Unable to establish connection to https://192.168.24.2:13004/v1/f0aba169cfbd4985a1d42d55e3d92c88/stacks/overcloud/202f02eb-bcf3-4149-a5d6-6fb167877d28: ('Connection aborted.', BadStatusLine("''",))

https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-newton-delorean-minimal-92/undercloud/home/stack/overcloud_deploy.log.gz

Unable to establish connection to https://192.168.24.2:13004/v1/cbd83ef4dfe34d759a065de6d5580c50/stacks/overcloud/f5299b12-639c-414e-9e7f-ee89095e0261: ('Connection aborted.', BadStatusLine("''",))

At that point "openstack overcloud deploy" command exits while heat is still working to create the stack (as seen in https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-newton-delorean-minimal-72/undercloud/var/log/heat/heat-engine.log.gz and job fails in overcloud-deploy-post.sh with error:

/home/stack/overcloud-deploy-post.sh: line 38: /home/stack/overcloudrc: No such file or directory

Other example:

https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-master-delorean-minimal-651/undercloud/home/stack/overcloud_deploy.log.gz

In this case the error was received by heat so, heat stack went to CREATE_FAILED, but the root cause seems similar.

We don't find similar errors in minimal_pacemaker jobs which doesn't includes haproxy running in undercloud so i wonder if this could be related to it.

description: updated
Revision history for this message
wes hayutin (weshayutin) wrote :

some more details are available on https://etherpad.openstack.org/p/delorean_newton_current_issues issue #5

Revision history for this message
Steven Hardy (shardy) wrote :

I don't think this is specific to quickstart - I hit this locally yesterday on my instack-virt-setup/tripleo.sh environment.

AFAICS the error comes when trying to store the plan data in swift, because I saw it when attempting to upload a deploy artifacts tarball, not on overcloud deploy.

Unfortunately we don't seem to configure logging for swift on the undercloud and thus I couldn't find any clues as to why it was failing, further investigation needed but eventually I rebooted the undercloud which "fixed" the issue.

Changed in tripleo:
status: New → Confirmed
importance: Undecided → Critical
milestone: none → ocata-1
Revision history for this message
Steven Hardy (shardy) wrote :

Note however that the BadStatusLine("''",) error is a pretty generic python error, so it's possible we've got different issues manifesting with the same error...

Revision history for this message
Alfredo Moralejo (amoralej) wrote :

There was an attempt to fix it for the swift upload case by increasing haproxy timeouts in:

https://review.openstack.org/#/c/389737/

That, apparently, fixed the issue, but maybe not completely.

In the examples posted in the description, error occurred when:

- tripleoclient tried to connect to heat service using https on haproxy.
- heat tried to connect to nova using http on haproxy

Digging a bit on error timestamps in logs in https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-master-delorean-minimal-651/undercloud/var/log/heat/heat-engine.log.gz:

2016-11-03 04:48:46.279 20747 DEBUG heat.engine.scheduler [req-05836d4e-16ec-4fa8-904c-ffde74d58b61 a0374c3772a64c76a96872a6190f03e6 2316dad07f6f48049d1cb84ab7a453ee - - -] Task stack_task from Stack "overcloud-Controller-3sktk7mhr5kq-0-7
mi5p4zfscxn" [75257dba-11c9-4571-90ea-bb51dbd2ac6e] running step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:215

...

2016-11-03 04:49:57.695 20746 DEBUG heat.engine.scheduler [req-05836d4e-16ec-4fa8-904c-ffde74d58b61 a0374c3772a64c76a96872a6190f03e6 2316dad07f6f48049d1cb84ab7a453ee - - -] Task stack_task from Stack "overcloud-Compute-iqruwqc7bwfk" [8e3b
edf5-503a-4718-bab7-c21a173bc0af] sleeping _sleep /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:156
2016-11-03 04:49:46.283 20747 INFO heat.engine.resource [req-05836d4e-16ec-4fa8-904c-ffde74d58b61 a0374c3772a64c76a96872a6190f03e6 2316dad07f6f48049d1cb84ab7a453ee - - -] CREATE: ServerUpdateAllowed "Controller" [da8fcc62-7c62-4e31-96ce-5
896d038dec7] Stack "overcloud-Controller-3sktk7mhr5kq-0-7mi5p4zfscxn" [75257dba-11c9-4571-90ea-bb51dbd2ac6e]
2016-11-03 04:49:46.283 20747 ERROR heat.engine.resource ConnectFailure: Unable to establish connection to http://192.168.24.3:8774/v2.1/servers/da8fcc62-7c62-4e31-96ce-5896d038dec7: ('Connection aborted.', BadStatusLine("''",))
2016-11-03 04:49:58.161 20747 DEBUG oslo_messaging._drivers.amqpdriver [-] received message msg_id: 2e9d442ef54649fca87418dd0ba0af52 reply to reply_5fe11ae54cb2455f87b25fa4e66fcb3f __call__ /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:194

We could infere that it's hitting a 10s timeout in haproxy. Taking a look into https://github.com/openstack/puppet-tripleo/blob/master/manifests/haproxy.pp#L38-L40 , both connect and http-request are 10s by default. I'd suggest to increase them to something like 20 seconds and test, wdyt?

Revision history for this message
Alfredo Moralejo (amoralej) wrote :

Re-reading logs, i'd say we are hitting a 60s timeout, as it's exactly 1 minute between the last log mentioning overcloud-Controller-3sktk7mhr5kq-0-7mi5p4zfscx before the error and the error message itself:

2016-11-03 04:48:46.279 20747 DEBUG heat.engine.scheduler [req-05836d4e-16ec-4fa8-904c-ffde74d58b61 a0374c3772a64c76a96872a6190f03e6 2316dad07f6f48049d1cb84ab7a453ee - - -] Task stack_task from Stack "overcloud-Controller-3sktk7mhr5kq-0-7mi5p4zfscxn" [75257dba-11c9-4571-90ea-bb51dbd2ac6e] running step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:215

2016-11-03 04:49:46.283 20747 INFO heat.engine.resource [req-05836d4e-16ec-4fa8-904c-ffde74d58b61 a0374c3772a64c76a96872a6190f03e6 2316dad07f6f48049d1cb84ab7a453ee - - -] CREATE: ServerUpdateAllowed "Controller" [da8fcc62-7c62-4e31-96ce-5896d038dec7] Stack "overcloud-Controller-3sktk7mhr5kq-0-7mi5p4zfscxn" [75257dba-11c9-4571-90ea-bb51dbd2ac6e]
2016-11-03 04:49:46.283 20747 ERROR heat.engine.resource ConnectFailure: Unable to establish connection to http://192.168.24.3:8774/v2.1/servers/da8fcc62-7c62-4e31-96ce-5896d038dec7: ('Connection aborted.', BadStatusLine("''",))

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-quickstart (master)

Fix proposed to branch: master
Review: https://review.openstack.org/393876

Changed in tripleo-quickstart:
assignee: nobody → Alfredo Moralejo (amoralej)
status: New → In Progress
Revision history for this message
Matt Young (halcyondude) wrote :

AFAIK we might have just hit this on the internal virt-HA RDO newton job. So far I've seen this in 1 of 4 jobs so far since we've picked up newton: 8ebb715a52afef8c5eea6fa343a915d97910907c_13bba89f

https://thirdparty-logs.rdoproject.org/jenkins-tripleo-quickstart-periodic-newton-delorean-ha_192gb-11/undercloud/home/stack/overcloud_deploy.log.gz#_2016-11-04_18_30_55

https://review.rdoproject.org/etherpad/p/rdo-internal-issues #59

shardy: Could you please point me to how to configure logging for swift on the UC? I would like to make whatever changes are necessary to make sure we collect these for each job.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/394378

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart (master)

Change abandoned by Alfredo Moralejo (<email address hidden>) on branch: master
Review: https://review.openstack.org/393876
Reason: Abandoned in favor of https://review.openstack.org/394378

Revision history for this message
Matt Young (halcyondude) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/394378
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=1bee7bc8fa0ca5ace330e54bc3e64d7f6692d5a7
Submitter: Jenkins
Branch: master

commit 1bee7bc8fa0ca5ace330e54bc3e64d7f6692d5a7
Author: Steven Hardy <email address hidden>
Date: Mon Nov 7 11:35:05 2016 +0000

    Increase haproxy timeouts

    It's been proposed this may help with the
    ('Connection aborted.', BadStatusLine("''",)) errors.

    This patch increase queue, server and client timeouts to 2m (default is 1m)
    Related-Bug: #1638908

    Change-Id: Ie4f059f3fad2271bb472697e85ede296eee91f5d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (stable/newton)

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/397822

Steven Hardy (shardy)
Changed in tripleo:
milestone: ocata-1 → ocata-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to puppet-tripleo (stable/newton)

Reviewed: https://review.openstack.org/397822
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=4460bd189228b8eb87403567df4c1365c2158e61
Submitter: Jenkins
Branch: stable/newton

commit 4460bd189228b8eb87403567df4c1365c2158e61
Author: Steven Hardy <email address hidden>
Date: Mon Nov 7 11:35:05 2016 +0000

    Increase haproxy timeouts

    It's been proposed this may help with the
    ('Connection aborted.', BadStatusLine("''",)) errors.

    This patch increase queue, server and client timeouts to 2m (default is 1m)
    Related-Bug: #1638908

    Change-Id: Ie4f059f3fad2271bb472697e85ede296eee91f5d
    (cherry picked from commit 1bee7bc8fa0ca5ace330e54bc3e64d7f6692d5a7)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to instack-undercloud (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/399619

Revision history for this message
Steven Hardy (shardy) wrote :

https://review.openstack.org/#/c/402081/ was posted which aims to reduce the event polling load on heat, which may help when heat is overloaded on resource-constrained underclouds.

Changed in tripleo:
status: Confirmed → Fix Released
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to instack-undercloud (master)

Reviewed: https://review.openstack.org/399619
Committed: https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=34ec2644284d59c0e07f2c4729cd354a37928aae
Submitter: Jenkins
Branch: master

commit 34ec2644284d59c0e07f2c4729cd354a37928aae
Author: John Trowbridge <email address hidden>
Date: Fri Nov 18 09:22:57 2016 -0500

    Increase the default number of workers for heat engine

    We switched to a saner default for service workers on the undercloud
    with this patch[1]. However, reducing the number of workers for heat
    engine is not so sane. The massive nested stack we are deploying grinds
    to a halt with only two engine workers, which is now our saner default
    for any system with 8 or fewer CPU cores.

    This patch changes to using the heat default for the number of engine
    workers which is max(#CPUs,4).

    [1] https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=ee5f25a726910961caf72de2c6c55de06c922b74

    Change-Id: I95df0f39f37316cc56eafb351c823197f252d7b7
    Related-Bug: 1638908

Changed in tripleo:
status: In Progress → Fix Released
John Trowbridge (trown)
Changed in tripleo-quickstart:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.