Overcloud deploy failing to due BadStatusLine exception when uploading tarball to swift

Bug #1635269 reported by Lars Kellogg-Stedman
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Unassigned

Bug Description

I have been experience a transient failure in my overcloud deploys in which the deploy terminates without any errors, like this:

  Uploading filename /tmp/tmp_M0CQ6 to Swift container overcloud
  ''

Running with '--debug' shows that this is the result of a BadStatusLine exception whening communicating with swift:

    File "/usr/lib/python2.7/site-packages/tripleo_common/utils/tarball.py", line 39, in tarball_extract_to_swift_container
      headers={'X-Detect-Content-Type': 'true'}
    File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 1796, in put_object
      response_dict=response_dict)
    File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 1647, in _retry
      service_token=self.service_token, **kwargs)
    File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 1278, in put_object
      conn.putrequest(path, headers=headers, data=data)
    File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 447, in putrequest
      return self.request('PUT', full_path, data, headers, files)
    File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 437, in request
      files=files, **self.requests_args)
    File "/usr/lib/python2.7/site-packages/swiftclient/client.py", line 420, in _request
      return self.request_session.request(*arg, **kwarg)
    File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 475, in request
      resp = self.send(prep, **send_kwargs)
    File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 585, in send
      r = adapter.send(request, **kwargs)
    File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 434, in send
      r = low_conn.getresponse(buffering=True)
    File "/usr/lib64/python2.7/httplib.py", line 1089, in getresponse
      response.begin()
    File "/usr/lib64/python2.7/httplib.py", line 444, in begin
      version, status, reason = self._read_status()
    File "/usr/lib64/python2.7/httplib.py", line 408, in _read_status
      raise BadStatusLine(line)
  BadStatusLine: ''

There are two problems here:

(a) the overcloud deploy command should communicate this error to the operator, rather than just printing '' and exiting.

(b) whatever is causing the badstatusline behavior needs fixing.

Revision history for this message
John Trowbridge (trown) wrote :

I am able to consistently reproduce this on a slower[1] virtual environment, while it does not reproduce on my personal dev environment.

I agree that the error reporting needs to be fixed along with the root cause.

[1] I am not actually sure what makes it slower, but undercloud install takes 3x longer than my personal dev environment.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (master)

Fix proposed to branch: master
Review: https://review.openstack.org/389737

Changed in tripleo:
assignee: nobody → John Trowbridge (trown)
status: New → In Progress
John Trowbridge (trown)
Changed in tripleo:
importance: Undecided → High
milestone: none → ocata-1
tags: added: newton-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/389737
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=4887e187a6b83ff74feaec65fc839e5ac65f813b
Submitter: Jenkins
Branch: master

commit 4887e187a6b83ff74feaec65fc839e5ac65f813b
Author: John Trowbridge <email address hidden>
Date: Fri Oct 21 10:47:17 2016 -0400

    Increase haproxy client/server timeout for swift-proxy

    The upload and extraction for the plan tarball to swift can take
    longer than the default one minute in slower environments. Doubling
    the timeout to two minutes has proven to help.

    This is only a partial fix, because the error reporting for this
    issue also needs to be improved.

    Change-Id: I06592d38fdfefacc8bdf76289a0bfa20eb33a89b
    Partial-Bug: 1635269

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/390557

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/newton)

Reviewed: https://review.openstack.org/390557
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=9540715c4e087df880283192356e7feecc7e8727
Submitter: Jenkins
Branch: stable/newton

commit 9540715c4e087df880283192356e7feecc7e8727
Author: John Trowbridge <email address hidden>
Date: Fri Oct 21 10:47:17 2016 -0400

    Increase haproxy client/server timeout for swift-proxy

    The upload and extraction for the plan tarball to swift can take
    longer than the default one minute in slower environments. Doubling
    the timeout to two minutes has proven to help.

    This is only a partial fix, because the error reporting for this
    issue also needs to be improved.

    Change-Id: I06592d38fdfefacc8bdf76289a0bfa20eb33a89b
    Partial-Bug: 1635269
    (cherry picked from commit 4887e187a6b83ff74feaec65fc839e5ac65f813b)

tags: added: in-stable-newton
Revision history for this message
Steven Hardy (shardy) wrote :

There is https://bugs.launchpad.net/tripleo-quickstart/+bug/1638908 which appears to be a partial duplicate of this?

What work remains before we can close this - a patch to surface a better error?

Changed in tripleo:
milestone: ocata-1 → ocata-2
Revision history for this message
John Trowbridge (trown) wrote :

Ya I think the haproxy timeouts have fixed the symptoms here, but the error handling from the deploy command needs improvement.

In general, it does not surface exceptions at all unless in debug mode. If there is a fatal exception during deploy, we need to surface that in all cases, not just debug.

Changed in tripleo:
milestone: ocata-2 → ocata-3
Changed in tripleo:
milestone: ocata-3 → ocata-rc1
Changed in tripleo:
milestone: ocata-rc1 → ocata-rc2
Changed in tripleo:
milestone: ocata-rc2 → pike-1
Changed in tripleo:
milestone: pike-1 → pike-2
Changed in tripleo:
milestone: pike-2 → pike-3
Revision history for this message
Emilien Macchi (emilienm) wrote :

There are no currently open reviews on this bug, changing the status back to the previous state and unassigning. If there are active reviews related to this bug, please include links in comments.

Changed in tripleo:
status: In Progress → New
assignee: John Trowbridge (trown) → nobody
status: New → Triaged
Changed in tripleo:
milestone: pike-3 → pike-rc1
Revision history for this message
Ben Nemec (bnemec) wrote :

There are actually two things being tracked in this bug: the underlying problem with Swift, and the lack of communication to the user about that problem. Since the problem referenced in the title of this bug has been fixed I'm closing it, but I opened https://bugs.launchpad.net/tripleo/+bug/1709706 to track the UX improvement part so we don't lose track of that either.

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.