Update OS API charm default haproxy timeout values

Bug #1736171 reported by Jason Hobbs
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph RADOS Gateway Charm
Fix Released
Medium
David Ames
OpenStack AODH Charm
Fix Released
Medium
David Ames
OpenStack Barbican Charm
Fix Released
Medium
David Ames
OpenStack Ceilometer Charm
Fix Released
Medium
David Ames
OpenStack Cinder Charm
Fix Released
Medium
David Ames
OpenStack Dashboard Charm
Fix Released
Medium
David Ames
OpenStack Designate Charm
Fix Released
Medium
David Ames
OpenStack Glance Charm
Fix Released
Medium
David Ames
OpenStack Heat Charm
Fix Released
Medium
Unassigned
OpenStack Keystone Charm
Fix Released
Medium
David Ames
OpenStack Manila Charm
Fix Released
Medium
David Ames
OpenStack Neutron API Charm
Fix Released
Medium
David Ames
OpenStack Neutron Gateway Charm
Invalid
Undecided
Unassigned
OpenStack Nova Cloud Controller Charm
Fix Released
Medium
David Ames
OpenStack Swift Proxy Charm
Fix Released
Medium
David Ames
neutron
Invalid
Undecided
Unassigned

Bug Description

Change OpenStack API charm haproxy timeout values

  haproxy-server-timeout: 90000
  haproxy-client-timeout: 90000
  haproxy-connect-timeout: 9000
  haproxy-queue-timeout: 9000

Workaround until this lands is to set these values in config:

juju config neutron-api haproxy-server-timeout=90000 haproxy-client-timeout=90000 haproxy-queue-timeout=9000 haproxy-connect-timeout=9000

------- Original Bug ---------
NeutronNetworks.create_and_delete_subnets is failing when run with concurrency greater than 1.

Here's a snippet of a failure: http://paste.ubuntu.com/25927074/

Here is my rally yaml: http://paste.ubuntu.com/26112719/

This is happening using pike on xenial, from the ubuntu cloud archive's. The deployment is distributed across 9 nodes, with HA services.

For now we have adjusted our test scenario to be more realistic. When we spread the test over 30 tenants, instead of 3 and if we simulate 2 users per tenant, instead of 3, we do not hit the issue.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

This bug has entered the 'Field High' SLA process.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Please add juju-crashdump logs.

tags: added: uosci
Changed in neutron:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Changed in neutron:
status: Incomplete → New
Ryan Beisner (1chb1n)
Changed in neutron:
status: New → Invalid
Changed in charm-neutron-gateway:
assignee: nobody → David Ames (thedac)
Revision history for this message
David Ames (thedac) wrote :

Re-ran rally and saw this go by:

2017-12-07 00:02:25.399 5255 INFO rally.task.runner [-] Task a502ee62-c31b-4333-8169-f6a3d07d592e | ITER: 74 START
2017-12-07 00:02:25.915 5252 INFO rally.task.runner [-] Task a502ee62-c31b-4333-8169-f6a3d07d592e | ITER: 67 END: OK
2017-12-07 00:02:25.927 5252 INFO rally.task.runner [-] Task a502ee62-c31b-4333-8169-f6a3d07d592e | ITER: 75 START
2017-12-07 00:02:26.202 5254 INFO rally.task.runner [-] Task a502ee62-c31b-4333-8169-f6a3d07d592e | ITER: 41 END: Error ConnectFailure: Unable to establish connection to http://10.245.208.97:9696/v2.0/subnets/d6fe1572-83ca-4f64-a30e-41522471e2f9: ('Connection aborted.', BadStatusLine("''",))
2017-12-07 00:02:26.217 5254 INFO rally.task.runner [-] Task a502ee62-c31b-4333-8169-f6a3d07d592e | ITER: 76 START
2017-12-07 00:02:26.601 5255 INFO rally.task.runner [-] Task a502ee62-c31b-4333-8169-f6a3d07d592e | ITER: 73 END: OK
2017-12-07 00:02:26.626 5255 INFO rally.task.runner [-] Task a5

BadStatusLine("''",) is the smoking gun. It is almost always haproxy dropping the connection due to one of its timeouts.

I highly recommend adding the follwowing configurations. For all the OpenStack API charms:

juju confi neutron-api haproxy-server-timeout=90000 haproxy-client-timeout=90000 haproxy-queue-timeout=9000 haproxy-connect-timeout=9000

The defaults are good for non-busy clouds. But once we are stress testing we need to bump up the timeouts so that haproxy does not drop connections. This is what we have running in serverstack.

Changed in charm-neutron-gateway:
status: New → Invalid
Revision history for this message
David Ames (thedac) wrote :

Rally timing info:

+-----------------------------------------------------------------------------------------------------------------------------------------+
| Response Times (sec) |
+--------------------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| Action | Min (sec) | Median (sec) | 90%ile (sec) | 95%ile (sec) | Max (sec) | Avg (sec) | Success | Count |
+--------------------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| neutron.create_network | 0.47 | 0.754 | 1.005 | 1.026 | 1.128 | 0.768 | 100.0% | 30 |
| neutron.create_subnet (x2) | 1.114 | 1.779 | 2.843 | 2.89 | 3.13 | 1.988 | 100.0% | 30 |
| neutron.create_router (x2) | 5.103 | 7.149 | 10.792 | 11.076 | 11.922 | 7.775 | 100.0% | 30 |
| neutron.add_interface_router (x2) | 4.421 | 6.012 | 9.526 | 9.773 | 10.016 | 6.599 | 100.0% | 30 |
| neutron.remove_interface_router (x2) | 3.381 | 4.805 | 7.59 | 8.362 | 9.048 | 5.276 | 100.0% | 30 |
| neutron.delete_router (x2) | 3.192 | 4.442 | 7.734 | 7.774 | 7.862 | 5.001 | 100.0% | 30 |
| total | 18.011 | 24.171 | 38.944 | 40.172 | 41.232 | 27.407 | 100.0% | 30 |
| -> duration | 18.011 | 24.171 | 38.944 | 40.172 | 41.232 | 27.407 | 100.0% | 30 |
| -> idle_duration | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 100.0% | 30 |
+--------------------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+

We need the timeout values to be greater than these.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

@thedac Do you think, it is still a valid bug or not valid anymore.

Revision history for this message
David Ames (thedac) wrote :

Discussed with the team. For 18.02 we will change the OpenStack API charm timeout values from their current default:

  haproxy-server-timeout: 30000
  haproxy-client-timeout: 30000
  haproxy-connect-timeout: 5000
  haproxy-queue-timeout: 5000

To more forgiving values:

  haproxy-server-timeout: 90000
  haproxy-client-timeout: 90000
  haproxy-connect-timeout: 9000
  haproxy-queue-timeout: 9000

summary: - create_and_delete_subnets rally test failures
+ Update OS API charm default haproxy timeout values
Revision history for this message
David Ames (thedac) wrote :

@Swaminathan This is not a valid neutron bug. It is charm configuration related.

description: updated
Changed in charm-cinder:
importance: Undecided → Medium
milestone: none → 18.02
status: New → Triaged
Changed in charm-glance:
importance: Undecided → Medium
milestone: none → 18.02
status: New → Triaged
Changed in charm-ceph-radosgw:
importance: Undecided → Medium
milestone: none → 18.02
status: New → Triaged
Changed in charm-heat:
importance: Undecided → Medium
milestone: none → 18.02
status: New → Triaged
Changed in charm-keystone:
importance: Undecided → Medium
milestone: none → 18.02
status: New → Triaged
Changed in charm-neutron-api:
importance: Undecided → Medium
milestone: none → 18.02
status: New → Triaged
Changed in charm-nova-cloud-controller:
importance: Undecided → Medium
milestone: none → 18.02
status: New → Triaged
Changed in charm-openstack-dashboard:
importance: Undecided → Medium
milestone: none → 18.02
status: New → Triaged
Changed in charm-neutron-gateway:
assignee: David Ames (thedac) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-keystone (master)

Reviewed: https://review.openstack.org/527221
Committed: https://git.openstack.org/cgit/openstack/charm-keystone/commit/?id=e1ac46f34264c11b56c571412bc40c42018370bb
Submitter: Zuul
Branch: master

commit e1ac46f34264c11b56c571412bc40c42018370bb
Author: David Ames <email address hidden>
Date: Mon Dec 11 11:36:56 2017 -0800

    Update HAProxy default timeout values

    The default HAProxy timeout values are fairly strict. On a busy cloud
    it is common to exceed one or more of these timeouts. The only
    indication that HAProxy has exceeded a timeout and dropped the
    connection is errors such as "BadStatusLine" or "EOF." These can be
    very difficult to diagnose when intermittent.

    This charm-helpers sync pulls in the change to update the default
    timeout values to more real world settings. These values have been
    extensively tested in ServerStack. Configured values will not be
    overridden.

    Partial Bug: #1736171

    Change-Id: I973962a5c1538b0d9afbebea8cebf50d938ecfb5

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-glance (master)

Reviewed: https://review.openstack.org/527220
Committed: https://git.openstack.org/cgit/openstack/charm-glance/commit/?id=c5048c78171d705d81680fb9902fc78baff73f72
Submitter: Zuul
Branch: master

commit c5048c78171d705d81680fb9902fc78baff73f72
Author: David Ames <email address hidden>
Date: Mon Dec 11 11:36:47 2017 -0800

    Update HAProxy default timeout values

    The default HAProxy timeout values are fairly strict. On a busy cloud
    it is common to exceed one or more of these timeouts. The only
    indication that HAProxy has exceeded a timeout and dropped the
    connection is errors such as "BadStatusLine" or "EOF." These can be
    very difficult to diagnose when intermittent.

    This charm-helpers sync pulls in the change to update the default
    timeout values to more real world settings. These values have been
    extensively tested in ServerStack. Configured values will not be
    overridden.

    Partial Bug: #1736171

    Change-Id: I4d15d8ef0f2bfb9966a45ca1850721c5de4d3b08

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-cinder (master)

Reviewed: https://review.openstack.org/527219
Committed: https://git.openstack.org/cgit/openstack/charm-cinder/commit/?id=cf6cd15b24ad35faa287333c58e0661475a84708
Submitter: Zuul
Branch: master

commit cf6cd15b24ad35faa287333c58e0661475a84708
Author: David Ames <email address hidden>
Date: Mon Dec 11 11:36:37 2017 -0800

    Update HAProxy default timeout values

    The default HAProxy timeout values are fairly strict. On a busy cloud
    it is common to exceed one or more of these timeouts. The only
    indication that HAProxy has exceeded a timeout and dropped the
    connection is errors such as "BadStatusLine" or "EOF." These can be
    very difficult to diagnose when intermittent.

    This charm-helpers sync pulls in the change to update the default
    timeout values to more real world settings. These values have been
    extensively tested in ServerStack. Configured values will not be
    overridden.

    Partial Bug: #1736171

    Change-Id: I342c06066b26ffa8240f076e0c9f461cae21b9c4

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-neutron-api (master)

Reviewed: https://review.openstack.org/527222
Committed: https://git.openstack.org/cgit/openstack/charm-neutron-api/commit/?id=00b52d10b1e1f085fea38ba84303f9f07cc7ad5d
Submitter: Zuul
Branch: master

commit 00b52d10b1e1f085fea38ba84303f9f07cc7ad5d
Author: David Ames <email address hidden>
Date: Mon Dec 11 11:37:06 2017 -0800

    Update HAProxy default timeout values

    The default HAProxy timeout values are fairly strict. On a busy cloud
    it is common to exceed one or more of these timeouts. The only
    indication that HAProxy has exceeded a timeout and dropped the
    connection is errors such as "BadStatusLine" or "EOF." These can be
    very difficult to diagnose when intermittent.

    This charm-helpers sync pulls in the change to update the default
    timeout values to more real world settings. These values have been
    extensively tested in ServerStack. Configured values will not be
    overridden.

    Partial Bug: #1736171

    Change-Id: I6651ecdb89af11e94c59f928c1eb4a89940f4679

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-cloud-controller (master)

Reviewed: https://review.openstack.org/527223
Committed: https://git.openstack.org/cgit/openstack/charm-nova-cloud-controller/commit/?id=373158b5cfb827359a1d8c821c30c1f2a934ebb5
Submitter: Zuul
Branch: master

commit 373158b5cfb827359a1d8c821c30c1f2a934ebb5
Author: David Ames <email address hidden>
Date: Mon Dec 11 11:37:14 2017 -0800

    Update HAProxy default timeout values

    The default HAProxy timeout values are fairly strict. On a busy cloud
    it is common to exceed one or more of these timeouts. The only
    indication that HAProxy has exceeded a timeout and dropped the
    connection is errors such as "BadStatusLine" or "EOF." These can be
    very difficult to diagnose when intermittent.

    This charm-helpers sync pulls in the change to update the default
    timeout values to more real world settings. These values have been
    extensively tested in ServerStack. Configured values will not be
    overridden.

    Partial Bug: #1736171

    Change-Id: I0a3a8f0dd2dedcc8e02dd6af2f5486501698833e

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-radosgw (master)

Reviewed: https://review.openstack.org/527218
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-radosgw/commit/?id=edad8b605412910aece9b9f9ae6806f70bd31be5
Submitter: Zuul
Branch: master

commit edad8b605412910aece9b9f9ae6806f70bd31be5
Author: David Ames <email address hidden>
Date: Mon Dec 11 11:36:27 2017 -0800

    Update HAProxy default timeout values

    The default HAProxy timeout values are fairly strict. On a busy cloud
    it is common to exceed one or more of these timeouts. The only
    indication that HAProxy has exceeded a timeout and dropped the
    connection is errors such as "BadStatusLine" or "EOF." These can be
    very difficult to diagnose when intermittent.

    This charm-helpers sync pulls in the change to update the default
    timeout values to more real world settings. These values have been
    extensively tested in ServerStack. Configured values will not be
    overridden.

    Partial Bug: #1736171

    Change-Id: I312dd56ecf55ad67485305e57f2807a5ea6975cd

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-openstack-dashboard (master)

Reviewed: https://review.openstack.org/527224
Committed: https://git.openstack.org/cgit/openstack/charm-openstack-dashboard/commit/?id=cad0fa0dcd42ac3e014cf192ca42e579676a3e6f
Submitter: Zuul
Branch: master

commit cad0fa0dcd42ac3e014cf192ca42e579676a3e6f
Author: David Ames <email address hidden>
Date: Mon Dec 11 11:37:24 2017 -0800

    Update HAProxy default timeout values

    The default HAProxy timeout values are fairly strict. On a busy cloud
    it is common to exceed one or more of these timeouts. The only
    indication that HAProxy has exceeded a timeout and dropped the
    connection is errors such as "BadStatusLine" or "EOF." These can be
    very difficult to diagnose when intermittent.

    This charm-helpers sync pulls in the change to update the default
    timeout values to more real world settings. These values have been
    extensively tested in ServerStack. Configured values will not be
    overridden.

    Partial Bug: #1736171

    Change-Id: Ida7949113594b9b859ab7b4ba8b2bb440bab6e7d

Ryan Beisner (1chb1n)
Changed in charm-barbican:
assignee: nobody → David Ames (thedac)
importance: Undecided → Medium
milestone: none → 18.02
status: New → Fix Committed
Changed in charm-keystone:
assignee: nobody → David Ames (thedac)
status: Triaged → Fix Committed
Changed in charm-glance:
assignee: nobody → David Ames (thedac)
status: Triaged → Fix Committed
Changed in charm-cinder:
assignee: nobody → David Ames (thedac)
status: Triaged → Fix Committed
Changed in charm-neutron-api:
assignee: nobody → David Ames (thedac)
status: Triaged → Fix Committed
Changed in charm-nova-cloud-controller:
assignee: nobody → David Ames (thedac)
status: Triaged → Fix Committed
Changed in charm-ceilometer:
assignee: nobody → David Ames (thedac)
importance: Undecided → Medium
milestone: none → 18.02
status: New → Fix Committed
Changed in charm-swift-proxy:
assignee: nobody → David Ames (thedac)
importance: Undecided → Medium
milestone: none → 18.02
status: New → Fix Committed
Changed in charm-ceph-radosgw:
assignee: nobody → David Ames (thedac)
status: Triaged → Fix Committed
Changed in charm-openstack-dashboard:
assignee: nobody → David Ames (thedac)
status: Triaged → Fix Committed
Ryan Beisner (1chb1n)
Changed in charm-manila:
assignee: nobody → David Ames (thedac)
importance: Undecided → Medium
milestone: none → 18.02
status: New → Fix Committed
Changed in charm-aodh:
assignee: nobody → David Ames (thedac)
importance: Undecided → Medium
milestone: none → 18.02
status: New → Fix Committed
Changed in charm-designate:
assignee: nobody → David Ames (thedac)
importance: Undecided → Medium
milestone: none → 18.02
status: New → Fix Committed
Revision history for this message
Ryan Beisner (1chb1n) wrote :

The heat charm previously lacked the haproxy timeout controls, and that was resolved with https://review.openstack.org/#/c/526674/. With that landed, the default values should now be proposed against it.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

FYI, heat charm change proposed @: https://review.openstack.org/#/c/530938/

Changed in charm-heat:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-heat (master)

Reviewed: https://review.openstack.org/530938
Committed: https://git.openstack.org/cgit/openstack/charm-heat/commit/?id=1817fe73465c494754968114844c1505b9336efd
Submitter: Zuul
Branch: master

commit 1817fe73465c494754968114844c1505b9336efd
Author: Ryan Beisner <email address hidden>
Date: Wed Jan 3 09:57:18 2018 -0500

    Update HAProxy default timeout values

    The default HAProxy timeout values are fairly strict. On a busy cloud
    it is common to exceed one or more of these timeouts. The only
    indication that HAProxy has exceeded a timeout and dropped the
    connection is errors such as "BadStatusLine" or "EOF." These can be
    very difficult to diagnose when intermittent.

    This charm-helpers sync pulls in the change to update the default
    timeout values to more real world settings. These values have been
    extensively tested in ServerStack. Configured values will not be
    overridden.

    Partial Bug: #1736171

    Change-Id: I5f602a8dc1ab1060696fd486beb66033efaae862

David Ames (thedac)
Changed in charm-heat:
status: In Progress → Fix Committed
Ryan Beisner (1chb1n)
Changed in charm-neutron-api:
status: Fix Committed → Fix Released
Changed in charm-keystone:
status: Fix Committed → Fix Released
Changed in charm-nova-cloud-controller:
status: Fix Committed → Fix Released
Changed in charm-cinder:
status: Fix Committed → Fix Released
Changed in charm-glance:
status: Fix Committed → Fix Released
Changed in charm-ceph-radosgw:
status: Fix Committed → Fix Released
Changed in charm-heat:
status: Fix Committed → Fix Released
Changed in charm-openstack-dashboard:
status: Fix Committed → Fix Released
Changed in charm-barbican:
status: Fix Committed → Fix Released
Changed in charm-ceilometer:
status: Fix Committed → Fix Released
Changed in charm-swift-proxy:
status: Fix Committed → Fix Released
Changed in charm-manila:
status: Fix Committed → Fix Released
Changed in charm-aodh:
status: Fix Committed → Fix Released
Changed in charm-designate:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.