rally wait_for_loadbalancers fails 50% of the time

Bug #1893446 reported by Jason Hobbs
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Octavia Charm
Fix Released
High
Unassigned
charm-interface-ovsdb
Fix Released
High
Unassigned

Bug Description

Solutions QA runs rally's wait_for_loadbalancers test for octavia. It fails about 50% of the time, with disconnection errors to the lbaas service.

Example test runs can be found here:
https://solutions.qa.canonical.com/bugs/bugs/bug/1893446

GetResourceFailure: Failed to get the resource ab09e820-e91c-4fcc-bab8-e5438f08b0de: Unable to establish connection to http://10.244.8.94:9876/v2.0/lbaas/loadbalancers/ab09e820-e91c-4fcc-bab8-e5438f08b0de: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

Traceback (most recent call last):
  File "/snap/fcbtest/14/lib/python3.6/site-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/snap/fcbtest/14/lib/python3.6/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/snap/fcbtest/14/lib/python3.6/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/snap/fcbtest/14/usr/lib/python3.6/http/client.py", line 1356, in getresponse
    response.begin()
  File "/snap/fcbtest/14/usr/lib/python3.6/http/client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "/snap/fcbtest/14/usr/lib/python3.6/http/client.py", line 276, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

+-----------------------------------------------------------------------------------------------------------------------------------------+
| Response Times (sec) |
+--------------------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| Action | Min (sec) | Median (sec) | 90%ile (sec) | 95%ile (sec) | Max (sec) | Avg (sec) | Success | Count |
+--------------------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| octavia.load_balancer_create | 7.782 | 7.812 | 7.835 | 7.838 | 7.841 | 7.812 | 100.0% | 2 |
| octavia.wait_for_loadbalancers | 35.973 | 48.946 | 59.323 | 60.621 | 61.918 | 48.946 | 50.0% | 2 |
| -> octavia.load_balancer_show (x31) | 1.866 | 1.866 | 1.866 | 1.866 | 1.866 | 1.866 | 100.0% | 1 |
| -> octavia.load_balancer_show (x17) | 3.943 | 3.943 | 3.943 | 3.943 | 3.943 | 3.943 | 0.0% | 1 |
| octavia.pool_create | 2.392 | 2.392 | 2.392 | 2.392 | 2.392 | 2.392 | 100.0% | 1 |
| octavia.pool_delete | 0.116 | 0.116 | 0.116 | 0.116 | 0.116 | 0.116 | 100.0% | 1 |
| total | 43.756 | 58.011 | 69.416 | 70.841 | 72.267 | 58.011 | 50.0% | 2 |
| -> duration | 43.756 | 58.011 | 69.416 | 70.841 | 72.267 | 58.011 | 50.0% | 2 |
| -> idle_duration | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 50.0% | 2 |
+--------------------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+

description: updated
Revision history for this message
Michael Skalka (mskalka) wrote :
Revision history for this message
Marian Gasparovic (marosg) wrote :

Actually we found out 100% failure mentioned above (https://solutions.qa.canonical.com/openstack/testRun/1f0c6004-a5bf-4002-9d78-afd9365a1a6e) is something else, unrelated. But we have enough of 50% failing tests https://solutions.qa.canonical.com/bugs/bugs/bug/1893446

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Looking at the most recent run there I see this on one of the Octavia units:

2020-10-20T13:24:46.755Z|00729|binding|INFO|Dropped 20 log messages in last 43 seconds (most recently, 35 seconds ago) due to excessive rate
2020-10-20T13:24:46.755Z|00730|binding|INFO|Not claiming lport 21b287ea-4595-441c-9915-1f2eef110670, chassis juju-bec91d-6-lxd-7.prodymcprodface.solutionsqa requested-chassis juju-bec91d-6-lxd-7

The same node also have a lot of failures in the octavia-worker log most likely because it is unable to connact to any of the Amphorae.

The possible root causes for this:
- The Octavia charm uses incorrect hostname when creating its port in Neutron
- The Octavia unit is provided with the wrong fqdn at deploy time making the Open vSwitch init script record an incorrect hostname in the Open_vSwitch table (bug 1896630)

Having 1/3 nodes not working properly would probably be a plausible cause of the rubbish rally results too.

Revision history for this message
Frode Nordahl (fnordahl) wrote :
Download full text (3.9 KiB)

Since the ovn-controller clearly has the correct notion of its FQDN I think this points in the direction of Octavia charm having the wrong information about what hostname to use.

Looking at the `ovsdb-subordinate` interface code it does set the `available` state prior to having published anything which may lead this part of the Octavia charm code create a port with a shortname:
https://github.com/openstack/charm-octavia/blob/b6bdb1a12cb946ceabf73f078d5791072d513507/src/reactive/octavia_handlers.py#L142-L150

Previously the charm would self-heal/paper over this during a update-status hook but as we have started to (rightfully) disable charm processing during update-status hook this no longer happens.

So I guess a viable path to fix this specific issue would be:
1) Fix the `ovsdb-subordinate` interface code so that it does not set `avaialble` state until chassis name is available on the relation.
2) Change the octavia charm behaviour when it finds its port down after initial creation. In addition to log the event we could put the charm into a `waiting` or`blocked` state and possibly do some waiting for it to come up.

Looking at the hook execution history in the log on the failed units support this:
2020-10-20 10:37:47 INFO juju-log toggling port 21b287ea-4595-441c-9915-1f2eef110670 (admin_state_up: False status: DOWN)
2020-10-20 10:39:06 INFO juju-log Reactive main running for hook update-status
2020-10-20 10:43:34 INFO juju-log Reactive main running for hook update-status
2020-10-20 10:48:56 INFO juju-log Reactive main running for hook update-status
2020-10-20 10:53:08 INFO juju-log Reactive main running for hook update-status
2020-10-20 10:57:51 INFO juju-log Reactive main running for hook update-status
2020-10-20 11:03:41 INFO juju-log Reactive main running for hook update-status
2020-10-20 11:10:03 INFO juju-log Reactive main running for hook update-status
2020-10-20 11:14:25 INFO juju-log Reactive main running for hook update-status
2020-10-20 11:19:47 INFO juju-log Reactive main running for hook update-status
2020-10-20 11:25:16 INFO juju-log Reactive main running for hook update-status
2020-10-20 11:29:18 INFO juju-log Reactive main running for hook update-status
2020-10-20 11:35:04 INFO juju-log Reactive main running for hook update-status
2020-10-20 11:40:48 INFO juju-log Reactive main running for hook update-status
2020-10-20 11:46:08 INFO juju-log Reactive main running for hook update-status
2020-10-20 11:50:21 INFO juju-log Reactive main running for hook update-status
2020-10-20 11:55:49 INFO juju-log Reactive main running for hook update-status
2020-10-20 12:00:21 INFO juju-log Reactive main running for hook update-status
2020-10-20 12:04:39 INFO juju-log Reactive main running for hook update-status
2020-10-20 12:09:48 INFO juju-log Reactive main running for hook update-status
2020-10-20 12:13:57 INFO juju-log Reactive main running for hook update-status
2020-10-20 12:19:39 INFO juju-log Reactive main running for hook update-status
2020-10-20 12:25:37 INFO juju-log Reactive main running for hook update-status
2020-10-20 12:31:27 INFO juju-log Reactive main running for hook update-status
2020-10-20 12:36:22 INFO juju-log Reactive...

Read more...

Frode Nordahl (fnordahl)
Changed in charm-interface-ovsdb:
status: New → In Progress
importance: Undecided → High
Changed in charm-octavia:
status: New → Triaged
importance: Undecided → High
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-interface-ovsdb (master)

Reviewed: https://review.opendev.org/758967
Committed: https://git.openstack.org/cgit/x/charm-interface-ovsdb/commit/?id=93e40a6bd571030a4a561ad8928e57515a27f496
Submitter: Zuul
Branch: master

commit 93e40a6bd571030a4a561ad8928e57515a27f496
Author: Frode Nordahl <email address hidden>
Date: Wed Oct 21 10:39:08 2020 +0200

    ovsdb-subordinate: Gate available state on presence of chassis_name

    Change-Id: Id04d2c33fef71c345e1f591d9e5259a834262260
    Closes-Bug: #1893446

Changed in charm-interface-ovsdb:
status: In Progress → Fix Committed
Frode Nordahl (fnordahl)
Changed in charm-octavia:
status: In Progress → Fix Committed
David Ames (thedac)
Changed in charm-octavia:
milestone: none → 21.01
Changed in charm-interface-ovsdb:
milestone: none → 21.01
David Ames (thedac)
Changed in charm-octavia:
status: Fix Committed → Fix Released
Changed in charm-interface-ovsdb:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-octavia (master)

Reviewed: https://review.opendev.org/c/openstack/charm-octavia/+/759185
Committed: https://opendev.org/openstack/charm-octavia/commit/762bfc6b29136eb94113baff8946a70aa1590706
Submitter: "Zuul (22348)"
Branch: master

commit 762bfc6b29136eb94113baff8946a70aa1590706
Author: Frode Nordahl <email address hidden>
Date: Thu Oct 22 08:17:39 2020 +0200

    Confirm operational status of mgmt port

    Put charm into a blocked state if the unit local port to mgmt.
    network does not come up.

    Add workaround in functional test that allows to re-enable
    bionic-ussuri gate. Drop proposed from bionic-ussuri
    bundles.

    Closes-Bug: #1893446
    Change-Id: I098e449723bce128e8c2efda1e9fafc9156a71e5

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.