Deleting a listener during failover causes it to go into ERROR (edge case)

Bug #1705764 reported by German Eichberger
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
octavia
Fix Released
High
German Eichberger

Bug Description

In one of my installations we ran a delete listener through what I assume was a failover:

2017-07-20 17:57:44.136 31862 DEBUG octavia.amphorae.drivers.haproxy.rest_api_driver [-] request url https://172.30.241.140:9443/0.5/listeners/d59b51b3-236d-4bbe-9eff-6b6979e9b24b request /openstack/venvs/octavia-master/lib/python2.7/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py:259
2017-07-20 17:58:44.217 31862 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: ReadTimeout: HTTPSConnectionPool(host='172.30.241.140', port=9443): Read timed out. (read timeout=60.0)
2017-07-20 17:58:59.242 31862 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: ReadTimeout: HTTPSConnectionPool(host='172.30.241.140', port=9443): Read timed out. (read timeout=10.0)
2017-07-20 17:59:11.474 31862 DEBUG octavia.amphorae.drivers.haproxy.rest_api_driver [-] Connected to amphora. Response: <Response [404]>: ReadTimeout: HTTPSConnectionPool(host='172.30.241.140', port=9443): Read timed out. (read timeout=10.0) request /openstack/venvs/octavia-master/lib/python2.7/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py:283
2017-07-20 17:59:11.475 31862 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: ConnectionError
2017-07-20 17:59:16.511 31862 DEBUG octavia.amphorae.drivers.haproxy.rest_api_driver [-] Connected to amphora. Response: <Response [404]>: ConnectionError request /openstack/venvs/octavia-master/lib/python2.7/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py:283
2017-07-20 17:59:16.528 31862 WARNING octavia.controller.worker.controller_worker [-] Task 'octavia.controller.worker.tasks.amphora_driver_tasks.ListenerDelete' (ea3f4801-c323-4a63-9c31-e43b96099577) transitioned into state 'FAILURE' from state 'RUNNING'
2 predecessors (most recent first):
  Atom 'octavia.controller.worker.tasks.lifecycle_tasks.ListenerToErrorOnRevertTask' {'intention': 'EXECUTE', 'state': 'SUCCESS', 'requires': {'listener': <octavia.common.data_models.Listener object at 0x7f890af15590>}, 'provides': None}
  |__Flow 'octavia-delete-listener_flow': NotFound: Not Found
2017-07-20 17:59:16.528 31862 ERROR octavia.controller.worker.controller_worker Traceback (most recent call last):
2017-07-20 17:59:16.528 31862 ERROR octavia.controller.worker.controller_worker File "/openstack/venvs/octavia-master/lib/python2.7/site-packages/taskflow/engines/action_engine/executor.py", line 53, in _execute_task
2017-07-20 17:59:16.528 31862 ERROR octavia.controller.worker.controller_worker result = task.execute(**arguments)
2017-07-20 17:59:16.528 31862 ERROR octavia.controller.worker.controller_worker File "/openstack/venvs/octavia-master/lib/python2.7/site-packages/octavia/controller/worker/tasks/amphora_driver_tasks.py", line 129, in execute
2017-07-20 17:59:16.528 31862 ERROR octavia.controller.worker.controller_worker self.amphora_driver.delete(listener, loadbalancer.vip)
2017-07-20 17:59:16.528 31862 ERROR octavia.controller.worker.controller_worker File "/openstack/venvs/octavia-master/lib/python2.7/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py", line 98, in delete
2017-07-20 17:59:16.528 31862 ERROR octavia.controller.worker.controller_worker self._apply(self.client.delete_listener, listener)
2017-07-20 17:59:16.528 31862 ERROR octavia.controller.worker.controller_worker File "/openstack/venvs/octavia-master/lib/python2.7/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py", line 89, in _apply
2017-07-20 17:59:16.528 31862 ERROR octavia.controller.worker.controller_worker func(amp, listener.id, *args)
2017-07-20 17:59:16.528 31862 ERROR octavia.controller.worker.controller_worker File "/openstack/venvs/octavia-master/lib/python2.7/site-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py", line 344, in delete_listener
2017-07-20 17:59:16.528 31862 ERROR octavia.controller.worker.controller_worker return exc.check_exception(r)
2017-07-20 17:59:16.528 31862 ERROR octavia.controller.worker.controller_worker File "/openstack/venvs/octavia-master/lib/python2.7/site-packages/octavia/amphorae/drivers/haproxy/exceptions.py", line 32, in check_exception
2017-07-20 17:59:16.528 31862 ERROR octavia.controller.worker.controller_worker raise responses[status_code]()
2017-07-20 17:59:16.528 31862 ERROR octavia.controller.worker.controller_worker NotFound: Not Found
2017-07-20 17:59:16.528 31862 ERROR octavia.controller.worker.controller_worker
2017-07-20 17:59:16.537 31862 DEBUG octavia.controller.worker.controller_worker [-] Task 'octavia.controller.worker.tasks.amphora_driver_tasks.ListenerDelete' (ea3f4801-c323-4a63-9c31-e43b96099577) transitioned into state 'REVERTING' from state 'FAILURE' _task_receiver /openstack/venvs/octavia-master/lib/python2.7/site-packages/taskflow/listeners/logging.py:194
2017-07-20 17:59:16.539 31862 WARNING octavia.controller.worker.tasks.amphora_driver_tasks [-] Reverting listener delete.

As seen getting a 404 while the listener is being built causes it to go into ERROR. Instead it should keep retrying...

Revision history for this message
Michael Johnson (johnsom) wrote :

This appears to be very old Octavia code. We highly recommend upgrading.

I would have expected to see this log:
https://github.com/openstack/octavia/blob/master/octavia/amphorae/drivers/haproxy/rest_api_driver.py#L287

Which would give us more information on what was not found.

My suspicion is that this triggered:
https://github.com/openstack/octavia/blob/master/octavia/amphorae/backends/agent/api_server/listener.py#L486

Likely because a listener create failed part way through. This would cause the listener to go into ERROR. If you followed that up with a delete call I could see how you could get stuck in an error loop here (under modern code).

We likely need a try block here:
https://github.com/openstack/octavia/blob/master/octavia/amphorae/backends/agent/api_server/listener.py#L268
To handle the case where the listener is not found and treat it as a success.

Or catch it here: https://github.com/openstack/octavia/blob/master/octavia/amphorae/drivers/haproxy/rest_api_driver.py#L354

Changed in octavia:
status: New → Triaged
importance: Undecided → High
Changed in octavia:
assignee: nobody → German Eichberger (german-eichberger)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to octavia (master)

Reviewed: https://review.openstack.org/487232
Committed: https://git.openstack.org/cgit/openstack/octavia/commit/?id=197bfa0c35fb88ada237e82539c0bbfe39d105d8
Submitter: Jenkins
Branch: master

commit 197bfa0c35fb88ada237e82539c0bbfe39d105d8
Author: German Eichberger <email address hidden>
Date: Tue Jul 25 15:30:57 2017 -0700

    Ignore 404 amphora error when deleting resources

    In case the same resource is deleted twice on the amphora
    ignore the 404 -- normally this should be covered by our
    API logic to only allow exectly one delete but this adds
    an additional fail safe.

    Per johnsom's suggestion also handle that on the amphora
    level.

    Closes-Bug: #1705764

    Change-Id: I2c5f2a4719c405d5a24acf76db40a90da67d8d17

Changed in octavia:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/octavia 1.0.0.0rc1

This issue was fixed in the openstack/octavia 1.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.