Bug #1547609 “Delete Health Monitor makes LB go into ERROR opera...” : Bugs : octavia

Franklin Naval (franknaval) on 2016-02-19

description:

updated

Michael Johnson (johnsom) on 2016-02-19

Changed in octavia:
importance:	Undecided → Critical

Michael Johnson (johnsom) on 2016-02-19

tags:

added: target-mitaka

Revision history for this message

Stephen Balukoff (sbalukoff) wrote on 2016-02-21:

#1

I'll see about patching this at the end of the end of the L7 chain. I don't think this is a problem with shared pools, as nothing in shared pools really touched health monitors... and nothing in the L7 chain does either. But as I want to try to avoid disturbing anyone reviewing the L7 chain because I suspect we are very close to merging on it, might as well fix this problem with the assumption that it has to work with L7, too.

Changed in octavia:
assignee:	nobody → Stephen Balukoff (sbalukoff)

Revision history for this message

Stephen Balukoff (sbalukoff) wrote on 2016-02-22:

#2

Ok, I've been trying to reproduce this one on the command line and I'm not able to do so. Here's *exactly* what I've been running, and it's appears to be working fine:

neutron port-create private
export PORT_ID=8d90f3e4-7302-449c-a806-600de641afa5
export SUBNET_ID=3ef7d8d2-74f1-4f58-9318-d6e11ff054b5

curl -X POST -H Content-type:application/json -d "{\"name\": \"test_lb\", \"vip\": {\"ip_address\": \"10.0.0.3\", \"port_id\": \"$PORT_ID\", \"subnet_id\": \"$SUBNET_ID\"}}" http://localhost:9876/v1/loadbalancers
export LB=98db71ee-7f18-4319-bc37-48ea7984824e

curl -X POST -H Content-type:application/json -d '{"name": "test_listener", "protocol": "HTTP", "protocol_port": 80}' http://localhost:9876/v1/loadbalancers/$LB/listeners
export LISTENER=ffb8d376-284d-4cdc-b99c-82e885bfed45

curl -X POST -H Content-type:application/json -d '{"name":"test_pool1", "protocol": "HTTP", "lb_algorithm": "ROUND_ROBIN"}' http://localhost:9876/v1/loadbalancers/$LB/listeners/$LISTENER/pools
export POOL1=d302ee28-84dd-4043-8e43-36560563b378

curl -X POST -H Content-type:application/json -d '{"ip_address": "10.0.0.50", "protocol_port": 81}' http://localhost:9876/v1/loadbalancers/$LB/pools/$POOL1/members
curl -X POST -H Content-type:application/json -d '{"ip_address": "10.0.0.51", "protocol_port": 80}' http://localhost:9876/v1/loadbalancers/$LB/pools/$POOL1/members

curl -X POST -H Content-type:application/json -d '{"type": "HTTP", "delay": 5, "timeout": 10, "fall_threshold": 4, "rise_threshold": 2}' http://localhost:9876/v1/loadbalancers/$LB/pools/$POOL1/healthmonitor

curl -X DELETE http://localhost:9876/v1/loadbalancers/$LB/pools/$POOL1/healthmonitor

...and the load balancer never went into an ERROR state, the controller worker never reported an error, and I was able to watch the haproxy.cfg that got generated on the amphora and see that the health monitor appear and the disappear off the amphora. (Did it a second and third time just to be sure.)

Please note that I tried reproducing this problem by just hitting the raw API because I wanted to remove the extra complication of the neutron-lbaas system and CLI (ie. to make sure the bug we're seeing is truly a problem with Octavia.) I notice that you've directed me to try reproducing the problem using your new tempest testing code. Given that this is literally the first tempest test you've written for the code... it seems rather likely to me that the problem is probably somewhere therein, especially since the problem doesn't seem to occur when the raw API is hit directly.

Franklin-- can you show me the exact steps I would need to take to reproduce this problem on the command line using curl or something as simple as that, so that we can narrow down where the problem really lies?

Ok, I've been trying to reproduce this one on the command line and I'm not able to do so. Here's *exactly* what I've been running, and it's appears to be working fine:

neutron port-create private
export PORT_ID=8d90f3e4-7302-449c-a806-600de641afa5
export SUBNET_ID=3ef7d8d2-74f1-4f58-9318-d6e11ff054b5

curl -X POST -H Content-type:application/json -d "{\"name\": \"test_lb\", \"vip\": {\"ip_address\": \"10.0.0.3\", \"port_id\": \"$PORT_ID\", \"subnet_id\": \"$SUBNET_ID\"}}" http://localhost:9876/v1/loadbalancers
export LB=98db71ee-7f18-4319-bc37-48ea7984824e

curl -X POST -H Content-type:application/json -d '{"name": "test_listener", "protocol": "HTTP", "protocol_port": 80}' http://localhost:9876/v1/loadbalancers/$LB/listeners
export LISTENER=ffb8d376-284d-4cdc-b99c-82e885bfed45

curl -X POST -H Content-type:application/json -d '{"name":"test_pool1", "protocol": "HTTP", "lb_algorithm": "ROUND_ROBIN"}' http://localhost:9876/v1/loadbalancers/$LB/listeners/$LISTENER/pools
export POOL1=d302ee28-84dd-4043-8e43-36560563b378

curl -X POST -H Content-type:application/json -d '{"ip_address": "10.0.0.50", "protocol_port": 81}' http://localhost:9876/v1/loadbalancers/$LB/pools/$POOL1/members
curl -X POST -H Content-type:application/json -d '{"ip_address": "10.0.0.51", "protocol_port": 80}' http://localhost:9876/v1/loadbalancers/$LB/pools/$POOL1/members

curl -X POST -H Content-type:application/json -d '{"type": "HTTP", "delay": 5, "timeout": 10, "fall_threshold": 4, "rise_threshold": 2}' http://localhost:9876/v1/loadbalancers/$LB/pools/$POOL1/healthmonitor

curl -X DELETE http://localhost:9876/v1/loadbalancers/$LB/pools/$POOL1/healthmonitor

...and the load balancer never went into an ERROR state, the controller worker never reported an error, and I was able to watch the haproxy.cfg that got generated on the amphora and see that the health monitor appear and the disappear off the amphora. (Did it a second and third time just to be sure.)

Please note that I tried reproducing this problem by just hitting the raw API because I wanted to remove the extra complication of the neutron-lbaas system and CLI (ie. to make sure the bug we're seeing is truly a problem with Octavia.) I notice that you've directed me to try reproducing the problem using your new tempest testing code.  Given that this is literally the first tempest test you've written for the code... it seems rather likely to me that the problem is probably somewhere therein, especially since the problem doesn't seem to occur when the raw API is hit directly.

Franklin-- can you show me the exact steps I would need to take to reproduce this problem on the command line using curl or something as simple as that, so that we can narrow down where the problem really lies?

Changed in octavia:
status:	New → Incomplete

Revision history for this message

Stephen Balukoff (sbalukoff) wrote on 2016-02-22:

#3

Ok, I just tried this with the neutron-lbaas CLI as well, and it all worked fine again. Here is exactly what I did:

neutron subnet-list
neutron lbaas-loadbalancer-create --name lb1 public-subnet
# Repeat the next command until the load balancer shows being ACTIVE:
neutron lbaas-loadbalancer-list
neutron lbaas-listener-create --name listener1 --loadbalancer lb1 --protocol HTTP --protocol-port 80
neutron lbaas-pool-create --name pool1 --lb-algorithm ROUND_ROBIN --listener listener1 --protocol HTTP
neutron lbaas-member-create --subnet private-subnet --address 10.0.0.50 --protocol-port 80 pool1
neutron lbaas-member-create --subnet private-subnet --address 10.0.0.50 --protocol-port 80 pool1
neutron lbaas-healthmonitor-create --delay 5 --max-retries 4 --timeout 10 --type HTTP --pool pool1
neutron lbaas-healthmonitor-delete 9bfec150-cf70-4579-8893-807c3e711fb3

(Again, repeated the healthmonitor stuff a few times for good measure-- it never errored out.)

Revision history for this message

Bhaargavi Natarajan (bnatara) wrote on 2016-02-25:

#4

Just tried this and could not reproduce.

Revision history for this message

Franklin Naval (franknaval) wrote on 2016-03-08:

#5

Tested this again today and the LB still it happening with the Octavia API.

Here are the logs:
https://gist.githubusercontent.com/fnaval/70186ceb929a5673c419/raw/263b632aa6d62aa5f296217410686d31e56525df/gistfile1.txt

Test from: https://review.openstack.org/#/c/282130
        self._create_servers()
        self._start_servers()
        self._create_load_balancer()
        self._create_health_monitor()
        self._check_load_balancing()
        # stopping the primary server
        self._stop_server()
        # Asserting the traffic is sent only to the secondary server
        self._traffic_validation_after_stopping_server()

You may need to stop one of the members prior to deleting the health monitor, if that makes any difference.

Revision history for this message

Stephen Balukoff (sbalukoff) wrote on 2016-03-08:

#6

Franklin: That makes all the difference!

When the health monitor is active, and you delete a member using the Nova API (as you have here), that doesn't automatically delete it from the pool's configuration. And therefore, the amphora is still going to try to connect to the deleted server's IP address with the health monitor. When this doesn't work, it reports that that member is in an 'ERROR' operating status. If that member was the last member of the pool, the pool goes to ERROR operating status as well, and this bubbles up to the listener, and then the load balancer.

So yes, if all the members of a pool in use are down, the parent load balancer is also going to show an ERROR operating status.

Also, even with the load balancer in an ERROR operating state, so long as its provisioning status is ACTIVE, you should still be able to do things like delete the load balancer, delete the member, etc.

Have you tried deleting the member from the pool before deleting the member server using the Nova API?

Revision history for this message

Franklin Naval (franknaval) wrote on 2016-03-10:

#7

Stephen - yes, that's what I'm seeing (see partial logs below). Deleting the health monitor with a member that is unreachable causes the load balancer to go into an ERROR operating status. Is this by design?

I spoke with Brandon and he said that there is no need to delete the member as deleting a pool does a cascading member delete.

Though, I will experiment with deleting the members prior to deleting the health monitors. Or change the logic such that an ERROR operating status is acceptable if by design.

LOGS:
...
2016-03-10 04:03:57,943 25121 INFO [tempest.lib.common.rest_client] Request (TestHealthMonitorBasic:_run_cleanups): 202 DELETE http://127.0.0.1:9876/v1/loadbalancers/16746cee-28ea-49f8-a5af-e0447a4677f3/pools/562c6ec7-8726-4b99-8909-9da55ade1b64/healthmonitor 0.110s
2016-03-10 04:03:57,943 25121 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'date': 'Thu, 10 Mar 2016 04:03:57 GMT', 'status': '202', 'content-length': '0', 'server': 'WSGIServer/0.1 Python/2.7.6'}
        Body:
2016-03-10 04:03:57,989 25121 INFO [tempest.lib.common.rest_client] Request (TestHealthMonitorBasic:_run_cleanups): 200 GET http://127.0.0.1:9876/v1/loadbalancers/16746cee-28ea-49f8-a5af-e0447a4677f3 0.045s
2016-03-10 04:03:57,989 25121 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'status': '200', 'content-length': '329', 'content-location': 'http://127.0.0.1:9876/v1/loadbalancers/16746cee-28ea-49f8-a5af-e0447a4677f3', 'server': 'WSGIServer/0.1 Python/2.7.6', 'date': 'Thu, 10 Mar 2016 04:03:57 GMT', 'content-type': 'application/json; charset=UTF-8'}
        Body: {"description": null, "provisioning_status": "PENDING_UPDATE", "enabled": true, "vip": {"subnet_id": "995fd9c4-3e5d-4942-b8fe-4632d2087993", "port_id": "83181505-98fb-4fce-a22d-0983dd08e575", "ip_address": "10.100.0.5"}, "project_id": null, "id": "16746cee-28ea-49f8-a5af-e0447a4677f3", "operating_status": "ERROR", "name": null}
2016-03-10 04:03:57,990 25121 INFO [octavia.tests.tempest.v1.scenario.base] provisioning_status: PENDING_UPDATE operating_status: ERROR
...

Stephen - yes, that's what I'm seeing (see partial logs below).  Deleting the health monitor with a member that is unreachable causes the load balancer to go into an ERROR operating status.   Is this by design?

I spoke with Brandon and he said that there is no need to delete the member as deleting a pool does a cascading member delete.

Though, I will experiment with deleting the members prior to deleting the health monitors.  Or change the logic such that an ERROR operating status is acceptable if by design.

LOGS:
...
2016-03-10 04:03:57,943 25121 INFO     [tempest.lib.common.rest_client] Request (TestHealthMonitorBasic:_run_cleanups): 202 DELETE http://127.0.0.1:9876/v1/loadbalancers/16746cee-28ea-49f8-a5af-e0447a4677f3/pools/562c6ec7-8726-4b99-8909-9da55ade1b64/healthmonitor 0.110s
2016-03-10 04:03:57,943 25121 DEBUG    [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'date': 'Thu, 10 Mar 2016 04:03:57 GMT', 'status': '202', 'content-length': '0', 'server': 'WSGIServer/0.1 Python/2.7.6'}
        Body:
2016-03-10 04:03:57,989 25121 INFO     [tempest.lib.common.rest_client] Request (TestHealthMonitorBasic:_run_cleanups): 200 GET http://127.0.0.1:9876/v1/loadbalancers/16746cee-28ea-49f8-a5af-e0447a4677f3 0.045s
2016-03-10 04:03:57,989 25121 DEBUG    [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'status': '200', 'content-length': '329', 'content-location': 'http://127.0.0.1:9876/v1/loadbalancers/16746cee-28ea-49f8-a5af-e0447a4677f3', 'server': 'WSGIServer/0.1 Python/2.7.6', 'date': 'Thu, 10 Mar 2016 04:03:57 GMT', 'content-type': 'application/json; charset=UTF-8'}
        Body: {"description": null, "provisioning_status": "PENDING_UPDATE", "enabled": true, "vip": {"subnet_id": "995fd9c4-3e5d-4942-b8fe-4632d2087993", "port_id": "83181505-98fb-4fce-a22d-0983dd08e575", "ip_address": "10.100.0.5"}, "project_id": null, "id": "16746cee-28ea-49f8-a5af-e0447a4677f3", "operating_status": "ERROR", "name": null}
2016-03-10 04:03:57,990 25121 INFO     [octavia.tests.tempest.v1.scenario.base] provisioning_status: PENDING_UPDATE  operating_status: ERROR
...

Revision history for this message

Franklin Naval (franknaval) wrote on 2016-03-10:

#8

Attempted this again with deleting the members prior to deleting the health monitor.
On deletion of members, doing a get on load balancer also returns:
provisioning_status: PENDING_UPDATE operating_status: ERROR

logs:
...
2016-03-10 04:48:32.429 9394 INFO tempest.lib.common.rest_client [-] Request (TestHealthMonitorBasic:_run_cleanups): 202 DELETE http://127.0.0.1:9876/v1/loadbalancers/4b915b49-0f6f-4b6f-884a-97bdbcce5687/pools/4d13551c-3418-4af9-b246-cf8dc629e37d/members/f06211e3-e4bd-4b6a-94c0-56e3bc5875f7 0.105s
2016-03-10 04:48:32.429 9394 DEBUG tempest.lib.common.rest_client [-] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'date': 'Thu, 10 Mar 2016 04:48:32 GMT', 'status': '202', 'content-length': '0', 'server': 'WSGIServer/0.1 Python/2.7.6'}
        Body: _log_request_full /opt/stack/tempest/tempest/lib/common/rest_client.py:414
2016-03-10 04:48:32.475 9394 INFO tempest.lib.common.rest_client [-] Request (TestHealthMonitorBasic:_run_cleanups): 200 GET http://127.0.0.1:9876/v1/loadbalancers/4b915b49-0f6f-4b6f-884a-97bdbcce5687 0.046s
2016-03-10 04:48:32.476 9394 DEBUG tempest.lib.common.rest_client [-] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'status': '200', 'content-length': '329', 'content-location': 'http://127.0.0.1:9876/v1/loadbalancers/4b915b49-0f6f-4b6f-884a-97bdbcce5687', 'server': 'WSGIServer/0.1 Python/2.7.6', 'date': 'Thu, 10 Mar 2016 04:48:32 GMT', 'content-type': 'application/json; charset=UTF-8'}
        Body: {"description": null, "provisioning_status": "PENDING_UPDATE", "enabled": true, "vip": {"subnet_id": "55cc15ad-5090-4d5e-9a89-e73cbd5401d6", "port_id": "1e8df5b2-c36d-45e6-a1e2-d45cdbfb98cd", "ip_address": "10.100.0.5"}, "project_id": null, "id": "4b915b49-0f6f-4b6f-884a-97bdbcce5687", "operating_status": "ERROR", "name": null} _log_request_full /opt/stack/tempest/tempest/lib/common/rest_client.py:414
2016-03-10 04:48:32.476 9394 INFO octavia.tests.tempest.v1.scenario.base [-] provisioning_status: PENDING_UPDATE operating_status: ERROR
...

Attempted this again with deleting the members prior to deleting the health monitor. 
On deletion of members, doing a get on load balancer also returns: 
provisioning_status: PENDING_UPDATE  operating_status: ERROR

logs:
...
2016-03-10 04:48:32.429 9394 INFO tempest.lib.common.rest_client [-] Request (TestHealthMonitorBasic:_run_cleanups): 202 DELETE http://127.0.0.1:9876/v1/loadbalancers/4b915b49-0f6f-4b6f-884a-97bdbcce5687/pools/4d13551c-3418-4af9-b246-cf8dc629e37d/members/f06211e3-e4bd-4b6a-94c0-56e3bc5875f7 0.105s
2016-03-10 04:48:32.429 9394 DEBUG tempest.lib.common.rest_client [-] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'date': 'Thu, 10 Mar 2016 04:48:32 GMT', 'status': '202', 'content-length': '0', 'server': 'WSGIServer/0.1 Python/2.7.6'}
        Body:  _log_request_full /opt/stack/tempest/tempest/lib/common/rest_client.py:414
2016-03-10 04:48:32.475 9394 INFO tempest.lib.common.rest_client [-] Request (TestHealthMonitorBasic:_run_cleanups): 200 GET http://127.0.0.1:9876/v1/loadbalancers/4b915b49-0f6f-4b6f-884a-97bdbcce5687 0.046s
2016-03-10 04:48:32.476 9394 DEBUG tempest.lib.common.rest_client [-] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'status': '200', 'content-length': '329', 'content-location': 'http://127.0.0.1:9876/v1/loadbalancers/4b915b49-0f6f-4b6f-884a-97bdbcce5687', 'server': 'WSGIServer/0.1 Python/2.7.6', 'date': 'Thu, 10 Mar 2016 04:48:32 GMT', 'content-type': 'application/json; charset=UTF-8'}
        Body: {"description": null, "provisioning_status": "PENDING_UPDATE", "enabled": true, "vip": {"subnet_id": "55cc15ad-5090-4d5e-9a89-e73cbd5401d6", "port_id": "1e8df5b2-c36d-45e6-a1e2-d45cdbfb98cd", "ip_address": "10.100.0.5"}, "project_id": null, "id": "4b915b49-0f6f-4b6f-884a-97bdbcce5687", "operating_status": "ERROR", "name": null} _log_request_full /opt/stack/tempest/tempest/lib/common/rest_client.py:414
2016-03-10 04:48:32.476 9394 INFO octavia.tests.tempest.v1.scenario.base [-] provisioning_status: PENDING_UPDATE  operating_status: ERROR
...

Revision history for this message

Stephen Balukoff (sbalukoff) wrote on 2016-03-10:

#9

Franklin: There is no inherent or automatic link between a pool member and a back-end Nova server instance. This means that if you delete the nova instances that are serving the back-end requests, the pool.member objects for them will still remain and will eventually show 'ERROR' status when the health monitor passes the fall threshold.

If all the members from a pool are showing ERROR status, then it makes sense that the pool, and its parent listener, and its parent load balancer will show an ERROR operating status. (Note that they should still show 'ACTIVE' for the provisioning status until you try to change a load balancer, pool, listener, member, health monitor or other object.)

Maybe I'm not understanding what you're trying to report here: Are you talking about the operating status or provisioning status?

In any case, it is expected that if a load balancer has at least one listener with at least one pool with all of its members down, then all of the above should show an ERROR operating status.

If you want to avoid the ERROR operating status, you must delete the pool members before you delete the nova back-end server instances.

Does this make sense?

Revision history for this message

Franklin Naval (franknaval) wrote on 2016-03-10:

#10

Stephen: Yes, it does make sense and that's what I've been testing. The ERROR in operating status occurs when either deleting a pool member(before deleting the nova servers) or the health monitor.

This happens when putting a nova server into a "SHUTDOWN" status. Should I make the server into "ACTIVE" status prior to performing the delete?

Revision history for this message

Stephen Balukoff (sbalukoff) wrote on 2016-03-10:

#11

Franklin: Consider what the haproxy instance is doing with its health monitoring: It's sending a web request to every member in the pool that has a health monitor on it to determine whether that member is responsive. So from that perspective, a back-end nova instance that has been deleted, and one that has been merely shut down look exactly the same.

Now: If the back-end server is up and running and responding to requests, and you delete the member from a pool... and THAT results in the pool / listener / load balancer going into an error operating status... that *would* be unexpected and probably the symptom of a larger bug. Is that what's going on here?

Revision history for this message

Franklin Naval (franknaval) wrote on 2016-03-11:

#12

So, it looks like it's an issue with the state transition of the members (bubbling up all the way to the load balancer) when adding a health monitor. I've written a separate bug for that issue, in case this issue persists after the fix.

see: https://bugs.launchpad.net/octavia/+bug/1555854
offending code: https://github.com/openstack/octavia/blob/master/octavia/controller/healthmanager/update_db.py#L168

Michael Johnson (johnsom) on 2016-03-14

tags:

removed: target-mitaka

Revision history for this message

Michael Johnson (johnsom) wrote on 2017-08-04:

#13

Re-tested this with Pike master. I cannot reproduce this behavior and feel it has been fixed with the v2 API work.

Changed in octavia:
status:	Incomplete → Fix Released

octavia

Delete Health Monitor makes LB go into ERROR operating status

Bug Description

Other bug subscribers

Remote bug watches