In my testbed, I found this bug when I repeated deleting and creating rc with 40 pods and 1 svc of which endpoint points out those pods.
Logs are as below.
2018-01-10 04:51:51.665 1 ERROR kuryr_kubernetes.k8s_client [-] Exception response, headers: {'Date': 'Wed, 10 Jan 2018 04:51:51 GMT', 'Content-Length': '200', 'Content-Type': 'application/json', 'Cache-Control': 'no-store'}, content: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"endpoints \"webserver-1\" not found","reason":"NotFound","details":{"name":"webserver-1","kind":"endpoints"},"code":404}
, text: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"endpoints \"webserver-1\" not found","reason":"NotFound","details":{"name":"webserver-1","kind":"endpoints"},"code":404}
2018-01-10 04:51:51.665 1 ERROR kuryr_kubernetes.handlers.logging [-] Failed to handle event {u'object': {u'kind': u'Endpoints', u'subsets': [{u'addresses': [{u'ip': u'10.0.6.1', u'targetRef': {u'kind': u'Pod', u'resourceVersion': u'35481', u'namespace': u'default', u'name': u'webserver-1-zq5bf', u'uid': u'e44aab3b-f5c1-11e7-ae32-fa163e7d22e2'}, u'nodeName': u'es2-vm-10-0-4-11.novalocal'}], u'ports': [{u'protocol': u'TCP', u'port': 8080}]}], u'apiVersion': u'v1', u'metadata': {u'name': u'webserver-1', u'namespace': u'default', u'resourceVersion': u'35482', u'creationTimestamp': u'2018-01-10T04:51:16Z', u'annotations': {u'openstack.org/kuryr-lbaas-spec': u'{"versioned_object.data": {"ip": "10.0.5.120", "lb_ip": null, "ports": [{"versioned_object.data": {"name": null, "port": 80, "protocol": "TCP"}, "versioned_object.name": "LBaaSPortSpec", "versioned_object.namespace": "kuryr_kubernetes", "versioned_object.version": "1.0"}], "project_id": "9243b6fce8704943805121f4992b7f5e", "security_groups_ids": ["3df3c214-2d29-468b-9c39-3adb645dcb88"], "subnet_id": "edc0fa91-e5c5-4e08-9b47-5dfa1bde709d", "type": "ClusterIP"}, "versioned_object.name": "LBaaSServiceSpec", "versioned_object.namespace": "kuryr_kubernetes", "versioned_object.version": "1.0"}'}, u'selfLink': u'/api/v1/namespaces/default/endpoints/webserver-1', u'uid': u'e44d32b1-f5c1-11e7-ae32-fa163e7d22e2'}}, u'type': u'MODIFIED'}: K8sClientException: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"endpoints \"webserver-1\" not found","reason":"NotFound","details":{"name":"webserver-1","kind":"endpoints"},"code":404}
2018-01-10 04:51:51.665 1 ERROR kuryr_kubernetes.handlers.logging Traceback (most recent call last):
2018-01-10 04:51:51.665 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/logging.py", line 37, in __call__
2018-01-10 04:51:51.665 1 ERROR kuryr_kubernetes.handlers.logging self._handler(event)
2018-01-10 04:51:51.665 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 61, in __call__
2018-01-10 04:51:51.665 1 ERROR kuryr_kubernetes.handlers.logging self._handler(event)
2018-01-10 04:51:51.665 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 60, in __call__
2018-01-10 04:51:51.665 1 ERROR kuryr_kubernetes.handlers.logging self.on_present(obj)
2018-01-10 04:51:51.665 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/lbaas.py", line 247, in on_present
2018-01-10 04:51:51.665 1 ERROR kuryr_kubernetes.handlers.logging self._set_lbaas_state(endpoints, lbaas_state)
2018-01-10 04:51:51.665 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/lbaas.py", line 569, in _set_lbaas_state
2018-01-10 04:51:51.665 1 ERROR kuryr_kubernetes.handlers.logging resource_version=endpoints['metadata']['resourceVersion'])
2018-01-10 04:51:51.665 1 ERROR kuryr_kubernetes.handlers.logging File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/k8s_client.py", line 148, in annotate
2018-01-10 04:51:51.665 1 ERROR kuryr_kubernetes.handlers.logging raise exc.K8sClientException(response.text)
2018-01-10 04:51:51.665 1 ERROR kuryr_kubernetes.handlers.logging K8sClientException: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"endpoints \"webserver-1\" not found","reason":"NotFound","details":{"name":"webserver-1","kind":"endpoints"},"code":404}
2018-01-10 04:51:51.665 1 ERROR kuryr_kubernetes.handlers.logging
2018-01-10 04:51:51.665 1 ERROR kuryr_kubernetes.handlers.logging
Have a look at this bug situation.
1. First pod creation
2. Endpoint has been changed
3. Neutron LBaaS has been synced with the LBaaSSpec (It's a high latency job)
-> Create LoadBalancer, Listener, Pool, and Member sequentially.
[3] takes too much time at neutron server. When k8s endpoint has been deleted before the LBaaS resources are created properly, hence, set_annotate has failed since the k8s object gone already as above log.
This raises the created lbaas resources remain, because there's no rollback logic anywhere in the code.
I could see many times this bug, and patched the rollback code on lbaas handler (after annotation failed).
Is there anyone who experienced this and knows the brilliant solution?
If this is a definite bug on kuryr, I'd like to contribute from my patch.