After a kube-manager failover scenario, KM repeatedly trying to delete stale VM object

Bug #1711274 reported by Vedamurthy Joshi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R4.0
Fix Committed
High
Yuvaraja Mariappan
Trunk
Fix Committed
High
Yuvaraja Mariappan

Bug Description

R4.0.1.0 Continuous Build 22 Ubuntu 16.04.2 containers

Logs : http://10.204.216.50/Docs/bugs/#

On a 3 controller, 3 kube-manager setup, i stopped all containers on one of the nodes and trying to delete and add 2 pods during that time.

Pod 3f43c2cf-826f-11e7-9ea8-525400010001 was one such busybox pod which i deleted.

KM log had this :
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cfgm_common/vnc_amqp.py", line 61, in _vnc_subscribe_callback
    self.vnc_subscribe_actions()
  File "/usr/lib/python2.7/dist-packages/cfgm_common/vnc_amqp.py", line 113, in vnc_subscribe_actions
    self.handle_delete()
  File "/usr/lib/python2.7/dist-packages/cfgm_common/vnc_amqp.py", line 197, in handle_delete
    self.obj_class.delete(obj_key)
  File "/usr/lib/python2.7/dist-packages/kube_manager/vnc/config_db.py", line 477, in delete
    super(VirtualMachineKM, cls).delete(uuid)
  File "/usr/lib/python2.7/dist-packages/kube_manager/vnc/config_db.py", line 192, in delete
    del cls._ann_fq_name_to_uuid[tuple(obj.ann_fq_name)]
KeyError: (u'default', u'k8s-default', u'k8s', u'Pod', u'busybox')

It is seen that kube-manager is repeatedly trying to delete it as shown below even though the object does not exist in cassandra

08/16/2017 07:01:24 PM [contrail-kube-manager]: __default__ [SYS_DEBUG]: KubeManagerDebugLog: VncPod - Got DELETED Pod None:None:3f43c2cf-826f-11e7-9ea8-525400010001
08/16/2017 07:01:24 PM [contrail-kube-manager]: VncKubernetes - <class 'cfgm_common.exceptions.NoIdError'>
Python 2.7.12: /usr/bin/python
Wed Aug 16 19:01:24 2017

A problem occurred in a Python script. Here is the sequence of
function calls leading up to the error, in the order they occurred.

 /usr/lib/python2.7/dist-packages/kube_manager/vnc/vnc_kubernetes.py in vnc_process(self=<kube_manager.vnc.vnc_kubernetes.VncKubernetes object>)
  374 uid = metadata.get('uid')
  375 if kind == 'Pod':
  376 self.pod_mgr.process(event)
  377 elif kind == 'Service':
  378 self.service_mgr.process(event)
self = <kube_manager.vnc.vnc_kubernetes.VncKubernetes object>
self.pod_mgr = <kube_manager.vnc.vnc_pod.VncPod object>
self.pod_mgr.process = <bound method VncPod.process of <kube_manager.vnc.vnc_pod.VncPod object>>
event = {'object': {'kind': 'Pod', 'metadata': {'labels': None, 'uid': u'3f43c2cf-826f-11e7-9ea8-525400010001'}}, 'type': 'DELETED'}

 /usr/lib/python2.7/dist-packages/kube_manager/vnc/vnc_pod.py in process(self=<kube_manager.vnc.vnc_pod.VncPod object>, event={'object': {'kind': 'Pod', 'metadata': {'labels': None, 'uid': u'3f43c2cf-826f-11e7-9ea8-525400010001'}}, 'type': 'DELETED'})
  496 pod_namespace, pod_node, labels, vm_vmi)
  497 if vm:
  498 self._network_policy_mgr.update_pod_np(pod_namespace, pod_id, labels)
  499 elif event['type'] == 'DELETED':
  500 self.vnc_pod_delete(pod_id)
self = <kube_manager.vnc.vnc_pod.VncPod object>
self.vnc_pod_delete = <bound method VncPod.vnc_pod_delete of <kube_manager.vnc.vnc_pod.VncPod object>>
pod_id = u'3f43c2cf-826f-11e7-9ea8-525400010001'

 /usr/lib/python2.7/dist-packages/kube_manager/vnc/vnc_pod.py in vnc_pod_delete(self=<kube_manager.vnc.vnc_pod.VncPod object>, pod_id=u'3f43c2cf-826f-11e7-9ea8-525400010001')
  409 # So explicitly update this entry in config db.
  410 if not vm.virtual_router:
  411 vm.update()
  412
  413 self._clear_label_to_pod_cache(vm)
vm = <kube_manager.vnc.config_db.VirtualMachineKM object>
vm.update = <bound method VirtualMachineKM.update of <kube_manager.vnc.config_db.VirtualMachineKM object>>

 /usr/lib/python2.7/dist-packages/kube_manager/vnc/config_db.py in update(self=<kube_manager.vnc.config_db.VirtualMachineKM object>, obj=None)
  451 def update(self, obj=None):
  452 if obj is None:
  453 obj = self.read_obj(self.uuid)
  454 self.name = obj['fq_name'][-1]
  455 self.fq_name = obj['fq_name']
obj = None
self = <kube_manager.vnc.config_db.VirtualMachineKM object>
self.read_obj = <bound method __metaclass__.read_obj of <class 'kube_manager.vnc.config_db.VirtualMachineKM'>>
self.uuid = u'3f43c2cf-826f-11e7-9ea8-525400010001'

 /usr/lib/python2.7/dist-packages/cfgm_common/vnc_db.py in read_obj(cls=<class 'kube_manager.vnc.config_db.VirtualMachineKM'>, uuid=u'3f43c2cf-826f-11e7-9ea8-525400010001', obj_type=None, fields=None)
  305 def read_obj(cls, uuid, obj_type=None, fields=None):
  306 ok, objs = cls._object_db.object_read(obj_type or cls.obj_type, [uuid],
  307 field_names=fields)
  308 if not ok:
  309 cls._logger.error(
field_names undefined
fields = None
/usr/lib/python2.7/dist-packages/cfgm_common/vnc_cassandra.py in wrapper(*args=('virtual_machine', [u'3f43c2cf-826f-11e7-9ea8-525400010001']), **kwargs={'field_names': None})
  466 self._cassandra_init_conn_pools()
  467
  468 return func(*args, **kwargs)
  469 except (AllServersUnavailable, MaximumRetryException) as e:
  470 if self._conn_state != ConnectionStatus.DOWN:
func = <bound method VncCassandraClient.object_read of ..._common.vnc_cassandra.VncCassandraClient object>>
args = ('virtual_machine', [u'3f43c2cf-826f-11e7-9ea8-525400010001'])
kwargs = {'field_names': None}

 /usr/lib/python2.7/dist-packages/cfgm_common/vnc_cassandra.py in object_read(self=<cfgm_common.vnc_cassandra.VncCassandraClient object>, obj_type='virtual_machine', obj_uuids=[u'3f43c2cf-826f-11e7-9ea8-525400010001'], field_names=None, ret_readonly=False)
  813 if not obj_dicts:
  814 if len(obj_uuids) == 1:
  815 raise NoIdError(obj_uuids[0])
  816 else:
  817 return (True, [])
global NoIdError = <class 'cfgm_common.exceptions.NoIdError'>
obj_uuids = [u'3f43c2cf-826f-11e7-9ea8-525400010001']
<class 'cfgm_common.exceptions.NoIdError'>: Unknown id: 3f43c2cf-826f-11e7-9ea8-525400010001
    __class__ = <class 'cfgm_common.exceptions.NoIdError'>
    __delattr__ = <method-wrapper '__delattr__' of NoIdError object>
    __dict__ = {'_unknown_id': u'3f43c2cf-826f-11e7-9ea8-525400010001'}
    __doc__ = None
    __format__ = <built-in method __format__ of NoIdError object>
    __getattribute__ = <method-wrapper '__getattribute__' of NoIdError object>
    __getitem__ = <method-wrapper '__getitem__' of NoIdError object>
    __getslice__ = <method-wrapper '__getslice__' of NoIdError object>
    __hash__ = <method-wrapper '__hash__' of NoIdError object>
    __init__ = <bound method NoIdError.__init__ of NoIdError()>
    __module__ = 'cfgm_common.exceptions'
    __new__ = <built-in method __new__ of type object>
    __reduce__ = <built-in method __reduce__ of NoIdError object>
    __reduce_ex__ = <built-in method __reduce_ex__ of NoIdError object>
    __repr__ = <method-wrapper '__repr__' of NoIdError object>
    __setattr__ = <method-wrapper '__setattr__' of NoIdError object>
    __setstate__ = <built-in method __setstate__ of NoIdError object>
    __sizeof__ = <built-in method __sizeof__ of NoIdError object>
    __str__ = <bound method NoIdError.__str__ of NoIdError()>
    __subclasshook__ = <built-in method __subclasshook__ of type object>
    __unicode__ = <built-in method __unicode__ of NoIdError object>
    __weakref__ = None
    _unknown_id = u'3f43c2cf-826f-11e7-9ea8-525400010001'
    args = ()
    message = ''

The above is a description of an error in a Python program. Here is
the original traceback:

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/kube_manager/vnc/vnc_kubernetes.py", line 376, in vnc_process
    self.pod_mgr.process(event)
  File "/usr/lib/python2.7/dist-packages/kube_manager/vnc/vnc_pod.py", line 500, in process
    self.vnc_pod_delete(pod_id)
  File "/usr/lib/python2.7/dist-packages/kube_manager/vnc/vnc_pod.py", line 411, in vnc_pod_delete
    vm.update()
  File "/usr/lib/python2.7/dist-packages/kube_manager/vnc/config_db.py", line 453, in update
    obj = self.read_obj(self.uuid)
  File "/usr/lib/python2.7/dist-packages/cfgm_common/vnc_db.py", line 307, in read_obj
    field_names=fields)
  File "/usr/lib/python2.7/dist-packages/cfgm_common/vnc_cassandra.py", line 468, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/cfgm_common/vnc_cassandra.py", line 815, in object_read
    raise NoIdError(obj_uuids[0])
NoIdError: Unknown id: 3f43c2cf-826f-11e7-9ea8-525400010001

summary: - After a kube-manager failover scenario, KM is trying to delete stale VM
- object
+ After a kube-manager failover scenario, KM repeatedly trying to delete
+ stale VM object
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.0

Review in progress for https://review.opencontrail.org/34728
Submitter: Yuvaraja Mariappan

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/34729
Submitter: Yuvaraja Mariappan

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/34728
Committed: http://github.com/Juniper/contrail-controller/commit/fb82d364229a36eae325990c859ad787525d8f53
Submitter: Zuul (<email address hidden>)
Branch: R4.0

commit fb82d364229a36eae325990c859ad787525d8f53
Author: Yuvaraja Mariappan <email address hidden>
Date: Fri Aug 18 12:00:31 2017 -0700

Fixed delete issue in k8s fail-over

During the fail-over, the active kube-manger can get
the event in the rabbit-mq before the local sync is
finished which can cause this issue. It is made sure
that local sync is completely done before registering
to the rabbit-mq events

Change-Id: Ic7114715d6fc8d817b30f0e0d31e0cd9cc1ea96b
Closes-bug: #1711274

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/34729
Committed: http://github.com/Juniper/contrail-controller/commit/73590c9bcab380e6ad328ee7df7d6835bcd54178
Submitter: Zuul (<email address hidden>)
Branch: master

commit 73590c9bcab380e6ad328ee7df7d6835bcd54178
Author: Yuvaraja Mariappan <email address hidden>
Date: Fri Aug 18 12:00:31 2017 -0700

Fixed delete issue in k8s fail-over

During the fail-over, the active kube-manger can get
the event in the rabbit-mq before the local sync is
finished which can cause this issue. It is made sure
that local sync is completely done before registering
to the rabbit-mq events

Change-Id: Ic7114715d6fc8d817b30f0e0d31e0cd9cc1ea96b
Closes-bug: #1711274

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.