After control node reboot on HA cluster, VM's are not reachable

Bug #1579177 reported by manishkn
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
Critical
Ignatious Johnson Christopher
R2.22.x
Fix Committed
Critical
Ignatious Johnson Christopher
Trunk
Invalid
Critical
Ignatious Johnson Christopher

Bug Description

Testbed is in problem state now

Attaching the email thread

+Hari

Hi Hari,

Seems after all three controller reboot, ARP fails for the default gw since the route for the same is not in the vrf. Can you please take a look?

http://10.87.143.20:8086/Snh_Inet4UcRouteReq?x=3

Relevant ipam info
type:virtual-network-network-ipam name:attr(default-domain:symantec.Tenant.0:tenant0.test_id1.Private_SNAT_VN0,default-domain:symantec.Tenant.0:tenant0.test_id1.ipam) value ipam-subnets subnet ip-prefix:11.76.133.0 ip-prefix-len:24 default-gateway:11.76.133.1 dns-server-address:11.76.133.2 subnet-uuid:131c6228-abd8-4504-a47c-197cbec9ad24 enable-dhcp:true addr_from_start:true subnet-name:tenant0.test_id1.Private_SNAT_VN0_ipv4_subnet0 Adjacencies: virtual-network default-domain:symantec.Tenant.0:tenant0.test_id1.Private_SNAT_VN0 network-ipam default-domain:symantec.Tenant.0:tenant0.test_id1.ipam

Thanks,
Senthil

_____________________________________________
From: Manish Krishnan
Sent: Thursday, May 05, 2016 10:20 PM
To: Senthilnathan Murugappan; Jeba Paulaiyan
Subject: VM's not reachable and all interfaces are down

Hi,

I was testing csol test with SNAT and LBAS features on this cluster with HA.
Now I see my setup is in weird state where all the VM’s are active but all the interfaces are down and even VM console is not reachable. This state is reached after rebooting all the control nodes.

Could you please take a look and check if this is a real issue.

Version : 2.22.2-10
Setup : 99.1.1.4, .6 and .8 are control nodes
               99.1.1.11, 13, 22, 23, 24, 25 are compute nodes

As this is a private IP, this is connected via jump host (10.87.143.20), so pls connect to jump host then ssh the nodes.

Thanks
Manish Krishnan

Observation from Hari
======================

type:virtual-network
name:default-domain:symantec.Tenant.0:tenant0.test_id1.Private_SNAT_VN0

Adjacencies:
instance-ip fd1dc175-55f7-4abe-84a0-9b11998659a3
virtual-machine-interface default-domain:symantec.Tenant.0:7b2e55ab-585f-4601-9a88-9f5ac85dabf9
virtual-network-network-ipam attr(default-domain:symantec.Tenant.0:tenant0.test_id1.Private_SNAT_VN0,default-domain:symantec.Tenant.0:tenant0.test_id1.ipam)

routing-instance adjacency is missing.

Regards,
Hari

schema traceback

<pre>Traceback (most recent call last):
  File "/usr/bin/contrail-schema", line 9, in &lt;module>
    load_entry_point('schema-transformer==0.1dev', 'console_scripts', 'contrail-schema')()
  File "/usr/lib/python2.7/dist-packages/schema_transformer/to_bgp.py", line 3838, in server_main
    main()
  File "/usr/lib/python2.7/dist-packages/schema_transformer/to_bgp.py", line 3832, in main
    args)
  File "/usr/lib/python2.7/dist-packages/cfgm_common/zkclient.py", line 297, in master_election
    self._election.run(self._zk_election_callback, func, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/kazoo/recipe/election.py", line 48, in run
    func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/cfgm_common/zkclient.py", line 289, in _zk_election_callback
    func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/schema_transformer/to_bgp.py", line 3811, in run_schema_transformer
    transformer = SchemaTransformer(args)
  File "/usr/lib/python2.7/dist-packages/schema_transformer/to_bgp.py", line 2603, in __init__
    ServiceChain.init()
  File "/usr/lib/python2.7/dist-packages/schema_transformer/to_bgp.py", line 1486, in init
    for (name, columns) in cls._cassandra.list_service_chain_uuid():
  File "/usr/lib/python2.7/dist-packages/pycassa/columnfamily.py", line 964, in get_range
    key_slices = self.pool.execute('get_range_slices', cp, sp, key_range, cl)
  File "/usr/lib/python2.7/dist-packages/pycassa/pool.py", line 577, in execute
    return getattr(conn, f)(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/pycassa/pool.py", line 127, in new_f
    result = f(self, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/pycassa/cassandra/Cassandra.py", line 757, in get_range_slices
    return self.recv_get_range_slices()
  File "/usr/lib/python2.7/dist-packages/pycassa/cassandra/Cassandra.py", line 783, in recv_get_range_slices
    raise result.ire
InvalidRequestException: InvalidRequestException(why='unconfigured columnfamily service_chain_uuid_table')
</pre>

manishkn (manishkn)
Changed in juniperopenstack:
milestone: none → r2.22.2
Jeba Paulaiyan (jebap)
information type: Proprietary → Public
tags: added: blocker
amit surana (asurana-t)
tags: added: soln
manishkn (manishkn)
description: updated
tags: added: config
removed: vrouter
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.22.x

Review in progress for https://review.opencontrail.org/20018
Submitter: Ignatious Johnson Christopher (<email address hidden>)

Revision history for this message
Ignatious Johnson Christopher (ijohnson-x) wrote :

This issue will not be seen in trunk and r3.0, so can you please remove those targets from the bug.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/20034
Submitter: Ignatious Johnson Christopher (<email address hidden>)

Jeba Paulaiyan (jebap)
no longer affects: juniperopenstack/r3.0
no longer affects: juniperopenstack/trunk
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/20018
Committed: http://github.org/Juniper/contrail-controller/commit/1e743ffd0e419281a9d303774d7348c6c84641b3
Submitter: Zuul
Branch: R2.22.x

commit 1e743ffd0e419281a9d303774d7348c6c84641b3
Author: Ignatious Johnson Christopher <email address hidden>
Date: Mon May 9 17:42:57 2016 +0000

Same type(tuple) is returned by the _get_routing_instance_from_route
method during early returns in case of failure.
Removing next_hop from si_dict if it is not in the route target list.

Closes-Bug: 1579177

Change-Id: Ifb3e127a398b02503193859b5b49a958d6d5348e

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/20034
Committed: http://github.org/Juniper/contrail-controller/commit/0d122d3ebdc59537aaebd45502b182dcf1123aba
Submitter: Zuul
Branch: R2.20

commit 0d122d3ebdc59537aaebd45502b182dcf1123aba
Author: Ignatious Johnson Christopher <email address hidden>
Date: Mon May 9 17:42:57 2016 +0000

Same type(tuple) is returned by the _get_routing_instance_from_route
method during early returns in case of failure.
Removing next_hop from si_dict if it is not in the route target list.

Closes-Bug: 1579177

Change-Id: Ifb3e127a398b02503193859b5b49a958d6d5348e

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.