Comment 3 for bug 1765590

Revision history for this message
ping (itestitest) wrote :

+ email threads with suresh.

From: Suresh Kumar Vinapamula Venkata
Sent: Tuesday, April 24, 2018 9:27 PM
To: Ping Song <email address hidden>
Subject: Re: https://bugs.launchpad.net/juniperopenstack/+bug/1765590

Ping,

I will be off tomorrow. Will get back to you on this on Thursday. I don’t have explanation at this point on this behavior.

From: Ping Song <email address hidden>
Date: Tuesday, April 24, 2018 at 5:04 PM
To: Suresh Kumar Vinapamula Venkata <email address hidden>
Subject: RE: https://bugs.launchpad.net/juniperopenstack/+bug/1765590

Eventually I see all nodes stabilized, all data popped out.

In [7]: len(dict(OBJ_FQ_NAME_TABLE.get_range()).keys())
Out[7]: 22

In [8]: dict(OBJ_FQ_NAME_TABLE.get_range()).keys()
Out[8]:
['service_appliance_set',
'domain',
'virtual_router',
'global_system_config',
'network_policy',
'route_table',
'service_appliance',
'network_ipam',
'config_node',
'namespace',
'bgp_router',
'analytics_node',
'service_template',
'api_access_list',
'discovery_service_assignment',
'qos_queue',
'database_node',
'route_target',
'global_vrouter_config',
'project',
'routing_instance',
'virtual_network']

I’m just worrying that, it took at least an hour or so to populate all data.
Right after I executed the recovery step, the DB looks below on all 3 nodes:

In [3]: len(dict(OBJ_FQ_NAME_TABLE.get_range()).keys())
Out[3]: 3

In [4]: dict(OBJ_FQ_NAME_TABLE.get_range()).keys()
Out[4]: ['virtual_network', 'virtual_router', 'discovery_service_assignment']

In [5]: dict(OBJ_FQ_NAME_TABLE.get_range())
Out[5]:
{'discovery_service_assignment': OrderedDict([('default-discovery-service-assignment:f3fe29b7-6cf6-4995-8d6d-232d681142d8', u'null')])
,
'virtual_network': OrderedDict([('default-domain:default-project:__link_local__:1272b262-c8bb-477f-8951-73606e810c0e', u'null'), ('de
fault-domain:default-project:default-virtual-network:366169d8-2b41-47f0-866c-f711d4a9ec94', u'null'), ('default-domain:default-project
:ip-fabric:cca2c1e8-be8f-4f47-af3f-4ac06708da0d', u'null')]),
'virtual_router': OrderedDict([('default-global-system-config:comp47:c62cb4a4-f43a-4635-a67e-6f9a62876303', u'null'), ('default-globa
l-system-config:comp48:815eec89-ba28-4916-a19b-f8f84eba6e62', u'null')])}

Suresh can you help explaining:

1. Why it took such a long time to fully recovery?
2. Does it indicate real issues when the table shows only 3 keys on all 3 nodes, or it is just a display issue? It is very easy to see in the test.

regards
ping

From: Suresh Kumar Vinapamula Venkata
Sent: Tuesday, April 24, 2018 6:24 PM
To: Ping Song <email address hidden>
Subject: Re: https://bugs.launchpad.net/juniperopenstack/+bug/1765590

Sorry, what problems do you see with the approach that is documented?

From: Ping Song <email address hidden>
Date: Tuesday, April 24, 2018 at 3:02 PM
To: Suresh Kumar Vinapamula Venkata <email address hidden>
Subject: RE: https://bugs.launchpad.net/juniperopenstack/+bug/1765590

My test shows there are still problems.

Steps I took:

Stop Cassandra db on node2
Remove /var/lib/Cassandra/* on node2
Nodetool removenode <NODE2> from other nodes
Start Cassandra db on node2

I’ve waited for quite a while and now I see all 3 nodes:

In [4]: dict(OBJ_FQ_NAME_TABLE.get_range()).keys()
Out[4]: ['virtual_network', 'virtual_router', 'discovery_service_assignment']

In [5]: dict(OBJ_FQ_NAME_TABLE.get_range())
Out[5]:
{'discovery_service_assignment': OrderedDict([('default-discovery-service-assignment:f3fe29b7-6cf6-4995-8d6d-232d681142d8', u'null')]),
'virtual_network': OrderedDict([('default-domain:default-project:__link_local__:1272b262-c8bb-477f-8951-73606e810c0e', u'null'), ('default-domain:default-project:default-virtual-network:366169d8-2b41-47f0-866c-f711d4a9ec94', u'null'), ('default-domain:default-project:ip-fabric:cca2c1e8-be8f-4f47-af3f-4ac06708da0d', u'null')]),
'virtual_router': OrderedDict([('default-global-system-config:comp47:c62cb4a4-f43a-4635-a67e-6f9a62876303', u'null'), ('default-global-system-config:comp48:815eec89-ba28-4916-a19b-f8f84eba6e62', u'null')])}

Regarding customer’s step there is no much details. We only know their node2 for a long time there is MTU issue, and after correction MTU issue node2 joined cluster, they start to see missing data.
I filed LP 1757441 for that.
https://bugs.launchpad.net/juniperopenstack/+bug/1757441

From: Suresh Kumar Vinapamula Venkata
Sent: Tuesday, April 24, 2018 2:47 PM
To: Ping Song <email address hidden>
Subject: Re: https://bugs.launchpad.net/juniperopenstack/+bug/1765590

Ping,

Yes, try out these steps and let me know if you face the issue. We also need to understand what is the procedure that customer followed?

Suresh

From: Ping Song <email address hidden>
Date: Tuesday, April 24, 2018 at 11:26 AM
To: Suresh Kumar Vinapamula Venkata <email address hidden>
Subject: RE: https://bugs.launchpad.net/juniperopenstack/+bug/1765590

Suresh:
Thanks for looking into it.

I opened this PR because I see issue during my replication effort of ATT’s issue.
I think I may not be fully emulating what they were testing. But I’m just wondering if this is a legal operation here.

What I did:
Stop node2
Remove /var/lib/Cassandra/*
Start node2

This is to emulate the case when data in harddisk got corrupted abruptly. The whole process last no more than 20minutes.

The recommended steps as I understood applies to the condition when one node DB down longer than 10 days?
So are you suggesting the same recovery steps in my test too?

I can test that out and see any difference.

Regards
ping

From: Suresh Kumar Vinapamula Venkata
Sent: Tuesday, April 24, 2018 1:51 PM
To: Ping Song <email address hidden>
Subject: https://bugs.launchpad.net/juniperopenstack/+bug/1765590

Not sure if you had a chance to look into this in the AT of https://bugs.launchpad.net/juniperopenstack/+bug/1765590
Ping,
Whom/Where did you receive these recovery steps from?
Could you check if they removed, they followed below steps?
rm -rf /var/lib/cassandra/commitlog/*
rm -f /var/log/cassandra/status-up
This was the recommended approach from analytics team who were handling Cassandra.
https://github.com/Juniper/contrail-controller/wiki/Recovery-procedure-when-contrail-database-is-down-for-greater-than-gc_grace_seconds
-Suresh
BTW, this is in addition to Cassandra data cleanup.