I’m just worrying that, it took at least an hour or so to populate all data.
Right after I executed the recovery step, the DB looks below on all 3 nodes:
In [3]: len(dict(OBJ_FQ_NAME_TABLE.get_range()).keys())
Out[3]: 3
In [4]: dict(OBJ_FQ_NAME_TABLE.get_range()).keys()
Out[4]: ['virtual_network', 'virtual_router', 'discovery_service_assignment']
1. Why it took such a long time to fully recovery?
2. Does it indicate real issues when the table shows only 3 keys on all 3 nodes, or it is just a display issue? It is very easy to see in the test.
Regarding customer’s step there is no much details. We only know their node2 for a long time there is MTU issue, and after correction MTU issue node2 joined cluster, they start to see missing data.
I filed LP 1757441 for that. https://bugs.launchpad.net/juniperopenstack/+bug/1757441
I opened this PR because I see issue during my replication effort of ATT’s issue.
I think I may not be fully emulating what they were testing. But I’m just wondering if this is a legal operation here.
What I did:
Stop node2
Remove /var/lib/Cassandra/*
Start node2
This is to emulate the case when data in harddisk got corrupted abruptly. The whole process last no more than 20minutes.
The recommended steps as I understood applies to the condition when one node DB down longer than 10 days?
So are you suggesting the same recovery steps in my test too?
+ email threads with suresh.
From: Suresh Kumar Vinapamula Venkata /bugs.launchpad .net/juniperope nstack/ +bug/1765590
Sent: Tuesday, April 24, 2018 9:27 PM
To: Ping Song <email address hidden>
Subject: Re: https:/
Ping,
I will be off tomorrow. Will get back to you on this on Thursday. I don’t have explanation at this point on this behavior.
From: Ping Song <email address hidden> /bugs.launchpad .net/juniperope nstack/ +bug/1765590
Date: Tuesday, April 24, 2018 at 5:04 PM
To: Suresh Kumar Vinapamula Venkata <email address hidden>
Subject: RE: https:/
Eventually I see all nodes stabilized, all data popped out.
In [7]: len(dict( OBJ_FQ_ NAME_TABLE. get_range( )).keys( ))
Out[7]: 22
In [8]: dict(OBJ_ FQ_NAME_ TABLE.get_ range() ).keys( ) appliance_ set', system_ config' , appliance' , service_ assignment' , vrouter_ config' ,
Out[8]:
['service_
'domain',
'virtual_router',
'global_
'network_policy',
'route_table',
'service_
'network_ipam',
'config_node',
'namespace',
'bgp_router',
'analytics_node',
'service_template',
'api_access_list',
'discovery_
'qos_queue',
'database_node',
'route_target',
'global_
'project',
'routing_instance',
'virtual_network']
I’m just worrying that, it took at least an hour or so to populate all data.
Right after I executed the recovery step, the DB looks below on all 3 nodes:
In [3]: len(dict( OBJ_FQ_ NAME_TABLE. get_range( )).keys( ))
Out[3]: 3
In [4]: dict(OBJ_ FQ_NAME_ TABLE.get_ range() ).keys( ) service_ assignment' ]
Out[4]: ['virtual_network', 'virtual_router', 'discovery_
In [5]: dict(OBJ_ FQ_NAME_ TABLE.get_ range() ) service_ assignment' : OrderedDict( [('default- discovery- service- assignment: f3fe29b7- 6cf6-4995- 8d6d-232d681142 d8', u'null')]) [('default- domain: default- project: __link_ local__ :1272b262- c8bb-477f- 8951-73606e810c 0e', u'null'), ('de default- project: default- virtual- network: 366169d8- 2b41-47f0- 866c-f711d4a9ec 94', u'null'), ('default- domain: default- project cca2c1e8- be8f-4f47- af3f-4ac06708da 0d', u'null')]), [('default- global- system- config: comp47: c62cb4a4- f43a-4635- a67e-6f9a628763 03', u'null'), ('default-globa config: comp48: 815eec89- ba28-4916- a19b-f8f84eba6e 62', u'null')])}
Out[5]:
{'discovery_
,
'virtual_network': OrderedDict(
fault-domain:
:ip-fabric:
'virtual_router': OrderedDict(
l-system-
Suresh can you help explaining:
1. Why it took such a long time to fully recovery?
2. Does it indicate real issues when the table shows only 3 keys on all 3 nodes, or it is just a display issue? It is very easy to see in the test.
regards
ping
From: Suresh Kumar Vinapamula Venkata /bugs.launchpad .net/juniperope nstack/ +bug/1765590
Sent: Tuesday, April 24, 2018 6:24 PM
To: Ping Song <email address hidden>
Subject: Re: https:/
Sorry, what problems do you see with the approach that is documented?
From: Ping Song <email address hidden> /bugs.launchpad .net/juniperope nstack/ +bug/1765590
Date: Tuesday, April 24, 2018 at 3:02 PM
To: Suresh Kumar Vinapamula Venkata <email address hidden>
Subject: RE: https:/
My test shows there are still problems.
Steps I took:
Stop Cassandra db on node2 Cassandra/ * on node2
Remove /var/lib/
Nodetool removenode <NODE2> from other nodes
Start Cassandra db on node2
I’ve waited for quite a while and now I see all 3 nodes:
In [4]: dict(OBJ_ FQ_NAME_ TABLE.get_ range() ).keys( ) service_ assignment' ]
Out[4]: ['virtual_network', 'virtual_router', 'discovery_
In [5]: dict(OBJ_ FQ_NAME_ TABLE.get_ range() ) service_ assignment' : OrderedDict( [('default- discovery- service- assignment: f3fe29b7- 6cf6-4995- 8d6d-232d681142 d8', u'null')]), [('default- domain: default- project: __link_ local__ :1272b262- c8bb-477f- 8951-73606e810c 0e', u'null'), ('default- domain: default- project: default- virtual- network: 366169d8- 2b41-47f0- 866c-f711d4a9ec 94', u'null'), ('default- domain: default- project: ip-fabric: cca2c1e8- be8f-4f47- af3f-4ac06708da 0d', u'null')]), [('default- global- system- config: comp47: c62cb4a4- f43a-4635- a67e-6f9a628763 03', u'null'), ('default- global- system- config: comp48: 815eec89- ba28-4916- a19b-f8f84eba6e 62', u'null')])}
Out[5]:
{'discovery_
'virtual_network': OrderedDict(
'virtual_router': OrderedDict(
Regarding customer’s step there is no much details. We only know their node2 for a long time there is MTU issue, and after correction MTU issue node2 joined cluster, they start to see missing data. /bugs.launchpad .net/juniperope nstack/ +bug/1757441
I filed LP 1757441 for that.
https:/
From: Suresh Kumar Vinapamula Venkata /bugs.launchpad .net/juniperope nstack/ +bug/1765590
Sent: Tuesday, April 24, 2018 2:47 PM
To: Ping Song <email address hidden>
Subject: Re: https:/
Ping,
Yes, try out these steps and let me know if you face the issue. We also need to understand what is the procedure that customer followed?
Suresh
From: Ping Song <email address hidden> /bugs.launchpad .net/juniperope nstack/ +bug/1765590
Date: Tuesday, April 24, 2018 at 11:26 AM
To: Suresh Kumar Vinapamula Venkata <email address hidden>
Subject: RE: https:/
Suresh:
Thanks for looking into it.
I opened this PR because I see issue during my replication effort of ATT’s issue.
I think I may not be fully emulating what they were testing. But I’m just wondering if this is a legal operation here.
What I did: Cassandra/ *
Stop node2
Remove /var/lib/
Start node2
This is to emulate the case when data in harddisk got corrupted abruptly. The whole process last no more than 20minutes.
The recommended steps as I understood applies to the condition when one node DB down longer than 10 days?
So are you suggesting the same recovery steps in my test too?
I can test that out and see any difference.
Regards
ping
From: Suresh Kumar Vinapamula Venkata /bugs.launchpad .net/juniperope nstack/ +bug/1765590
Sent: Tuesday, April 24, 2018 1:51 PM
To: Ping Song <email address hidden>
Subject: https:/
Not sure if you had a chance to look into this in the AT of https:/ /bugs.launchpad .net/juniperope nstack/ +bug/1765590 cassandra/ commitlog/ * cassandra/ status- up /github. com/Juniper/ contrail- controller/ wiki/Recovery- procedure- when-contrail- database- is-down- for-greater- than-gc_ grace_seconds
Ping,
Whom/Where did you receive these recovery steps from?
Could you check if they removed, they followed below steps?
rm -rf /var/lib/
rm -f /var/log/
This was the recommended approach from analytics team who were handling Cassandra.
https:/
-Suresh
BTW, this is in addition to Cassandra data cleanup.