In the contrail-query-engine.log, there are lots of below error-
2018-05-08 Tue 04:19:02:013.662 UTC zalp1bcnal03.alp1b.cci.att.com [Thread 139828687521536, Pid 23133]: Db_GetMultiRow:controller/src/database/cassandra/cql/cql_if.cc:2019: SELECT FROM Table: MessageTable Partition Key: [27bd89fe-ff11-401a-beb3-26ab144622f4] FAILED
2018-05-08 Tue 04:19:02:017.881 UTC zalp1bcnal03.alp1b.cci.att.com [Thread 139829287302912, Pid 23133]: ExecuteQuerySyncInternal:controller/src/database/cassandra/cql/cql_if.cc:899: SyncQuery: FAILED: Operation timed out - received only 0 responses.
2018-05-08 Tue 04:19:02:017.928 UTC zalp1bcnal03.alp1b.cci.att.com [Thread 139829287302912, Pid 23133]: Db_GetMultiRow:controller/src/database/cassandra/cql/cql_if.cc:2019: SELECT FROM Table: MessageTable Partition Key: [1f4568ed-52a3-4626-b459-8f96cc766d31] FAILED
2018-05-08 Tue 04:19:02:019.111 UTC zalp1bcnal03.alp1b.cci.att.com [Thread 139830897915648, Pid 23133]: ExecuteQuerySyncInternal:controller/src/database/cassandra/cql/cql_if.cc:899: SyncQuery: FAILED: Operation timed out - received only 0 responses.
2018-05-08 Tue 04:19:02:019.139 UTC zalp1bcnal03.alp1b.cci.att.com [Thread 139830897915648, Pid 23133]: Db_GetMultiRow:controller/src/database/cassandra/cql/cql_if.cc:2019: SELECT FROM Table: MessageTable Partition Key: [814b3ac3-f487-4af8-bc1e-f17a8f6a7d3e] FAILED
2018-05-08 Tue 04:19:02:020.258 UTC zalp1bcnal03.alp1b.cci.att.com [Thread 139830876923648, Pid 23133]: ExecuteQuerySyncInternal:controller/src/database/cassandra/cql/cql_if.cc:899: SyncQuery: FAILED: Operation timed out - received only 0 responses.
2018-05-08 Tue 04:19:02:020.299 UTC zalp1bcnal03.alp1b.cci.att.com [Thread 139830876923648, Pid 23133]: Db_GetMultiRow:controller/src/database/cassandra/cql/cql_if.cc:2019: SELECT FROM Table: MessageTable Partition Key: [fdaa1479-2f22-43b8-ac2d-44391a90387d] FAILED
2018-05-08 Tue 04:19:02:028.785 UTC zalp1bcnal03.alp1b.cci.att.com [Thread 139830776293120, Pid 23133]: ExecuteQuerySyncInternal:controller/src/database/cassandra/cql/cql_if.cc:899: SyncQuery: FAILED: Operation timed out - received only 0 responses.
2018-05-08 Tue 04:19:02:028.883 UTC zalp1bcnal03.alp1b.cci.att.com [Thread 139830776293120, Pid 23133]: Db_GetMultiRow:controller/src/database/cassandra/cql/cql_if.cc:2019: SELECT FROM Table: MessageTable Partition Key: [db1c7608-5a01-4e99-b86d-47ad7eea2989] FAILED
"Operation Timed Out" error means either cassandra was under really heavy load or there were some networking issues.
Could you also attach contrail-collector.log and cassandra logs at the time of the failure from all the nodes?
Also, if the system is still in the failure state, could you set up a debugging session, that would be more helpful to figure out why is cassandra in bad state.
In the contrail- query-engine. log, there are lots of below error-
2018-05-08 Tue 04:19:02:013.662 UTC zalp1bcnal03. alp1b.cci. att.com [Thread 139828687521536, Pid 23133]: Db_GetMultiRow: controller/ src/database/ cassandra/ cql/cql_ if.cc:2019: SELECT FROM Table: MessageTable Partition Key: [27bd89fe- ff11-401a- beb3-26ab144622 f4] FAILED alp1b.cci. att.com [Thread 139829287302912, Pid 23133]: ExecuteQuerySyn cInternal: controller/ src/database/ cassandra/ cql/cql_ if.cc:899: SyncQuery: FAILED: Operation timed out - received only 0 responses. alp1b.cci. att.com [Thread 139829287302912, Pid 23133]: Db_GetMultiRow: controller/ src/database/ cassandra/ cql/cql_ if.cc:2019: SELECT FROM Table: MessageTable Partition Key: [1f4568ed- 52a3-4626- b459-8f96cc766d 31] FAILED alp1b.cci. att.com [Thread 139830897915648, Pid 23133]: ExecuteQuerySyn cInternal: controller/ src/database/ cassandra/ cql/cql_ if.cc:899: SyncQuery: FAILED: Operation timed out - received only 0 responses. alp1b.cci. att.com [Thread 139830897915648, Pid 23133]: Db_GetMultiRow: controller/ src/database/ cassandra/ cql/cql_ if.cc:2019: SELECT FROM Table: MessageTable Partition Key: [814b3ac3- f487-4af8- bc1e-f17a8f6a7d 3e] FAILED alp1b.cci. att.com [Thread 139830876923648, Pid 23133]: ExecuteQuerySyn cInternal: controller/ src/database/ cassandra/ cql/cql_ if.cc:899: SyncQuery: FAILED: Operation timed out - received only 0 responses. alp1b.cci. att.com [Thread 139830876923648, Pid 23133]: Db_GetMultiRow: controller/ src/database/ cassandra/ cql/cql_ if.cc:2019: SELECT FROM Table: MessageTable Partition Key: [fdaa1479- 2f22-43b8- ac2d-44391a9038 7d] FAILED alp1b.cci. att.com [Thread 139830776293120, Pid 23133]: ExecuteQuerySyn cInternal: controller/ src/database/ cassandra/ cql/cql_ if.cc:899: SyncQuery: FAILED: Operation timed out - received only 0 responses. alp1b.cci. att.com [Thread 139830776293120, Pid 23133]: Db_GetMultiRow: controller/ src/database/ cassandra/ cql/cql_ if.cc:2019: SELECT FROM Table: MessageTable Partition Key: [db1c7608- 5a01-4e99- b86d-47ad7eea29 89] FAILED
2018-05-08 Tue 04:19:02:017.881 UTC zalp1bcnal03.
2018-05-08 Tue 04:19:02:017.928 UTC zalp1bcnal03.
2018-05-08 Tue 04:19:02:019.111 UTC zalp1bcnal03.
2018-05-08 Tue 04:19:02:019.139 UTC zalp1bcnal03.
2018-05-08 Tue 04:19:02:020.258 UTC zalp1bcnal03.
2018-05-08 Tue 04:19:02:020.299 UTC zalp1bcnal03.
2018-05-08 Tue 04:19:02:028.785 UTC zalp1bcnal03.
2018-05-08 Tue 04:19:02:028.883 UTC zalp1bcnal03.
"Operation Timed Out" error means either cassandra was under really heavy load or there were some networking issues.
Could you also attach contrail- collector. log and cassandra logs at the time of the failure from all the nodes?
Also, if the system is still in the failure state, could you set up a debugging session, that would be more helpful to figure out why is cassandra in bad state.
-Miraj