error: Error Run Query: REST Server Error: EIO

Bug #1768360 reported by ping
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R3.0
Incomplete
High
ping
R3.0.3.x
Incomplete
High
ping
R3.1
Incomplete
High
ping
R3.2
Incomplete
High
ping
R4.0
Incomplete
High
ping
R4.1
Incomplete
High
ping
R5.0
Incomplete
High
ping
Trunk
Incomplete
High
ping

Bug Description

after upgraded from 3033.22 to 3280.66 ATT see these errors in webui log:

    cs. {"start_time":1525190508120000,"end_time":1525191108120000,"select_fields":["name","fields.value"],"table":"StatTable.FieldNames.fields","where":[[{"name":"name","value":"MessageTable","op":7}]]}
    05/01/2018 04:13:17 PM - error: URL [http://172.29.6.30:9081/analytics/query] returned error ["EIO"]
    05/01/2018 04:13:17 PM - error: Error Run Query: REST Server Error: EIO
        at APIServer.retryMakeCall (/usr/src/contrail/contrail-web-core/src/serverroot/common/rest.api.js:203:13)
        at Request.<anonymous> (/usr/src/contrail/contrail-web-core/src/serverroot/common/rest.api.js:336:18)
        at Request.emit (events.js:98:17)
        at Request.mixin._fireSuccess (/usr/lib64/node_modules/restler/lib/restler.js:226:10)
        at /usr/lib64/node_modules/restler/lib/restler.js:157:20
        at IncomingMessage.parsers.auto (/usr/lib64/node_modules/restler/lib/restler.js:390:7)
        at Request.mixin._encode (/usr/lib64/node_modules/restler/lib/restler.js:194:29)
        at /usr/lib64/node_modules/restler/lib/restler.js:153:16
        at Request.mixin._decode (/usr/lib64/node_modules/restler/lib/restler.js:169:7)
        at IncomingMessage.<anonymous> (/usr/lib64/node_modules/restler/lib/restler.js:146:14)
    05/01/2018 04:13:17 PM - error: REST Server Error: EIO
        at APIServer.retryMakeCall (/usr/src/contrail/contrail-web-core/src/serverroot/common/rest.api.js:203:13)
        at Request.<anonymous> (/usr/src/contrail/contrail-web-core/src/serverroot/common/rest.api.js:336:18)
        at Request.emit (events.js:98:17)
        at Request.mixin._fireSuccess (/usr/lib64/node_modules/restler/lib/restler.js:226:10)
        at /usr/lib64/node_modules/restler/lib/restler.js:157:20
        at IncomingMessage.parsers.auto (/usr/lib64/node_modules/restler/lib/restler.js:390:7)
        at Request.mixin._encode (/usr/lib64/node_modules/restler/lib/restler.js:194:29)
        at /usr/lib64/node_modules/restler/lib/restler.js:153:16
        at Request.mixin._decode (/usr/lib64/node_modules/restler/lib/restler.js:169:7)
        at IncomingMessage.<anonymous> (/usr/lib64/node_modules/restler/lib/restler.js:146:14)

meanwhile in web GUI -> monitor -> Infrastructure -> config node -> console paused with "EIO error"

![alp1b-2](https://user-images.githubusercontent.com/2038044/39495629-0b14568a-4d69-11e8-996c-8a5de1dbe533.JPG)
![alp1b_ah287m_1524744131173](https://user-images.githubusercontent.com/2038044/39495642-1320ab62-4d69-11e8-9c74-b71f4c5b3bf0.png)

looks similiar to but not:
https://bugs.launchpad.net/juniperopenstack/+bug/1708764

Revision history for this message
ping (itestitest) wrote :

screenshot of errors and /var/logs/contrail files are available in here:

root@comp45:~/cases/2018-0427-0265# ls -lt
total 559980
-rw-r--r-- 1 801 20062 570624000 May 1 14:02 alp1bccnt03.tar
drwxr-xr-x 2 root root 4096 May 1 09:35 alp1bccnt03 <--/var/log/contrail
-rw-r--r-- 1 801 20062 46040 Apr 27 07:58 ALP1B-2.JPG
-rw-r--r-- 1 801 20062 43221 Apr 27 07:58 ALP1B-1.JPG
-rw-r--r-- 1 801 20062 2591808 Apr 27 07:43 ctlrs.log
-rw-r--r-- 1 801 20062 54061 Apr 27 06:50 alp1b_ah287m.png
-rw-r--r-- 1 801 20062 39064 Apr 27 06:46 alp1b_ah287m_1524744131173.png

Jeba Paulaiyan (jebap)
no longer affects: juniperopenstack/r5.0
Jim Reilly (jpreilly)
information type: Proprietary → Private
information type: Private → Public
Revision history for this message
Suresh Akula (surakula) wrote :

EIO : input/output error

We have validated the above query from UI --> Analytics on 3.2.8.0 (Build 66) query parameters are good. issue may not be UI.

We need check with Analytics team.

Assigning it to Sunder

Revision history for this message
mkheni (mkheni) wrote :

In the contrail-query-engine.log, there are lots of below error-

2018-05-08 Tue 04:19:02:013.662 UTC zalp1bcnal03.alp1b.cci.att.com [Thread 139828687521536, Pid 23133]: Db_GetMultiRow:controller/src/database/cassandra/cql/cql_if.cc:2019: SELECT FROM Table: MessageTable Partition Key: [27bd89fe-ff11-401a-beb3-26ab144622f4] FAILED
2018-05-08 Tue 04:19:02:017.881 UTC zalp1bcnal03.alp1b.cci.att.com [Thread 139829287302912, Pid 23133]: ExecuteQuerySyncInternal:controller/src/database/cassandra/cql/cql_if.cc:899: SyncQuery: FAILED: Operation timed out - received only 0 responses.
2018-05-08 Tue 04:19:02:017.928 UTC zalp1bcnal03.alp1b.cci.att.com [Thread 139829287302912, Pid 23133]: Db_GetMultiRow:controller/src/database/cassandra/cql/cql_if.cc:2019: SELECT FROM Table: MessageTable Partition Key: [1f4568ed-52a3-4626-b459-8f96cc766d31] FAILED
2018-05-08 Tue 04:19:02:019.111 UTC zalp1bcnal03.alp1b.cci.att.com [Thread 139830897915648, Pid 23133]: ExecuteQuerySyncInternal:controller/src/database/cassandra/cql/cql_if.cc:899: SyncQuery: FAILED: Operation timed out - received only 0 responses.
2018-05-08 Tue 04:19:02:019.139 UTC zalp1bcnal03.alp1b.cci.att.com [Thread 139830897915648, Pid 23133]: Db_GetMultiRow:controller/src/database/cassandra/cql/cql_if.cc:2019: SELECT FROM Table: MessageTable Partition Key: [814b3ac3-f487-4af8-bc1e-f17a8f6a7d3e] FAILED
2018-05-08 Tue 04:19:02:020.258 UTC zalp1bcnal03.alp1b.cci.att.com [Thread 139830876923648, Pid 23133]: ExecuteQuerySyncInternal:controller/src/database/cassandra/cql/cql_if.cc:899: SyncQuery: FAILED: Operation timed out - received only 0 responses.
2018-05-08 Tue 04:19:02:020.299 UTC zalp1bcnal03.alp1b.cci.att.com [Thread 139830876923648, Pid 23133]: Db_GetMultiRow:controller/src/database/cassandra/cql/cql_if.cc:2019: SELECT FROM Table: MessageTable Partition Key: [fdaa1479-2f22-43b8-ac2d-44391a90387d] FAILED
2018-05-08 Tue 04:19:02:028.785 UTC zalp1bcnal03.alp1b.cci.att.com [Thread 139830776293120, Pid 23133]: ExecuteQuerySyncInternal:controller/src/database/cassandra/cql/cql_if.cc:899: SyncQuery: FAILED: Operation timed out - received only 0 responses.
2018-05-08 Tue 04:19:02:028.883 UTC zalp1bcnal03.alp1b.cci.att.com [Thread 139830776293120, Pid 23133]: Db_GetMultiRow:controller/src/database/cassandra/cql/cql_if.cc:2019: SELECT FROM Table: MessageTable Partition Key: [db1c7608-5a01-4e99-b86d-47ad7eea2989] FAILED

"Operation Timed Out" error means either cassandra was under really heavy load or there were some networking issues.

Could you also attach contrail-collector.log and cassandra logs at the time of the failure from all the nodes?

Also, if the system is still in the failure state, could you set up a debugging session, that would be more helpful to figure out why is cassandra in bad state.

-Miraj

Revision history for this message
ping (itestitest) wrote :

thanks Miraj.
I just notice this update.
I'm checking with ATT and will get back to you soon.

Revision history for this message
mkheni (mkheni) wrote :
Download full text (3.9 KiB)

Email chain below consists of all the things BU has requested:

On May 30, 2018, at 4:54 PM, Miraj Subhashbhai Kheni <email address hidden> wrote:

Also,

Could you also confirm that there is no connectivity issues (i.e. slow interface) in the cluster?

Could you give the output of the following as well-

ifstat 1 30 (on both database and analytics nodes)
bandwidth between database nodes and between database and analytics nodes
curl http://localhost:8089/Snh_ShowCollectorServerReq? | python -c 'from lxml import etree; from sys import stdin; tree = etree.parse(stdin); cql_metrics = tree.xpath("//table_info/list"); print etree.tostring(cql_metrics[0], pretty_print=True)'; sleep 1m; curl http://localhost:8089/Snh_ShowCollectorServerReq? | python -c 'from lxml import etree; from sys import stdin; tree = etree.parse(stdin); cql_metrics = tree.xpath("//table_info/list"); print etree.tostring(cql_metrics[0], pretty_print=True)’ (on all the analytics nodes- this is to get number of writes success/failures)

Thanks,
Miraj

On May 30, 2018, at 2:03 PM, Miraj Subhashbhai Kheni <email address hidden> wrote:

Hi Abe,
Could you also send the output of the following commands run on all the 3 database nodes-

iostat -m 2 30
nodetool cfstats; sleep 1m; nodetool cfstats

Thanks,
Miraj

On May 24, 2018, at 5:08 PM, Ping Song <email address hidden> wrote:

Abe:
I see the commands output.
Can you send the whole /var/log/Cassandra tarball as requested?

tar czf Cassandra01.tgz /var/log/Cassandra

thanks.

From: HATTAR, ABRAHAM <email address hidden>
Sent: Thursday, May 24, 2018 5:12 PM
To: Ping Song <email address hidden>; Miraj Subhashbhai Kheni <email address hidden>
Cc: Sundaresan Rajangam <email address hidden>; support <email address hidden>; MOINUDDIN, KHAJA <email address hidden>; SICURANZO, DEBORAH A <email address hidden>; KNOST, JOHN <email address hidden>; BROOKES-DANIELS, JOHN <email address hidden>; BATES, SIMON <email address hidden>
Subject: RE: List of things to be collected for SR P2 - 2018-0427-0265

Hi Miraj and Ping,

I upload everything you asked for in a compressed flle name ftp.zip

Please note that a directory name GC_Files contains 3 files each belong to the following DB VM

db01 gc-1519981498.log 2.2Gb
db02 gc-1519983054.log 2.5GB
db03 gc-1525161802.log 560MB

regards
Abe

From: Ping Song <email address hidden>
Sent: Wednesday, May 23, 2018 11:59 AM
To: Miraj Subhashbhai Kheni <email address hidden>; HATTAR, ABRAHAM <email address hidden>
Cc: Sundaresan Rajangam <email address hidden>; support <email address hidden>
Subject: RE: List of things to be collected for SR P2 - 2018-0427-0265

+ case.

From: Ping Song
Sent: Wednesday, May 23, 2018 11:58 AM
To: Miraj Subhashbhai Kheni <email address hidden>; HATTAR, ABRAHAM <email address hidden>
Cc: Sundaresan Rajangam <email address hidden>
Subject: RE: List of things to be collected for SR P2 - 2018-0427-0265

Abe:
Can you have to collect the below info as requested by our developer Miraj?
Appreciate your response.

Regards
ping

From: Miraj Subhashbhai Kheni
Sent: Tuesday, May 2...

Read more...

Ning Zhong (nzhong)
tags: added: 2018-0710-0464
Revision history for this message
Gleb Zimin (gzimin) wrote :

Is there any updates?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.