Large query sees operating system error 201

Bug #1307796 reported by Weishiun Tsai
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
Fix Released
High
Mike Hanlon

Bug Description

When running this large query on Trafodion, it returned operating system error 201. It did not appear to have any core file generated when this happened. This was seen on centos-mapr, a 6-node cluster, but the same query will probably encounter similar errors in one form or another if run on other clusters.

The tables are all salted with 12 partitions. To reproduce it will require populating the QA g_tpch2x tables first. The query plan is shown in the following output before the query execution.

>>obey mytest.sql;
>>log mytest.log clear;
>>
>>set schema g_tpch2x;

--- SQL operation complete.
>>
>>prepare xx from
+>select [first 300]
+>l_orderkey,
+>cast(sum(l_extendedprice*(1-l_discount)) as numeric(18,2)) as revenue,
+>o_orderdate,
+>o_shippriority
+>from
+>customer,
+>orders,
+>lineitem
+>where
+>c_mktsegment = 'HOUSEHOLD'
+>and c_custkey = o_custkey
+>and l_orderkey = o_orderkey
+>and o_orderdate < date '1995-03-12'
+>and l_shipdate > date '1995-03-12'
+>group by
+>l_orderkey,
+>o_orderdate,
+>o_shippriority
+>order by
+>revenue desc,
+>o_orderdate;

--- SQL command prepared.
>>
>>explain options 'f' xx;

LC RC OP OPERATOR OPT DESCRIPTION CARD
---- ---- ---- -------------------- -------- -------------------- ---------

14 . 15 root 2.33E+003
13 . 14 firstn 2.33E+003
12 . 13 esp_exchange 1:24(hash2) (m) 2.33E+003
11 . 12 sort 2.33E+003
10 . 11 hash_partial_groupby 2.33E+003
9 . 10 esp_exchange 24(hash2):24(hash2) 2.33E+003
8 . 9 hash_partial_groupby 2.33E+003
7 2 8 hybrid_hash_join 4.53E+005
6 4 7 hybrid_hash_join 4.53E+005
5 . 6 esp_exchange 24(hash2):6(range) 5.99E+004
. . 5 trafodion_scan CUSTOMER 5.99E+004
3 . 4 esp_exchange 24(hash2):6(range) 1.45E+006
. . 3 trafodion_scan ORDERS 1.45E+006
1 . 2 esp_exchange 24(rep-b):6(range) 3.29E+001
. . 1 trafodion_scan LINEITEM 3.29E+001

--- SQL operation complete.
>>
>>execute xx;

*** ERROR[2034] $Z000E68: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z000E68: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z000E68: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z000E68: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z000E68: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z000E68: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z010PSP: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z010PSP: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z040JGQ: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z040JGQ: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z040JH2: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z040JH2: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z040JGW: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z040JGW: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z010PT2: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z010PT2: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z010PSV: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z010PSV: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z050R96: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z050R96: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z050R9C: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z050R9C: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z050R9I: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z050R9I: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z020HQU: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z020HQU: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z020HQH: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z020HQH: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z020HQN: Operating system error 201 while communicating with server process $Z000GFM.

*** ERROR[2034] $Z020HQN: Operating system error 201 while communicating with server process $Z000GFM.

--- 0 row(s) selected.
>>
>>log off;
>>
>>

Tags: sql-exe
summary: - Large query sees operation system error 201
+ Large query sees operating system error 201
Revision history for this message
Anoop Sharma (anoop-sharma) wrote :

Mike Hanlon, assigning this one to you. Welcome aboard.

Changed in trafodion:
assignee: nobody → Mike Hanlon (mike-hanlon)
Changed in trafodion:
status: New → In Progress
Revision history for this message
Mike Hanlon (mike-hanlon) wrote :

The testcase shows severe memory growth and eventually, an ESP is killed by the Linux kernel OOM killer. We have a simpler testcase and bug for this -- 1312847 . I plan to retest once a fix for 1312847 is available.

information type: Proprietary → Public
Revision history for this message
Weishiun Tsai (wei-shiun-tsai) wrote :
Download full text (3.3 KiB)

Verified on the v0616_0930 build and no longer saw this problem. Presumably it has been addressed as part of the fix done for another bug report https://bugs.launchpad.net/trafodion/+bug/1312847 'upsert crashes sqlci on rhel-cdh1 with a 11M rows table':

>>set schema g_tpch2x;

--- SQL operation complete.
>>
>>prepare xx from
+>select [first 300]
+>l_orderkey,
+>cast(sum(l_extendedprice*(1-l_discount)) as numeric(18,2)) as revenue,
+>o_orderdate,
+>o_shippriority
+>from
+>customer,
+>orders,
+>lineitem
+>where
+>c_mktsegment = 'HOUSEHOLD'
+>and c_custkey = o_custkey
+>and l_orderkey = o_orderkey
+>and o_orderdate < date '1995-03-12'
+>and l_shipdate > date '1995-03-12'
+>group by
+>l_orderkey,
+>o_orderdate,
+>o_shippriority
+>order by
+>revenue desc,
+>o_orderdate;

*** WARNING[6007] Multi-column statistics for columns (O_ORDERKEY, O_ORDERDATE, O_SHIPPRIORITY) from table TRAFODION.G_TPCH2X.ORDERS were not available. The columns were being used by GroupBy operator. As a result, the access path chosen might not be the best possible.

--- SQL command prepared.
>>
>>explain options 'f' xx;

LC RC OP OPERATOR OPT DESCRIPTION CARD
---- ---- ---- -------------------- -------- -------------------- ---------

14 . 15 root 1.15E+006
13 . 14 firstn 1.15E+006
12 . 13 esp_exchange 1:24(hash2) (m) 1.15E+006
11 . 12 sort 1.15E+006
10 . 11 hash_groupby 1.15E+006
9 . 10 esp_exchange 24(hash2):24(hash2) 1.15E+006
8 2 9 hybrid_hash_join 1.15E+006
7 . 8 esp_exchange 24(hash2):24(hash2) 4.54E+005
6 4 7 hybrid_hash_join 4.54E+005
5 . 6 esp_exchange 24(hash2):6(range) 5.99E+004
. . 5 trafodion_scan CUSTOMER 5.99E+004
3 . 4 esp_exchange 24(hash2):6(range) 1.45E+006
. . 3 trafodion_scan ORDERS 1.45E+006
1 . 2 esp_exchange 24(hash2):6(range) 6.48E+006
. . 1 trafodion_scan LINEITEM 6.48E+006

--- SQL operation complete.
>>
>>execute xx;

L_ORDERKEY REVENUE O_ORDERDATE O_SHIPPRIORITY
----------- --------------------- ----------- --------------

    8207586 419366.13 1995-03-04 0
    4163074 416152.80 1995-02-13 0
    6487431 412710.05 1995-02-06 0
    5412866 408427.30 1995-03-10 0
   10666915 397407.41 1995-02-14 0

   <lines removed to shorten the output>

    4258912 290337.59 1995-02-23 0
    7029634 290137.75 1995-03-04 0
   11945889 290129.46 1995-02-02 0
   10694341 ...

Read more...

Changed in trafodion:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.