Getting TM error 97 for INSERT SELECT <hive table>

Bug #1274716 reported by Guy Groulx
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
Fix Released
High
Justin Du

Bug Description

With a table defined in both TRAFODION and HIVE, we get error 97 when doing INSERT into TRAFODION table (SELECT from HIVE table).

We did not a split of the target table and therefore is a separate issue.

WORKAROUND: Using INSERT USING LOAD which does not use transactions.

Tags: sql-general
Guy Groulx (guy-groulx)
Changed in trafodion:
importance: Medium → High
Guy Groulx (guy-groulx)
tags: added: transaction
tags: added: dtm
removed: transaction
Revision history for this message
Narendra Goyal (narendra-goyal) wrote :

I will look at it - Narendra

Revision history for this message
John de Roo (john-deroo) wrote : RE: [Bug 1274716] Re: Getting TM error 97 for INSERT SELECT <hive table>

Thanks Narendra.

John.

-----Original Message-----
From: <email address hidden> [mailto:<email address hidden>] On Behalf Of Goyal, Narendra (SeaQuest R&D)
Sent: Tuesday, 22 April 2014 4:11 p.m.
To: DeRoo, John
Subject: [Bug 1274716] Re: Getting TM error 97 for INSERT SELECT <hive table>

I will look at it - Narendra

--
You received this bug notification because you are subscribed to the bug report.
https://bugs.launchpad.net/bugs/1274716

Title:
  Getting TM error 97 for INSERT SELECT <hive table>

Status in Trafodion:
  New

Bug description:
  With a table defined in both TRAFODION and HIVE, we get error 97 when
  doing INSERT into TRAFODION table (SELECT from HIVE table).

  We did not a split of the target table and therefore is a separate
  issue.

  WORKAROUND: Using INSERT USING LOAD which does not use
  transactions.

To manage notifications about this bug go to:
https://bugs.launchpad.net/trafodion/+bug/1274716/+subscriptions

Changed in trafodion:
assignee: nobody → Narendra Goyal (narendra-goyal)
information type: Proprietary → Public
Revision history for this message
Guy Groulx (guy-groulx) wrote :

NOTE: As of version 0.8.3, the error 97 message does not occur anymore, but with large enough tables, rows are not being inserted even if we are given a successful status.

John de Roo (john-deroo)
Changed in trafodion:
assignee: Narendra Goyal (narendra-goyal) → Adriana Fuentes (adriana-fuentes)
Changed in trafodion:
status: New → In Progress
milestone: none → r0.9
Changed in trafodion:
milestone: r0.9 → r1.1
Revision history for this message
Guy Groulx (guy-groulx) wrote :
Download full text (8.4 KiB)

Just tried again on our system.

SQL>create table trafodion.mxoltp.supplier (
  S_SUPPKEY INT no default not null
, S_NAME CHAR(2048 BYTES) no default not null
, S_ADDRESS CHAR(2048 BYTES) no default not null
, S_NATIONKEY INT no default not null
, S_PHONE CHAR(2048 BYTES) no default not null
, S_ACCTBAL FLOAT(54) no default not null
, S_COMMENT VARCHAR(2048 BYTES) no default not null
)
primary key (s_suppkey);+>+>+>+>+>+>+>+>+>

--- SQL operation complete.

SQL>get tables;

Tables in Schema TRAFODION.MXOLTP
=================================

SB_HISTOGRAMS
SB_HISTOGRAM_INTERVALS
SUPPLIER
TBL500

--- SQL operation complete.

SQL>insert into supplier (select * from hive.hive.tpch10_supplier);

*** ERROR[2034] $Z001GK9: Operating system error 201 while communicating with server process $Z060XNE. [2015-01-09 19:57:05]
*** ERROR[2034] $Z001GK9: Operating system error 201 while communicating with server process $Z060XNE. [2015-01-09 19:57:05]

SQL>prepare cmd from insert into supplier (select * from hive.hive.tpch10_supplier);
....

--- SQL command prepared.

SQL>explain cmd;

------------------------------------------------------------------ PLAN SUMMARY
MODULE_NAME .............. DYNAMICALLY COMPILED
STATEMENT_NAME ........... CMD
PLAN_ID .................. 212287593420906040
ROWS_OUT ........... 101,988
EST_TOTAL_COST ........... 0.03
STATEMENT ................ insert into supplier (select * from
                             hive.hive.tpch10_supplier)

------------------------------------------------------------------ NODE LISTING
ROOT ====================================== SEQ_NO 6 ONLY CHILD 5
REQUESTS_IN .............. 1
ROWS_OUT ........... 101,988
EST_OPER_COST ............ 0
EST_TOTAL_COST ........... 0.03
DESCRIPTION
  max_card_est ..... 101,988
  fragment_id ............ 0
  parent_frag ............ (none)
  fragment_type .......... master
  statement_index ........ 0
  affinity_value ......... 0
  est_memory_per_cpu ..... 81968 KB
  max_max_cardinali 101,988
  total_overflow_size .... 0.00 KB
  esp_2_node_map ......... (\NSK:-1:-1:-1:-1:-1:-1:-1:-1:-1:-1)
  upd_action_on_error .... xn_rollback
  xn_access_mode ......... read_only
  xn_autoabort_interval 0
  auto_query_retry ....... enabled
  plan_version ....... 2,600
  embedded_arkcmp ........ used
  LDAP_USERNAME .......... SQUSER2
  CATALOG ................ TRAFODION
  SCHEMA ................. MXOLTP
  SHOWCONTROL_SHOW_ALL ... OFF
  ObjectUIDs ............. 65357623245348342

ESP_EXCHANGE ============================== SEQ_NO 5 ONLY CHILD 4
REQUESTS_IN .............. 1
ROWS_OUT ........... 101,988
EST_OPER_COST ............ 0.01
EST_TOTAL_COST ........... 0.03
DESCRIPTION
  max_card_est ..... 101,988
  fragment_id ............ 2
  parent_frag ............ 0
  fragment_type .......... esp
  buffer_size ....... 12,500
  record_length .......... 0
  parent_processes ....... 1
  child_processes ....... 10
  child_partitioning_func hash2 partitioned 10 ways on (randomNum)
  seamonster_query ..........

Read more...

Revision history for this message
Guy Groulx (guy-groulx) wrote :

Workaround INSERT USING LOAD INTO, actually worked. Took a while but worked.

Revision history for this message
Adriana Fuentes (adriana-fuentes) wrote :

I reproduced the problem in a workstation and I was able to see an assertion failure thru sqlci. More details on the problem is here:

--- SQL operation complete.
>>insert into supplier (select * from hive.hive.tpch10_supplier);

*** EXECUTOR ASSERTION FAILURE
*** Time: Thu Jan 15 22:14:07 2015
*** Process: 7247
*** File: ../executor/ex_tuple_flow.cpp
*** Line: 350
*** Message: We have a Q_SQLERROR in Tupleflow but no diags area
*** EXECUTOR ASSERTION FAILURE
*** Time: Thu Jan 15 22:14:08 2015
*** Process: 8028
*** File: ../executor/ex_tuple_flow.cpp
*** Line: 350
*** Message: We have a Q_SQLERROR in Tupleflow but no diags area

*** ERROR[2034] $Z000M14: Operating system error 201 while communicating with server process $Z0106JD.

*** ERROR[2034] $Z000M14: Operating system error 201 while communicating with server process $Z0106JD.

--- 0 row(s) inserted.

The following is more information about the query:
>>prepare xx from insert into supplier (select * from hive.hive.tpch10_supplier);
..

--- SQL command prepared.
>>explain options 'f' xx;

LC RC OP OPERATOR OPT DESCRIPTION CARD
---- ---- ---- -------------------- -------- -------------------- ---------

5 . 6 root x 1.01E+005
4 . 5 esp_exchange 1:4(hash2) 1.01E+005
2 3 4 tuple_flow 1.01E+005
. . 3 trafodion_insert SUPPLIER 1.00E+000
1 . 2 sort 1.01E+005
. . 1 hive_scan TPCH10_SUPPLIER 1.01E+005

--- SQL operation complete.

tags: added: sql-general
removed: dtm
Changed in trafodion:
assignee: Adriana Fuentes (adriana-fuentes) → Justin Du (justin-du-2)
Revision history for this message
Justin Du (justin-du-2) wrote :

Found the ESP cores on sqws107 that Adriana worked. The core files indicated that the first up entry the sort operator produced was Q_ERROR but it did not has an accompany diags info, thus cored.

Attempted to reproduce the problem on the same workstation hung. Command pstack show the esp and sqlci processes were in pthread_cond_wait stage.

Revision history for this message
Justin Du (justin-du-2) wrote :

Further analysis via existing core files on workstation indicated that the up queue entry with Q_SQLERROR were originated from ExHdfsScanTcb. The work() method there is possible to insert an up queue Q_SQLERROR entry without populating it with a diagsArea at the step of HANDLE_ERROR. The diags info was expected to be associated with the workAtp_, if that was not set, then we would see this kind of core.

The fix for the core is to provide proper diags info at all possible error cases in the ExHdfsScanTcb.

There should be other error(s) that caused the hdfs scan failure that needs to be identified

Revision history for this message
Adriana Fuentes (adriana-fuentes) wrote :

Some extra steps to prepare a workstation to reproduce the problem:

alias hadoop=swhadoop
alias hive=swhive
cd /h/ggroulx/tpc/tpch-gen-master

. ./createTPCHhive.sh -s 10 -p 10

Revision history for this message
Guy Groulx (guy-groulx) wrote :

Tested on git150118 and it now works.

Changed in trafodion:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.