Accessing hive table with ucs2 encoded field returns 0 rows.

Bug #1443482 reported by Howard Qin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
New
Medium
khaled Bouaziz

Bug Description

When accessing hive table with ucs2 encoded field, our implementation will return 0 rows.
This is caused by using of “strchr()”, see ExHdfsScanTcb::extractAndTransformAsciiSourceToSqlRow(),
strchr() returns at ‘\0’ before hit line delimiter ‘\n’, however the '\0' may just be a 0x00 part of ucs2 character, and the line is considered invalid.

Scripts to reproduce:

create table sck(
    userId int not null,
    name varchar(20) character set UCS2
);

insert into sck values (1001, _ucs2'JBL'), (1002, _ucs2'YS '), (1003, _ucs2'8#RTG');

unload into '/ucs2test' select * from sck;

create external table hsck
(
  id int,
  name string
) row format delimited fields terminated by '|'
location '/ucs2test';

select * from hive.hive.hsck;

Tags: hive
Revision history for this message
Roberta Marton (roberta-marton) wrote :
Download full text (15.8 KiB)

A separate but related issue occurs when trying to handle UTF8 and string data.
Excerpt from an e-mail converations:

I would think we need a way to influence the default mapping of string column in hive. If a user knows that the hive contains ISO or UCS2 characters, he/she can issue a CQD to influence the string mapping. The drawback with the approach is that all the string columns in the table will be mapped to this encoding.

I believe that this should take care of the conversion issue.

Selva

From: Subbiah, Suresh
Sent: Wednesday, May 20, 2015 10:31 AM
To: Govindarajan, Selvaganes; Marton, Roberta S; Capps, Jim; Fritchman, Barry
Cc: Zeller, Hans
Subject: RE: Trying to create a warning

Hi Roberta, Selva

Roberta : The answer to both questions in your message is YES.
Chinese PoC need UTF8. All PoCs start with a bulk load usually. Maybe they are less important in some and maybe they will not be used in production. But bulkload is needed to get started at all places. We always bulk load from Hive to Traf.

Selva : All Hive string columns are mapped to charset UTF8 in Traf. I don’t think there is even a cqd to change it. If the Traf table being loaded is ISO88591 then we use a TRANSLATE ItemExpr (Jim’s feature) to convert. OSS tables and usually Chinese PoC Traf tables use UTF8 charset so no TRANSLATE is needed in those cases

Thanks
Suresh

From: Govindarajan, Selvaganes
Sent: Wednesday, May 20, 2015 12:26 PM
To: Marton, Roberta S; Capps, Jim; Fritchman, Barry
Cc: Subbiah, Suresh; Zeller, Hans
Subject: RE: Trying to create a warning

Do you know how invoke hive.hive.customer displays character set UTF8?

Selva
From: Marton, Roberta S
Sent: Wednesday, May 20, 2015 9:59 AM
To: Marton, Roberta S; Govindarajan, Selvaganes; Capps, Jim; Fritchman, Barry
Cc: Subbiah, Suresh; Zeller, Hans
Subject: RE: Trying to create a warning

After reviewing the code and talking with Suresh it looks to be an issue with how Trafodion maps data types from Hive to Trafodion.

A description on how Trafodion gets translates Hive metadata to Trafodion – thanks to Suresh for the explanation:

When a hive table is accessed from Traf (either for getTables or in a select/insert query) we use Java to call a java function exposed by Hive.
Hive jars are included in our class path. This goes though our usual JNI path. The function we call returns the description of a table as a giant string.
We parse the string in C++ side in our code and create a struct called hive_tbl_desc. This desc is then converted to an NATable class.

When the statement “invoke hive.hive.customer;” is performed, an NATable structure is obtained and information displayed like a Trafodion table.

In ExExeUtilGet there is code that translates the column information from the Hive type to the Trafodion type. This translation does not consider UTF8 – only ISO88591.
If you look at: ExExeUtilHiveMDaccessTcb::getFSTypeFromHiveColType the data type for a Hive string is translated into a REC_BYTE_V_ASCII file type.
Then this is converted to ISO88591:

    // only iso charset
   if ((infoCol->fsDatatype == REC_BYTE_F_ASCII) || (infoCol->fsDatatype == REC_BYTE_V_ASCII))
              str_cpy(infoCol->charSe...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.