A separate but related issue occurs when trying to handle UTF8 and string data. Excerpt from an e-mail converations: I would think we need a way to influence the default mapping of string column in hive. If a user knows that the hive contains ISO or UCS2 characters, he/she can issue a CQD to influence the string mapping. The drawback with the approach is that all the string columns in the table will be mapped to this encoding. I believe that this should take care of the conversion issue. Selva From: Subbiah, Suresh Sent: Wednesday, May 20, 2015 10:31 AM To: Govindarajan, Selvaganes; Marton, Roberta S; Capps, Jim; Fritchman, Barry Cc: Zeller, Hans Subject: RE: Trying to create a warning Hi Roberta, Selva Roberta : The answer to both questions in your message is YES. Chinese PoC need UTF8. All PoCs start with a bulk load usually. Maybe they are less important in some and maybe they will not be used in production. But bulkload is needed to get started at all places. We always bulk load from Hive to Traf. Selva : All Hive string columns are mapped to charset UTF8 in Traf. I don’t think there is even a cqd to change it. If the Traf table being loaded is ISO88591 then we use a TRANSLATE ItemExpr (Jim’s feature) to convert. OSS tables and usually Chinese PoC Traf tables use UTF8 charset so no TRANSLATE is needed in those cases Thanks Suresh From: Govindarajan, Selvaganes Sent: Wednesday, May 20, 2015 12:26 PM To: Marton, Roberta S; Capps, Jim; Fritchman, Barry Cc: Subbiah, Suresh; Zeller, Hans Subject: RE: Trying to create a warning Do you know how invoke hive.hive.customer displays character set UTF8? Selva From: Marton, Roberta S Sent: Wednesday, May 20, 2015 9:59 AM To: Marton, Roberta S; Govindarajan, Selvaganes; Capps, Jim; Fritchman, Barry Cc: Subbiah, Suresh; Zeller, Hans Subject: RE: Trying to create a warning After reviewing the code and talking with Suresh it looks to be an issue with how Trafodion maps data types from Hive to Trafodion. A description on how Trafodion gets translates Hive metadata to Trafodion – thanks to Suresh for the explanation: When a hive table is accessed from Traf (either for getTables or in a select/insert query) we use Java to call a java function exposed by Hive. Hive jars are included in our class path. This goes though our usual JNI path. The function we call returns the description of a table as a giant string. We parse the string in C++ side in our code and create a struct called hive_tbl_desc. This desc is then converted to an NATable class. When the statement “invoke hive.hive.customer;” is performed, an NATable structure is obtained and information displayed like a Trafodion table. In ExExeUtilGet there is code that translates the column information from the Hive type to the Trafodion type. This translation does not consider UTF8 – only ISO88591. If you look at: ExExeUtilHiveMDaccessTcb::getFSTypeFromHiveColType the data type for a Hive string is translated into a REC_BYTE_V_ASCII file type. Then this is converted to ISO88591: // only iso charset if ((infoCol->fsDatatype == REC_BYTE_F_ASCII) || (infoCol->fsDatatype == REC_BYTE_V_ASCII)) str_cpy(infoCol->charSet, "ISO88591", 40, ' '); I checked and there are a couple of LP bugs that are related but they don’t cover this case. 1443482 - Accessing hive table with ucs2 encoded field returns 0 rows. 1392459 - Internal error 2005 when querying a Hive table with an unsupported data type Do we know if any of our POC’s will need UTF8 character support and will be using Hive to load the data? Roberta From: Marton, Roberta S Sent: Tuesday, May 19, 2015 1:00 PM To: Govindarajan, Selvaganes; Capps, Jim; Fritchman, Barry Subject: RE: Trying to create a warning From hive: describe customer > ; OK col_name data_type comment c_customer_sk int c_customer_id string c_current_cdemo_sk int c_current_hdemo_sk int c_current_addr_sk int c_first_shipto_date_sk int c_first_sales_date_sk int c_salutation string c_first_name string c_last_name string c_preferred_cust_flag string c_birth_day int c_birth_month int c_birth_year int c_birth_country string c_login string c_email_address string c_last_review_date string From Trafodion: invoke hive.hive.customer; -- Definition of hive table CUSTOMER -- Definition current Tue May 19 12:59:24 2015 ( C_CUSTOMER_SK INT , C_CUSTOMER_ID VARCHAR(32000 BYTES) CHARACTER SET UTF8 COLLATE DEFAULT , C_CURRENT_CDEMO_SK INT , C_CURRENT_HDEMO_SK INT , C_CURRENT_ADDR_SK INT , C_FIRST_SHIPTO_DATE_SK INT , C_FIRST_SALES_DATE_SK INT , C_SALUTATION VARCHAR(32000 BYTES) CHARACTER SET UTF8 COLLATE DEFAULT , C_FIRST_NAME VARCHAR(32000 BYTES) CHARACTER SET UTF8 COLLATE DEFAULT , C_LAST_NAME VARCHAR(32000 BYTES) CHARACTER SET UTF8 COLLATE DEFAULT , C_PREFERRED_CUST_FLAG VARCHAR(32000 BYTES) CHARACTER SET UTF8 COLLATE DEFAULT , C_BIRTH_DAY INT , C_BIRTH_MONTH INT , C_BIRTH_YEAR INT , C_BIRTH_COUNTRY VARCHAR(32000 BYTES) CHARACTER SET UTF8 COLLATE DEFAULT , C_LOGIN VARCHAR(32000 BYTES) CHARACTER SET UTF8 COLLATE DEFAULT , C_EMAIL_ADDRESS VARCHAR(32000 BYTES) CHARACTER SET UTF8 COLLATE DEFAULT , C_LAST_REVIEW_DATE VARCHAR(32000 BYTES) CHARACTER SET UTF8 COLLATE DEFAULT ) /* stored as text */ Trafodion table: showddl customer; CREATE TABLE TRAFODION.USER1.CUSTOMER ( C_CUSTOMER_SK INT DEFAULT NULL , C_CUSTOMER_ID VARCHAR(32000 BYTES) CHARACTER SET UTF8 COLLATE DEFAULT DEFAULT NULL , C_CURRENT_CDEMO_SK INT DEFAULT NULL , C_CURRENT_HDEMO_SK INT DEFAULT NULL , C_CURRENT_ADDR_SK INT DEFAULT NULL , C_FIRST_SHIPTO_DATE_SK INT DEFAULT NULL , C_FIRST_SALES_DATE_SK INT DEFAULT NULL , C_SALUTATION VARCHAR(32000 BYTES) CHARACTER SET UTF8 COLLATE DEFAULT DEFAULT NULL , C_FIRST_NAME VARCHAR(32000 BYTES) CHARACTER SET UTF8 COLLATE DEFAULT DEFAULT NULL , C_LAST_NAME VARCHAR(32000 BYTES) CHARACTER SET UTF8 COLLATE DEFAULT DEFAULT NULL , C_PREFERRED_CUST_FLAG VARCHAR(32000 BYTES) CHARACTER SET UTF8 COLLATE DEFAULT DEFAULT NULL , C_BIRTH_DAY INT DEFAULT NULL , C_BIRTH_MONTH INT DEFAULT NULL , C_BIRTH_YEAR INT DEFAULT NULL , C_BIRTH_COUNTRY VARCHAR(32000 BYTES) CHARACTER SET UTF8 COLLATE DEFAULT DEFAULT NULL , C_LOGIN VARCHAR(32000 BYTES) CHARACTER SET UTF8 COLLATE DEFAULT DEFAULT NULL , C_EMAIL_ADDRESS VARCHAR(32000 BYTES) CHARACTER SET UTF8 COLLATE DEFAULT DEFAULT NULL , C_LAST_REVIEW_DATE VARCHAR(32000 BYTES) CHARACTER SET UTF8 COLLATE DEFAULT DEFAULT NULL ) ; Roberta From: Govindarajan, Selvaganes Sent: Tuesday, May 19, 2015 12:44 PM To: Capps, Jim; Marton, Roberta S; Fritchman, Barry Subject: RE: Trying to create a warning Can you please get the ddl definition of hive.hive.customer and the customer table in Trafodion. Selva From: Capps, Jim Sent: Tuesday, May 19, 2015 11:50 AM To: Marton, Roberta S; Fritchman, Barry Cc: Govindarajan, Selvaganes Subject: RE: Trying to create a warning Selva may be the best person to answer this. My first guess is that the table hive.hive.customer has its data encoded as ISO88591 and some piece of code involved with CTAS was assuming that the data was encoded as UTF8. Regards, Jim From: Marton, Roberta S Sent: Tuesday, May 19, 2015 1:35 PM To: Capps, Jim; Fritchman, Barry Cc: Govindarajan, Selvaganes Subject: RE: Trying to create a warning Is the encoding done through LOAD? Should I write up a case for this against the executor? Roberta From: Capps, Jim Sent: Tuesday, May 19, 2015 11:30 AM To: Fritchman, Barry; Marton, Roberta S Cc: Govindarajan, Selvaganes Subject: RE: Trying to create a warning Barry, I agree completely with you. Thanks, Jim From: Fritchman, Barry Sent: Monday, May 18, 2015 9:12 PM To: Marton, Roberta S Cc: Govindarajan, Selvaganes; Capps, Jim Subject: RE: Trying to create a warning Hi Roberta, The problem isn’t Update Stats, it seems to be a problem with the way char values were encoded when the table was populated during ctas. For example, applying converttohex to c_birth_country where the value is ‘CÔTE D’IVOIRE’ gives 43D4544520442749564F495245. D4 is the ISO-8859-1 code point for capital O with a circumflex, but in UTF-8 it should be the 2-byte sequence C394. c_birth_country is declared as having utf8 as its character set, so when a string function like substr or char_length is used on this you get the same error you saw from ustat: >>select max(char_length(c_birth_country)) from cust; *** ERROR[8433] Invalid UTF8 character encountered in CHAR FUNCTION. --- 0 row(s) selected. >> I’m copying Jim in case I got any of this wrong. Barry From: Marton, Roberta S Sent: Monday, May 18, 2015 4:44 PM To: Fritchman, Barry Cc: Govindarajan, Selvaganes Subject: RE: Trying to create a warning I have put the ULOG on /designs/seaquest/fritchma/roberta_ulog The first couple of lines describe how the table is created. Roberta From: Fritchman, Barry Sent: Monday, May 18, 2015 3:37 PM To: Marton, Roberta S Cc: Govindarajan, Selvaganes Subject: RE: Trying to create a warning Roberta, If you capture a ULOG you can send it to me and I’ll check into the error. Barry From: Marton, Roberta S Sent: Monday, May 18, 2015 3:24 PM To: Fritchman, Barry Cc: Govindarajan, Selvaganes Subject: RE: Trying to create a warning I noticed the hive test generate this error. So I performed the following but getting an error from update statistics: Create table customer as select * from hive.hive.customer; update statistics for table customer on every column; *** ERROR[9200] UPDATE STATISTICS for table TRAFODION.USER1.CUSTOMER encountered an error (8433) from statement HSCursor::fetchRowset(). *** ERROR[8433] Invalid UTF8 character encountered in SUBSTRING FUNCTION. *** ERROR[9200] UPDATE STATISTICS for table TRAFODION.USER1.CUSTOMER encountered an error (8433) from statement FETCH_BOUNDARY_ROWSET. *** ERROR[8433] Invalid UTF8 character encountered in SUBSTRING FUNCTION. --- SQL operation failed with errors. Selva – is LOAD called for create as select requests? Should some code be checking for valid character data before insertion? Roberta From: Fritchman, Barry Sent: Friday, May 15, 2015 5:10 PM To: Marton, Roberta S Subject: RE: Trying to create a warning I created table with up to 10 million rows and couldn’t get the warning to appear. For some reason, I can only get it on one table that I already had. Here is how to generate that table if you want to try that. Run the Hive command-line tool and do the following 2 commands: CREATE TABLE ORDERS ( O_ORDERKEY INT , O_CUSTKEY INT , O_ORDERSTATUS STRING , O_TOTALPRICE DOUBLE , O_ORDERDATE TIMESTAMP , O_ORDERPRIORITY STRING , O_CLERK STRING , O_SHIPPRIORITY INT , O_COMMENT STRING ) row format delimited fields terminated by '|'; load data local inpath '/designs/seaquest/thaiju/rhel-cdh1/g_data/tpch2x/data/orders.tbl' into table orders; Then in sqlci, do CREATE TABLE ORDERS2 ( O_ORDERKEY INT NO DEFAULT NOT NULL , O_CUSTKEY INT NO DEFAULT NOT NULL , O_ORDERSTATUS CHAR(1) CHARACTER SET ISO88591 NO DEFAULT NOT NULL , O_TOTALPRICE NUMERIC(12, 2) NO DEFAULT NOT NULL , O_ORDERDATE DATE NO DEFAULT NOT NULL , O_ORDERPRIORITY CHAR(15) CHARACTER SET ISO88591 NO DEFAULT NOT NULL , O_CLERK CHAR(15) CHARACTER SET ISO88591 NO DEFAULT NOT NULL , O_SHIPPRIORITY INT --NO DEFAULT NOT NULL , O_COMMENT VARCHAR(79) CHARACTER SET ISO88591 NO DEFAULT NOT NULL , PRIMARY KEY (O_ORDERKEY ASC) ) salt using 4 partitions ; cqd COMP_BOOL_226 ‘ON’; cqd TRAF_LOAD_PREP_TMP_LOCATION '/bulkload/'; load into ORDERS select o_orderkey, o_custkey, o_orderstatus, o_totalprice, cast(o_orderdate as date) o_orderdate, o_orderpriority, o_clerk, o_shippriority, o_comment from hive.hive.orders; Now you should be able to get the warning with Prepare s from select * from orders; Don’t forget the “Prepare s from”, or you’ll be watching 3 million rows scroll by. Barry From: Marton, Roberta S Sent: Friday, May 15, 2015 3:13 PM To: Fritchman, Barry Subject: RE: Trying to create a warning I created and added rows to a table. I did not update statistics so the histogram tables ddi not exist. I don’t get any warnings. Then I generated stats for one column and selected on the other and still did not get a warning: sqlci -u sql_user2 Trafodion Conversational Interface 1.2.0 (c) Copyright 2014 Hewlett-Packard Development Company, LP. Database user: SQL_USER2 >>cqd HIST_MISSING_STATS_WARNING_LEVEL '4'; --- SQL operation complete. >>select * from user1.t1 where c1 > 0; C1 C2 ----------- ----------- 1 1 2 2 3 3 4 4 5 5 --- 5 row(s) selected. >>showstats for table user1.t1 on every column; Histogram data for Table TRAFODION.USER1.T1 Table ID: 132359578113963536 Hist ID # Ints Rowcount UEC Colname(s) ========== ====== =========== =========== =========================== 153857896 5 5 5 C2 --- SQL operation complete. Roberta From: Fritchman, Barry Sent: Friday, May 15, 2015 2:58 PM To: Marton, Roberta S; Thai, Julie Subject: RE: Trying to create a warning The warning is dependent on the HIST_MISSING_STATS_WARNING_LEVEL cqd, but it should be 4 by default, which is the most inclusive level. Most likely the optimizer is not requesting that column. Try using the primary key column as the one that stats are missing for. The optimizer almost always requests stats for the primary key. Barry From: Marton, Roberta S Sent: Friday, May 15, 2015 2:46 PM To: Thai, Julie; Fritchman, Barry Subject: Trying to create a warning I am trying to generate the warning: *** WARNING[6008] Statistics for column () from table were not available. As a result, the access path chosen might not be the best possible. [2015-04-29 13:10:04] But can’t seem to do it. Do I need to insert a certain number of rows first? Roberta