'\N' and empty string are both treated as null when selecting hive table.

Bug #1443463 reported by Howard Qin on 2015-04-13
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
New
High
khaled Bouaziz

Bug Description

When loading/selecting from hive table, we treat both empty string and '\N' as null, this doesn't conform with hive behavior, in which, user can configure what null character is.

The issue can be explain in following steps:
1)
>select * from t1;
A B
----------- ------------------------------------------------------------------------------------------------------------------------
          4
          1 ?
          2 fsdf
          3 hhah

2)
unload with null_string '\N' into '/nulltest' select * from t1;
-bash-4.1$ swhdfs dfs -cat /nulltest/file0-20150410015053-828
4|
1|\N
2|fsdf
3|hhah

3)
hive> create external table nulltest ( A int, B string )
         > row format delimited fields terminated by '|'
         > location '/nulltest'
         > tblproperties ('serialization.null.format' = '\\N');

hive> select * from nulltest;
OK
nulltest.a nulltest.b
4  <-- empty string
1 NULL
2 fsdf
3 hhah

4)
>>select * from hive.hive.nulltest;
A B
----------- ------------------------------------------------------------------------------------------------------------------------
          4 ? <-- empty string is considered null.
          1 ?
          2 fsdf
          3 hhah

>select * from t1;
A B
----------- ------------------------------------------------------------------------------------------------------------------------
          4 <-- should consist with t1 and hive.
          1 ?
          2 fsdf
          3 hhah

Suresh Subbiah (suresh-subbiah) wrote :

Same problem was seen when loading HPDSM data from one system to another. Increasing the priority.

Changed in trafodion:
importance: Medium → High
khaled Bouaziz (khaled-bouaziz) wrote :

From Qifan

Looks like the bug is in the following method

char * ExHdfsScanTcb::extractAndTransformAsciiSourceToSqlRow(int &err,
                                                             ComDiagsArea* &diagsArea)
{
  err = 0;

  1267 if (attr) // this is a needed column. We need to convert
   1268 {
   1269 *(short*)&hdfsAsciiSourceData_[attr->getVCLenIndOffset()] = l en;
   1270 if (attr->getNullFlag())
   1271 {
   1272 if (len == 0)
   1273 *(short *)&hdfsAsciiSourceData_[attr->getNullIndOffset()] = -1;
   1274 else if (memcmp(sourceData, "\\N", len) == 0)
   1275 *(short *)&hdfsAsciiSourceData_[attr->getNullIndOffset()] = -1;
   1276 else
   1277 *(short *)&hdfsAsciiSourceData_[attr->getNullIndOffset()] = 0;
   1278 }

The correct logic should be:

If (\N is the null indicator recorded in meta-data and the current string is “\N” || (\N is not the null indicator and len == 0) )
  *(short *)&hdfsAsciiSourceData_[attr->getNullIndOffset()] = -1;

As a work-around, we can assume \N is the null indicator by default, and comment out line 1272 and 1273 and remove ‘else’ from line 1274.

Naveen S S (navi99) wrote :

Any update on this bug. Currently we are facing the same issue in hive.

Issue overview
Say we have a table named T1 stored as orc. Select * fom t1 would yeild the following result.
Key | Value |
1 | |<-- Empty

Now, we create table T2 like T1
Create TABLE T2 like T1;

Unload data from T1 to T2
Insert into T2 select * from T1;

If we query T2 i see all empty values converted to NULL's
Key | Value
1 | NULL

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers