Row mismatch between index/table cause init auth to fail

Bug #1426479 reported by Roberta Marton
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
Fix Committed
High
Justin Du

Bug Description

Initialize authorization is failing because of a problem with index maintenance on the OBJECTS metadata table.

After running full regressions, the number of rows in the OBJECTS table is less than the number of rows in the OBJECTS table's index. The number of rows should match.

How to recreate:
   Run full regressions on the work station (either debug or release) – everything works
   Re-run regressions and all catman1 tests failed because initialize authorization fails:

initialize authorization;

*** ERROR[1001] An internal error occurred in module ../sqlcomp/PrivMgrPrivileges.cpp on line 3183. DETAILS(Expected to insert 246 rows into OBJECT_PRIVILEGES table, instead 238 were found.).

The initialize authorization code first performs an insert .. select -> this returns 238 rows (it reads from the OBJECTS table)
Then the code performs a count(*) on the newly inserted table -> this returns 246 rows (it reads from the OBJECTS table index)

Note that initialize authorization creates five tables, and then performs the insert … select
When initialize authorization fails it removes the 5 tables.

If I log onto sqlci and perform similar commands:
   A select that mimics what the insert .. select performs and
  A select with count(*)

select count(*)
       from "_MD_".objects o
       where o.object_type in ('VI','BT','LB','UR','SG')
;

(EXPR)
--------------------

                 241

--- 1 row(s) selected.

select
       object_uid,
       object_owner
    from "_MD_".objects o
    where o.object_type in ('VI','BT','LB','UR','SG')
;

OBJECT_UID OBJECT_OWNER
-------------------- ------------

   61418117531923345 33333
   82247271613094528 33333
. . .
  121090312342560989 33333
  121090312342560327 33333
  121090312342560715 33333

--- 233 row(s) selected.

Note that the difference in row counts in these queries and the row counts from initialize authorization is because the 5 privilege manager table that no longer exist (246 – 5 = 241 & 238 – 5 = 233)

If I don’t include object_owner in the select list, then 241 rows are returned as expected.

Comparing the returned UID’s from both requests, the following rows were not returned in the second select:

OBJECT_TYPE SCHEMA_NAME OBJECT_NAME
----------- ------------------------------- --------------------------------------------------

BT SCH SKC
BT SCH DM2C
BT SCH T062A
BT SCH T062B
BT SCH SKC
BT SCH DM2C
BT SCH T062A
BT SCH T062B

--- 8 row(s) selected.

The returned rows have UID’s in the same range:

  83939089253157639
   83939089253158296
   83939089253164783
   83939089253164911
   86190385130448198
   86190385130448858
   86190385130453959
   86190385130454168

Now I did a select count(*) on the table versus the index (read all the rows, not just a subset):

Set parserflags 1;
select count(*) from table (index_table objects_uniq_idx);
..

(EXPR)
--------------------

                 448

--- 1 row(s) selected.
>>select count(*) from objects;

(EXPR)
--------------------

                 436

There is a difference – there are more rows in the index than in the table. This seems to be the problem encountered by initialize authorization.

Hopefully, the table names above can lead to which statements are actually failing to do index maintenance and the problem can be more easily recreated.

------> Mike

All the tables below come from compGeneral/TEST062. I ran it by itself and then checked for corruption but I didn’t find it, so this might be a little harder to debug.

ps - What’s the significance of the UIDs being in the same range?

------> Selva

This might be yet another issue that we might need to understand better. Couple of days ago, I ran executor/TEST106 in my workspace because jenkens had some failures with this test. For some reason, this left the metadata for the table TEST106 inconsistent though the test passed. I couldn’t drop T106B (I think) because it was not able to delete all the rows from SB_HISTOGRAM_INTERVALS for some reason and drop used to get stuck there. However, I didn’t find this table in hbase.

I initialized Trafodion again and ran full regressions again. This time similar problem happens in T42 table in compGeneral/TEST011.

------> Sandhya

This is troubling though – I don’t recall this issue being reported where our dev regressions left the metadata in inconsistent state so easily . We recently made some changes that changed the plan for the delete statement that gets executed during a drop table – it was to avoid an error 73 issue when concurrent ddls were performed. I wonder if there is some issue with that CQD and the plan causing index maintenance on the metadata tables to not work correctly. Or perhaps some other change causing this regression. I am just guessing here – no concrete clues. As Mike says this may be difficult to pin point.

Tags: sql-exe
tags: added: sql-exe
Changed in trafodion:
importance: Undecided → High
milestone: none → r1.1
Changed in trafodion:
assignee: nobody → Justin Du (justin-du-2)
Justin Du (justin-du-2)
Changed in trafodion:
status: New → In Progress
Revision history for this message
Justin Du (justin-du-2) wrote :

This problem may be the result of CQD HIDE_INDEXES 'ALL' in the test (compGeneral/TEST062). In fact the problem can be reproduced with the following scripts:

1 Run this on a new sqlci session:
drop table skc;
create table skc(name varchar(20) character set UCS2 not casespecific not null,
                 userId varchar(20) character set UCS2 not casespecific not null
                )
               store by (name);

drop table dm2c;
create table dm2c( name char(30) character set UCS2 not null,
                   salary float (40) not null)
               store by (name);

control query default HIDE_INDEXES 'ALL';

drop table skc;
drop table dm2c;

2. Run this script on the same sqlci session or new one:
set parserflags 1;
set schema "_MD_";
cqd query_cache '0';

control query shape join(scan(path 'OBJECTS'), scan(path 'OBJECTS_UNIQ_IDX'));
select object_uid, object_type from OBJECTS where object_uid not in
  (select "OBJECT_UID@" from table(index_table OBJECTS_UNIQ_IDX));

control query shape join(scan(path 'OBJECTS_UNIQ_IDX'), scan(path 'OBJECTS'));
select "OBJECT_UID@", object_type, object_name from table(index_table OBJECTS_UNIQ_IDX)
   where "OBJECT_UID@" not in (select object_uid from OBJECTS);

The second select statement would return one or 2 rows.

The user CQD should not affect the metadata queries, so none of the select statements should return any row.

Revision history for this message
Justin Du (justin-du-2) wrote :
Download full text (8.1 KiB)

More info about the problem analysis from email exchanges:

The problem seems with switching CmpContext. Here is the scenario:

1) We executed the drop seabase table statement, we set cqd HIDE_INDEXES ‘NONE’ at sqlcomp/CmpSeabaseDDLcommon.cpp:6620.
2) However, when we started dropping the seabase table at line 6740, we went through a binder step to get the NATable info,
3) Since we were in the embedded CmpContext when doing all this and the getNATable call somehow needs to compile statements to get the table info, this caused CmpContext switch,
4) When switching back, we called restoreAllControlsAndFlags() and that caused the external compiler to reset the cqd HIDE_INDEXES to ‘ALL’.
5) Later we compiled the delete statement, the compilation was sent to the external compiler (as we were already in the embedded compiler) thus the returned plan didn’t include the index part.

Here is the stacks info that shows 2) to 4) . I had observed that 3) didn’t happen sometimes. In that case, the delete statement would be right.

I hope we can find the solution from this call path.

Thanks,
Justin

(gdb) bt
#0 ExControlTcb::work (this=0x7fffe8ab8600) at ../executor/ex_control.cpp:191
#1 0x00007ffff485369b in ex_tcb::sWork (tcb=0x7fffe8ab8600)
    at ../executor/ex_tcb.h:99
#2 0x00007ffff49bdd67 in ExSubtask::work (this=0x7fffe8ab8b38)
    at ../executor/ExScheduler.cpp:751
#3 0x00007ffff49bd12a in ExScheduler::work (this=0x7fffe8ab8240,
    prevWaitTime=0) at ../executor/ExScheduler.cpp:328
#4 0x00007ffff48aa3a2 in ex_root_tcb::execute (this=0x7fffe8ab8bb8,
    cliGlobals=0xb84880, glob=0x7fffe8ab47a0, input_desc=0x7fffe8ae1030,
    diagsArea=@0x7ffffffe86d0, reExecute=0) at ../executor/ex_root.cpp:1055
#5 0x00007ffff5ef63db in CliStatement::execute (this=0x7fffe8ae0168,
    cliGlobals=0xb84880, input_desc=0x7fffe8ae1030, diagsArea=...,
    execute_state=CliStatement::INITIAL_STATE_, fixupOnly=0, cliflags=0)
    at ../cli/Statement.cpp:4570
#6 0x00007ffff5e7fc7f in SQLCLI_PerformTasks(CliGlobals *, ULng32, SQLSTMT_ID *, SQLDESC_ID *, SQLDESC_ID *, Lng32, Lng32, typedef __va_list_tag __va_list_tag *, SQLCLI_PTR_PAIRS *, SQLCLI_PTR_PAIRS *) (cliGlobals=0xb84880, tasks=4882,
    statement_id=0x10d5f40, input_descriptor=0x1bbddc0, output_descriptor=0x0,
    num_input_ptr_pairs=0, num_output_ptr_pairs=0, ap=0x7ffffffe8cb0,
    input_ptr_pairs=0x0, output_ptr_pairs=0x0) at ../cli/Cli.cpp:3284
#7 0x00007ffff5e805f0 in SQLCLI_Exec(CliGlobals *, SQLSTMT_ID *, SQLDESC_ID *, Lng32, typedef __va_list_tag __va_list_tag *, SQLCLI_PTR_PAIRS *) (
    cliGlobals=0xb84880, statement_id=0x10d5f40, input_descriptor=0x1bbddc0,
    num_ptr_pairs=0, ap=0x7ffffffe8cb0, ptr_pairs=0x0) at ../cli/Cli.cpp:3531
#8 0x00007ffff5f0b938 in SQL_EXEC_Exec (statement_id=0x10d5f40,
    input_descriptor=0x1bbddc0, num_ptr_pairs=0) at ../cli/CliExtern.cpp:2071
#9 0x00007ffff49232b0 in ExeCliInterface::exec (this=0x7ffffffe90e0,
    inputBuf=0x0, inputBufLen=0) at ../executor/ExExeUtilCli.cpp:616
#10 0x00007ffff4923e6a in ExeCliInterface::executeImmediateExec (
    this=0x7ffffffe90e0,
    stmtStr=0x7ffffffe8f00 "control query default hide_indexes reset;",
    outputBuf=0x...

Read more...

Revision history for this message
Justin Du (justin-du-2) wrote :

If my analysis is correct, the problem is because we called CmpSeabaseDDL::sendAllControlsAndFlags() (which issues hide_indexes ‘NONE’) twice, once at sqlcomp/CmpSeabaseDDLcommon.cpp:6620 without given prevContext, another at CmpSeabaseDDL::switchCompiler() but with prevContext (i.e. the current cmp context when it is called). The second call can be traced back to sqlcomp/CmpSeabaseDDLcommon.cpp:6740 as we needed to get the table info at the bind stage for dropping the seabase table. Also as the second call to sendAllControlsAndFlags() passed prevContext, this prevContext contains user cqds including hide_indexes ‘ALL’. This cqd was sent to the external compiler via sendAllControls() (in sqlcomp/CmpDescribe.cpp). Later in CmpSeabaseDDL::sendAllControlsAndFlags(), we did hold and set cqd hide_indexes ‘NONE’ at line1319. These caused the external compiler to hold ‘ALL’ value and set ‘NONE’.

Once done with getting table info and switching back cmp context, we called CmpSeabaseDDL::restoreAllControlsAndFlags(). This is why the external compiler process got the hide_indexes ‘ALL’ value back and used it when compiling the delete metadata table statements

Revision history for this message
Justin Du (justin-du-2) wrote :

The fix is not to send user CQDs to the metadata compiler instances. This is done by not passing current CmpContext (switching from) when calling CmpSeabaseDDL::sendAllControlsAndFlags() in CmpSeabaseDDL::switchCompiler().

Justin Du (justin-du-2)
Changed in trafodion:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.