Not all logs on all nodes are processed by event log reader UDF

Bug #1412630 reported by gaoruixian
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
Fix Released
Critical
Hans Zeller

Bug Description

select * from udf(event_log_reader('f')) should return all records in logs on all nodes, however , seems logs on some nodes are not being processed.

Tried on centos-mapr1.hpl.hp.com:37800

SQL>select * from udf(event_log_reader('f')) where cpu=0 and log_file_name='master_exec_0_7476.log';

          --- 0 row(s) selected.

         SQL>select distinct cpu from udf(event_log_reader('f')) where log_file_name='master_exec_0_7476.log';

         CPU
         -----------
                      2
                      5
                      3
                      4

          --- 4 row(s) selected.

          The result didn’t get cpu 0 and cpu 1 , but we do have logs on those nodes.

          Check node1(centos-mapr2) –

          [trafodion@centos-mapr2 logs]$ ll master*.log
          -rw-r--r-- 1 trafodion trafodion 258 Jan 18 20:35 master_exec_0_7476.log
          -rw-r--r-- 1 trafodion trafodion 258 Jan 18 20:30 master_exec_0_7487.log
          -rw-r--r-- 1 trafodion trafodion 136 Jan 18 19:42 master_exec_0_7851.log
          -rw-r--r-- 1 trafodion trafodion 5281 Jan 18 21:09 master_exec_1_15592.log
          -rw-r--r-- 1 trafodion trafodion 7216 Jan 18 20:35 master_exec_1_15605.log
          -rw-r--r-- 1 trafodion trafodion 259 Jan 18 21:25 master_exec_3_2078.log
          -rw-r--r-- 1 trafodion trafodion 258 Jan 18 20:27 master_exec_4_15691.log
          -rw-r--r-- 1 trafodion trafodion 130 Jan 18 20:03 master_exec_5_11066.log

         Cat master_exec_0_7476.log –

         2015-01-19 04:29:53,454, INFO, SQL.ESP, Node Number: 0, CPU: 1, PIN: 3309, Process Name: $Z0102PJ,,, An ESP process is launched.
         2015-01-19 04:35:23,103, INFO, SQL.ESP, Node Number: 0, CPU: 1, PIN: 6333, Process Name: $Z01055Y,,, An ESP process is launched.
2015-01-19 05:31:34,150, INFO, SQL.ESP, Node Number: 0, CPU: 1, PIN: 30036, Process Name: $Z010PI6,,, An ESP process is launched.

Changed in trafodion:
assignee: nobody → Hans Zeller (hans-zeller)
importance: Undecided → High
milestone: none → r1.0
Changed in trafodion:
importance: High → Critical
Changed in trafodion:
status: New → Confirmed
Revision history for this message
Hans Zeller (hans-zeller) wrote :

The issue is that we start the tdm_udrserv processes on random CPUs, we don't co-locate them with the ESPs. The query plan ensures we get one ESP per node, and that seems to work in this case, but then each of these ESPs starts its UDR server on a random CPU, so we return some data multiple times and other data not at all. Example:

[trafodion@centos-mapr3 ~]$ sqps | grep tdm_arkesp
[$Z020HEQ] 000,00031849 001 GEN ES--A-- $Z000QZZ $Z000PH2 tdm_arkesp
[$Z020HEQ] 001,00000807 001 GEN ES--A-- $Z0100N2 $Z000PH2 tdm_arkesp
[$Z020HEQ] 002,00015935 001 GEN ES--A-- $Z020D0A $Z000PH2 tdm_arkesp
[$Z020HEQ] 003,00001904 001 GEN ES--A-- $Z0301JE $Z000PH2 tdm_arkesp
[$Z020HEQ] 004,00024644 001 GEN ES--A-- $Z040K44 $Z000PH2 tdm_arkesp
[$Z020HEQ] 005,00002529 001 GEN ES--A-- $Z050229 $Z000PH2 tdm_arkesp
[trafodion@centos-mapr3 ~]$ sqps | grep tdm_udrserv
[$Z020HH1] 002,00015941 001 GEN ES--A-- $Z020D0G $Z000QZZ tdm_udrserv
[$Z020HH1] 002,00015943 001 GEN ES--A-- $Z020D0I $Z0100N2 tdm_udrserv
[$Z020HH1] 002,00015944 001 GEN ES--A-- $Z020D0J $Z050229 tdm_udrserv
[$Z020HH1] 003,00001911 001 GEN ES--A-- $Z0301JL $Z020D0A tdm_udrserv
[$Z020HH1] 004,00024650 001 GEN ES--A-- $Z040K4A $Z0301JE tdm_udrserv
[$Z020HH1] 005,00002536 001 GEN ES--A-- $Z05022G $Z040K44 tdm_udrserv
[trafodion@centos-mapr3 ~]$

The fix is to co-locate each tdm_udrserv with its local ESP. That should not cause any load balancing issues, since we already made sure that the ESPs are evenly balanced. To the contrary, if the UDR servers perform fairly heavy work, this ensures that they are as evenly balanced as the ESPs.

Changed in trafodion:
status: Confirmed → In Progress
Revision history for this message
Hans Zeller (hans-zeller) wrote :
Changed in trafodion:
status: In Progress → Fix Committed
Revision history for this message
gaoruixian (ruixian-gao) wrote :

The fix has been verified on traf_0122 build on centos-mapr1

Changed in trafodion:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.