Not all logs on all nodes are processed by event log reader UDF
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Trafodion |
Fix Released
|
Critical
|
Hans Zeller |
Bug Description
select * from udf(event_
Tried on centos-
SQL>select * from udf(event_
--- 0 row(s) selected.
SQL>select distinct cpu from udf(event_
CPU
--- 4 row(s) selected.
The result didn’t get cpu 0 and cpu 1 , but we do have logs on those nodes.
Check node1(centos-mapr2) –
Cat master_
2015-01-19 04:29:53,454, INFO, SQL.ESP, Node Number: 0, CPU: 1, PIN: 3309, Process Name: $Z0102PJ,,, An ESP process is launched.
2015-01-19 04:35:23,103, INFO, SQL.ESP, Node Number: 0, CPU: 1, PIN: 6333, Process Name: $Z01055Y,,, An ESP process is launched.
2015-01-19 05:31:34,150, INFO, SQL.ESP, Node Number: 0, CPU: 1, PIN: 30036, Process Name: $Z010PI6,,, An ESP process is launched.
Changed in trafodion: | |
assignee: | nobody → Hans Zeller (hans-zeller) |
importance: | Undecided → High |
milestone: | none → r1.0 |
Changed in trafodion: | |
importance: | High → Critical |
Changed in trafodion: | |
status: | New → Confirmed |
Changed in trafodion: | |
status: | Confirmed → In Progress |
The issue is that we start the tdm_udrserv processes on random CPUs, we don't co-locate them with the ESPs. The query plan ensures we get one ESP per node, and that seems to work in this case, but then each of these ESPs starts its UDR server on a random CPU, so we return some data multiple times and other data not at all. Example:
[trafodion@ centos- mapr3 ~]$ sqps | grep tdm_arkesp centos- mapr3 ~]$ sqps | grep tdm_udrserv centos- mapr3 ~]$
[$Z020HEQ] 000,00031849 001 GEN ES--A-- $Z000QZZ $Z000PH2 tdm_arkesp
[$Z020HEQ] 001,00000807 001 GEN ES--A-- $Z0100N2 $Z000PH2 tdm_arkesp
[$Z020HEQ] 002,00015935 001 GEN ES--A-- $Z020D0A $Z000PH2 tdm_arkesp
[$Z020HEQ] 003,00001904 001 GEN ES--A-- $Z0301JE $Z000PH2 tdm_arkesp
[$Z020HEQ] 004,00024644 001 GEN ES--A-- $Z040K44 $Z000PH2 tdm_arkesp
[$Z020HEQ] 005,00002529 001 GEN ES--A-- $Z050229 $Z000PH2 tdm_arkesp
[trafodion@
[$Z020HH1] 002,00015941 001 GEN ES--A-- $Z020D0G $Z000QZZ tdm_udrserv
[$Z020HH1] 002,00015943 001 GEN ES--A-- $Z020D0I $Z0100N2 tdm_udrserv
[$Z020HH1] 002,00015944 001 GEN ES--A-- $Z020D0J $Z050229 tdm_udrserv
[$Z020HH1] 003,00001911 001 GEN ES--A-- $Z0301JL $Z020D0A tdm_udrserv
[$Z020HH1] 004,00024650 001 GEN ES--A-- $Z040K4A $Z0301JE tdm_udrserv
[$Z020HH1] 005,00002536 001 GEN ES--A-- $Z05022G $Z040K44 tdm_udrserv
[trafodion@
The fix is to co-locate each tdm_udrserv with its local ESP. That should not cause any load balancing issues, since we already made sure that the ESPs are evenly balanced. To the contrary, if the UDR servers perform fairly heavy work, this ensures that they are as evenly balanced as the ESPs.