Lots of mxosrvrs kicked off for proper number to stay up.

Bug #1410504 reported by Guy Groulx
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
Confirmed
Medium
Anuradha

Bug Description

We’ve noticed this before but never make anything out of it since we did not have proof. But now that master_exec files are created in the logs directory, I believe we have proof that on dcsstart, we need to start many mxosrvrs before we get to the number of mxosrvrs we actually want. This concerns us a bit as it means that we are creating processes that die right away.

The last start time on zircon4 was around at 15:48.
/opt/hp/squser4/git150112dcs/logs> ps -C mxosrvr | wc –l
129

So this is showing that we have 128 mxosrvrs running on this node + 1 header line from the PS command.

/opt/hp/squser4/git150112dcs/logs> ls master_exec* | wc –l
598
This is telling that we have 598 master_exec* file under the directory. And yes, these are all from this start as the mod date is all 15:48.

/opt/hp/squser4/git150112dcs/logs> ls -lt master*
-rw-r----- 1 squser4 seaquest 42037 Jan 13 15:48 master_exec_0_21145.log
-rw-r----- 1 squser4 seaquest 42543 Jan 13 15:48 master_exec_0_21018.log
-rw-r----- 1 squser4 seaquest 41706 Jan 13 15:48 master_exec_0_21012.log
-rw-r----- 1 squser4 seaquest 41375 Jan 13 15:48 master_exec_0_20991.log

-rw-r----- 1 squser4 seaquest 1499 Jan 13 15:48 master_exec_0_63779.log
-rw-r----- 1 squser4 seaquest 1168 Jan 13 15:48 master_exec_0_63462.log
-rw-r----- 1 squser4 seaquest 1499 Jan 13 15:48 master_exec_0_63652.log
-rw-r----- 1 squser4 seaquest 662 Jan 13 15:48 master_exec_0_63673.log
-rw-r----- 1 squser4 seaquest 331 Jan 13 15:48 master_exec_0_63598.log
-rw-r----- 1 squser4 seaquest 837 Jan 13 15:48 master_exec_0_63667.log
-rw-r----- 1 squser4 seaquest 0 Jan 13 15:48 master_exec_0_63594.log
-rw-r----- 1 squser4 seaquest 0 Jan 13 15:47 master_exec

Looking at some of those:
/opt/hp/squser4/git150112dcs/logs> ps 63667
  PID TTY STAT TIME COMMAND
/opt/hp/squser4/git150112dcs/logs> cat master_exec_0_63667.log
2015-01-13 15:48:27,292, ERROR, MXOSRVR, Node Number: 0, CPU: 0, PIN:63667, Process Name:$Z001GZ2 , , ,A network component error Platform: NSK, Transport: TCPIP, Api: UNKNOWN_API, Error type: LISTENER, Process: ListenToPort, Operation: INIT_PROCESS, function: ACCEPT, error: 98, error_detail: 0. has occurred.
CEE Error Text: The specified address is already in use.

2015-01-13 15:48:27,293, ERROR, MXOSRVR, Node Number: 0, CPU: 0, PIN:63667, Process Name:$Z001GZ2 , , ,A network component error UNKNOWN_ERROR_TYPE 0, Process: bailout ListenToPort[21002][-2], Operation: INIT_PROCESS, function: SOCKET, error: 0, error_detail: 0. has occurred.
CEE Error Text: .

2015-01-13 15:48:27,293, ERROR, MXOSRVR, Node Number: 0, CPU: 0, PIN:63667, Process Name:$Z001GZ2 , , ,A network component error 5012 has occurred.
CEE Error Text: <5>.
The above is showing that 63667 did not survive as an mxosrvr.

/opt/hp/squser4/git150112dcs/logs> ps 65102
  PID TTY STAT TIME COMMAND
65102 ? Sl 0:07 mxosrvr -ZKHOST n014.cm.cluster:2181,n015.cm.cluster:2181,n016.cm.cluster:2181 -RZ zircon-n011.usa.hp.com
/opt/hp/squser4/git150112dcs/logs> cat master_exec_0_65102.log
2015-01-13 15:48:27,832, ERROR, MXOSRVR, Node Number: 0, CPU: 0, PIN:65102, Process Name:$Z001I52 , , ,A network component error UNKNOWN_ERROR_TYPE 0, Process: verifyPortAvailable:[21002], Operation: INIT_PROCESS, function: BIND, error: 98, error_detail: 0. has occurred.
CEE Error Text: The specified address is already in use.

2015-01-13 15:48:27,832, ERROR, MXOSRVR, Node Number: 0, CPU: 0, PIN:65102, Process Name:$Z001I52 , , ,A network component error UNKNOWN_ERROR_TYPE 0, Process: verifyPortAvailable:[21003], Operation: INIT_PROCESS, function: BIND, error: 98, error_detail: 0. has occurred.
CEE Error Text: The specified address is already in use.

**** The above msg is giving 13 more times. (Exact same date time).
But finally it worked as the process is running.

If I was to look at all the master_exec above I would find all of them have at least one error and (598 - 128) ~> 470 of them did not start.

Matt Brown (mattbrown-2)
Changed in trafodion:
status: New → Confirmed
Changed in trafodion:
milestone: none → r1.1
assignee: nobody → Anuradha (anuradha-hegde)
Changed in trafodion:
milestone: r1.1 → none
Changed in trafodion:
milestone: none → r2.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.