Lots of mxosrvrs kicked off for proper number to stay up.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Trafodion |
Confirmed
|
Medium
|
Anuradha |
Bug Description
We’ve noticed this before but never make anything out of it since we did not have proof. But now that master_exec files are created in the logs directory, I believe we have proof that on dcsstart, we need to start many mxosrvrs before we get to the number of mxosrvrs we actually want. This concerns us a bit as it means that we are creating processes that die right away.
The last start time on zircon4 was around at 15:48.
/opt/hp/
129
So this is showing that we have 128 mxosrvrs running on this node + 1 header line from the PS command.
/opt/hp/
598
This is telling that we have 598 master_exec* file under the directory. And yes, these are all from this start as the mod date is all 15:48.
/opt/hp/
-rw-r----- 1 squser4 seaquest 42037 Jan 13 15:48 master_
-rw-r----- 1 squser4 seaquest 42543 Jan 13 15:48 master_
-rw-r----- 1 squser4 seaquest 41706 Jan 13 15:48 master_
-rw-r----- 1 squser4 seaquest 41375 Jan 13 15:48 master_
…
-rw-r----- 1 squser4 seaquest 1499 Jan 13 15:48 master_
-rw-r----- 1 squser4 seaquest 1168 Jan 13 15:48 master_
-rw-r----- 1 squser4 seaquest 1499 Jan 13 15:48 master_
-rw-r----- 1 squser4 seaquest 662 Jan 13 15:48 master_
-rw-r----- 1 squser4 seaquest 331 Jan 13 15:48 master_
-rw-r----- 1 squser4 seaquest 837 Jan 13 15:48 master_
-rw-r----- 1 squser4 seaquest 0 Jan 13 15:48 master_
-rw-r----- 1 squser4 seaquest 0 Jan 13 15:47 master_exec
Looking at some of those:
/opt/hp/
PID TTY STAT TIME COMMAND
/opt/hp/
2015-01-13 15:48:27,292, ERROR, MXOSRVR, Node Number: 0, CPU: 0, PIN:63667, Process Name:$Z001GZ2 , , ,A network component error Platform: NSK, Transport: TCPIP, Api: UNKNOWN_API, Error type: LISTENER, Process: ListenToPort, Operation: INIT_PROCESS, function: ACCEPT, error: 98, error_detail: 0. has occurred.
CEE Error Text: The specified address is already in use.
2015-01-13 15:48:27,293, ERROR, MXOSRVR, Node Number: 0, CPU: 0, PIN:63667, Process Name:$Z001GZ2 , , ,A network component error UNKNOWN_ERROR_TYPE 0, Process: bailout ListenToPort[
CEE Error Text: .
2015-01-13 15:48:27,293, ERROR, MXOSRVR, Node Number: 0, CPU: 0, PIN:63667, Process Name:$Z001GZ2 , , ,A network component error 5012 has occurred.
CEE Error Text: <5>.
The above is showing that 63667 did not survive as an mxosrvr.
/opt/hp/
PID TTY STAT TIME COMMAND
65102 ? Sl 0:07 mxosrvr -ZKHOST n014.cm.
/opt/hp/
2015-01-13 15:48:27,832, ERROR, MXOSRVR, Node Number: 0, CPU: 0, PIN:65102, Process Name:$Z001I52 , , ,A network component error UNKNOWN_ERROR_TYPE 0, Process: verifyPortAvail
CEE Error Text: The specified address is already in use.
2015-01-13 15:48:27,832, ERROR, MXOSRVR, Node Number: 0, CPU: 0, PIN:65102, Process Name:$Z001I52 , , ,A network component error UNKNOWN_ERROR_TYPE 0, Process: verifyPortAvail
CEE Error Text: The specified address is already in use.
**** The above msg is giving 13 more times. (Exact same date time).
But finally it worked as the process is running.
If I was to look at all the master_exec above I would find all of them have at least one error and (598 - 128) ~> 470 of them did not start.
Changed in trafodion: | |
status: | New → Confirmed |
Changed in trafodion: | |
milestone: | none → r1.1 |
assignee: | nobody → Anuradha (anuradha-hegde) |
Changed in trafodion: | |
milestone: | r1.1 → none |
Changed in trafodion: | |
milestone: | none → r2.0 |