DCS: Can not start more than 200 mxosrvrs across cluster.

Bug #1408454 reported by Guy Groulx
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
Fix Released
Critical
Matt Brown

Bug Description

Installed git150107 onto our system.
We have dcs servers file configured to start 128 servers on each of our nodes (10 nodes in total).

With the dcs-150107 version, we got many cores.

echo /local/cores/1005/core.1420648319.n013.25176.mxosrvr
echo
[New Thread 25176]
[New Thread 25834]
[New Thread 25937]
[New Thread 25936]
[New Thread 25833]
[New Thread 25832]
Core was generated by `0:63:21063:10:64:21064:10:65:21065:10:66:21066:10:67:21067:10:68:21068:10:69:21'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007ffff4d48d5f in ?? ()
#0 0x00007ffff4d48d5f in ?? ()
#1 0x00000000004dd2ed in ?? ()

In testing various scenarios we found that we could start on 2 nodes with 100 servers each. But at 128, the ports would not connect correctly.

Changed in trafodion:
assignee: nobody → Matt Brown (mattbrown-2)
milestone: none → r1.0
Revision history for this message
Trafodion-Gerrit (neo-devtools) wrote : Fix proposed to dcs (master)

Fix proposed to branch: master
Review: https://review.trafodion.org/929

Changed in trafodion:
status: New → In Progress
Matt Brown (mattbrown-2)
Changed in trafodion:
status: In Progress → Fix Committed
Revision history for this message
Trafodion-Gerrit (neo-devtools) wrote : Fix merged to dcs (master)

Reviewed: https://review.trafodion.org/929
Committed: https://github.com/trafodion/dcs/commit/a5c766c7aef13584706810636a77f4eda712baa4
Submitter: Trafodion Jenkins
Branch: master

commit a5c766c7aef13584706810636a77f4eda712baa4
Author: matbrown <email address hidden>
Date: Fri Jan 9 19:45:13 2015 +0000

    Disable mxosrvr port map
    Closes-Bug: #1408454

    In order to improve mxosrvr start time performance 2 new properties were
    added:

    dcs.server.user.program.port.map.timeout.seconds = 60
    dcs.server.user.program.port.bind.timeout.seconds = 30

    They are passed to mxosrvr at start time. The DcsMaster would create a
    port map in the servers/registered znode so mxosrvrs could identify
    their DcsServer instance/child id and discover the TCP/IP port to use.
    During performance testing, it was found that because DCS/MXOSRVRs usage
    of ports in the ephemeral range DCS couldn't guarantee dedicated use of
    a given range of ports. Setting the 2 properties to -1 disables port
    mapping and returns to original herding style port identification

    dcs.server.user.program.port.map.timeout.seconds = -1
    dcs.server.user.program.port.bind.timeout.seconds = -1

    Change-Id: I0497bfbc58878378d67a8cba9293708e88dc271e

Revision history for this message
Guy Groulx (guy-groulx) wrote :

Fixed as seen in git150114.

Changed in trafodion:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.