Server takes a long time to populate the server browser list.

Bug #1201200 reported by Dean Bouvier
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ember
Confirmed
Medium
Erik Ogenvik

Bug Description

I've created a new server, and it takes around 5 minutes for it to be added to the server browser list.
I can direct connect immediately to it using the manual method.
This server is running most recent version (6.2), and occurs on 7.0 and 7.1 clients. It has been running and reporting in to the Metaserver for several weeks now. Ping is roughly 100ms which is typical or better then other servers on the list.

Currently the server is called "New Test Server" and you can manually connect to it at: alienchrysalis.net

Revision history for this message
Sean Ryan (sryan) wrote :

Just to add some clarity to this, the process is something like this:
1) ember sends a server list to metaserver
2) response(s) come back
3) servers are individually enumerated and queried
4) once ember has the information for the server, it is put in the server browser.

Things that affect this:
1) slow servers
2) one/more servers with some firewall issues (I think that the query is UDP, and is repeated after a time) which can take a lot of time to sort out.
3) network congestion

Technically this is not a 'bug' as it is working exactly as it is supposed to, the user experience is definitely missing though.

The solution that has been talked about for this is 3fold:

1) modify the server (cyphesis) to push all the stats that ember(client) would want to query. Things like, uptime, users, server name, etc. This was originally held up due to the metaserver client portion of cyphesis being a custom and minimal subset of the functionality. This was recently completed, and it would be reasonably trivial to add this to cyphesis (it's on my todo)

2) modify the metaserver to allow for stateful client based filtering. This is approximately 75% complete, and would be trivial to complete the rest (also on my todo;)

3) modify ember to populate server browser based on information from the metaserver, not the actual server. I have not started this, but if i recall this should also not be very difficult.

Erik, let me know if you think that's kosher, and if so I can go ahead and complete 1 and 2 easily enough, and we can talk about how to work it from the client side.

Revision history for this message
Erik Ogenvik (erik-ogenvik) wrote :

@Sean, I think what you describe is pretty much spot on what we've thought about previously. It sounds like you can go ahead with 1 and 2 right now; once that's in place we'll alter Ember.

In the mean time I'll look into this on the Eris side, just to check that there's no stupid bug that can be fixed in the interim.

Changed in ember:
assignee: nobody → Erik Ogenvik (erik-ogenvik)
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Sean Ryan (sryan) wrote :

@Erik, what you may want to do, is look at the manner in which the servers are queried. Are they done in parallel, etc ... it could be that you spawn a bunch of tasks to handle it or something since the comms are async for this if i recall.

What I would also suggest on the ember side ... is that in the interim, perhaps we can populate the server browser based solely on the metaserver information (which at present is just the IP ... which should still serve as a valid method to connect to it). And have a status attribute ... something where you can say "online" or "unknown" or "down" or some such.

That way, you fire off the queries to the server, and then they are all listed, and then they can be updated/filtered, etc in the server browsers as updates come in. This can allow special actions like maybe saying 'hey i sent a packet to server x twice and got no response, maybe i should send some closer together rather than the default'.

Just a visual cue item, but seems a bummer to have a feeling of unknown to it. Even if we implement everything I discussed, this is still useful where the MS has *some* but not all of the information. Multiple levels of sorting can be applied to ensure that the servers are displayed in the right manner.

I'll get started on #1 and #2 then and let you know in a few days.

Revision history for this message
Erik Ogenvik (erik-ogenvik) wrote :

I've looked into it, and the issue seems to be with the Metaserver.

The main issue is that it doesn't purge inactive servers. When connecting to it it report 1604 servers. The absolute majority of these are inactive. But Ember still has to query them one after another with a timeout, which is the reason for it taking five minutes for Deans server to appear (as I assume it's at the bottom of the list).

Another issue is that I got once an error from Eris that the query to the Metaserver timed out. This was when getting one portion of the list of servers (as it's so large). It happened around 23.40 CEST.

This issue should go away with the new protocol, but we have to keep in mind that there are older clients out there that's using the old protocol, so we need to fix this nonetheless.
@Sryan, could you check how the code for purging inactive servers is behaving?

Revision history for this message
Sean Ryan (sryan) wrote :
  • x Edit (22.0 KiB, text/plain)

I'll look into it for sure. You are correct, that is huge. It's very odd ... the max connection is suppose to be 1000 and the timeout is supposed to be controlled by this config item:

server_session_expiry_seconds=300

wf@code-bear:~/metaserver-ng$ bin/testclient --server metaserver.worldforge.org | grep Server | awk '{ print $2 }' | sort | uniq | wc -l
1601

Revision history for this message
Sean Ryan (sryan) wrote :

wf@code-bear:~/metaserver-ng/bin$ ./testclient --server=metaserver.worldforge.org | grep Server: | wc -l
45

Ok ... now we're good, I have restarted things.

I see from the logs that there was an IO Exception in the async service ... the cause is unknown, but likely is systemic ( meaning something with the system like load or something ) that caused a hiccup. I actually catch this and attempt to restart the service internally inside the main.cpp for the metaserver. This was like a fail-safe to attempt recovery, but it appears to not really be possible.

The problem with this use case is that because we can't really know what the exception was (a general exception catch-all), it's entirely possible that the exception is fatal. In either case, since I've committed the watchdog script as part of the metaserver build, i may take it out and just let it fail. I'll experiment with it a bit ... see if I can reproduce it (I'm not optimistic)

Revision history for this message
Dean Bouvier (demarii-wf) wrote :

So it turns out that part of the problem was my firewall was only passing TCP and not UDP.
After re-configuring it the New Test Server is coming online immediately. So looks like a bit
of a false alarm case, but if it helped find a few unknown bugs then I guess it was partly lucky.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.