Not enough bandwidth available, please try again later

Bug #1088638 reported by maksis
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
ADCH++
Fix Released
Undecided
Unassigned

Bug Description

The error "Not enough bandwidth available, please try again later" in ClientManager::verifyOverflow shouldn't be sent with the "TL -1" param, as quite a few users receive that error after the hub has been restarted.

Because of that it takes a long time to get all users back in the hub as their clients won't reconnect automatically.

Revision history for this message
maksis (maksis) wrote :
Revision history for this message
Janne (barajag) wrote :

why has not this been fixed in the new soft
on restart disappears Type 100 users for us.
Because of that it takes a long time to get all users back in the hub as their clients won't reconnect automatically.

Janne

Revision history for this message
eMTee (realprogger) wrote :

https://answers.launchpad.net/adchpp/+question/690223 brought up this report again. Sorry if this wasn't handled the time when reported.
The patch itself was probably ignored back then because in case of the said error the obvious action is to decrease the number of users by disconnecting some with a parameter that tells those client that they cannot reconnect immediately. If it'd do otherwise then it would periodically overstrain the hub (server's bandwith) and so the error would most probably periodically reoccour.
So what the code does seems legit - IF the problem that produces this error is really a lack of bandwidth and not something else, like a weird behaving network, ISP, etc...
I can add that I've never seen this error running an ADCH++ hub and I have been doing so for 10+ years.
From what I see in the code this error can happen if the socket buffer is getting overflown for a significant number of online users. Using a value from a formula the code decuces that there's too much users' socket buffer is owerflown for a certain amount of time so it must be an out of bandwith issue, hence disconnects the clients of those users.
At the same time the code uses configurable base parameters so modifying those might result this error situation going away.
Namely they're the OverflowTimeout and BufferSize parameters, configurable in the adchpp.xml file. I'd recommend to raise those values by 1,5x...2x steps and see the effect.
As the comments in the config file say, larger BufferSize value would result more memory usage for ADCH++ but I don't think today's computers' available memory would be a barrier to do a even a significant rise of buffer sizes.
A larger OverflowTimeout value would certainly result slow or improperly disconnecting users' removal from online state being slower than before but that could be a reasonable tradeoff in the reported situation.
Please carefully, but do experiment with those values. This bug report can possibily be investigated further if the real cause (lack of bandwith) is out of question and config parameter changes cannot solve the issue.

Revision history for this message
maksis (maksis) wrote :

I'm getting a feeling that the context of this issue isn't being understood. I'm aware of it happening only after the hub has been restarted (and there are lots of users reconnecting).

As someone who used to run hubs with 10k+ users, I remember that the hub is under heavy load after the restart. I'd say that it's totally fine if the hub happens to run out of bandwidth or server resources in such cases. Blocking a login attempt because of lack of resources is also fine. What's definitely not fine for the hub owner is to lose a big chunk of his users because their clients won't reconnect automatically in case of such errors.

This issue isn't that relevant for me anymore since all the hubs I'm staying in that initially migrated to ADCH++ have switched to other ADC hubsofts a long time ago, but seems like there are still some people using it. If playing with buffer sizes/overflow timeouts is considered to be a viable solution for the hub owners that are running larger hubs (or have a large number of user commands or other data to send on login, or whatever has an impact on this) and want to get their users back after a restart within a reasonable time, I'd find it good to document that clearly.

As a side note after a quick look at the code (hopefully I understood it correcly):

The default MaxBufferSize is 16384 bytes. If an average size of an INF command is 250 bytes, each user logging in will trigger an instant overflow condition only after having 65 users in the hub (that count would be significantly less if you also need to send user commands, MOTD or other protocol commands). If all that data is being queued synchronously on a single go, the user will instantly hit an overflow condition. If the hubsoft is single threaded and the socket implementation (Boost ASIO) will send the queued data asynchronously, I'm not sure that how long it will take before that same thread has time to do the sending, since it has quite a few other tasks and connecting users to handle.

I don't believe that the error is related to running out of bandwidth, even though it's not that relevant in regards of handling the reconnect time.

Revision history for this message
eMTee (realprogger) wrote :

I don't believe that the error is related to running out of bandwidth, either, hence I am in contact with the reporter and we're currently trying to tweak things.
None of the reporters mentioned that this problem to happen only after the hub has been restarted. That would need a confirmation as it would streighten my suspitions of why this happening.
What I know that in the past someone had ran an ADCH++ hub, with this code already in, with 600-800 users for a prolonged time and IIRC this issue was not reported from there.
I have looked at the code and the structure and it looks as if we're dealing with a code that has made rather for e.g. avoid clients errorously flood the hub, etc... and the error message itself is probably one of the many issues that this code is made to prevent. As I see it counts the individual overflow states of each client in a given time and if the 25% of the total usercount is already in overflow condition then it disconnects those.
I'm not an expert on low level socket programming by any means, and I can be easily wrong, but this tells me if we blatantly change the behaivor as the attached patch suggests then we may lose the ability of the hub of preventing various overstrain situations.
So I recommended the approach of tweaking relevant settings and if it that does not lead anywhere I plan to add a setting that allows the hub owner to enable instant/timed reconnecting (using the TL parameter) for this situation on their own risk.
And yes I agree of lack of documentation and once fixed I already planned to document this, also in the form of a FAQ in the support site.
I want to add that it's possible I'm not understood the issue at all but currently, unless someone wants to help investigating it or, rather, with relevant knowledge, wants to actively join the DC++ team, I am the only one who'll deal with the issue short term as the people who has made or contributed to this software are not working on the project at the moment and as you see not even cast their opinion on the issue.
I've been in this team for 10+ years and what I've learned from the people who made this software is to avoid radical changes and apply fixes only cautiously, not to risk the original purpose.
So this is the approach I'll follow here and hopefully we'll figure out the best solution soon being that adding new options or shipping with more reasonable defaults.
As I said I'm happy to make a build with the attached patch to anyone but the reporter currently experiencing the issue has agreed with the above approach.

Revision history for this message
laurent (laurent456) wrote :

I think it may happen on a hub restart or when the hub looses connectivity due to ISP line problems.
The user reporting the issue uses 2 different clients, ndcd and Apex, both have experienced the same behaviour at differents times.

20:28:27 Read error: Error in the pull function.
20:28:27 Connection lost. Waiting 30s before reconnecting.
20:28:57 Connecting to adcs://...
20:28:57 Trying xx.xx.xx.xx:iiii...
20:29:13 Connected to xx.xx.xx.xx:iiii.
20:29:17 <hub> You are registered, please provide a password
20:29:17 (error-11) Not enough bandwidth available, please try again later
20:29:17 Not enough bandwidth available, please try again later
20:29:17 Disconnected.

Following eMTee advice we have doubled OverflowTimeout settings in adchpp.xml. So far, one hub restart and in about 2 minutes all the previous users were connected again.
Previous to this modification, we observed users connected for days, then a disconnection happens and some of these users didn't connect again after a few days. Anyways, more testing is needed to confirm the overflow modification helps.

Thanks eMTee!

Revision history for this message
eMTee (realprogger) wrote :

Alright so maksis was right about this issue isn't being fully understood. I thought these disconnects can happen at anytime. This puts it into a slightly different perspective.

Revision history for this message
maksis (maksis) wrote :

Hmm, I don't follow. How does increasing OverflowTimeout help with this issue? I can't spot any explanation from the code for that.

Revision history for this message
eMTee (realprogger) wrote :

The reporter complains about network issues "lot of line microcuts affecting all the devices connected to the router) ... with this ISP" and increasing OverflowTimeout might help avoid the initial disconnects (which you see from the log above isn't because of the error of the topic of this bug report) so the mass reconnection, which causes the overstrain of the hub might not happen at all.

In other words it does not help fixing the actual error this bug report created for - it might help mitigating the problem of the reporter and through this we may learn practices that we can document so others can avoid this problem as well.

Later we can e.g. go back to the default OverflowTimeout and increase BufferSize, which directly affects the disconnection error reported in this ticket, and see how much extra memory it needs and what's the result.

I try to find alternatives to the simple change to never send TL=-1 and dropping protection against clients possibly misbehaving in the IDENTIFY state and so endangering the hub stability.
Do you think your patch would surely not pose any such danger?

Revision history for this message
maksis (maksis) wrote :

"dropping protection against clients possibly misbehaving in the IDENTIFY state and so endangering the hub stability"

What kind of misbehavior are you thinking?

Revision history for this message
eMTee (realprogger) wrote :

Whaterver else the whole function is possibly made to prevent from above the denial of a legitimate login attempt when 25% or more of the hub's total online users' socket buffer is overflown.
Isn't there any more purpose in this code in your opinion?

Revision history for this message
maksis (maksis) wrote :

I don't see how "TL-1" would protect from misbehavior in the IDENTIFY state, as the error is not being sent in that state. I don't even believe that the author of that code had a clear intention to send TL-1 in that case, and I'd rather consider it to be a bug (which would be fixed by my patch). That generic disconnect function just generally seems to be used in case of non-recoverable errors that require user attention, with this being the only exception. Even the error itself says "Not enough bandwidth available, please try again later". If the goal really is to prevent user from rejoining the hub, the message should say something different ("don't you dare to try again").

I honestly don't understand that why "TL-1" would be a big deal. If the hub thinks that it's too busy at the moment to accept the user, just let him reconnect again after a while. I don't see how the user could know better that when is the correct time to hit the reconnect button.

Revision history for this message
eMTee (realprogger) wrote :

We decided to fix this problem differently than maksis suggests so the behavior don't change and in an out of bandwith case users still aren't automatically hammering the hub. Rather we changed a constant to a new configurable parameter and documented the issue, along with other possible cases, when scaling the hub's resouce usage is needed.
See commits https://sourceforge.net/p/adchpp/code/ci/68e99c987292859ff923ac8ff49c19877262ba39/ and https://sourceforge.net/p/adchpp/code/ci/d4a4e2fe8398c19b4ddb0bd3ac3933fcd3e57255/

Changed in adchpp:
status: New → Fix Committed
Revision history for this message
maksis (maksis) wrote :

My original issue was about sending the TL -1 parameter and thus telling the clients to never reconnect in case of overflow (even though the error message shown to the user conflicts with that). If that is the way how it's supposed to work, I'd rather set a different status for the issue. Documentation may help some hub owners to avoid the issue but I'm not aware of any other hubsoft having the same problem.

Changed in adchpp:
status: Fix Committed → Won't Fix
Revision history for this message
eMTee (realprogger) wrote :

We think the message "try again later" is nowhere near telling the clients to never reconnect and we indeed suppose this is a fix for the problem and we added it so in the changelog of ADCH++.
Additionally, there were an actual case where the hub owner's problem solved this way.
We see this is as a bug report and not a behavioral change request and we think we provided ways to solve the problem for those who have it.

Changed in adchpp:
status: Won't Fix → Fix Committed
Revision history for this message
eMTee (realprogger) wrote :

Fixed in ADCH++ 3.0.0

Changed in adchpp:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.