Desync on bzr8954[trunk] with territorial lord

Bug #1811583 reported by Klaus Halfmann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
widelands
Fix Released
Critical
Unassigned

Bug Description

I was playing "Territorial Lord" on my "Crossriver" Map with some "renemeiderich"
on the net, with had two desnycs during the game.

After the second desnyc I found that my server had loast all network connectivity?

Attched my files around the nd desync.

Related branches

Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :
Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :
Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :
Revision history for this message
kaputtnik (franku) wrote :

bzr 8957 should have fixed this.

Revision history for this message
Notabilis (notabilis27) wrote :

I am not sure whether this is fixed in r8957. That commit (should) fix desyncs with seafaring of AIs, but based on the emergency save there was neither AIs nor seafaring in this game. So this might indeed be a different issue.

Unfortunately I can't reproduce it with r8954 and the emergency save nor with the nethost files. The autosave seems to be from another game, have you uploaded the wrong file? Also, does the autosave always desync if continued?

I don't understand what you mean with "server had lost all network connectivity", even when it probably is a separate issue. You weren't able to connect to the metaserver? Or you were able to host a game but renemeiderich was unable to see the game?

Revision history for this message
kaputtnik (franku) wrote :

Maybe i was too enthusiastic about the AI fix :-D

Did both of you played the same version?

Revision history for this message
kaputtnik (franku) wrote :

WolrdSavior and me played the same configuration (Map: Crossriver, Territorial Lord), but we had no desyncs.

But the game last only around 1 and a half hour...

WorldSavior: Blue, Barbarians
Me: Red, Empire

We had paused the game for a short time and after that i got a message "Connection lost, because no Ping in Time" (or similar), so we resumed the autosaved game.

Later on, i got the same message again, but we could play further without reloading.

Revision history for this message
_aD (ad-simplypeachy) wrote :

We are getting 10-15 minutes of play time between desyncs.

Revision history for this message
Notabilis (notabilis27) wrote :

Since I don't really know what to do with the normal syncstreams, I prepared a branch based on r8960 at https://code.launchpad.net/~widelands-dev/widelands/bug-better-syncstreams that extends the information stored in the syncstreams and also creates additional shorter ones that only contain the last few seconds before the desync.

_aD, when it desyncs that often for you: Can you two please use that branch for your game and upload the newest savegame and the matching *.wse (warning: New file extension) files of host and client when it desyncs again?

Revision history for this message
kaputtnik (franku) wrote :

Notabilis, please propose your branch for merging, so we get installable windows builds. Just in case some one of the players is using windows :)

kaputtnik (franku)
Changed in widelands:
milestone: none → build20-rc1
Revision history for this message
kaputtnik (franku) wrote :

Played with Hasi50 the same configuration, but with the linked branch and had a desync close to the end of the game: the second status message was shown and then the desync happen. Last console messages:

[Client]: sending timestamp: 4617654
[Client]: sending timestamp: 4618452
[Client]: sending timestamp: 4619279
[Client]: send playercommand at time 4620100
[Client]: sending timestamp: 4620519
[Client] received NETCMD_INFO_DESYNC. Trying to salvage some information for debugging.
[Client] speed: 0.000
[Client]: disconnect(CLIENT_DESYNCED, )
[NetRelayConnection] Closing network socket connected to 2a01:238:4326:5100:b5ce:5c0b:f1f4:2e40:7397.
FATAL ERROR - game crashed. Attempting emergency save.

Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :

On my (Hasi50) side:

GameSaver::save() took 1178ms
SaveHandler::save_game() took 1704ms
[Host]: disconnect_player_controller(1, kaputtnik1)
[Host]: disconnect_client(0, CLIENT_DESYNCED, )
[Host]: Requesting sync reports for time 4621371

As I often debug I have an Autosave every 5 Minutes,
maybe diring the save the other sied went to far ahead,
attaching more files now.

Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :
Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :

kaputtnik: we culd try some "crazy" Autosave setting like you 2 and me ever 3 minutes,
this should increase desany propability quite a lot

Revision history for this message
kaputtnik (franku) wrote :

Klaus and me continued the game (from last savegame) and played until the end without any desync.

Revision history for this message
kaputtnik (franku) wrote :

Teppo and me played a short game with autocrat wincondition. It desync after a short time, see:
https://bugs.launchpad.net/widelands/+bug/1721126/comments/26

kaputtnik (franku)
Changed in widelands:
importance: Undecided → Critical
Revision history for this message
kaputtnik (franku) wrote :

Notabilis, do you need more test playing?

Revision history for this message
Notabilis (notabilis27) wrote :

Hey, who said that I will be fixing all these bugs??? ;-)

But I have enough data to look at for now, thanks. The game played by Teppo and kaputtnik doesn't seem to be useful for debugging, but I will have another look at it. The syncstream contains the entry that the client is doing something with entity 3619. Well, whatever that might be. Maybe I should add huge amounts of debug output for this desync, but I have to think about it first.

Debugging the desync between Hasi50 and kaputtnik was more successful. The syncstream pointed to a problem with Lua code running only on the host. Based on the script files, I think it was the code path:
[territorial_lord.lua]main loop -> _send_state() -> [messages.lua]send_message() -> [ui.lua]wait_for_roadbuilding() -> _await_animation() -> sleep(41).
I guess what happened is that the player moved the map view (possibly by clicking on the minimap, see _await_animation()) when the status message was received, resulting in the sleep call being triggered. The sleep command results in the script execution to be rescheduled for 41ms later. This leads to another entry in the command queue and in the syncstream, leading to desynchronized games. The same can probably happen in all other scenarios where popup messages are used. Besides moving the map this desync can also be triggered by being in road building mode when a message should be received (I tested this one) and probably also in other cases (don't know, but someone should check the sleep() calls in all multiplayer scripts).
I am unsure how to fix this, if someone has an idea feel free to give it a try. Hm, spontaneous idea, uh, suggestion: Check whether it is a single- or multiplayer game and only be nice (i.e., wait for the player to finish the road) when in singleplayer. And probably add a documentation command that the sleep() call always has to be run for all participants of a game.

Oh, and regarding your question about testing internet games: It shouldn't make any difference whether it is on the same computer, over the LAN or over the relay server. The (relevant) code paths should always be the same. Besides, desyncs are due to bugs in the logic code somewhere and not in the network code.

Revision history for this message
kaputtnik (franku) wrote :

Wow, many thanks Notabilis!

Since i do not know anything about 'command queue' or anything related, maybe there is another one around who can help here? I think what is missing is some explanation about debugging... we have a wiki article about debugging desyncs, but it looks very outdated. Maybe you can update it?
https://wl.widelands.org/wiki/DebuggingDesyncs/

I will help with playtesting, if wanted.

Revision history for this message
Toni Förster (stonerl) wrote :

Notabilis, do you mean for example the sleep() calls in the main loop?

In my merge request for collectors (and wood gnome) I changed the main loops and use wake_me() calls instead of sleep().

Do you think this would help?

Revision history for this message
Toni Förster (stonerl) wrote :

I can confirm desyncs for Collectors & Wood Gnome when the road building mode is active and a notification is send.

Revision history for this message
Notabilis (notabilis27) wrote :

I haven't expected any (useful) response regarding the command queue, no problem. I know roughly how it is working which is fine for now. The desync happens because somehow additional commands have been added to the command queue for only one of the players, which is something that should not happen. Since I was unable to trigger the desync by loading any of the provided savegames it was probably some player action that added the command. Unfortunately the syncstream does not provide any information about the type of the acting entity.

wake_me() has the same problem as sleep() since it is also calling coroutine.yield() (see coroutine.lua). In generally these functions are fine and they probably don't make any problems in the main loop. The problem is if only one player calls one of these functions and the others do not. So as long as all player execute the main loop the same way and all or none are going to call sleep/wake_me(), it is fine.

Revision history for this message
Toni Förster (stonerl) wrote :

Another finding. The desync only happens when the message window pops up. In my branch the window only pops up ever hour. So the first message after 30 minutes does not result in a desync but the second after 1 hour does.

Revision history for this message
kaputtnik (franku) wrote :

Is it possible to reschedule the popup if a player makes a critical action, like 'map move' or 'roadbuilding'?

Revision history for this message
Notabilis (notabilis27) wrote :

Rescheduling is what is currently happening by using the sleep() call. As far as I see it, we either have to show the message independent of the players action or delay the dialog for all players.

Revision history for this message
kaputtnik (franku) wrote :

> or delay the dialog for all players.

That's what i was thinking of.

Revision history for this message
kaputtnik (franku) wrote :

I think i found a solution, by using the players send_message function. In file data/scripting/win_condition/territorial_lord.lua, line 53, exchange the line:

send_message(player, game_status.title, msg, {popup = true})

with:

player:send_message(game_status.title, msg, {popup = true})

Made just a short test, which seems to be fine, so please give it a try :-)

If this works, we should may exchange some more places where messages:send_message is used. E.g. the function broadcast() in:

https://bazaar.launchpad.net/~widelands-dev/widelands/trunk/view/head:/data/scripting/win_conditions/win_condition_functions.lua#L101

The RST description says it uses 'wl.game.Player.send_message` but in fact it uses messages.send_message().

See also: https://wl.widelands.org/documentation/search/?q=send_message&check_keywords=yes&area=default

Revision history for this message
Toni Förster (stonerl) wrote :

This indeed is the solution. Just tested it.

https://bazaar.launchpad.net/~widelands-dev/widelands/trunk/view/head:/data/scripting/win_conditions/win_condition_functions.lua#L111

needs to be replaced like with this:

p:send_message(header, msg, options)

And in the territorial files this:

send_message(player, game_status.title, msg, {popup = true})

needs to be changed to this:

broadcast(plrs, game_status.title, msg, {popup = true})

Revision history for this message
kaputtnik (franku) wrote :

Thanks for verifying, Toni :-)

Not sure if this solution is just a workaround though. Notabilis, what do you think?

I have checked which functions rely also to wait_for_roadbuilding() and _await_animation():

  - _await_animation() is only used in ui.lua
  - wait_for_roadbuilding() is used in some campaigns, messages.lua and ui.lua

We should add a comment to ui.lua not to use the functions in mutliplayer games and winconditions.
The same goes for all messages related functions in messages.lua, imho.

But i am mostly lost in the widelands lua api :-D , see also bug 1745359.

kaputtnik (franku)
Changed in widelands:
status: New → Fix Committed
Revision history for this message
GunChleoc (gunchleoc) wrote :

Fixed in build20-rc1

Changed in widelands:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.