desync in "defeat all" as soon as victory messages are sent.

Bug #647155 reported by Nasenbaer
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
widelands
Fix Released
Critical
Unassigned

Bug Description

As soon as victory messages are sent to the victorious players, the game desyncs.

Revision history for this message
Timowi (timo-wingender) wrote :

I had this problem only with the autocrat win condition so far while testing the game reporting feature. See bug #576347. The collectors win condition never crashed so far. I will have a look at this.

Changed in widelands:
assignee: nobody → Timowi (timo-wingender)
Revision history for this message
SirVer (sirver) wrote :

does this happen in trunk as well? or is this related to the server communication?

Revision history for this message
Timowi (timo-wingender) wrote :

This happens in trunk as well. The metaserver reporting is independent of the game logic. But as far as I remember I had this problem only in 1v1 games.

Revision history for this message
SirVer (sirver) wrote :

Message sending does not interfere with the game logic either; there is nothing that interferes with the game logic in the checking of the win condition: no randomness and nothing that changes the game state. I highly doubt that the problem is related to the win condition - but it could be related to some timing issues: maybe the lua code takes to long on some machines and there are then behind in their simulations or so.

 Or maybe message sending does infact alter the game state and I just didn't realize it. If so, it would be strange that no desyncs happen in ealier messages.

I for one could not reproduce this problem on trunk running host and 2 clients on the same machine. Is there a reliable way to reproduce this?

Revision history for this message
Timowi (timo-wingender) wrote :

I just tried to reproduce this with two widelands instances on the same host and I could not reproduce the desync. Then I tried the same with two different computers and shortly after the message I got a desync.

1. I got this desync only when running on different computers and every time it where two "real" players (no AIs)

2. The desync does not occur direct with the win message but a few seconds after the messages arrived.

Both widelands executables are x86 (host is a x86_64 machine but that should not matter as it runs as 32bit). One is compiled with gcc 4.3.2 and one with gcc 4.4.1 Both where trunk 5545 without any changes.

I think there is some difference in the game state after the win. This may be caused by the win condition but may also be cause by destroy the other player. I have not tried if it happens if I defeat a player without a win condition.

I can easily reproduce this by playing the map "Firegames" on my desktop and my laptop.

Revision history for this message
Timowi (timo-wingender) wrote :

I have compared the syncstreams now. The problem seems to be related to lua_coroutine. The anly difference in the syncstreams is a additional QUEUE_CMD_LUACOROUTINE on the client.

Revision history for this message
Timowi (timo-wingender) wrote :
Revision history for this message
Timowi (timo-wingender) wrote :
Revision history for this message
Timowi (timo-wingender) wrote :
Changed in widelands:
assignee: Timowi (timo-wingender) → nobody
Revision history for this message
Timowi (timo-wingender) wrote :

I am right that widelands just calculates a hash over the complete stream and compares the value from time to time? So the simulation do not really differ here but the hash value is different because of the additional command?

I deassigned me from this bug because I do not know much about the lua implementation. Holger, I hope you can fix the bug with this information.

Revision history for this message
SirVer (sirver) wrote :

You are right in how syncing works.

The bug is still quite not understandable for me. There are very little places were a Cmd_LuaCoroutine is enqueued and the only one that could probably lead to this bug is in Cmd_LuaCoroutine::execute. Whenever a lua scripts calls coroutine.yield(sleeptime) a new Cmd_LuaCoroutine is enqueued at current_gametime + sleeptime. I guess one of the boxes enqueues a command while another one does not. I cannot say why that is the case.

I cannot reproduce this error in my local network with two computers either. I added a lot of debug output to Cmd_LuaCoroutine::execute in r5547. Timo, could you redo the test and compare the stdouts from host and client?

Revision history for this message
Timowi (timo-wingender) wrote :
Revision history for this message
Timowi (timo-wingender) wrote :
Revision history for this message
Timowi (timo-wingender) wrote :
Revision history for this message
Timowi (timo-wingender) wrote :
Revision history for this message
SirVer (sirver) wrote :

There is the problem; at time 15719692 the host kills the coroutine while the client reschedules it for execution again in 5000 ms. It seems like the LuaCommands are not run at the exactly same time in both games and it also seems like the Host has another game state then the client at this point in time. I have no explanation for that.

-- host --
[Host]: Client 0: Time 15719692
[Host]: Requesting sync reports for time 15720122
Cmd_LuaCoroutine::execute
 Called resume: sleeptime: 0
   done. Deleting Coroutine! done with Cmd_LuaCoroutine::execute
-- end host --

-- client --
[Client]: sending timestamp: 15719692
Cmd_LuaCoroutine::execute
 Called resume: sleeptime: 15725000
   yielded! done with Cmd_LuaCoroutine::execute
[Client]: sending timestamp: 15720473
[Client]: sending timestamp: 15721258
[Client]: sending timestamp: 15722034
[Client]: sending timestamp: 15722799
[Client]: sending timestamp: 15723550
Request::get_base_required_time: WARNING nr = 1 but count is 1, which is not allowed according to the comment for this function
[Client]: sending timestamp: 15724318
Cmd_LuaCoroutine::execute
[Client]: send playercommand at time 15725000
 Called resume: sleeptime: 0
   done. Deleting Coroutine! done with Cmd_LuaCoroutine::execute
[Client] received NETCMD_INFO_DESYNC. Trying to salvage some information for debugging.
[Client]: disconnect(Client and host have become desynchronized.)
-- end client --

Revision history for this message
Timowi (timo-wingender) wrote :

I have attached output from two desynced games with the additional output now. Looks like the client needs one more round to detect the result. I had one game before (same conditions) that did not crashed. Maybe this is some sort of race condition? My laptop (client) is much slower than my desktop. If game speed is higher than two widelands takes 100% cput ime on my laptop while on my desktop it takes around 30% at 10x speed. At the not desyncing game widelands runs slower during the end of the game and I had no other programs running at the same time. So this might be related to speed differences between the computers.

Revision history for this message
SirVer (sirver) wrote :

The bug report should then be that gamestate is not the same on both computers... which is not a Lua thing per se.

On a sidenote: if every player is send a message, does the message arrive on the one of the computers 5 seconds later when it desyncs? It sure would seem as if this is the case.

Revision history for this message
Timowi (timo-wingender) wrote :
Revision history for this message
Timowi (timo-wingender) wrote :
Revision history for this message
Timowi (timo-wingender) wrote :

I added a bit more output (log of gametime in cmd_coroutine and get_defeated and print result in get defeated). The coroutines run at the same time but the result of get_defeated is different. The client detects the defeated state one round later.

Yes, it looks like the client get the message later. It is not easy to see as it is more likely that the desync occurs if the game speed is higher.

Revision history for this message
Timowi (timo-wingender) wrote :

I saw that get_defeated uses get_general_statistics() to check if a player is defeated. Maybe the statistic generation is not deterministic as it was not intended to influence the game logic? If the game statistics are not updated from the command queue but somewhere asynchronous to the game logic that would explain the desync.

Revision history for this message
Timowi (timo-wingender) wrote :
Revision history for this message
Timowi (timo-wingender) wrote :
Revision history for this message
Timowi (timo-wingender) wrote :

And two more logs with nr_workers vector size and last value. The statistic vector is behind one value on the client.

Revision history for this message
Timowi (timo-wingender) wrote :

Ok. The problem is the statistic sampling. sample_statistics() is called asynchronously from Game::think() and not through the command queue. So it is not safe to use the statistics for something that may influence the game.

Revision history for this message
SirVer (sirver) wrote :

Excellent track down, Timo. I guess the best solution is to make statistics gathering also determinisitic. There is no real reason I know off why it should be desyncron on different computers - It would be strange if you would get slightly different statistic data than I.

So we basically need a new Cmd_GatherStatistics and make it sync. Agreed?

Changed in widelands:
status: Confirmed → Triaged
Revision history for this message
Timowi (timo-wingender) wrote :

I think calling the statistic sampling function from the command queue is the easiest way to fix this bug. But this makes the statisitc sampling part of the game logic. I would try to keep the game logic as small as possible and hopefully sometime the simulation code is independent from the rest of the source. This makes testing and different usage easier. Adding the statistic sampling to the command queue and using it for win conditions make it essentially a part of the simulation code. I would try to make the decision whether a player is defeated based on the "real" game state and see the statistic just as a visualisation code, just as the ui. But that is just my opinion.

Revision history for this message
SirVer (sirver) wrote :

Other win conditions, like Collectors and Wood Gnome are also using the statistics to grant point and calculate the position of players. For me, the statistics are therefore not part of the UI (anymore?) and should get their own Player Command. Otherwise, displaying point calculations in the statistics becomes very awkward and hackish. Also, Lua developers would take it for granted that statistics are the same as the game state is the same for all games - even I took if for granted without looking it up.

I am therefore very much for making Statistics gathering a player command.

Revision history for this message
SirVer (sirver) wrote :

sorry, not a player command, but a command in the cmdqueue :/

Revision history for this message
Nasenbaer (nasenbaer) wrote :

+1 for making it a cmdqueue

and thank you very much Timo for tracking this down :)!

SirVer (sirver)
Changed in widelands:
assignee: nobody → SirVer (sirver)
Revision history for this message
SirVer (sirver) wrote :

fixed in r5625

Changed in widelands:
status: Triaged → Fix Committed
assignee: SirVer (sirver) → nobody
Revision history for this message
SirVer (sirver) wrote :

Released in build16-rc1

Changed in widelands:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.