Multiplayer savegame construction wares don't arrive

Bug #1816018 reported by GunChleoc on 2019-02-15
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
widelands
Won't Fix
Medium
Unassigned

Bug Description

The red player's labyrinth and farm never get their last log to finish the construction, although it is marked as being on its way.

Maybe we should add some info to the debug window to show the flag coordinates of the ware's current location.

GunChleoc (gunchleoc) wrote :
Changed in widelands:
status: New → Triaged
kaputtnik (franku) wrote :

I think a replay would be helpful. So one can see e.g. if some roads were rearranged.

GunChleoc (gunchleoc) wrote :

The attached branch will print some debug output from the debug window.

Both yield a transfer object with an empty route. Now the question is why.

Toni Förster (stonerl) wrote :

I don't know if this observation helps.

The log between the two woodcutters left to the weaving mill is designated for the labyrinth. Destroying the road between the two woodcutters starts the delivery process and the labyrinth gets finished.

The log next to the quarry is designated for the farm. Destroying the road that connects the quarry with flag where the log is lying around starts, the delivery and the farm gets finished.

GunChleoc (gunchleoc) wrote :

Yes, this helps. I'll keep digging.

kaputtnik (franku) wrote :

Sounds exactly like bug 1797213 'Sometimes ware(s) are lying at a flag and get not transported'

Toni Förster (stonerl) wrote :

Could it be that the game has been reloaded from an emergency save game?

We had a game yesterday, where exactly this happened. A carrier was carrying a stone to a flag and at very moment when he dropped the stone at the flag, the game was emergency saved, because a player had disconnected. After restoring the game from the save, the stone wasn't picked up until we destroyed all surrounding roads and build them again.

Also it was the Atlantean tribe that suffered from this. If I recall correctly with some older revision we had some problem with the Atlanteans after restoring from an emergency save.

Do you need the replays and saves?

Toni Förster (stonerl) wrote :

It is what I suspected. This bug can be easily reproduced.

Here is how I did it:

- Open 2 Clients
- Host a multiplayer game.
- Build a building on either the client or host
- Have a road with at least one flag between the
  Headquarter and the construction site.
- the moment the ware is dropped at the flag and
  the second carrier hasn't started, kill the client
- start the client again
- on the host load the emergency save
- when the game is started the ware that was dropped
  will be ignored by the carrier.

kaputtnik (franku) wrote :

I can reproduce this like you described. This not tribe specific.

Changed in widelands:
status: Triaged → Confirmed
GunChleoc (gunchleoc) wrote :

The original savegame is from an emergency save too, so we can reproduce this now. Thanks, Toni!

kaputtnik (franku) wrote :

IMHO this bug is the last one which needs fixing for build 20.

Very good work so far nevertheless :-)

GunChleoc (gunchleoc) wrote :

How did you manage this?

- the moment the ware is dropped at the flag and
  the second carrier hasn't started, kill the client

I am always too slow in killing the client.

Toni Förster (stonerl) wrote :

Quite easily. Shall I make a video/replay/savegame?

kaputtnik (franku) wrote :

I just clicked the X in the title bar of the window the client is running in.

GunChleoc (gunchleoc) wrote :

I do that too but I can never catch the right moment :(

A video might help. A replay or savegame won't because I need the state *before* the save.

Toni Förster (stonerl) wrote :

@GunChleoc which revision are you using? I currently have the same problem with trunk r9007. I cannot reproduce this error. But I'm able to reproduce it with the revision from the linked branch.

Here is the video: https://youtu.be/3MWuzaP22gg

@kaputtnik can you try with the latest trunk?

kaputtnik (franku) wrote :

hm... couldn't reproduce this with bzr9005 too... but that was a release build. Will try a debug build also.

kaputtnik (franku) wrote :

The bug is magically gone :-)

At least it can't be reproduced anymore with the steps in #8

Toni Förster (stonerl) wrote :

It took me 30 minutes or so, but I finally have the debug log. I compiled the linked branch with debug and set the game log to verbose.

I also waited for the construction worker to finish and destroyed the road, the stuck grout git delivered to the HQ and after I connected the road again it was delivered to the construction site.

If you also need the replay let me know.

Toni Förster (stonerl) wrote :

I dug through the log here is what if found. After the game has been saved all movement for wares get cancelled. The stuck ware is MO(761,grout) which becomes MO(746,grout) after loading the game. The movement gets cancelled once again after the game has been loaded. As far as I understand the log, after destroying the road MO(746,grout) gets cancelled again. I hope this somehow helps.

Wouldn't the flag-unfreezer branch help?

https://code.launchpad.net/~widelands-dev/widelands/flag_unfreezer

Here are the parts of the important parts of the log file.

FATAL ERROR - game crashed. Attempting emergency save.

SaveHandler::save_game() took 576ms
[Host]: disconnect_player_controller(1, Player 2)
[Host]: disconnect_client(0, CLIENT_LEFT_GAME, )
[NetHost] Closing network connection to some client.
MO(767,log): cancel_moving
MO(765,blackwood): cancel_moving
MO(763,granite): cancel_moving
MO(761,grout): cancel_moving
MO(756,blackwood): cancel_moving
MO(754,granite): cancel_moving
MO(752,grout): cancel_moving
lastserial: 0

New Game:

Reloading the game from replay
MO(749,log): cancel_moving
MO(748,blackwood): cancel_moving
MO(747,granite): cancel_moving
MO(746,grout): cancel_moving
MO(742,blackwood): cancel_moving
MO(741,granite): cancel_moving
MO(740,grout): cancel_moving

[sync] Reset

MO(741,granite): cancel_moving
MO(742,blackwood): cancel_moving
MO(746,grout): cancel_moving
MO(747,granite): cancel_moving
.
.
.
MO(748,blackwood): cancel_moving
.
.
.
Road gets destroyed.
MO(746,grout): cancel_moving

GunChleoc (gunchleoc) wrote :

I could not trigger this bug by following the exact steps in your video. I have tried debug and release builds with gcc and clang.

Our best option for now is to use the savegame provided by WorldSavior and fix it / work around it during loading. If all else fails, send the ware to the warehouse and trigger a new request in the building.

Toni Förster (stonerl) wrote :

Yes, it is very hard to trigger the right moment for this bug to appear. I guess I just got lucky on the first try. If you need more savegames later on, to test your fix, let me know.

You added some logging, are they part of the output I send? Since I'm somehow able to reproduce this bug (even if it takes 30 minutes or more), are there any other information I can provide?

GunChleoc (gunchleoc) wrote :

The extra logging is only triggered by the debug window, so it will be hard to get anything from there, since it takes a split second already to trigger the bug at all.

hessenfarmer (stephan-lutz) wrote :

Just my 5 cents:

- Here we have a bug that is resulting from an abnormal situation in a game (emergency save due to a crash / disconnect)
- It is hard to reproduce even if desired
- we do not really understand what is going on even when the thing is triggered
- Emergency saves are not guaranteed to work anyhow as we do not know exactly what is going on in a crash situation
- It does not appear in normal saves (at least we have no report for that yet)
- It seems to be related to multiplayer

So these things would lead me to following conclusions:
- we should not introduce some bloody hack on savegame loading until we now what is really going on.
- we propably could live with this for b20 as it's occurrence should be rare
- we might think about improving robustness for emergency saves (e.g. by keeping and updating kind of a "state machine" in the RAM every 5 seconds which will then be written to disk in case of saving / emergency saving. or something similar), but this will be some work that should be post b20 and fit together with speeding up loading and saving.

So I would vote to leave things like they are for b20, cause I really think we need the build rather quick as it blocks / hinders the development currently in my eyes.

GunChleoc (gunchleoc) wrote :

Agreed.

Changed in widelands:
milestone: build20-rc1 → build21-rc1
importance: Undecided → Medium
Toni Förster (stonerl) wrote :

I totally agree. We could add a known issues/bug section to the release notes and since we know how to workaround this issues we could list this as well.

I'll still try to get some more information, in the hopes that I can help you to figure out what's going on here.

Toni Förster (stonerl) wrote :

Just one question. What exactly does the log you added output? I can't find any ware-specific output when I open the debug windows for roads/flags or construction sites. Also when I click on the button in the windows it just shows "Flag at XX/YY" or "this: 0x....." for roads and buildings...

GunChleoc (gunchleoc) wrote :

When clicking "constructionsite" in the building's debug window:

MO(1114,constructionsite): this: 0x614000008240
MO(1114,constructionsite): owner_: 0x6190002d3280
MO(1114,constructionsite): player_number: 2
MO(1114,constructionsite): economy_: 0x61200017ddc0
MO(1114,constructionsite): position: (50, 21)
MO(1114,constructionsite): flag: 0x613000042680
MO(1114,constructionsite): * position: (51, 22)
MO(1114,constructionsite): anim: idle
MO(1114,constructionsite): animstart: 246914
MO(1114,constructionsite): leave_time: 246914
MO(1114,constructionsite): leave_queue.size(): 0
MO(1114,constructionsite): leave_allow.get(): (nil)
MO(1114,constructionsite): Input queue for: granite
MO(1114,constructionsite): Input queue for: log
Transfer: Route has 3 steps: (46,24) (47,23) (49,22)
MO(1114,constructionsite): Input queue for: planks

When clicking on "Flag" in the the flag's debug window:

MO(946,flag): Flag at 46,24
MO(946,flag): this: 0x613000043f00
MO(946,flag): owner_: 0x6190002d3280
MO(946,flag): player_number: 2
MO(946,flag): economy_: 0x61200017ddc0
MO(946,flag): Wares at flag:
MO(946,flag): 1/8: log(879), nextstep 945 == flag 945 @ (47,23), acked by carrier
 Transfer: Empty route
MO(946,flag): 2/8: log(1402), nextstep 953 == flag 953 @ (45,26), acked by carrier
 Transfer: Route has 9 steps: (46,24) (45,26) (46,28) (47,29) (49,30) (51,30) (50,32) (51,34) (52,33)

The building's log has a transfer route that starts at the flag 46,24
The log at that flag has an empty route

Toni Förster (stonerl) wrote :

None of this shows up here... please have a look at the attached screenshot. Am I doing something wrong?

GunChleoc (gunchleoc) wrote :

It will not show on screen yet, unfortunately - you will have to check the console log.

Toni Förster (stonerl) wrote :

I see, well I'll continue testing now that I know where to look :D Do you mind if I merge trunk into this branch?

GunChleoc (gunchleoc) wrote :

Merging trunk will invalidate the attached savegame, but if you can create a new savegame, that won't be a problem.

kaputtnik (franku) wrote :

Looks like this happens also in singleplayer games (r9007): https://wl.widelands.org/forum/post/27311/

Maybe creating an autosave can lead to the same issue like creating an emergency save?

GunChleoc (gunchleoc) on 2019-03-16
Changed in widelands:
milestone: build21-rc1 → build20-rc1
GunChleoc (gunchleoc) wrote :

Information from Nordfriese:

Good news, these logs appear only in a corner case, and although there is an underlying problem, it is fixed automatically within 1-2 seconds, so it is not worth fixing this IMHO.
The problem is that some carriers who have been created for certain roads are just about to leave the HQ, and they notice that their target road has been split off into a separate economy before the economies are actually separated.

This delay when splitting the economies has been implemented on purpose to avoid performance issues; and this bug happens you hit that very narrow time window exactly, and an affected worker is pathfinding at this very moment – so as I said, I´m in favour of ignoring this.

https://wl.widelands.org/forum/topic/2786/?page=27#post-27334

I agree.

Changed in widelands:
milestone: build20-rc1 → build21-rc1
Niektory (niektory) wrote :

I've also encountered this issue multiple times, except in single player and with normal, manual saves. Looks like it's the same bug though, so I'm posting it here instead of opening a new bug.

OS: Manjaro 64-bit
Version: Occurs in Build 20 as well as bzr9141. IIRC it occured in Build 19 as well, but I'm not 100% sure.

Sometimes when reloading a saved game, some wares get stuck at flags and are no longer transported to their destinations. This can result in construction sites never getting finished or production sites never receiving some of the the raw materials.

This happens frequently enough to be somewhat annoying (I guesstimate 10% of my normal saves being affected).

When inspecting the destination building after the bug has occured, the icon of the needed ware is darkened (as if it's en route). However, the ware is lying at a flag and never getting picked up by any carriers, even if they're all idle and the road to the destination is clear.

Getting the ware to move requires intervention from the player (waiting even for several hours doesn't help). Destroying and remaking the roads connected to the flag seems to be the most reliable, especially the road going in the direction the ware was supposed to go.

I can reproduce the bug semi-reliably using the steps in #8, with the following differences:
- I start a single player game, not a multiplayer one.
- Instead of killing the client, I pause the game with PageDown and save manually. If I pause too early I hit PageUp and PageDown in quick succession to advance the time in small steps.
- Sometimes it's not the ware that was being dropped that gets stuck, but another ware waiting at the HQ's flag.

Niektory (niektory) wrote :

I think it's the same issue as in bug 1829848.

GunChleoc (gunchleoc) wrote :
Changed in widelands:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.