cannot unmarshal state entry for "timings"

Bug #1843683 reported by Zygmunt Krynicki
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
snapd
Confirmed
Medium
Paweł Stołowski

Bug Description

While reviewing test failures on master I ran across this error:

error: cannot get timings of change 6: could not get timings data from the
       state: internal error: could not unmarshal state entry "timings": json:
       cannot unmarshal number 4.39e+08 into Go struct field
       TimingJSON.duration of type time.Duration

This was on google:ubuntu-14.04-64:tests/main/snap-debug-timings

Changed in snapd:
assignee: nobody → Paweł Stołowski (stolowski)
importance: Undecided → High
Changed in snapd:
status: Confirmed → In Progress
Revision history for this message
Paweł Stołowski (stolowski) wrote :

Very interesting and bizzare problem. I couldn't reproduce it by running this spread test a 100 of times. The only way to reproduce it programatically in a unit test was to hand-craft state JSON to have such notation for the duration field. Injecting such value via mocked duration helper (even with saving and the restoring state) didn't reproduce it as for some reason json marshaller simply decided to encode it as 439000000. The fact that we hit this randomly means that the encoding is not fully deterministic and sometimes may use scientific notation.

Now, if such value is encountered when unmarshalling, the only way to succesfully unmarshall it to predefined structs is to use floats internally (float64). Decoding with UseNumber flag (e.g. with our DecodeWithNumber() that does that) doesn't help. While changing time.Duration (int64 really) to float64 for the duration field can be made in a backwards-compatible way, I fear the same can happen to doing/undoing fields which also use time.Duration, and that makes the scope of a potential fix bigger.

Revision history for this message
Paweł Stołowski (stolowski) wrote :
Revision history for this message
Paweł Stołowski (stolowski) wrote :

After investigating this issue for a while (tried to reproduce, created additional unit test, looked at the json encoding of int64 in the standard go library) it's unclear to me how this could happen (i.e. having scientific notation for the int64 duration & this particular value of 439000000); it would be useful to have full travis log if this happens again, perhaps some other test affects this one in unexpected ways.

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Pawel, I had a look at the golang bug.

I think this is still an issue. The fact that we couldn't reproduce doesn't change the fact that it did really happen and when it does, it brings down snapd with it.

Given that timings are not a core feature, it is irresponsible to bring down all of snapd by inability to handle timing data.

One of the referenced golang comments suggested to use json.Number, can we consider doing that?

Revision history for this message
Paweł Stołowski (stolowski) wrote :

@zyga this is not "brining down all of snapd". We don't normally unmarshall timings *unless* asked to via debug API. Even then, snapd would not go down, just report an internal error (like we saw in this one instance of failed spread test).

I'd be fine to investigate json.Number but having a reproducer is a prerequisite as otherwise we would have no idea if it fixes anything. Plus, it would tell us if we need to do something about Task's Doing/Undoing times which are also int64 and might exhibit same problem.

Changed in snapd:
status: In Progress → Confirmed
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Indeed I didn't think about the fact that we partially unmarshal the state, not entirely. That's an important indicator for the severity, or lack of thereof, of this bug.

As for reproducer, I think it's fine if we just start with a string to unmarshal and see how we handle that. I matters less how we got the state to be that in the first place and more how we handle in case we did.

Changed in snapd:
importance: High → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.