concurrency race condition, several units executing hooks, restart of server, resulted in all of them going to pending.. i have the full log (520mb) and here's status output of half-dozen machines
https://chinstrap.canonical.com/~acollard/kapil-status-output
machine-0: panic: rescanned document misses transaction in queue
machine-0: goroutine 3847 [running]:
machine-0: runtime.panic(0xb36e20, 0xc210c65160)
machine-0: /usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
machine-0: labix.org/v2/mgo/txn.(*flusher).rescan(0xc210941000, 0xc210be2e00, 0x7f5b13b9c601, 0x0, 0x0, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/labix.org/v2/mgo/txn/flusher.go:472 +0xdba
machine-0: labix.org/v2/mgo/txn.(*flusher).prepare(0xc210941000, 0xc210be2e00, 0x410701, 0x0, 0x0, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/labix.org/v2/mgo/txn/flusher.go:223 +0x9a
machine-0: labix.org/v2/mgo/txn.(*flusher).advance(0xc210941000, 0xc210be2e00, 0xc210bda4b0, 0x1, 0xc2108280d0, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/labix.org/v2/mgo/txn/flusher.go:160 +0x64
machine-0: labix.org/v2/mgo/txn.(*flusher).run(0xc210941000, 0x0, 0x0)
machine-0: /build/buildd/juju-core-1.18.3/src/labix.org/v2/mgo/txn/flusher.go:118 +0xd31
machine-0: labix.org/v2/mgo/txn.flush(0xc2101082c0, 0xc210bc1300, 0x1, 0x1)
machine-0: /build/buildd/juju-core-1.18.3/src/labix.org/v2/mgo/txn/flusher.go:21 +0x1a8
machine-0: labix.org/v2/mgo/txn.(*Runner).Run(0xc2101082c0, 0xc210ba9000, 0x8, 0x8, 0xc210934880, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/labix.org/v2/mgo/txn/txn.go:302 +0x498
machine-0: launchpad.net/juju-core/state.(*State).runTransaction(0xc21006f3c0, 0xc210ba9000, 0x8, 0x8, 0x0, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/state/state.go:113 +0x1fe
machine-0: launchpad.net/juju-core/state.(*RelationUnit).EnterScope(0xc210a491b0, 0xc210859d80, 0xe4baf0, 0xf)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/state/relationunit.go:138 +0xa43
machine-0: launchpad.net/juju-core/state/apiserver/uniter.(*UniterAPI).EnterScope(0xc210a2bd20, 0xc210933400, 0x1, 0x4, 0x0, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/state/apiserver/uniter/uniter.go:836 +0x2af
machine-0: reflect.Value.call(0xdf23c0, 0xc21086dee8, 0x1d3a, 0xdfd1d0, 0x4, ...)
machine-0: /usr/lib/go/src/pkg/reflect/value.go:474 +0xe0b
machine-0: reflect.Value.Call(0xdf23c0, 0xc21086dee8, 0x1d3a, 0xc210928860, 0x1, ...)
machine-0: /usr/lib/go/src/pkg/reflect/value.go:345 +0x9d
machine-0: launchpad.net/juju-core/rpc/rpcreflect.func·007(0xdf23c0, 0xc21086dee8, 0x162, 0xcb3cc0, 0xc2109286e0, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/rpc/rpcreflect/type.go:317 +0x12f
machine-0: launchpad.net/juju-core/rpc/rpcreflect.MethodCaller.Call(0x7f5b13ba4ea8, 0xcb3cc0, 0x7f5b13ba4ea8, 0xd095a0, 0xdd4080, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/rpc/rpcreflect/value.go:103 +0x103
machine-0: launchpad.net/juju-core/rpc.(*Conn).runRequest(0xc210a2a3c0, 0x7f5b13ba4ea8, 0xcb3cc0, 0x7f5b13ba4ea8, 0xd095a0, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/rpc/server.go:485 +0xf5
machine-0: created by launchpad.net/juju-core/rpc.(*Conn).handleRequest
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/rpc/server.go:417 +0x5f6
machine-0: goroutine 1 [chan receive]:
machine-0: launchpad.net/tomb.(*Tomb).Wait(0xc210037e40, 0x402e72, 0x402e72)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/tomb/tomb.go:110 +0x42
machine-0: launchpad.net/juju-core/worker.(*runner).Wait(0xc210037e40, 0xe52d10, 0xb)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/worker/runner.go:124 +0x2e
machine-0: main.(*MachineAgent).Run(0xc21004fb40, 0xc2100af780, 0x0, 0x0)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/cmd/jujud/machine.go:166 +0x4fc
machine-0: launchpad.net/juju-core/cmd.(*SuperCommand).Run(0xc2100aedc0, 0xc2100af780, 0xc2100af780, 0x0)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/cmd/supercommand.go:303 +0x43f
machine-0: launchpad.net/juju-core/cmd.Main(0x7f5b13ba63f0, 0xc2100aedc0, 0xc2100af780, 0xc21000a010, 0x6, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/cmd/cmd.go:244 +0x283
machine-0: main.jujuDMain(0xc21000a000, 0x7, 0x7, 0xc2100af780, 0xb2f301, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/cmd/jujud/main.go:107 +0x33a
machine-0: main.Main(0xc21000a000, 0x7, 0x7)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/cmd/jujud/main.go:122 +0x20d
machine-0: main.main()
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/cmd/jujud/main.go:139 +0x44
machine-0: goroutine 3 [syscall]:
machine-0: os/signal.loop()
machine-0: /usr/lib/go/src/pkg/os/signal/signal_unix.go:21 +0x1e
machine-0: created by os/signal.init·1
machine-0: /usr/lib/go/src/pkg/os/signal/signal_unix.go:27 +0x31
machine-0: goroutine 6 [select]:
machine-0: launchpad.net/juju-core/worker.(*runner).run(0xc210037e40, 0xc210059280, 0x424ad2)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/worker/runner.go:162 +0x9db
machine-0: launchpad.net/juju-core/worker.func·004()
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/worker/runner.go:86 +0x54
machine-0: created by launchpad.net/juju-core/worker.NewRunner
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/worker/runner.go:87 +0x133
machine-0: goroutine 7 [chan receive]:
machine-0: launchpad.net/tomb.(*Tomb).Wait(0xc210119c00, 0x7f5b13a1fef0, 0x7f5b13a1de78)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/tomb/tomb.go:110 +0x42
machine-0: launchpad.net/juju-core/worker.(*runner).Wait(0xc210119c00, 0x7f5b13bac528, 0xc210119c00)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/worker/runner.go:124 +0x2e
machine-0: main.(*closeWorker).Wait(0xc2100fd080, 0xc2100d1000, 0xe13c30)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/cmd/jujud/agent.go:244 +0x2f
machine-0: launchpad.net/juju-core/worker.(*runner).runWorker(0xc210037e40, 0x0, 0xe13c30, 0x5, 0xc2100e5030)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/worker/runner.go:266 +0x3c6
machine-0: created by launchpad.net/juju-core/worker.(*runner).run
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/worker/runner.go:179 +0x356
machine-0: goroutine 8 [select]:
machine-0: labix.org/v2/mgo.(*mongoCluster).syncServersLoop(0xc210045d20)
machine-0: /build/buildd/juju-core-1.18.3/src/labix.org/v2/mgo/cluster.go:360 +0x510
machine-0: created by labix.org/v2/mgo.newCluster
machine-0: /build/buildd/juju-core-1.18.3/src/labix.org/v2/mgo/cluster.go:69 +0x102
machine-0: goroutine 14 [chan receive]:
I head from #landscape that the MongoDB logs or the general syslog from the system were not saved. It would be important to know more about the state of the database and the system in which this happened as a whole. I understand what the panic is saying, but I don't understand how it got into that state. The abrupt message is there to let us know that something that could not have happened did happen. The only other similar case I've heard from the txn package was when the system has run out of space, and MongoDB was unable to preserve basic properties of the data written.
Also, as a side note worth knowing about, the txn system is failing as a safety net. There is a last-resource method on the Runner to clean up its metadata from the collections, if that ever happens again. Please never call this method automatically, though, as it will definitely corrupt the system. It's supposed to be used manually, in emergencies like these.