panic in mgo transaction lib

Bug #1318044 reported by Kapil Thangavelu
This bug report is a duplicate of:  Edit Remove
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju-core
Triaged
High
Unassigned

Bug Description

concurrency race condition, several units executing hooks, restart of server, resulted in all of them going to pending.. i have the full log (520mb) and here's status output of half-dozen machines
https://chinstrap.canonical.com/~acollard/kapil-status-output

machine-0: panic: rescanned document misses transaction in queue
machine-0: goroutine 3847 [running]:
machine-0: runtime.panic(0xb36e20, 0xc210c65160)
machine-0: /usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
machine-0: labix.org/v2/mgo/txn.(*flusher).rescan(0xc210941000, 0xc210be2e00, 0x7f5b13b9c601, 0x0, 0x0, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/labix.org/v2/mgo/txn/flusher.go:472 +0xdba
machine-0: labix.org/v2/mgo/txn.(*flusher).prepare(0xc210941000, 0xc210be2e00, 0x410701, 0x0, 0x0, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/labix.org/v2/mgo/txn/flusher.go:223 +0x9a
machine-0: labix.org/v2/mgo/txn.(*flusher).advance(0xc210941000, 0xc210be2e00, 0xc210bda4b0, 0x1, 0xc2108280d0, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/labix.org/v2/mgo/txn/flusher.go:160 +0x64
machine-0: labix.org/v2/mgo/txn.(*flusher).run(0xc210941000, 0x0, 0x0)
machine-0: /build/buildd/juju-core-1.18.3/src/labix.org/v2/mgo/txn/flusher.go:118 +0xd31
machine-0: labix.org/v2/mgo/txn.flush(0xc2101082c0, 0xc210bc1300, 0x1, 0x1)
machine-0: /build/buildd/juju-core-1.18.3/src/labix.org/v2/mgo/txn/flusher.go:21 +0x1a8
machine-0: labix.org/v2/mgo/txn.(*Runner).Run(0xc2101082c0, 0xc210ba9000, 0x8, 0x8, 0xc210934880, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/labix.org/v2/mgo/txn/txn.go:302 +0x498
machine-0: launchpad.net/juju-core/state.(*State).runTransaction(0xc21006f3c0, 0xc210ba9000, 0x8, 0x8, 0x0, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/state/state.go:113 +0x1fe
machine-0: launchpad.net/juju-core/state.(*RelationUnit).EnterScope(0xc210a491b0, 0xc210859d80, 0xe4baf0, 0xf)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/state/relationunit.go:138 +0xa43
machine-0: launchpad.net/juju-core/state/apiserver/uniter.(*UniterAPI).EnterScope(0xc210a2bd20, 0xc210933400, 0x1, 0x4, 0x0, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/state/apiserver/uniter/uniter.go:836 +0x2af
machine-0: reflect.Value.call(0xdf23c0, 0xc21086dee8, 0x1d3a, 0xdfd1d0, 0x4, ...)
machine-0: /usr/lib/go/src/pkg/reflect/value.go:474 +0xe0b
machine-0: reflect.Value.Call(0xdf23c0, 0xc21086dee8, 0x1d3a, 0xc210928860, 0x1, ...)
machine-0: /usr/lib/go/src/pkg/reflect/value.go:345 +0x9d
machine-0: launchpad.net/juju-core/rpc/rpcreflect.func·007(0xdf23c0, 0xc21086dee8, 0x162, 0xcb3cc0, 0xc2109286e0, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/rpc/rpcreflect/type.go:317 +0x12f
machine-0: launchpad.net/juju-core/rpc/rpcreflect.MethodCaller.Call(0x7f5b13ba4ea8, 0xcb3cc0, 0x7f5b13ba4ea8, 0xd095a0, 0xdd4080, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/rpc/rpcreflect/value.go:103 +0x103
machine-0: launchpad.net/juju-core/rpc.(*Conn).runRequest(0xc210a2a3c0, 0x7f5b13ba4ea8, 0xcb3cc0, 0x7f5b13ba4ea8, 0xd095a0, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/rpc/server.go:485 +0xf5
machine-0: created by launchpad.net/juju-core/rpc.(*Conn).handleRequest
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/rpc/server.go:417 +0x5f6
machine-0: goroutine 1 [chan receive]:
machine-0: launchpad.net/tomb.(*Tomb).Wait(0xc210037e40, 0x402e72, 0x402e72)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/tomb/tomb.go:110 +0x42
machine-0: launchpad.net/juju-core/worker.(*runner).Wait(0xc210037e40, 0xe52d10, 0xb)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/worker/runner.go:124 +0x2e
machine-0: main.(*MachineAgent).Run(0xc21004fb40, 0xc2100af780, 0x0, 0x0)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/cmd/jujud/machine.go:166 +0x4fc
machine-0: launchpad.net/juju-core/cmd.(*SuperCommand).Run(0xc2100aedc0, 0xc2100af780, 0xc2100af780, 0x0)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/cmd/supercommand.go:303 +0x43f
machine-0: launchpad.net/juju-core/cmd.Main(0x7f5b13ba63f0, 0xc2100aedc0, 0xc2100af780, 0xc21000a010, 0x6, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/cmd/cmd.go:244 +0x283
machine-0: main.jujuDMain(0xc21000a000, 0x7, 0x7, 0xc2100af780, 0xb2f301, ...)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/cmd/jujud/main.go:107 +0x33a
machine-0: main.Main(0xc21000a000, 0x7, 0x7)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/cmd/jujud/main.go:122 +0x20d
machine-0: main.main()
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/cmd/jujud/main.go:139 +0x44
machine-0: goroutine 3 [syscall]:
machine-0: os/signal.loop()
machine-0: /usr/lib/go/src/pkg/os/signal/signal_unix.go:21 +0x1e
machine-0: created by os/signal.init·1
machine-0: /usr/lib/go/src/pkg/os/signal/signal_unix.go:27 +0x31
machine-0: goroutine 6 [select]:
machine-0: launchpad.net/juju-core/worker.(*runner).run(0xc210037e40, 0xc210059280, 0x424ad2)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/worker/runner.go:162 +0x9db
machine-0: launchpad.net/juju-core/worker.func·004()
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/worker/runner.go:86 +0x54
machine-0: created by launchpad.net/juju-core/worker.NewRunner
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/worker/runner.go:87 +0x133
machine-0: goroutine 7 [chan receive]:
machine-0: launchpad.net/tomb.(*Tomb).Wait(0xc210119c00, 0x7f5b13a1fef0, 0x7f5b13a1de78)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/tomb/tomb.go:110 +0x42
machine-0: launchpad.net/juju-core/worker.(*runner).Wait(0xc210119c00, 0x7f5b13bac528, 0xc210119c00)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/worker/runner.go:124 +0x2e
machine-0: main.(*closeWorker).Wait(0xc2100fd080, 0xc2100d1000, 0xe13c30)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/cmd/jujud/agent.go:244 +0x2f
machine-0: launchpad.net/juju-core/worker.(*runner).runWorker(0xc210037e40, 0x0, 0xe13c30, 0x5, 0xc2100e5030)
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/worker/runner.go:266 +0x3c6
machine-0: created by launchpad.net/juju-core/worker.(*runner).run
machine-0: /build/buildd/juju-core-1.18.3/src/launchpad.net/juju-core/worker/runner.go:179 +0x356
machine-0: goroutine 8 [select]:
machine-0: labix.org/v2/mgo.(*mongoCluster).syncServersLoop(0xc210045d20)
machine-0: /build/buildd/juju-core-1.18.3/src/labix.org/v2/mgo/cluster.go:360 +0x510
machine-0: created by labix.org/v2/mgo.newCluster
machine-0: /build/buildd/juju-core-1.18.3/src/labix.org/v2/mgo/cluster.go:69 +0x102
machine-0: goroutine 14 [chan receive]:

tags: added: cloud-installer landscape
Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

I head from #landscape that the MongoDB logs or the general syslog from the system were not saved. It would be important to know more about the state of the database and the system in which this happened as a whole. I understand what the panic is saying, but I don't understand how it got into that state. The abrupt message is there to let us know that something that could not have happened did happen. The only other similar case I've heard from the txn package was when the system has run out of space, and MongoDB was unable to preserve basic properties of the data written.

Also, as a side note worth knowing about, the txn system is failing as a safety net. There is a last-resource method on the Runner to clean up its metadata from the collections, if that ever happens again. Please never call this method automatically, though, as it will definitely corrupt the system. It's supposed to be used manually, in emergencies like these.

Revision history for this message
Kapil Thangavelu (hazmat) wrote : Re: [Bug 1318044] Re: panic in mgo transaction lib

noted for future ref re collection of additional logs. fwiw, the machine
was in good health (disk/cpu/etc), and post restart the machined state
server agent was operating normally. the only item of note was the machine
was a seamicro, which utilizes a low latency/high throughput network fabric
(inter-host ping times < 0.2 ms).

On Fri, May 9, 2014 at 5:47 PM, Gustavo Niemeyer <email address hidden>wrote:

> I head from #landscape that the MongoDB logs or the general syslog from
> the system were not saved. It would be important to know more about the
> state of the database and the system in which this happened as a whole.
> I understand what the panic is saying, but I don't understand how it got
> into that state. The abrupt message is there to let us know that
> something that could not have happened did happen. The only other
> similar case I've heard from the txn package was when the system has run
> out of space, and MongoDB was unable to preserve basic properties of the
> data written.
>
> Also, as a side note worth knowing about, the txn system is failing as a
> safety net. There is a last-resource method on the Runner to clean up
> its metadata from the collections, if that ever happens again. Please
> never call this method automatically, though, as it will definitely
> corrupt the system. It's supposed to be used manually, in emergencies
> like these.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1318044
>
> Title:
> panic in mgo transaction lib
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1318044/+subscriptions
>

Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → next-stable
tags: added: mongodb race-condition
Revision history for this message
Adam Collard (adam-collard) wrote :

On 10 May 2014 11:57, Kapil Thangavelu <email address hidden> wrote:

> noted for future ref re collection of additional logs. fwiw, the machine
> was in good health (disk/cpu/etc), and post restart the machined state
> server agent was operating normally. the only item of note was the machine
> was a seamicro, which utilizes a low latency/high throughput network fabric
> (inter-host ping times < 0.2 ms)
>

Correction: This was found on an orangebox (SSDs)

Revision history for this message
James Page (james-page) wrote :

bug 1318366 might be related to this - I've attached a full machine-0.log and the mongod logs from syslog to that bug.

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

Oh, and please take out a dump of the database itself too if you see it again.

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

James, I don't see the correlation. What makes you feel like they are related?

By the way, it's too bad the (gigantic) log in that bug doesn't include the top entries. All of the entries there seem to be just blocked while something else crashed.

Revision history for this message
Adam Collard (adam-collard) wrote :

This definitely looks like a duplicate of #1318366, for which I have a full copy of /var/log

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.
  • Duplicate of a private bug Remove

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.