panics in controller logs

Bug #1803629 reported by Tim Penhey
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
High
Unassigned
2.4
Expired
High
Unassigned

Bug Description

Looking through some 2.4.5 log files I see a number of panics. It appears to be during shutdown...

2018-11-10 21:39:41 WARNING juju.state pool.go:300 state for 1dc0eb3a-6dea-498c-8246-d9d6690e67fd leaked from pool - references: 1, removed: false
2018-11-10 21:39:41 ERROR juju.state pool.go:67 releasing state back to pool: unable to return unknown model 1dc0eb3a-6dea-498c-8246-d9d6690e67fd to the pool
panic: Session already closed

goroutine 249529919 [running]:
gopkg.in/mgo%2ev2.(*Session).cluster(...)
 /workspace/_build/src/gopkg.in/mgo.v2/session.go:1624
gopkg.in/mgo%2ev2.copySession(0xc026ca8b60, 0x1, 0x0)
 /workspace/_build/src/gopkg.in/mgo.v2/session.go:540 +0x391
gopkg.in/mgo%2ev2.(*Session).Copy(0xc026ca8b60, 0xc09a479210)
 /workspace/_build/src/gopkg.in/mgo.v2/session.go:1590 +0x40
github.com/juju/juju/mongo.CollectionFromName(0xc06e16a2a0, 0x326ffa8, 0x8, 0x8, 0xc0ef097168, 0xc0ff295c01)
 /workspace/_build/src/github.com/juju/juju/mongo/collections.go:16 +0x32
github.com/juju/juju/state.(*database).GetCollection(0xc0e939ddb0, 0x326ffa8, 0x8, 0x328840c, 0x11, 0x0)
 /workspace/_build/src/github.com/juju/juju/state/database.go:289 +0x318
github.com/juju/juju/state.getStatus(0x37d8060, 0xc0e939ddb0, 0xc0ff295cc0, 0x11, 0x3269e01, 0x5, 0x0, 0x0, 0x0, 0x0, ...)
 /workspace/_build/src/github.com/juju/juju/state/status.go:215 +0xd1
github.com/juju/juju/state.(*UnitAgent).Status(0xc09a479598, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0e939ddb0, 0xc0e939ddb0)
 /workspace/_build/src/github.com/juju/juju/state/unitagent.go:38 +0xfa
github.com/juju/juju/state.(*Unit).AgentStatus(0xc113a6bef0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
 /workspace/_build/src/github.com/juju/juju/state/unit.go:1098 +0x10b
github.com/juju/juju/state.(*backingUnit).unitAndAgentStatus(0xc12cae6640, 0xc113a6bef0, 0xc0f6e04b00, 0xc09a479998, 0x1)
 /workspace/_build/src/github.com/juju/juju/state/allwatcher.go:351 +0xeb
github.com/juju/juju/state.(*backingUnit).updated(0xc12cae6640, 0xc065069600, 0xc04da8e680, 0xc0e288cdc0, 0x34, 0x16, 0x199)
 /workspace/_build/src/github.com/juju/juju/state/allwatcher.go:396 +0x826
github.com/juju/juju/state.loadAllWatcherEntities(0xc065069600, 0xc0a371e4e0, 0xc04da8e680, 0x0, 0x0)
 /workspace/_build/src/github.com/juju/juju/state/allwatcher.go:1504 +0x4cb
github.com/juju/juju/state.(*allModelWatcherStateBacking).loadAllWatcherEntitiesForModel(0xc0a371e570, 0xc0416c7c50, 0x24, 0xc04da8e680, 0x0, 0x0)
 /workspace/_build/src/github.com/juju/juju/state/allwatcher.go:1401 +0x165
github.com/juju/juju/state.(*allModelWatcherStateBacking).GetAll(0xc0a371e570, 0xc04da8e680, 0xc0a371e570, 0xc0fd1d2780)
 /workspace/_build/src/github.com/juju/juju/state/allwatcher.go:1379 +0x99
github.com/juju/juju/state.(*storeManager).loop(0xc16b27c0c0, 0x0, 0x0)
 /workspace/_build/src/github.com/juju/juju/state/multiwatcher.go:229 +0xe0
github.com/juju/juju/state.newStoreManager.func1(0x0, 0x0)
 /workspace/_build/src/github.com/juju/juju/state/multiwatcher.go:206 +0x2e
gopkg.in/tomb%2ev2.(*Tomb).run(0xc16b27c0c0, 0xc01c5b87f0)
 /workspace/_build/src/gopkg.in/tomb.v2/tomb.go:153 +0x2b
created by gopkg.in/tomb%2ev2.(*Tomb).Go
 /workspace/_build/src/gopkg.in/tomb.v2/tomb.go:149 +0xba

another example:

2018-11-09 03:02:55 WARNING juju.state pool.go:300 state for d937da9d-d568-4dab-8f65-43d47c25e499 leaked from pool - references: 5, removed: false
2018-11-09 03:02:55 INFO juju.state state.go:402 starting standard state workers
2018-11-09 03:02:55 INFO juju.state state.go:409 creating cloud image metadata storage
2018-11-09 03:02:55 INFO juju.state state.go:415 started state for model-d03553a7-eddb-46bd-8c62-597946c76016 successfully
2018-11-09 03:02:55 ERROR juju.state pool.go:67 releasing state back to pool: unable to return unknown model d937da9d-d568-4dab-8f65-43d47c25e499 to the pool
panic: Session already closed

goroutine 30034 [running]:
gopkg.in/mgo%2ev2.(*Session).cluster(...)
 /workspace/_build/src/gopkg.in/mgo.v2/session.go:1624
gopkg.in/mgo%2ev2.copySession(0xc0000d51e0, 0x1, 0x5a3)
 /workspace/_build/src/gopkg.in/mgo.v2/session.go:540 +0x391
gopkg.in/mgo%2ev2.(*Session).Copy(0xc0000d51e0, 0x7f51e4c85000)
 /workspace/_build/src/gopkg.in/mgo.v2/session.go:1590 +0x40
github.com/juju/juju/state.(*StatePool).openState(0xc000384450, 0xc0236f8900, 0x24, 0x800, 0xc0520d8000, 0x5a3)
 /workspace/_build/src/github.com/juju/juju/state/pool.go:200 +0x36
github.com/juju/juju/state.(*StatePool).Get(0xc000384450, 0xc0236f8900, 0x24, 0x0, 0x0, 0x0)
 /workspace/_build/src/github.com/juju/juju/state/pool.go:182 +0x2e4
github.com/juju/juju/state/statemetrics.statePoolShim.Get(0xc000384450, 0xc0236f8900, 0x24, 0x0, 0x0, 0xc0138d7c50, 0xc0683cfc20)
 /workspace/_build/src/github.com/juju/juju/state/statemetrics/state.go:81 +0x43
github.com/juju/juju/state/statemetrics.(*Collector).updateModelMetrics(0xc00071a2d0, 0xc0554fbb90, 0x24)
 /workspace/_build/src/github.com/juju/juju/state/statemetrics/statemetrics.go:202 +0x323
github.com/juju/juju/state/statemetrics.(*Collector).updateMetrics(0xc00071a2d0)
 /workspace/_build/src/github.com/juju/juju/state/statemetrics/statemetrics.go:146 +0x232
github.com/juju/juju/state/statemetrics.(*Collector).Collect(0xc00071a2d0, 0xc01fc29680)
 /workspace/_build/src/github.com/juju/juju/state/statemetrics/statemetrics.go:128 +0x1a4
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func2(0xc01fc9aac0, 0xc01fc29680, 0x37865e0, 0xc00071a2d0)
 /workspace/_build/src/github.com/prometheus/client_golang/prometheus/registry.go:382 +0x61
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
 /workspace/_build/src/github.com/prometheus/client_golang/prometheus/registry.go:380 +0x302

and....

2018-11-09 02:59:46 ERROR juju.state pool.go:67 releasing state back to pool: unable to return unknown model 308db235-38d6-4f31-8e4b-4cb5380900bf to the pool
2018-11-09 02:59:46 ERROR juju.state pool.go:67 releasing state back to pool: unable to return unknown model 308db235-38d6-4f31-8e4b-4cb5380900bf to the pool
panic: Session already closed

goroutine 571397 [running]:
gopkg.in/mgo%2ev2.(*Session).cluster(...)
 /workspace/_build/src/gopkg.in/mgo.v2/session.go:1624
gopkg.in/mgo%2ev2.copySession(0xc000f33a00, 0x1, 0xc0237c8bf0)
 /workspace/_build/src/gopkg.in/mgo.v2/session.go:540 +0x391
gopkg.in/mgo%2ev2.(*Session).Copy(0xc000f33a00, 0xc01b7d72c0)
 /workspace/_build/src/gopkg.in/mgo.v2/session.go:1590 +0x40
github.com/juju/juju/mongo.CollectionFromName(0xc01ecb5d80, 0x326ffa8, 0x8, 0x8, 0xc003b9dca8, 0xc0237c8c01)
 /workspace/_build/src/github.com/juju/juju/mongo/collections.go:16 +0x32
github.com/juju/juju/state.(*database).GetCollection(0xc01b4b6910, 0x326ffa8, 0x8, 0x328840c, 0x11, 0x0)
 /workspace/_build/src/github.com/juju/juju/state/database.go:289 +0x318
github.com/juju/juju/state.getStatus(0x37d8060, 0xc01b4b6910, 0xc0237c8c00, 0xc, 0x326f7c8, 0x8, 0x0, 0x0, 0x0, 0x0, ...)
 /workspace/_build/src/github.com/juju/juju/state/status.go:215 +0xd1
github.com/juju/juju/state.(*Machine).InstanceStatus(0xc01f702240, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
 /workspace/_build/src/github.com/juju/juju/state/machine.go:1109 +0x13a
github.com/juju/juju/state/statemetrics.(*Collector).updateModelMetrics(0xc0006c85a0, 0xc028598480, 0x24)
 /workspace/_build/src/github.com/juju/juju/state/statemetrics/statemetrics.go:229 +0x7a4
github.com/juju/juju/state/statemetrics.(*Collector).updateMetrics(0xc0006c85a0)
 /workspace/_build/src/github.com/juju/juju/state/statemetrics/statemetrics.go:146 +0x232
github.com/juju/juju/state/statemetrics.(*Collector).Collect(0xc0006c85a0, 0xc028b31140)
 /workspace/_build/src/github.com/juju/juju/state/statemetrics/statemetrics.go:128 +0x1a4
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func2(0xc01ecec810, 0xc028b31140, 0x37865e0, 0xc0006c85a0)
 /workspace/_build/src/github.com/prometheus/client_golang/prometheus/registry.go:382 +0x61
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
 /workspace/_build/src/github.com/prometheus/client_golang/prometheus/registry.go:380 +0x302

Changed in juju:
assignee: nobody → Anastasia (anastasia-macmood)
Revision history for this message
John A Meinel (jameinel) wrote :

Just to mention the Panic() that we saw is what happens if you have a shared mgo.Session and one side calls .Close() while the other side tries to do a query. (Using a closed Session is a panic())

" unable to return unknown model 308db235-38d6-4f31-8e4b-4cb5380900bf to the pool"

That sounds like we have a reference counting issue. Where something is grabbing access to a State object, and calls Release() on it, and a second bit of code is *also* calling Release on it.

That might be related to the Close() issue. If we ended up with 2 code paths getting a copy of the same State object, and then double releasing it, probably the first Release is triggering a cleanup, including closing the mgo session of that State object, and then the *second* Release is trying to use/close one that is already closed and thus you get the error and a panic.

Revision history for this message
John A Meinel (jameinel) wrote :

Or as likely we have some sort of "defer st.Close()" which is trying to close something already closed.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

I have not looked at panic 2 which seems to be occuring in metrics collection path.

I have had a look at the code from panic 1 & 3, both are the same and are coming from allwatcher code where we are retrieving agent and machine status. I believe that the problem is that we are creating a full state object, state.Entity, in https://github.com/juju/juju/blob/2.4/state/allwatcher.go#L252. We then use this object to determine machine status, instance status, agent version etc in various subsequent calls.

The problem is that I do not think that we can guarantee that the *State, st, that is used to create this *state.Entity will not be closed when the subsequent calls are made. I have examined all code in the allwatcher and this approach seems to be unique and is new, introduced recently by https://github.com/juju/juju/commit/37a0f7f1774de24d2b8317bfabfa16561db17151

I beleive this code needs to be re-written to not keep & use reference to the state - it can become stale.

I could not figure out a reproducible scenario to easily test my suggestion.

Changed in juju:
assignee: Anastasia (anastasia-macmood) → Ian Booth (wallyworld)
Revision history for this message
Ian Booth (wallyworld) wrote :

The issue isn't the recently changed code - the new code simply essentially reorders the business logic a little - the state object is still used to fetch a unit and from there status is extracted. The root cause appears to be that the multiwatcher worker passes to a worker loop a state object that may be closed outside the loop when things are still being initialised.

The sequence of events is:

state is started using start()
start() method creates workers

At some time later, state is watched

func (st *State) Watch(params WatchParams) *Multiwatcher {
 return NewMultiwatcher(st.workers.allManager(params))
}

The multiwatcher startup code runs the worker loop in a go routine. Before the loop starts monitoring the dying channel, all the entities are loaded:

 if err := sm.backing.GetAll(sm.all); err != nil {
  return err
 }

The process of loading the entities uses state. However,the state may be closed while this go routine is running and hence the panic.

Inside the code to load all the entities, there is an attempt to use a copy of the state:

 // Use a single new MongoDB connection for all the work here.
 db, closer := st.newDB()
 defer closer()

However, this db is not used everywhere - in some places the st is used. That's what appears to be the problem.

Revision history for this message
Ian Booth (wallyworld) wrote :

The panics in metrics are due to the system state being closed but the worker to gather the metrics is not stopped. The stateworker which registers the prometheus collector should unregister it when the worker dies. This is only done using a defer in the worker loop and it seems possible on a heavily loaded system that the state tracker Done() could be called before the defer executes, so there's a window where the collector could use a closed state pool.

Ian Booth (wallyworld)
Changed in juju:
milestone: 2.4.7 → 2.5-beta2
status: Triaged → In Progress
Revision history for this message
Ian Booth (wallyworld) wrote :

We're parking these for now as fixes have landed elsewhere to address a state leak and we want to see if that incidentally addresses this issue.

Changed in juju:
assignee: Ian Booth (wallyworld) → nobody
status: In Progress → Triaged
importance: Critical → High
Changed in juju:
milestone: 2.5-beta2 → 2.5-rc1
Changed in juju:
milestone: 2.5-rc1 → 2.5.1
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.5.1 → 2.5.2
Changed in juju:
milestone: 2.5.2 → 2.5.3
Changed in juju:
milestone: 2.5.3 → 2.5.4
Changed in juju:
milestone: 2.5.4 → 2.5.5
Revision history for this message
Anastasia (anastasia-macmood) wrote :

I am marking this as Incomplete to let it expire - comment # 7 seems to indicated that there were some changes landed that would prevent these panics from re-occurring. Let's re-open it if anyone sees re-occurrence.

Changed in juju:
status: Triaged → Incomplete
milestone: 2.5.6 → none
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju 2.4 because there has been no activity for 60 days.]

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.