excessive mongodb replicaset role churn

Bug #1671258 reported by Paul Collins
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Andrew Wilkins

Bug Description

Canonical IS recently encountered problems with a Juju 2.0.3 environment rapidly changeing which node was primary. With assistance from Team Juju, we upgraded to mongodb 3.2.12 and juju 2.1. However, we are still having problems with mongodb. Some recent events:

Mar 8 21:28:28 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0 mongod.37017[24506]: [ReplicationExecutor] transition to PRIMARY
Mar 8 21:28:29 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2 mongod.37017[24282]: [replExecDBWorker-0] transition to SECONDARY
Mar 8 21:28:34 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0 mongod.37017[24506]: [rsSync] transition to primary complete; database writes are now permitted
Mar 8 21:28:41 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2 mongod.37017[24282]: [ReplicationExecutor] transition to ROLLBACK
Mar 8 21:28:53 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2 mongod.37017[24282]: [ReplicationExecutor] transition to RECOVERING
Mar 8 21:28:53 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2 mongod.37017[24282]: [ReplicationExecutor] transition to ROLLBACK
Mar 8 21:28:54 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2 mongod.37017[24282]: [ReplicationExecutor] transition to RECOVERING
Mar 8 21:28:56 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2 mongod.37017[24282]: [ReplicationExecutor] transition to SECONDARY
Mar 8 21:28:58 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1 mongod.37017[13540]: [ReplicationExecutor] transition to PRIMARY
Mar 8 21:28:59 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1 mongod.37017[13540]: [rsSync] transition to primary complete; database writes are now permitted
Mar 8 21:29:06 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0 mongod.37017[32554]: [ReplicationExecutor] transition to RECOVERING
Mar 8 21:29:06 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0 mongod.37017[32554]: [ReplicationExecutor] transition to SECONDARY
Mar 8 21:29:06 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0 mongod.37017[32554]: [ReplicationExecutor] transition to STARTUP2
Mar 8 21:31:06 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2 mongod.37017[24282]: [ReplicationExecutor] transition to PRIMARY
Mar 8 21:31:15 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1 mongod.37017[13540]: [replExecDBWorker-0] transition to SECONDARY
Mar 8 21:31:16 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2 mongod.37017[24282]: [rsSync] transition to primary complete; database writes are now permitted
Mar 8 21:31:25 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1 mongod.37017[13540]: [ReplicationExecutor] transition to ROLLBACK
Mar 8 21:31:26 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1 mongod.37017[13540]: [ReplicationExecutor] transition to RECOVERING
Mar 8 21:31:26 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1 mongod.37017[13540]: [ReplicationExecutor] transition to SECONDARY
Mar 8 21:31:34 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0 mongod.37017[575]: [ReplicationExecutor] transition to RECOVERING
Mar 8 21:31:34 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0 mongod.37017[575]: [ReplicationExecutor] transition to SECONDARY
Mar 8 21:31:34 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-0 mongod.37017[575]: [ReplicationExecutor] transition to STARTUP2
Mar 8 21:31:36 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1 mongod.37017[11744]: [ReplicationExecutor] transition to RECOVERING
Mar 8 21:31:36 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1 mongod.37017[11744]: [ReplicationExecutor] transition to SECONDARY
Mar 8 21:31:36 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-1 mongod.37017[11744]: [ReplicationExecutor] transition to STARTUP2

Paul Collins (pjdc)
Changed in juju:
importance: Undecided → Critical
Revision history for this message
Paul Collins (pjdc) wrote :

mongodb logging from 21:00 onwards is at https://private-fileshare.canonical.com/~pjdc/bug1671258/

The primary logs tend to be very large due to the default slow query logging at 100ms.

Revision history for this message
Paul Collins (pjdc) wrote :

I've stopped all 3 jujuds, and we're seeing some strange effects on the primary. According to mongostat it's essentally idle:

insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netIn netOut conn set repl time
    *0 *0 *0 *0 1 3|0 0.0 79.9 0 11.7G 5.5G 0|2601 0|128 583b 23k 2741 juju PRI 2017-03-08T22:31:53Z
    *0 *0 *0 *0 1 13|0 0.0 79.9 0 11.7G 5.5G 0|2601 0|128 4k 26k 2740 juju PRI 2017-03-08T22:31:55Z
    *0 *0 *0 *0 1 3|0 0.0 79.9 0 11.7G 5.5G 0|2600 1|128 583b 23k 2739 juju PRI 2017-03-08T22:31:58Z
    *0 *0 *0 *0 0 6|0 0.0 79.9 0 11.7G 5.5G 0|2594 0|127 1k 24k 2738 juju PRI 2017-03-08T22:32:03Z

but it's consuming a lot of CPU:

ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ ps uf -C mongod
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 24282 379 70.3 12264992 5753156 ? Rsl 09:12 3038:54 /usr/lib/juju/mongo3.2/bin/mongod --dbpath /var/lib/juju/db --sslOnNormalPorts --sslPEMKeyFile /var/lib/juju/server.
ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ uptime
 22:32:36 up 12 days, 6:23, 2 users, load average: 1221.97, 963.10, 690.73

and I see no evidence of so many connections:

ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ ss -n src :37017 | wc -l
5

although:

ubuntu@juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2:~$ sudo lsof -n -p 24282 | grep -c TCP
2719

but lsof doesn't have much useful information on those sockets.

Revision history for this message
Paul Collins (pjdc) wrote :

I wound down the slow query logging threshold earlier, and stopped all 3 jujud, but the rimary remained very busy. We were still logging some slow queries, but my eyes kind of glazed over and I didn't realize that it was logging slow queries in the order of 71 *minutes*, e.g.:

Mar 8 22:52:11 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2 mongod.37017[24282]: [conn72833] command juju.$cmd command: delete { delete: "statuseshistory", deletes: [ { q: { updated: { $lt: 1487799605850125781 } }, limit: 0 } ], writeConcern: { getLastError: 1, j: true }, ordered: true } keyUpdates:0 writeConflicts:0 numYields:0 reslen:100 locks:{ Global: { acquireCount: { r: 63994, w: 63994 } }, Database: { acquireCount: { w: 63994 }, acquireWaitCount: { w: 1566 }, timeAcquiringMicros: { w: 130809598 } }, Collection: { acquireCount: { w: 63994 } } } protocol:op_query 4324901ms
Mar 8 22:52:14 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2 mongod.37017[24282]: [conn71119] remove juju.statuseshistory query: { updated: { $lt: 1487799666911020159 } } ndeleted:3 keyUpdates:0 writeConflicts:0 numYields:64000 locks:{ Global: { acquireCount: { r: 64004, w: 64004 } }, Database: { acquireCount: { w: 64004 }, acquireWaitCount: { w: 1600 }, timeAcquiringMicros: { w: 133792069 } }, Collection: { acquireCount: { w: 64001 } }, Metadata: { acquireCount: { w: 3 } }, oplog: { acquireCount: { w: 3 } } } 4265339ms
Mar 8 22:52:14 juju-4da59b22-9710-4e69-840a-be49ee864a97-machine-2 mongod.37017[24282]: [conn71119] command juju.$cmd command: delete { delete: "statuseshistory", deletes: [ { q: { updated: { $lt: 1487799666911020159 } }, limit: 0 } ], writeConcern: { getLastError: 1, j: true }, ordered: true } keyUpdates:0 writeConflicts:0 numYields:0 reslen:100 locks:{ Global: { acquireCount: { r: 64004, w: 64004 } }, Database: { acquireCount: { w: 64004 }, acquireWaitCount: { w: 1600 }, timeAcquiringMicros: { w: 133792069 } }, Collection: { acquireCount: { w: 64001 } }, Metadata: { acquireCount: { w: 3 } }, oplog: { acquireCount: { w: 3 } } } protocol:op_query 4265353ms

Andrew Wilkins (axwalk)
Changed in juju:
status: New → Triaged
milestone: none → 2.2-alpha1
assignee: nobody → Andrew Wilkins (axwalk)
status: Triaged → In Progress
Revision history for this message
Andrew Wilkins (axwalk) wrote :

So, we're lacking an index appropriate for pruning status history. Without that index, pruning status history scans the entire collection, which will grow quite quickly for a large environment. Another issue is that we've got a status history pruning worker per model, but each one is operating on the collection globally (i.e. all models).

Paul also found that there were slow remove operations on the logs.logs collection, which similarly lacks appropriate indexes for pruning.

What we need to do:
 - change PruneStatusHistory to prune per-model. We can possibly share code between pruning logs and status history
 - add an index on juju.statuseshistory of {"model-uuid", "updated"}
 - add an index on logs.logs of {"e", "t"} and {"e"}

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Next up, presence. Queries for all presence.beings for a model are slow, because we have no indexes at all (except for the default _id one). We need an index for model-uuid at least.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

AFAICS, there's also no index created for "s" on juju.txns. That means txn.Runner.ResumeAll will be slow (and has proven to be slow in IS's shared-controller environment).

Revision history for this message
Andrew Wilkins (axwalk) wrote :
Revision history for this message
Christian Muirhead (2-xtian) wrote :

Some historical context on why the index was changed:
1. In the change to Mongo 3.2, a number of tests started failing because the thing they were testing was sorting log rows (and maybe other things) by (e, t) and the tests were implicitly relying on rows with the same timestamp coming back in insert order.

2. I changed the query to sort by _id as well to fully constrain the order of the returned rows.

3. That meant that the sort stopped using the index and we'd get errors from Mongo when the sort started using too much memory.

4. So Menno changed it to use two sorts - first by (e, t) and then by (_id) - that meant the the sort used the index again. This solution only works when rows are inserted in time order - so that time and id orders are in the same direction. (Foreshadowing, guess what happens next!)

5. When I did log transfer for migrations, that constraint stopped being true - the log messages transferred from the old model are written to the new model after some messages are logged in the new model, so the (e, t) then id sort started returning messages in the wrong order.

6. So I changed the sort back to (e, t, id) and changed the index to match (in the mistaken belief that an (e, t) query would use the (e, t, id) index), which introduced the bug.

Your fix (adding the (e, t) index back works, as long as we're ok with the extra space and index update overhead (I don't have a feel for how much that is). Menno pointed out that we could put index hints onto the (e, t) queries to get them to use the (e, t, id) index instead.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

I backed out the change to logs, as I couldn't reproduce my previous results. Will need to keep an eye on log pruning performance. The main culprit was status history pruning anyway.

Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Junien F (axino) wrote :

Perhaps mongod should be run with --notablescane ( https://docs.mongodb.com/manual/reference/program/mongod/#cmdoption-mongod--notablescan ) in development environments and in some QA tests ?

Revision history for this message
Junien F (axino) wrote :

--notablescan, sorry

Revision history for this message
Andrew Wilkins (axwalk) wrote :

I was not aware of that flag -- sounds like a great idea to me. I'm pretty certain we're missing indexes for a lot of low-volume collections, so I doubt we'll be able to turn it on straight away. I'll file a tech-debt bug to investigate.

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.