[2.4.3] juju controller becomes very slow after many model create/use/delete lifecycles

Bug #1797566 reported by Dmitrii Shcherbakov
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
High
Unassigned

Bug Description

After one of the redeployments we found that the next one was very slow without an apparent reason.

There were more than 10 deployments with fairly large models on this controller. These models were removed and re-created for new deployments.

The juju controller (non-HA) is running in a qemu-kvm virtual machine and resource utilization (CPU, memory, disk) seems to be far lower from what an underlying host can provide. The VM is not rate-limited.

juju status -m maas:controller
Model Controller Cloud/Region Version SLA Timestamp Notes
controller maas maas 2.4.3 unsupported 09:38:42-04:00 upgrade available: 2.4.4

Machine State DNS Inst id Series AZ Message
0 started 172.16.7.13 kyb86b bionic default Deployed

Model deployment in progress:
http://paste.ubuntu.com/p/6kfbpbbzP2/

It took about 30 minutes for a juju client to exit after a `juju deploy` command (which is only enumerating the charms and units and adding machines to the model).

mongostat reports (subjectively) significant CRUD per second rates while the resident memory utilization is fairly static:

root@juju:/var/log/juju# mongostat --host 127.0.0.1 --port 37017 --authenticationDatabase admin --ssl --sslAllowInvalidCertificates --username "$user" --password "$password"
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
    92 805 323 *0 0 12|0 3.5% 72.1% 0 3.17G 1.27G 0|0 1|0 282k 1.50m 553 juju PRI Oct 12 13:32:16.532
    13 26 2 *0 0 2|0 3.5% 72.1% 0 3.17G 1.27G 0|0 1|0 7.99k 91.9k 553 juju PRI Oct 12 13:32:17.531
   171 543 43 *0 1 5|0 3.6% 72.2% 0 3.17G 1.27G 0|0 1|1 164k 1.12m 553 juju PRI Oct 12 13:32:18.529
    15 140 24 *0 0 2|0 3.6% 72.2% 0 3.17G 1.27G 0|0 1|0 36.1k 269k 553 juju PRI Oct 12 13:32:19.533
   222 153 29 *0 0 7|0 4.0% 72.5% 0 3.17G 1.27G 0|0 1|0 86.0k 262k 553 juju PRI Oct 12 13:32:20.532
    18 901 117 *0 1 4|0 4.0% 72.5% 0 3.17G 1.27G 0|0 1|0 218k 1.55m 553 juju PRI Oct 12 13:32:21.531
   275 852 76 *0 0 4|0 4.3% 72.8% 0 3.17G 1.27G 0|0 1|0 324k 1.63m 553 juju PRI Oct 12 13:32:22.541
    24 217 65 *0 0 2|0 4.3% 72.8% 0 3.17G 1.27G 0|0 1|0 62.3k 331k 553 juju PRI Oct 12 13:32:23.532
   242 390 *0 *0 0 1|0 4.4% 72.9% 0 3.17G 1.27G 0|0 1|0 127k 717k 553 juju PRI Oct 12 13:32:24.535
     2 1182 279 2 0 6|0 4.5% 73.0% 0 3.17G 1.27G 0|0 1|0 350k 2.05m 553 juju PRI Oct 12 13:32:25.532
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
   112 59 4 *0 0 4|0 4.5% 73.0% 0 3.17G 1.27G 0|0 1|0 37.0k 137k 553 juju PRI Oct 12 13:32:26.532
     5 403 21 *0 1 4|0 4.3% 73.0% 0 3.17G 1.27G 0|0 1|0 87.1k 763k 553 juju PRI Oct 12 13:32:27.532
    77 189 22 *0 0 5|0 4.4% 73.1% 0 3.17G 1.27G 0|0 1|0 56.1k 286k 553 juju PRI Oct 12 13:32:28.531
     1 805 76 *0 0 9|0 4.3% 73.1% 0 3.17G 1.27G 0|0 5|1 204k 1.56m 553 juju PRI Oct 12 13:32:29.577
    56 348 172 *0 0 2|0 4.4% 73.1% 0 3.17G 1.27G 0|0 1|0 137k 589k 553 juju PRI Oct 12 13:32:30.533
    62 141 35 1 0 14|0 4.5% 73.2% 0 3.17G 1.27G 0|0 1|0 58.6k 288k 554 juju PRI Oct 12 13:32:31.551
   111 1408 251 *0 1 10|0 4.6% 73.3% 0 3.17G 1.27G 0|0 1|0 401k 2.40m 554 juju PRI Oct 12 13:32:32.530
   102 368 89 *0 0 7|0 4.6% 73.3% 0 3.17G 1.27G 0|0 2|0 117k 530k 554 juju PRI Oct 12 13:32:33.536
    34 250 37 *0 0 3|0 4.6% 73.3% 0 3.17G 1.27G 0|0 1|0 66.1k 406k 554 juju PRI Oct 12 13:32:34.532
   217 1148 119 *0 0 3|0 4.7% 73.4% 0 3.17G 1.27G 0|0 3|0 351k 2.18m 554 juju PRI Oct 12 13:32:35.556
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
    48 905 299 *0 0 9|0 4.8% 73.5% 0 3.17G 1.27G 0|0 1|0 284k 1.61m 554 juju PRI Oct 12 13:32:36.531
   240 683 214 1 1 6|0 4.9% 73.6% 0 3.17G 1.27G 0|0 1|0 280k 1.40m 554 juju PRI Oct 12 13:32:37.529
     3 388 54 *0 0 1|0 5.0% 73.6% 0 3.17G 1.27G 0|0 1|0 91.6k 693k 554 juju PRI Oct 12 13:32:38.532
    60 106 55 *0 0 2|0 5.0% 73.6% 0 3.17G 1.27G 0|0 1|0 50.0k 264k 554 juju PRI Oct 12 13:32:39.531
     4 707 210 *0 0 5|0 5.0% 73.7% 0 3.17G 1.27G 0|0 1|0 229k 1.44m 554 juju PRI Oct 12 13:32:40.531
    59 967 189 *0 1 7|0 5.0% 73.7% 0 3.17G 1.27G 0|0 1|0 269k 1.67m 554 juju PRI Oct 12 13:32:41.528
   167 710 160 *0 0 2|0 4.7% 73.3% 0 3.17G 1.27G 0|0 1|0 237k 1.16m 554 juju PRI Oct 12 13:32:42.531
    51 941 153 *0 0 11|0 4.8% 73.3% 0 3.17G 1.27G 0|0 1|0 252k 1.65m 554 juju PRI Oct 12 13:32:43.531
   128 503 71 *0 0 7|0 4.8% 73.4% 0 3.17G 1.27G 0|0 1|0 157k 919k 554 juju PRI Oct 12 13:32:44.533
   263 519 78 *0 0 4|0 5.0% 73.6% 0 3.17G 1.27G 0|0 1|0 218k 964k 554 juju PRI Oct 12 13:32:45.528
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
   116 262 43 *0 0 12|0 5.2% 73.6% 0 3.17G 1.27G 0|0 1|0 88.9k 392k 554 juju PRI Oct 12 13:32:46.544

~~~
Logs for the mongostat measurement period:

journalctl -u juju-db.service --since='2018-10-12 13:31:46' --until='2018-10-12 13:32:46' | pastebinit

http://paste.ubuntu.com/p/449m6CxFTf/

grep -P '(2018-10-12 13:31:(\d+))|(2018-10-12 13:32:(\d+))' debug-log-controller-0.log | pastebinit
http://paste.ubuntu.com/p/nP4kJvvVTw/

mongodump:
https://people.canonical.com/~dima/mongodump-1539507005.zst

Tags: cpe-onsite
description: updated
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Marking as ~field-critical due to the fact that the core functionality was critically impacted and resulted in customer-facing implications.

Although the model was v2.4.3 I do not see any bugfixes related to the described behavior for 2.4.4:
https://launchpad.net/juju/+milestone/2.4.4

We will preserve the environment for now to investigate.

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

Looks to me like we are not doing sufficient performance analysis and correction, and probably not doing it as part of the landing test suites. We should not be making hundreds of writes to the database per second for a single bundle deployment.

Revision history for this message
John A Meinel (jameinel) wrote :

To get some context, is this with 10 models running, or is this just that a model was deployed, destroyed, repeat and after 10 times of doing this it started to slow down?

Other useful monitoring would be to see what collections 'mongotop' says are being accessed, and if there is anything significant in 'juju debug-log -m controller'.

description: updated
description: updated
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

The openstack model was redeployed several times, and openstack-base a couple of times.

The described behavior was when the large openstack model was in the steady state and openstack-base deployment was started.

The mongostat measurement was taken during the deployment of openstack-base.

juju controllers
Use --refresh flag with this command to see the latest information.

Controller Model User Access Cloud/Region Models Machines HA Version
google-us-east1 default admin superuser google/us-east1 2 - - (unknown)
maas* openstack admin superuser maas 6 1 none 2.4.3
openstack-RegionOne default admin superuser openstack/RegionOne 2 1 none 2.4.4

juju models
Controller: maas

Model Cloud/Region Status Machines Cores Access Last connection
base maas available 0 - admin 2018-10-12
base1 maas available 16 80 admin 2018-10-12
controller maas available 1 2 admin just now
default maas available 0 - admin 2018-10-12
openstack* maas available 38 202 admin 8 seconds ago
test23 maas available 1 32 admin 2018-10-12

https://paste.ubuntu.com/p/2qXYGjNBTH/ (juju status -m openstack)
https://paste.ubuntu.com/p/4TK3mn7mYG/ (juju status -m base1)

With all models in a steady state:

mongotop:
https://paste.ubuntu.com/p/tmMx6hbTDy/

mongostat:
https://paste.ubuntu.com/p/8NTF7NstKd/

Just a `tail -f`:
https://paste.ubuntu.com/p/383Dzx3jPB/

repetitive messages albeit not clear if they are related:
https://paste.ubuntu.com/p/WvcjxKzYgt/

Also attached `juju debug-log --date --utc -m controller -i 0 --replay > debug-log-controller-0.log` which includes apiserver logs

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

openstack model logs (zstd)

juju debug-log --date --utc -m openstack --replay | zstd > debug-log-openstack.zst

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

juju debug-log --date --utc -m base1 --replay | zstd > debug-log-base1.zst

Revision history for this message
John A Meinel (jameinel) wrote :

I do agree that we should be tracking it more closely. I do wonder if some of the overhead is related to destroying an earlier model. I do see a lot more noise in query logs when doing a destroy-model than a deploy.

I do agree that just 'juju deploy' should not be taking that long, but once the records about machines/applications have been created, provisioning does start. So while the actual deploy isn't, itself, doing provisioning, it does start occurring in the background, thus adding load to the system.

In Juju 2.3 (and thus also 2.4), we also introduced:
 juju model-config -m controller logging-config="juju.state.txn=TRACE"
 juju debug-log -m controller --include-module juju.state.txn

Which will print out each transaction that we run, along with how long it took to run it. This doesn't give you the breakdown into individual database queries (a given transaction turns into at least 2 updates for each document involved in the transaction).

I did try:
for i in `seq 50`; do echo $i; juju add-model m$i; time juju deploy ./small-bundle; sleep 30; time juju destroy-model -y m$i; done

And there wasn't an appreciable difference in the deploy or destroy-model times on the first run vs the 50th run.

I'll try again with re-using the model name, rather than a distinct model each time.

description: updated
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

> I do see a lot more noise in query logs when doing a destroy-model than a deploy

I talked to the colleagues who originally reported the problem during a demo to get more details about this.

A more detailed description is as follows:

1) openstack (large, settled) and openstack-base (settled) were deployed;
2) openstack-base was used as a guinea pig to be destroyed and then brought back up from scratch;
3) the destruction of a model was very slow and when the last machine was still being cleaned up ("Waiting on model to be removed. 1 machine(s)...") a new model was created and a deployment was started.
https://private-fileshare.canonical.com/~dima/destoy-model-slow-2018.jpg

I also uploaded a mongodump from today https://people.canonical.com/~dima/mongodump-1539507005.zst, not sure how much relevance will that have though.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

The qcow2 image for the juju VM is located at /srv/libvirt/storage/.

virsh dumpxml juju | grep -P 'qcow|/srv/libv'
      <driver name='qemu' type='qcow2'/>
      <source file='/srv/libvirt/storage/juju.img'/>

Just to show the storage capabilities of the machine used to host the Juju controller VM:

sudo rsync -Hv --progress testfile /srv/libvirt/storage/
testfile
 12,001,017,856 100% 273.62MB/s 0:00:41 (xfr#1, to-chk=0/1)

sent 12,003,947,872 bytes received 35 bytes 289,251,756.80 bytes/sec
total size is 12,001,017,856 speedup is 1.00

/srv is a mount point for an ext4 file system created on top of a sata md-raid5

sudo lvdisplay
  --- Logical volume ---
  LV Path /dev/sata/system
  LV Name system
  VG Name sata

sudo pvdisplay | grep Name
  PV Name /dev/md1
  VG Name sata
  PV Name /dev/md0
  VG Name ssds

blkid | grep sata
/dev/mapper/sata-system: UUID="26efa563-5b5f-4690-8846-0a49fed116d4" TYPE="ext4"

mount | grep sata
/dev/mapper/sata-system on /srv type ext4 (rw,relatime,stripe=384,data=ordered)

cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid10]
md1 : active raid5 sdd1[3] sdc1[2] sda1[0] sdb1[1]
      2929889280 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 2/8 pages [8KB], 65536KB chunk

md0 : active raid1 sdf2[1] sde2[0]
      468524032 blocks super 1.2 [2/2] [UU]
      bitmap: 1/4 pages [4KB], 65536KB chunk

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (3.5 KiB)

Took a look at the environment again with John this morning:

1) found noticeable I/O use percentage by [jbd2/vda1-8] kernel thread (ext4 journal) in the Juju VM with a very low utilized bandwidth;

2) there was a bogus transaction which was not properly applied.

machine-0: 08:08:04 ERROR juju.worker.dependency "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {cloudimagemetadata released:Australia Central 2:xenial:amd64:Hyper-V::default ubuntu cloud images} for applying transaction 5bc09f4d0348dc50cc9a3b2d_b33796ad

After forcing this transaction into a different state and thus resolving the problem with it, transaction processing times dropped from 200 seconds to about 30s. At the same time, the deployment process moved forward from adding machines to adding units so it is not 100% clear whether it had such a positive effect on the transaction processing speed.

~~~~

After this I stopped the Juju controller VM, moved it to a RAID-1 on top of 2 SSDs and brought it up.

Not surprisingly, on SSDs transaction processing times are much better because the latency is roughly 100 times smaller with the SSD mdraid1 in comparison to SATA HDD mdraid5:

machine-0: 11:21:37 TRACE juju.state.txn ran transaction in 0.022s []txn.Op{

ubuntu@jumphost:~$ sudo ioping /dev/mapper/sata-system
4 KiB from /dev/mapper/sata-system (block device 2 TiB): request=1 time=13.9 ms
4 KiB from /dev/mapper/sata-system (block device 2 TiB): request=2 time=10.6 ms
4 KiB from /dev/mapper/sata-system (block device 2 TiB): request=3 time=23.1 ms
4 KiB from /dev/mapper/sata-system (block device 2 TiB): request=4 time=19.1 ms
4 KiB from /dev/mapper/sata-system (block device 2 TiB): request=5 time=5.83 ms
4 KiB from /dev/mapper/sata-system (block device 2 TiB): request=6 time=13.4 ms
4 KiB from /dev/mapper/sata-system (block device 2 TiB): request=7 time=8.80 ms
^C
--- /dev/mapper/sata-system (block device 2 TiB) ioping statistics ---
7 requests completed in 6.48 s, 73 iops, 295.2 KiB/s
min/avg/max/mdev = 5.83 ms / 13.5 ms / 23.1 ms / 5.52 ms
ubuntu@jumphost:~$ sudo ioping /dev/mapper/ssds-root
4 KiB from /dev/mapper/ssds-root (block device 446.1 GiB): request=1 time=196 us
4 KiB from /dev/mapper/ssds-root (block device 446.1 GiB): request=2 time=132 us
4 KiB from /dev/mapper/ssds-root (block device 446.1 GiB): request=3 time=107 us
4 KiB from /dev/mapper/ssds-root (block device 446.1 GiB): request=4 time=101 us
4 KiB from /dev/mapper/ssds-root (block device 446.1 GiB): request=5 time=98 us
4 KiB from /dev/mapper/ssds-root (block device 446.1 GiB): request=6 time=214 us
4 KiB from /dev/mapper/ssds-root (block device 446.1 GiB): request=7 time=234 us
4 KiB from /dev/mapper/ssds-root (block device 446.1 GiB): request=8 time=234 us
4 KiB from /dev/mapper/ssds-root (block device 446.1 GiB): request=9 time=90 us
^C
--- /dev/mapper/ssds-root (block device 446.1 GiB) ioping statistics ---
9 requests completed in 8.16 s, 6.40 k iops, 25.0 MiB/s
min/avg/max/mdev = 90 us / 156 us / 234 us / 58 us

So, I think it could be a combination of multiple factors:

1) running mongodb on top of ext4 with qcow2 on top of mdraid5 underneath...

Read more...

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

On the SSD RAID1 it takes about 30 seconds to destroy a model:

time juju destroy-model base1
WARNING! This command will destroy the "base1" model.
This includes all machines, applications, data and other resources.

Continue [y/N]? y
Destroying model
Waiting on model to be removed, 16 machine(s), 16 application(s)...
Waiting on model to be removed, 16 machine(s), 16 application(s)...
Waiting on model to be removed, 16 machine(s), 16 application(s)...
Waiting on model to be removed, 13 machine(s), 16 application(s)...
Waiting on model to be removed, 12 machine(s), 16 application(s)...
Waiting on model to be removed, 9 machine(s), 15 application(s)...
Waiting on model to be removed, 7 machine(s), 14 application(s)...
Waiting on model to be removed, 5 machine(s), 12 application(s)...
Waiting on model to be removed, 2 machine(s), 9 application(s)...
Waiting on model to be removed, 1 machine(s), 1 application(s)...
Waiting on model to be removed, 1 machine(s)...
Waiting on model to be removed...
Waiting on model to be removed...
Model destroyed.

real 0m28.110s
user 0m0.076s
sys 0m0.028s

This is in comparison to 30 *minutes* with SATA/RAID5: https://asciinema.org/a/kHYrV0vTXCxC2vmMoCjR7bOrW

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Moved to field-high from field-critical.

We still need to understand if the same is reproducible, for example, on a single SATA HDD to avoid problems with QCOW2 + RAID5 or with the RAID5 itself.

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.6-beta1
Revision history for this message
Tim Penhey (thumper) wrote :

Based on the underlying disk issues here, I'm marking it incomplete.

Juju controller deploys have a certain expectation around the performance of the underlying disk. If that disk doesn't perform, it isn't a Juju issue.

Changed in juju:
status: Triaged → Incomplete
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1797566] Re: [2.4.3] juju controller becomes very slow after many model create/use/delete lifecycles

note that we did see 10s queries on Prodstack yesterday. We don't know if
that was strictly due to load or some other contention. I'm hoping to have
some time today to play with the sanitized backup and see if there is
anything interesting in the data.

John
=:->

On Wed, Nov 21, 2018, 02:30 Tim Penhey <<email address hidden> wrote:

> Based on the underlying disk issues here, I'm marking it incomplete.
>
> Juju controller deploys have a certain expectation around the
> performance of the underlying disk. If that disk doesn't perform, it
> isn't a Juju issue.
>
> ** Changed in: juju
> Status: Triaged => Incomplete
>
> --
> You received this bug notification because you are a member of Canonical
> Field High, which is subscribed to the bug report.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1797566
>
> Title:
> [2.4.3] juju controller becomes very slow after many model
> create/use/delete lifecycles
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1797566/+subscriptions
>

Changed in juju:
milestone: 2.6-beta1 → 2.6-beta2
Changed in juju:
milestone: 2.6-beta2 → 2.6-rc1
Changed in juju:
milestone: 2.6-rc1 → 2.6-rc2
Changed in juju:
milestone: 2.6-rc2 → 2.6.1
Changed in juju:
milestone: 2.6.1 → 2.6.2
Changed in juju:
milestone: 2.6.2 → 2.6.3
Changed in juju:
milestone: 2.6.3 → 2.6.4
Changed in juju:
milestone: 2.6.4 → 2.6.5
Changed in juju:
milestone: 2.6.5 → 2.6.6
Changed in juju:
milestone: 2.6.6 → 2.6.7
Changed in juju:
milestone: 2.6.7 → none
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.