Slow compute nodes

Bug #1843248 reported by Liam Young
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Charm Test Infra
Triaged
Undecided
Unassigned

Bug Description

I don't have much science yet but it feels like guests on ciguapa (and possible mutus) are slower than guests hosted on other compute nodes.

Revision history for this message
Liam Young (gnuoy) wrote :

In a recent deploy these seemed to be the slow nodes:

ciguapa
koch
cuegle
geiger

particularly ciguapa & cuegle

Revision history for this message
Frode Nordahl (fnordahl) wrote :

+1

I have a deploy unfolding where the units on ciguapa run unbearably slow right now.

ubuntu@ciguapa:~$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r b swpd free buff cache si so bi bo in cs us sy id wa st
119 0 24660880 1127528 285340 38450224 1 1 5 247 0 0 49 10 41 0 0
132 0 24660896 1083548 285412 38468772 10 17 3252 13191 96976 146211 87 12 1 0 0
104 0 24660896 1046648 285468 38497120 3 0 4134 161 95260 147292 87 12 1 0 0
106 0 24660896 1023924 285524 38521848 9 0 2964 154 98079 150934 87 12 1 0 0
100 0 24660896 947468 285596 38590696 3 0 5290 19629 97576 151811 86 13 1 0 0

instance:
ubuntu@juju-5c8a8e-zaza-c1bac91020f7-1:~$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r b swpd free buff cache si so bi bo in cs us sy id wa st
 1 0 0 67136 44132 1727796 0 0 588 1901 240 633 25 6 36 2 31
 2 0 0 73436 44136 1721720 0 0 3195 7 253 453 45 3 0 0 52
 1 0 0 78116 44136 1717020 0 0 3791 2 247 326 38 3 0 1 58
 1 0 0 81728 44140 1713372 0 0 1637 2 199 214 22 2 0 0 77
 1 0 0 65476 44144 1729664 0 0 7051 2 236 511 30 3 0 0 66

Look at those steal times, that's madness!

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Test failure: https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline_func_full/openstack/charm-neutron-api/684701/3/3922/index.html

Was due to deployment timeout where a unit deployed to ciguapa never completed in time:
/var/log/nova/nova-conductor.log:2019-10-01 11:03:06.273 889 DEBUG nova.conductor.manager [req-9c5655fe-041f-404a-a356-eafd642c6ded 94d9f8d2314b4853a7dfd0c3cc934e17 2dd543a265894161b751db6090a0eaf0 - 1bf127c9b631435984600ac72fa5374f 1bf127c9b631435984600ac72fa5374f] [instance: 72148a58-129c-4b7f-a01b-ecbe4887404f] Selected host: ciguapa; Selected node: ciguapa; Alternates: [(u'mutus', u'mutus.serverstack'), (u'fechner', u'fechner.serverstack')] schedule_and_build_instances /usr/lib/python2.7/dist-packages/nova/conductor/manager.py:1245

Revision history for this message
Liam Young (gnuoy) wrote :

ciguapa disabled:

openstack compute service set --disable --disable-reason "Very Slow Bug #1843248" ciguapa nova-compute

Revision history for this message
Frode Nordahl (fnordahl) wrote :

\o/ for ciguapa disable

I was hit by this again, this time ``cuegle``.

https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline_func_full/openstack/charm-neutron-api/684701/3/3927/index.html

nova-cloud-controller/0* active executing 6 172.17.103.17 8774/tcp Unit is ready

6 started 172.17.103.17 28007892-ade5-48a6-a96f-49501c189fe9 xenial nova ACTIVE

/var/log/nova/nova-conductor.log:2019-10-01 17:32:51.006 889 DEBUG nova.conductor.manager [req-727bb3b5-3690-4b04-9456-384eea929b17 94d9f8d2314b4853a7dfd0c3cc934e17 2dd543a265894161b751db6090a0eaf0 - 1bf127c9b631435984600ac72fa5374f 1bf127c9b631435984600ac72fa5374f] [instance: 28007892-ade5-48a6-a96f-49501c189fe9] Selected host: cuegle; Selected node: cuegle.serverstack; Alternates: [(u'mutus', u'mutus.serverstack'), (u'koch', u'koch.serverstack')] schedule_and_build_instances /usr/lib/python2.7/dist-packages/nova/conductor/manager.py:1245

Revision history for this message
Frode Nordahl (fnordahl) wrote :

cuegle disabled:

openstack compute service set --disable --disable-reason "Very Slow Bug #1843248" cuegle nova-compute

Revision history for this message
Frode Nordahl (fnordahl) wrote :

This instance was placed prior to cuegle being disabled, but just adding it for documentation purposes.

https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline_func_smoke/openstack/charm-ceph-radosgw/685992/1/12613/index.html

ceph-mon/1 active executing 1 172.17.111.5 Unit is ready and clustered

1 started 172.17.111.5 2ba36b52-530c-467d-bd88-da6684e0056b bionic nova ACTIVE

/var/log/nova/nova-conductor.log:2019-10-01 15:55:58.210 2921 DEBUG nova.conductor.manager [req-db32d62b-1f70-4351-9902-ac13024ca82d 94d9f8d2314b4853a7dfd0c3cc934e17 2dd543a265894161b751db6090a0eaf0 - 1bf127c9b631435984600ac72fa5374f 1bf127c9b631435984600ac72fa5374f] [instance: 2ba36b52-530c-467d-bd88-da6684e0056b] Selected host: cuegle; Selected node: cuegle.serverstack; Alternates: [(u'geiger', u'geiger.serverstack'), (u'koch', u'koch.serverstack')] schedule_and_build_instances /usr/lib/python2.7/dist-packages/nova/conductor/manager.py:1245

Revision history for this message
Frode Nordahl (fnordahl) wrote :

mutus:

https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline_func_smoke/openstack/charm-octavia/685942/1/12647/index.html

octavia/0 blocked executing 8 172.17.105.30 9876/tcp 'shared-db' incomplete, 'amqp' incomplete, 'identity-service' missing, 'neutron-openvswitch' missing, Awaiting leader to create required resources, Missing required certificate configuration, please examine documentation
  neutron-openvswitch-octavia/2 waiting executing 172.17.105.30 Incomplete relations: messaging
octavia/1* blocked idle 9 172.17.105.9 9876/tcp Awaiting end-user execution of `configure-resources` action to create required resources, Missing required certificate configuration, please examine documentation

/var/log/nova/nova-conductor.log:2019-10-02 11:52:43.143 2921 DEBUG nova.conductor.manager [req-5f99570d-d6f5-4f13-b47d-87de91ead768 94d9f8d2314b4853a7dfd0c3cc934e17 2dd543a265894161b751db6090a0eaf0 - 1bf127c9b631435984600ac72fa5374f 1bf127c9b631435984600ac72fa5374f] [instance: cbe94f62-8eda-4ef1-a9a9-ea4de9cd8769] Selected host: mutus; Selected node: mutus.serverstack; Alternates: [(u'fechner', u'fechner.serverstack'), (u'koch', u'koch.serverstack')] schedule_and_build_instances /usr/lib/python2.7/dist-packages/nova/conductor/manager.py:1245
/var/log/nova/nova-conductor.log:2019-10-02 11:52:39.647 2920 DEBUG nova.conductor.manager [req-d8ef5008-dd9e-4882-bb6d-470803affb55 94d9f8d2314b4853a7dfd0c3cc934e17 2dd543a265894161b751db6090a0eaf0 - 1bf127c9b631435984600ac72fa5374f 1bf127c9b631435984600ac72fa5374f] [instance: 1426a761-5785-46c5-86bd-18c3e6f4a4de] Selected host: koch; Selected node: koch.serverstack; Alternates: [(u'fechner', u'fechner.serverstack'), (u'mutus', u'mutus.serverstack')] schedule_and_build_instances /usr/lib/python2.7/dist-packages/nova/conductor/manager.py:1245

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Re comment #8 that is actually pointing at both mutus and koch, it may be that that specific job is at the edge of what the current default timeout of 2700s allows.

Have a proposal up to increase it here: https://github.com/openstack-charmers/zaza/pull/281

Revision history for this message
Frode Nordahl (fnordahl) wrote :

From instance running on mutus at Mon Oct 7 09:40:00 UTC 2019

# vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r b swpd free buff cache si so bi bo in cs us sy id wa st
 2 0 0 8138000 106492 1656948 0 0 149 2925 483 600 9 8 63 2 17
 1 0 0 8104104 106512 1657248 0 0 26 148 6299 11019 17 7 58 0 18
 4 0 0 8131928 106520 1657280 0 0 0 66 4072 7405 15 6 65 0 14
 3 0 0 8127820 106524 1657232 0 0 5 810 5448 9476 14 7 62 0 16
 1 0 0 8127364 106532 1656884 0 0 0 1632 7377 13231 10 9 57 0 24
 1 0 0 8103300 106724 1658088 0 0 2 3142 1885 2997 14 5 68 0 13
 2 0 0 8172212 106776 1659752 0 0 170 5040 2509 4284 16 8 57 0 19
 1 0 0 8130476 106780 1659892 0 0 0 849 1726 2746 28 5 52 0 15
 3 0 0 8132008 106788 1659876 0 0 0 62 6196 10520 8 8 55 0 28
 2 0 0 8144720 106812 1659964 0 0 0 48 4034 7270 14 6 64 0 15
 2 0 0 8106168 107440 1662512 0 0 0 1781 4738 8814 16 11 48 0 25
 2 0 0 8104612 107444 1662540 0 0 0 19 6421 11860 10 9 52 0 29
 1 0 0 8104584 107456 1662536 0 0 0 7382 5840 10958 8 8 54 0 30

Up to 30% of the time is spent with its VCPUs not being scheduled due to load on the hypervisor

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Leaving Importance as undecided, as this bug seems to be in a "collecting" phase at present.

Changed in charm-test-infra:
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.