Large Ops scenario is taking too long

Bug #1500615 reported by Sylvain Bauza
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Won't Fix
High
Unassigned
devstack
Fix Released
Undecided
Matt Riedemann

Bug Description

gate-tempest-dsvm-large-ops error rate is spiking since the last 24 hours (http://goo.gl/G9Zazy) with the following stacktrace :

2015-09-28 15:02:50.954 | Traceback (most recent call last):
2015-09-28 15:02:50.954 | File "tempest/test.py", line 127, in wrapper
2015-09-28 15:02:50.954 | return f(self, *func_args, **func_kwargs)
2015-09-28 15:02:50.954 | File "tempest/scenario/test_large_ops.py", line 138, in test_large_ops_scenario_3
2015-09-28 15:02:50.954 | self._large_ops_scenario()
2015-09-28 15:02:50.954 | File "tempest/scenario/test_large_ops.py", line 123, in _large_ops_scenario
2015-09-28 15:02:50.954 | self.nova_boot()
2015-09-28 15:02:50.954 | File "tempest/scenario/test_large_ops.py", line 119, in nova_boot
2015-09-28 15:02:50.954 | self._wait_for_server_status('ACTIVE')
2015-09-28 15:02:50.954 | File "tempest/scenario/test_large_ops.py", line 81, in _wait_for_server_status
2015-09-28 15:02:50.955 | server['id'], status)
2015-09-28 15:02:50.955 | File "tempest/common/waiters.py", line 95, in wait_for_server_status
2015-09-28 15:02:50.955 | raise exceptions.TimeoutException(message)
2015-09-28 15:02:50.955 | tempest.exceptions.TimeoutException: Request timed out

http://logs.openstack.org/23/226923/5/gate/gate-tempest-dsvm-large-ops/826845d/console.html#_2015-09-28_15_02_50_955

Tags: testing
Revision history for this message
Matt Riedemann (mriedem) wrote :

https://review.openstack.org/#/c/226831/ was a devstack change that cut n-api workers from 4 to 2 and n-cond workers from 8 to 2. That merged on 9/24 and we started spiking shortly after that, so maybe related.

Changed in nova:
status: New → Confirmed
Revision history for this message
Matt Riedemann (mriedem) wrote :

The test itself is creating 100 instances at once:

http://logs.openstack.org/23/226923/5/gate/gate-tempest-dsvm-large-ops/826845d/logs/tempest_conf.txt.gz

large_ops_number = 100

Changed in nova:
importance: Critical → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to devstack (master)

Fix proposed to branch: master
Review: https://review.openstack.org/228636

Changed in devstack:
assignee: nobody → Matt Riedemann (mriedem)
status: New → In Progress
Revision history for this message
Nikola Đipanov (ndipanov) wrote :

After reading the IRC logs from when the bug was reported I noticed an observation by Matt that the time the resource semaphore was held kept increasing until it dropped - this can indeed be caused by a small number of conductor workers. COMPUTE_RESOURCE_SEMAPHORE is nova-cpu process wide, and there is a non-trivial amount of DB querying that goes on while it's held. A storm of requests would mean even more stuff going to the conductor which in turn makes ongoing requests holding the lock slow down as they are competing for the same pool of conductor workers to do their DB queries for them.

It is hard to tell without actually profiling the code (AND porobably adding even more probes like number of threads and DB requests per conductor worker), but increasing the number of conductor workers is definitely a good guess as to where the problem might be

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/228980

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to devstack (master)

Reviewed: https://review.openstack.org/228636
Committed: https://git.openstack.org/cgit/openstack-dev/devstack/commit/?id=c35eee5dbbb6b4dbc2901ebef4c4d88780aa74ec
Submitter: Jenkins
Branch: master

commit c35eee5dbbb6b4dbc2901ebef4c4d88780aa74ec
Author: Matt Riedemann <email address hidden>
Date: Mon Sep 28 14:46:27 2015 -0700

    use nproc/2 workers for large ops job

    Commit 1ce19ab76d67a89b04f907f1d292d013a3b699e0 dropped API_WORKERS from
    nproc/2 to nproc/4 and also started using API_WORKERS for the number of
    conductor workers, so in gate runs that dropped conductor workers from 8
    to 2. We're now seeing instance build timeouts in the large ops job.

    This change goes back to nproc/2 for the large ops job (VIRT_DRIVER=='fake').

    Closes-Bug: #1500615

    Change-Id: Ie6ef855fce0a99c930d479b7459c15b69e8de499

Changed in devstack:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/229490

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/228980
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9bb4c07b2996aa9e3153e6d6f28e7aa600c64100
Submitter: Jenkins
Branch: master

commit 9bb4c07b2996aa9e3153e6d6f28e7aa600c64100
Author: Matt Riedemann <email address hidden>
Date: Tue Sep 29 07:21:19 2015 -0700

    Add checkpoint logging when building an instance in compute manager

    When the large ops job fails because instances timeout during the build,
    it's hard to know what is going on for a given instance in the compute
    manager's _do_build_and_run_instance flow since a lot of the trace
    logging that happens is in the virt drivers.

    The large ops job uses the fake virt driver which doesn't do much
    logging at all. Rather than add logging to the fake virt driver, it'd be
    useful to add some logging to the compute manager when we are doing
    sub-tasks like building networks, building bdms, and spawning the instance
    on the hypervisor.

    Per: https://wiki.openstack.org/wiki/LoggingStandards

    The 'Starting instance' log message is changed to debug level and the
    completed task logging unit of work messages with time spent are done at
    info level.

    Related-Bug: #1500615

    Change-Id: I6b9e7fc7b3a14ccf8b1531fa08ce8d732e23775a

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/229490
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=48a07fe48f6d45f14d711fa35d23a1a9b45e3efc
Submitter: Jenkins
Branch: master

commit 48a07fe48f6d45f14d711fa35d23a1a9b45e3efc
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 30 08:45:48 2015 -0700

    Add checkpoint logging when terminating an instance

    Following I6b9e7fc7b3a14ccf8b1531fa08ce8d732e23775a for building an
    instance, this adds some checkpoint logging for terminating an instance,
    including destroying it on the hypervisor, deallocating the network, and
    detaching any volumes.

    Related-Bug: #1500615

    Change-Id: I16b8879f9dad9b0283b0758c41c70233624276f9

Revision history for this message
Matt Riedemann (mriedem) wrote :

We dropped the large ops job.

Changed in nova:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.