Unit tests fail in handle_schedule_error with wrong instance

Bug #1360320 reported by Dan Smith
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Unassigned

Bug Description

From http://logs.openstack.org/70/113270/3/check/gate-nova-python26/038b3fa/console.html:

2014-08-21 20:08:33.507 | Traceback (most recent call last):
2014-08-21 20:08:33.507 | File "nova/tests/conductor/test_conductor.py", line 1343, in test_build_instances_scheduler_failure
2014-08-21 20:08:33.507 | legacy_bdm=False)
2014-08-21 20:08:33.507 | File "nova/conductor/rpcapi.py", line 415, in build_instances
2014-08-21 20:08:33.507 | cctxt.cast(context, 'build_instances', **kw)
2014-08-21 20:08:33.508 | File "/home/jenkins/workspace/gate-nova-python26/.tox/py26/lib/python2.6/site-packages/oslo/messaging/rpc/client.py", line 152, in call
2014-08-21 20:08:33.508 | retry=self.retry)
2014-08-21 20:08:33.508 | File "/home/jenkins/workspace/gate-nova-python26/.tox/py26/lib/python2.6/site-packages/oslo/messaging/transport.py", line 90, in _send
2014-08-21 20:08:33.508 | timeout=timeout, retry=retry)
2014-08-21 20:08:33.508 | File "/home/jenkins/workspace/gate-nova-python26/.tox/py26/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_fake.py", line 194, in send
2014-08-21 20:08:33.508 | return self._send(target, ctxt, message, wait_for_reply, timeout)
2014-08-21 20:08:33.508 | File "/home/jenkins/workspace/gate-nova-python26/.tox/py26/lib/python2.6/site-packages/oslo/messaging/_drivers/impl_fake.py", line 181, in _send
2014-08-21 20:08:33.508 | raise failure
2014-08-21 20:08:33.509 | UnexpectedMethodCallError: Unexpected method call. unexpected:- expected:+
2014-08-21 20:15:52.443 | - handle_schedule_error.__call__(<nova.tests.conductor.test_conductor.FakeContext object at 0xc19a3d0>, NoValidHost(u'No valid host was found. fake-reason',), '8eb9d649-0985-43libvir: error : internal error could not initialize domain event timer
2014-08-21 20:16:46.065 | Exception TypeError: "'NoneType' object is not callable" in <bound method GreenSocket.__del__ of <eventlet.greenio.GreenSocket object at 0xdc5bd10>> ignored
2014-08-21 20:19:45.254 | 50-8946-570ce100534c', {'instance_properties': Instance(access_ip_v4=None,access_ip_v6=None,architecture=None,auto_disk_config=False,availability_zone=None,cell_name=None,cleaned=False,config_drive=None,created_at=1955-11-05T00:00:00Z,default_ephemeral_device=None,default_swap_device=None,deleted=False,deleted_at=None,disable_terminate=False,display_description=None,display_name=None,ephemeral_gb=0,ephemeral_key_uuid=None,fault=<?>,host='fake-host',hostname=None,id=1,image_ref=None,info_cache=<?>,instance_type_id=None,kernel_id=None,key_data=None,key_name=None,launch_index=None,launched_at=None,launched_on=None,locked=False,locked_by=None,memory_mb=None,metadata=<?>,node=None,os_type=None,pci_devices=<?>,power_state=None,progress=None,project_id='fake-project',ramdisk_id=None,reservation_id=None,root_device_name=None,root_gb=0,scheduled_at=None,security_groups=<?>,shutdown_terminate=False,system_metadata=<?>,task_state=None,terminated_at=None,updated_at=None,user_data=None,user_id='fake-user',uuid=8eb9d649-0985-4350-8946-570ce100534c,vcpus=None,vm_mode=None,vm_state=None), 'fake': 'specs'}) -> None
2014-08-21 20:19:45.254 | + handle_schedule_error.__call__(<nova.tests.conductor.test_conductor.FakeContext object at 0xc19a3d0>, NoValidHost(u'No valid host was found. fake-reason',), '712006a3-7ca5-4350-8f7f-028a9e4c78b2', {'instance_properties': Instance(access_ip_v4=None,access_ip_v6=None,architecture=None,auto_disk_config=False,availability_zone=None,cell_name=None,cleaned=False,config_drive=None,created_at=1955-11-05T00:00:00Z,default_ephemeral_device=None,default_swap_device=None,deleted=False,deleted_at=None,disable_terminate=False,display_description=None,display_name=None,ephemeral_gb=0,ephemeral_key_uuid=None,fault=<?>,host='fake-host',hostname=None,id=1,image_ref=None,info_cache=<?>,instance_type_id=None,kernel_id=None,key_data=None,key_name=None,launch_index=None,launched_at=None,launched_on=None,locked=False,locked_by=None,memory_mb=None,metadata=<?>,node=None,os_type=None,pci_devices=<?>,power_state=None,progress=None,project_id='fake-project',ramdisk_id=None,reservation_id=None,root_device_name=None,root_gb=0,scheduled_at=None,security_groups=<?>,shutdown_terminate=False,system_metadata=<?>,task_state=None,terminated_at=None,updated_at=None,user_data=None,user_id='fake-user',uuid=8eb9d649-0985-4350-8946-570ce100534c,vcpus=None,vm_mode=None,vm_state=None), 'fake': 'specs'}) -> None

Revision history for this message
Matt Riedemann (mriedem) wrote :

Yeah the diff is in the instance.uuid that's expected, from the test, it mocks out handle_schedule_error:

self.mox.StubOutWithMock(scheduler_driver, 'handle_schedule_error')

It's trying to build 2 instances and handle errors for both:

        for instance in instances:
            scheduler_driver.handle_schedule_error(self.context, exception,
                    instance.uuid, spec)

This is going through the now-deprecated _build_instances method in nova.compute.manager, which was updated recently to not call _instance_update:

https://review.openstack.org/#/c/114033/

I'm not really sure how that would cause issues, unless it was slowing the test down before so we didn't race, or was changing the database differently?

Using this query:

message:"+ handle_schedule_error.__call__" AND message:"nova.tests.conductor.test_conductor.FakeContext object at" AND message:"No valid host was found. fake-reason" AND tags:"console" AND project:"openstack/nova"

There are 14 hits since 8/21, check queue only and on 2 changes:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiKyBoYW5kbGVfc2NoZWR1bGVfZXJyb3IuX19jYWxsX19cIiBBTkQgbWVzc2FnZTpcIm5vdmEudGVzdHMuY29uZHVjdG9yLnRlc3RfY29uZHVjdG9yLkZha2VDb250ZXh0IG9iamVjdCBhdFwiIEFORCBtZXNzYWdlOlwiTm8gdmFsaWQgaG9zdCB3YXMgZm91bmQuIGZha2UtcmVhc29uXCIgQU5EIHRhZ3M6XCJjb25zb2xlXCIgQU5EIHByb2plY3Q6XCJvcGVuc3RhY2svbm92YVwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA4NzIwMTI4NDQyLCJtb2RlIjoiIiwiYW5hbHl6ZV9maWVsZCI6IiJ9

One of those changes is an objects change, the other isn't, so this is probably something that is already merged.

tags: added: testing
tags: added: compute conductor
Changed in nova:
status: New → Confirmed
Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm seeing a lot of tests running concurrently through the same code paths at the same time when looking through the subunit logs. There are several cases of InstanceNotFound when scheduler_driver.handle_schedule_error is called and this is run on the instance uuid:

    (old_ref, new_ref) = db.instance_update_and_get_original(context,
            instance_uuid, {'vm_state': vm_states.ERROR,
                            'task_state': None})

I don't see anything in nova.conductor.manager.build_instance that handles that, so maybe those are just from negative unit tests?

When you check the subunit logs for the instance.uuid that is expected to be passed to handle_schedule_error, it doesn't show up anywhere else.

If you check the instance.uuid from the 'unexpected' call though, it's the instance object mapped to 'instance_properties' in the spec passed to build_request_spec, and in the test that's the first instance created (of the 2):

    def test_build_instances_scheduler_failure(self):
        instances = [fake_instance.fake_instance_obj(self.context)
                for i in xrange(2)]
        image = {'fake-data': 'should_pass_silently'}
        spec = {'fake': 'specs',
                'instance_properties': instances[0]}

So it looks like the test is expecting handle_schedule_error to be called with the 2nd instance but it's getting the first, I'm not sure why, but maybe order isn't important in how handle_schedule_error is called as long as it's called on each instance.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (4.1 KiB)

Maybe this is racing with things that are racing in the shelve tests, i.e. bug 1353654. These could all be races related to the fact that the test_keypairs, test_shelve, test_conductor and test_compute tests are all extending from a common base class in nova.tests.compute.test_compute.BaseTestCase, and there are 3 conductor task test classes running at the same time.

I wouldn't think concurrent runs would be an issue since those should be in separate processes, but they are all hitting the database for their instances and mocking out different things so maybe the database is getting wonky somehow. In this case there are 7 workers.

This might be totally unrelated as well, but there are 19 instances of this in the subunit logs:

Traceback (most recent call last):
  File "nova/quota.py", line 1326, in commit
    user_id=user_id)
  File "nova/quota.py", line 569, in commit
    user_id=user_id)
  File "nova/db/api.py", line 1148, in reservation_commit
    user_id=user_id)
  File "nova/db/sqlalchemy/api.py", line 167, in wrapper
    return f(*args, **kwargs)
  File "nova/db/sqlalchemy/api.py", line 205, in wrapped
    return f(*args, **kwargs)
  File "nova/db/sqlalchemy/api.py", line 3299, in reservation_commit
    context, session, project_id, user_id)
  File "nova/db/sqlalchemy/api.py", line 3057, in _get_project_user_quota_usages
    with_lockmode('update').\
  File "/home/jenkins/workspace/gate-nova-python26/.tox/py26/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2300, in all
    return list(self)
  File "/home/jenkins/workspace/gate-nova-python26/.tox/py26/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2412, in __iter__
    return self._execute_and_instances(context)
  File "/home/jenkins/workspace/gate-nova-python26/.tox/py26/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2427, in _execute_and_instances
    result = conn.execute(querycontext.statement, self._params)
  File "/home/jenkins/workspace/gate-nova-python26/.tox/py26/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 729, in execute
    return meth(self, multiparams, params)
  File "/home/jenkins/workspace/gate-nova-python26/.tox/py26/lib/python2.6/site-packages/sqlalchemy/sql/elements.py", line 321, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/home/jenkins/workspace/gate-nova-python26/.tox/py26/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 826, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/home/jenkins/workspace/gate-nova-python26/.tox/py26/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 958, in _execute_context
    context)
  File "/home/jenkins/workspace/gate-nova-python26/.tox/py26/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1160, in _handle_dbapi_exception
    exc_info
  File "/home/jenkins/workspace/gate-nova-python26/.tox/py26/lib/python2.6/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/home/jenkins/workspace/gate-nova-python26/.tox/py26/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 951, in _execute_...

Read more...

Revision history for this message
Joe Gordon (jogo) wrote :

Just saw this again

Changed in nova:
importance: Undecided → High
Revision history for this message
Joe Gordon (jogo) wrote :
Revision history for this message
Hans Lindgren (hanlind) wrote :
Revision history for this message
Joe Gordon (jogo) wrote :

not seeing nova unit tests fail in the gate anymore so this looks like was fixed somehow.

Changed in nova:
status: Confirmed → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → kilo-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-1 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.