Functional versioned notification test test_create_delete_server_with_instance_update randomly fails

Bug #1657087 reported by Matt Riedemann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Balazs Gibizer

Bug Description

We have some non-deterministic failures in this functional test for versioned notifications:

http://logs.openstack.org/32/420132/3/check/gate-nova-tox-db-functional-ubuntu-xenial/0ecd455/console.html#_2017-01-17_03_22_08_097927

2017-01-17 03:22:08.097927 | nova.tests.functional.notification_sample_tests.test_instance.TestInstanceNotificationSample.test_create_delete_server_with_instance_update
2017-01-17 03:22:08.097978 | -------------------------------------------------------------------------------------------------------------------------------------------
2017-01-17 03:22:08.097991 |
2017-01-17 03:22:08.098008 | Captured traceback:
2017-01-17 03:22:08.098026 | ~~~~~~~~~~~~~~~~~~~
2017-01-17 03:22:08.098048 | Traceback (most recent call last):
2017-01-17 03:22:08.098098 | File "nova/tests/functional/notification_sample_tests/test_instance.py", line 166, in test_create_delete_server_with_instance_update
2017-01-17 03:22:08.098124 | self.assertEqual(7, len(instance_updates))
2017-01-17 03:22:08.098183 | File "/home/jenkins/workspace/gate-nova-tox-db-functional-ubuntu-xenial/.tox/functional/local/lib/python2.7/site-packages/testtools/testcase.py", line 411, in assertEqual
2017-01-17 03:22:08.098223 | self.assertThat(observed, matcher, message)
2017-01-17 03:22:08.098283 | File "/home/jenkins/workspace/gate-nova-tox-db-functional-ubuntu-xenial/.tox/functional/local/lib/python2.7/site-packages/testtools/testcase.py", line 498, in assertThat
2017-01-17 03:22:08.098302 | raise mismatch_error
2017-01-17 03:22:08.098328 | testtools.matchers._impl.MismatchError: 7 != 6
2017-01-17 03:22:08.098341 |
2017-01-17 03:22:08.098353 |
2017-01-17 03:22:08.098371 | Captured pythonlogging:
2017-01-17 03:22:08.098388 | ~~~~~~~~~~~~~~~~~~~~~~~
2017-01-17 03:22:08.098838 | 2017-01-17 03:20:02,061 INFO [nova.api.openstack] Loaded extensions: ['extensions', 'flavors', 'image-metadata', 'image-size', 'images', 'ips', 'limits', 'os-admin-actions', 'os-admin-password', 'os-agents', 'os-aggregates', 'os-assisted-volume-snapshots', 'os-attach-interfaces', 'os-availability-zone', 'os-baremetal-nodes', 'os-block-device-mapping', 'os-cells', 'os-certificates', 'os-cloudpipe', 'os-config-drive', 'os-console-auth-tokens', 'os-console-output', 'os-consoles', 'os-create-backup', 'os-deferred-delete', 'os-evacuate', 'os-extended-availability-zone', 'os-extended-server-attributes', 'os-extended-status', 'os-extended-volumes', 'os-fixed-ips', 'os-flavor-access', 'os-flavor-extra-specs', 'os-flavor-manage', 'os-flavor-rxtx', 'os-floating-ip-dns', 'os-floating-ip-pools', 'os-floating-ips', 'os-floating-ips-bulk', 'os-fping', 'os-hide-server-addresses', 'os-hosts', 'os-hypervisors', 'os-instance-actions', 'os-instance-usage-audit-log', 'os-keypairs', 'os-lock-server', 'os-migrate-server', 'os-migrations', 'os-multinic', 'os-multiple-create', 'os-networks', 'os-networks-associate', 'os-pause-server', 'os-quota-class-sets', 'os-quota-sets', 'os-remote-consoles', 'os-rescue', 'os-scheduler-hints', 'os-security-group-default-rules', 'os-security-groups', 'os-server-diagnostics', 'os-server-external-events', 'os-server-groups', 'os-server-password', 'os-server-tags', 'os-server-usage', 'os-services', 'os-shelve', 'os-simple-tenant-usage', 'os-suspend-server', 'os-tenant-networks', 'os-used-limits', 'os-user-data', 'os-virtual-interfaces', 'os-volumes', 'server-metadata', 'server-migrations', 'servers', 'versions']
2017-01-17 03:22:08.099353 | 2017-01-17 03:20:02,598 INFO [nova.api.openstack] Loaded extensions: ['extensions', 'flavors', 'image-metadata', 'image-size', 'images', 'ips', 'limits', 'os-admin-actions', 'os-admin-password', 'os-agents', 'os-aggregates', 'os-assisted-volume-snapshots', 'os-attach-interfaces', 'os-availability-zone', 'os-baremetal-nodes', 'os-block-device-mapping', 'os-cells', 'os-certificates', 'os-cloudpipe', 'os-config-drive', 'os-console-auth-tokens', 'os-console-output', 'os-consoles', 'os-create-backup', 'os-deferred-delete', 'os-evacuate', 'os-extended-availability-zone', 'os-extended-server-attributes', 'os-extended-status', 'os-extended-volumes', 'os-fixed-ips', 'os-flavor-access', 'os-flavor-extra-specs', 'os-flavor-manage', 'os-flavor-rxtx', 'os-floating-ip-dns', 'os-floating-ip-pools', 'os-floating-ips', 'os-floating-ips-bulk', 'os-fping', 'os-hide-server-addresses', 'os-hosts', 'os-hypervisors', 'os-instance-actions', 'os-instance-usage-audit-log', 'os-keypairs', 'os-lock-server', 'os-migrate-server', 'os-migrations', 'os-multinic', 'os-multiple-create', 'os-networks', 'os-networks-associate', 'os-pause-server', 'os-quota-class-sets', 'os-quota-sets', 'os-remote-consoles', 'os-rescue', 'os-scheduler-hints', 'os-security-group-default-rules', 'os-security-groups', 'os-server-diagnostics', 'os-server-external-events', 'os-server-groups', 'os-server-password', 'os-server-tags', 'os-server-usage', 'os-services', 'os-shelve', 'os-simple-tenant-usage', 'os-suspend-server', 'os-tenant-networks', 'os-used-limits', 'os-user-data', 'os-virtual-interfaces', 'os-volumes', 'server-metadata', 'server-migrations', 'servers', 'versions']
2017-01-17 03:22:08.099399 | 2017-01-17 03:20:02,602 INFO [nova.wsgi] osapi_compute listening on 127.0.0.1:40792
2017-01-17 03:22:08.099449 | 2017-01-17 03:20:02,617 INFO [nova.osapi_compute.wsgi.server] (4544) wsgi starting up on http://127.0.0.1:40792
2017-01-17 03:22:08.099486 | 2017-01-17 03:20:02,618 INFO [nova.service] Starting conductor node (version 15.0.0)
2017-01-17 03:22:08.099522 | 2017-01-17 03:20:02,649 INFO [nova.service] Starting scheduler node (version 15.0.0)
2017-01-17 03:22:08.099561 | 2017-01-17 03:20:02,671 INFO [nova.network.driver] Loading network driver 'nova.network.linux_net'
2017-01-17 03:22:08.099596 | 2017-01-17 03:20:02,673 INFO [nova.service] Starting network node (version 15.0.0)
2017-01-17 03:22:08.099634 | 2017-01-17 03:20:02,720 INFO [nova.virt.driver] Loading compute driver 'fake.SmallFakeDriver'
2017-01-17 03:22:08.099669 | 2017-01-17 03:20:02,720 INFO [nova.service] Starting compute node (version 15.0.0)
2017-01-17 03:22:08.099706 | 2017-01-17 03:20:02,758 ERROR [nova.compute.manager] No compute node record for host compute
2017-01-17 03:22:08.099765 | 2017-01-17 03:20:02,759 WARNING [nova.compute.monitors] Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabled monitors (CONF.compute_monitors).
2017-01-17 03:22:08.099806 | 2017-01-17 03:20:02,764 WARNING [nova.compute.resource_tracker] No compute node record for compute:fake-mini
2017-01-17 03:22:08.099849 | 2017-01-17 03:20:02,769 INFO [nova.compute.resource_tracker] Compute_service record created for compute:fake-mini
2017-01-17 03:22:08.099921 | 2017-01-17 03:20:02,780 WARNING [nova.scheduler.client.report] No authentication information found for placement API. Placement is optional in Newton, but required in Ocata. Please enable the placement service before upgrading.
2017-01-17 03:22:08.099964 | 2017-01-17 03:20:02,781 WARNING [nova.scheduler.client.report] Unable to refresh my resource provider record
2017-01-17 03:22:08.100029 | 2017-01-17 03:20:02,838 INFO [nova.compute.resource_tracker] Final resource view: name=fake-mini phys_ram=8192MB used_ram=512MB phys_disk=1028GB used_disk=0GB total_vcpus=1 used_vcpus=0 pci_stats=[]
2017-01-17 03:22:08.100079 | 2017-01-17 03:20:02,846 INFO [nova.scheduler.host_manager] Received an update from an unknown host 'compute'. Re-created its InstanceList.
2017-01-17 03:22:08.100121 | 2017-01-17 03:20:02,848 WARNING [nova.scheduler.client.report] Unable to refresh my resource provider record
2017-01-17 03:22:08.100168 | 2017-01-17 03:20:02,858 INFO [nova.osapi_compute.wsgi.server] 127.0.0.1 "GET /v2.1 HTTP/1.1" status: 204 len: 352 time: 0.0011370
2017-01-17 03:22:08.100226 | 2017-01-17 03:20:04,180 INFO [nova.osapi_compute.wsgi.server] 127.0.0.1 "POST /v2.1/6f70656e737461636b20342065766572//flavors HTTP/1.1" status: 200 len: 862 time: 1.3181760
2017-01-17 03:22:08.100285 | 2017-01-17 03:20:04,358 INFO [nova.osapi_compute.wsgi.server] 127.0.0.1 "POST /v2.1/6f70656e737461636b20342065766572//servers HTTP/1.1" status: 202 len: 370 time: 0.1736619
2017-01-17 03:22:08.100350 | 2017-01-17 03:20:04,499 INFO [nova.osapi_compute.wsgi.server] 127.0.0.1 "GET /v2.1/6f70656e737461636b20342065766572//servers?reservation_id=r-90pz0j6n HTTP/1.1" status: 200 len: 695 time: 0.1371081
2017-01-17 03:22:08.100420 | 2017-01-17 03:20:04,579 INFO [nova.osapi_compute.wsgi.server] 127.0.0.1 "GET /v2.1/6f70656e737461636b20342065766572//servers/cbb34ead-5367-423c-aac0-640ce68a0467 HTTP/1.1" status: 200 len: 1779 time: 0.0684171
2017-01-17 03:22:08.100461 | 2017-01-17 03:20:04,801 INFO [nova.compute.claims] Attempting claim: memory 512 MB, disk 1 GB, vcpus 1 CPU
2017-01-17 03:22:08.100497 | 2017-01-17 03:20:04,802 INFO [nova.compute.claims] Total memory: 8192 MB, used: 512.00 MB
2017-01-17 03:22:08.100537 | 2017-01-17 03:20:04,802 INFO [nova.compute.claims] memory limit not specified, defaulting to unlimited
2017-01-17 03:22:08.100593 | 2017-01-17 03:20:04,802 INFO [nova.compute.claims] Total disk: 1028 GB, used: 0.00 GB
2017-01-17 03:22:08.100640 | 2017-01-17 03:20:04,802 INFO [nova.compute.claims] disk limit not specified, defaulting to unlimited
2017-01-17 03:22:08.100682 | 2017-01-17 03:20:04,802 INFO [nova.compute.claims] Total vcpu: 1 VCPU, used: 0.00 VCPU
2017-01-17 03:22:08.100722 | 2017-01-17 03:20:04,802 INFO [nova.compute.claims] vcpu limit not specified, defaulting to unlimited
2017-01-17 03:22:08.122749 | 2017-01-17 03:20:04,803 INFO [nova.compute.claims] Claim successful
2017-01-17 03:22:08.122802 | 2017-01-17 03:20:04,891 WARNING [nova.scheduler.client.report] Unable to refresh my resource provider record
2017-01-17 03:22:08.122845 | 2017-01-17 03:20:04,978 WARNING [nova.scheduler.client.report] Unable to refresh my resource provider record
2017-01-17 03:22:08.122888 | 2017-01-17 03:20:05,174 INFO [nova.compute.manager] Took 0.00 seconds to spawn the instance on the hypervisor.
2017-01-17 03:22:08.122956 | 2017-01-17 03:20:05,362 INFO [nova.osapi_compute.wsgi.server] 127.0.0.1 "GET /v2.1/6f70656e737461636b20342065766572//servers/cbb34ead-5367-423c-aac0-640ce68a0467 HTTP/1.1" status: 200 len: 2017 time: 0.2028451
2017-01-17 03:22:08.122993 | 2017-01-17 03:20:05,376 INFO [nova.compute.manager] Took 0.58 seconds to build instance.
2017-01-17 03:22:08.123023 | 2017-01-17 03:20:05,546 INFO [nova.wsgi] Stopping WSGI server.
2017-01-17 03:22:08.123062 | 2017-01-17 03:20:05,547 INFO [nova.osapi_compute.wsgi.server] (4544) wsgi exited, is_accepting=True

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/421221

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/421221
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5713e5cb04133be7b65a2f2293af62ffa358e379
Submitter: Jenkins
Branch: master

commit 5713e5cb04133be7b65a2f2293af62ffa358e379
Author: Matt Riedemann <email address hidden>
Date: Tue Jan 17 06:17:57 2017 -0500

    Add more details when test_create_delete_server_with_instance_update fails

    This test fails in non-deterministic ways so this change adds more
    detail to the failure message when we don't get as many instance.update
    notifications as expected.

    Change-Id: Iec863aa7a15b504f4a100801d7c1a18934ebb299
    Related-Bug: #1657087

Revision history for this message
Matt Riedemann (mriedem) wrote :

Here is a more detailed failure with the notifications received:

http://logs.openstack.org/45/403745/11/gate/gate-nova-tox-db-functional-ubuntu-xenial/76b1646/console.html#_2017-01-23_15_02_47_382251

We need to compare that to what's expected and figure out where the leak is.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I did the comparison. Based on the received state updates the last instance.update is missing when the state goes from building (spawning) to active (none)

[{'new_task_state': u'scheduling',
  'old_state': u'building',
  'old_task_state': u'scheduling',
  'state': u'building'},
 {'new_task_state': None,
  'old_state': u'building',
  'old_task_state': u'scheduling',
  'state': u'building'},
 {'new_task_state': None,
  'old_state': u'building',
  'old_task_state': None,
  'state': u'building'},
 {'new_task_state': u'networking',
  'old_state': u'building',
  'old_task_state': u'networking',
  'state': u'building'},
 {'new_task_state': u'block_device_mapping',
  'old_state': u'building',
  'old_task_state': u'networking',
  'state': u'building'},
 {'new_task_state': u'spawning',
  'old_state': u'building',
  'old_task_state': u'block_device_mapping',
  'state': u'building'}]

I will try to find the root cause of the missing update.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I have a guess. Based on the notification sending code [1] the instance.update is sent _after_ the DB was updated with the new values. So theoretically when the instance goes to the ACTIVE state the DB update happen earlier than the last instance.update that contains the ACTIVE state. So there can be a race condition in the test because the test polls the REST API and waits for the ACTIVE state.

I will propose a patch where the test will wait for a bit for the last instance.update to arrive.

[1] https://github.com/openstack/nova/blob/master/nova/objects/instance.py#L788-L802

Changed in nova:
assignee: nobody → Balazs Gibizer (balazs-gibizer)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/424574

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/424574
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=189a4a7c76d224a227c09f2b34b56ef91c1680c1
Submitter: Jenkins
Branch: master

commit 189a4a7c76d224a227c09f2b34b56ef91c1680c1
Author: Balazs Gibizer <email address hidden>
Date: Tue Jan 24 11:38:59 2017 +0100

    Fix race condition in instance.update sample test

    The test case test_create_delete_server_with_instance_update failed randomly
    because the last expected instance.update notification was not received due
    to a race condition.

    The test polls the REST API and waits for the instance to reach ACTIVE state.
    However the last instance.update is sent from the instance.save after the
    state of the instance is updated in the DB. Therefore it is possible that
    the test saw ACTIVE instance state before the last instance.update
    notification is emitted.

    This patch introduce an extra wait call to receive the last notification
    as well.

    Closes-Bug: #1657087
    Change-Id: I438bbb6de0733d0c3fd8361fdc4149c88abb53aa

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Based on logstash the fix really removed the race condition from the test as there was no failure after the patch was merged.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.0.0.0rc1

This issue was fixed in the openstack/nova 15.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.