Inconsistent info of availability zone (az) if the default az is replaced

Bug #1390033 reported by Trung Trinh
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Sylvain Bauza

Bug Description

Affected version: stable/juno

Description:
On a single-node deployment of OpenStack (using DevStack), if the default availability zone of Nova is replaced by another one, then the API api.nova.server_list returns the list of VMs in which the info of availability zone is inconsistent. This results in the toggling effect of Horizon dashboard when displaying the list of instances (under tab "Project/Instances"). The toggling effect is caused by inconsistent info of availability zone whose values either the default one or the newly-created one.

This bug can easily be reproduced using Horizon dashboard as follows:
 - Go to tab "Admin/Host Aggregates" to create a new host aggregate which is assigned with the current host. Pls note that if this newly-created host aggregate is not assigned with any host, then the availability zone won't be defined.
 - After that, also under this view, we find (in the Availability zones) that the newly-created availability zone hides away the default one.
 - Go to tab "System information", we find that nova-compute service is running in the newly-created availability zone while all the cinder services are running in the default availability zone.
 - Go to tab "Project/Image" to select some image for creating a new bootable volume.
 - This newly-created volume is then used to launch a new VM
 - After launching a new VM, it's auto re-directed to the view of "Instances". At here, we can observe the toggling effect on the availability zone info.

Analysis:
Root cause is due to the API api.nova.server_list as described above.
This can be seen by adding some more debug info as follows:

2014-11-06 10:30:05,103 - my_logger - DEBUG - openstack_dashboard.dashboards.project.instances.views - Instance amount: 1, Instances: "[<Server: {'status': u'BUILD', 'OS-EXT-STS:task_state': u'scheduling', 'addresses': {}, 'name': u'from_vol_cirr_nova', 'links': [{u'href': u'http://192.168.56.103:8774/v2/a0c581f7a88441ed84e9878fa9fc8e50/servers/d5c1575d-8ac8-4921-802c-e9b121acd82e', u'rel': u'self'}, {u'href': u'http://192.168.56.103:8774/a0c581f7a88441ed84e9878fa9fc8e50/servers/d5c1575d-8ac8-4921-802c-e9b121acd82e', u'rel': u'bookmark'}], 'created': u'2014-11-06T10:30:03Z', 'key_name': None, 'image': u'', 'OS-DCF:diskConfig': u'AUTO', 'image_name': '-', 'OS-EXT-STS:power_state': 0, 'OS-EXT-SRV-ATTR:host': None, 'OS-EXT-SRV-ATTR:instance_name': u'instance-00000004', 'tenant_id': u'a0c581f7a88441ed84e9878fa9fc8e50', 'user_id': u'2f8c907029eb43e5ab98a55ac28c885e', 'flavor': {u'id': u'1', u'links': [{u'href': u'http://192.168.56.103:8774/a0c581f7a88441ed84e9878fa9fc8e50/flavors/1', u'rel': u'bookmark'}]}, 'OS-EXT-AZ:availability_zone': u'nova', 'id': u'd5c1575d-8ac8-4921-802c-e9b121acd82e', 'metadata': {}}>]"

2014-11-06 10:31:02,037 - my_logger - DEBUG - openstack_dashboard.dashboards.project.instances.views - Instance amount: 1, Instances: "[<Server: {'status': u'ACTIVE', 'OS-EXT-STS:task_state': None, 'addresses': {u'public': [{u'OS-EXT-IPS-MAC:mac_addr': u'fa:16:3e:13:01:55', u'version': 4, u'addr': u'172.24.4.6', u'OS-EXT-IPS:type': u'fixed'}]}, 'name': u'from_vol_cirr_nova', 'links': [{u'href': u'http://192.168.56.103:8774/v2/a0c581f7a88441ed84e9878fa9fc8e50/servers/d5c1575d-8ac8-4921-802c-e9b121acd82e', u'rel': u'self'}, {u'href': u'http://192.168.56.103:8774/a0c581f7a88441ed84e9878fa9fc8e50/servers/d5c1575d-8ac8-4921-802c-e9b121acd82e', u'rel': u'bookmark'}], 'created': u'2014-11-06T10:30:03Z', 'key_name': None, 'image': u'', 'OS-DCF:diskConfig': u'AUTO', 'image_name': '-', 'OS-EXT-STS:power_state': 1, 'OS-EXT-SRV-ATTR:host': u'ubuntu', 'OS-EXT-SRV-ATTR:instance_name': u'instance-00000004', 'tenant_id': u'a0c581f7a88441ed84e9878fa9fc8e50', 'user_id': u'2f8c907029eb43e5ab98a55ac28c885e', 'flavor': {u'id': u'1', u'links': [{u'href': u'http://192.168.56.103:8774/a0c581f7a88441ed84e9878fa9fc8e50/flavors/1', u'rel': u'bookmark'}]}, 'OS-EXT-AZ:availability_zone': u'test_az', 'id': u'd5c1575d-8ac8-4921-802c-e9b121acd82e', 'metadata': {}}>]"

2014-11-06 10:31:32,437 - my_logger - DEBUG - openstack_dashboard.dashboards.project.instances.views - Instance amount: 1, Instances: "[<Server: {'status': u'ACTIVE', 'OS-EXT-STS:task_state': None, 'addresses': {u'public': [{u'OS-EXT-IPS-MAC:mac_addr': u'fa:16:3e:13:01:55', u'version': 4, u'addr': u'172.24.4.6', u'OS-EXT-IPS:type': u'fixed'}]}, 'name': u'from_vol_cirr_nova', 'links': [{u'href': u'http://192.168.56.103:8774/v2/a0c581f7a88441ed84e9878fa9fc8e50/servers/d5c1575d-8ac8-4921-802c-e9b121acd82e', u'rel': u'self'}, {u'href': u'http://192.168.56.103:8774/a0c581f7a88441ed84e9878fa9fc8e50/servers/d5c1575d-8ac8-4921-802c-e9b121acd82e', u'rel': u'bookmark'}], 'created': u'2014-11-06T10:30:03Z', 'key_name': None, 'image': u'', 'OS-DCF:diskConfig': u'AUTO', 'image_name': '-', 'OS-EXT-STS:power_state': 1, 'OS-EXT-SRV-ATTR:host': u'ubuntu', 'OS-EXT-SRV-ATTR:instance_name': u'instance-00000004', 'tenant_id': u'a0c581f7a88441ed84e9878fa9fc8e50', 'user_id': u'2f8c907029eb43e5ab98a55ac28c885e', 'flavor': {u'id': u'1', u'links': [{u'href': u'http://192.168.56.103:8774/a0c581f7a88441ed84e9878fa9fc8e50/flavors/1', u'rel': u'bookmark'}]}, 'OS-EXT-AZ:availability_zone': u'test_az', 'id': u'd5c1575d-8ac8-4921-802c-e9b121acd82e', 'metadata': {}}>]"

When searching for the 'availability_zone' in the log above, it can be seen that its value is really inconsistent. In fact, at the beginning when the 'task_state' is 'scheduling' then the availability zone is 'nova' (the default one). However, later the availability zone changes to 'test_az' (the newly-created one)

Tags: horizon nova
Changed in nova:
assignee: nobody → Trung Trinh (trung-t-trinh)
Changed in nova:
status: New → In Progress
Revision history for this message
Trung Trinh (trung-t-trinh) wrote :

With some added debug logs, the most possible root cause is around the function "get_host_availability_zone()" in the module "nova.availability_zones".
In particular, in this function, if the variable "host" is None, then the returned "aggregates" and "metadata" will also be None. Consequently, the final retrieved value of "az" will be the default one of "nova" that is hard-coded in the module.

During the process of launching new VM instance, this function is called multiple times. At the first time of being called, the passed value of "host" is always None (and the task_state is SCHEDULING). However, for subsequent calls, the "host" is returned correctly (no longer None) and everything becomes correct again.

So, the question is why the "host" value passed is None when the task_state is SCHEDULING??

Further investigation is being carried out.

Horizon is not guilty in this story but it is affected.
Nova is the guilty one.

Revision history for this message
Trung Trinh (trung-t-trinh) wrote :

Even if the default availability zone of Nova is not changed, the problem that the "host" value passed is None when the task_state is SCHEDULING still occurs.

So, this is really a general issue of Nova!

Revision history for this message
Trung Trinh (trung-t-trinh) wrote :

One of the buggy locations was found out.
It is inside the function "get_instance_availability_zone()" of the module "nova.availability_zones".

Let's look at the its code snippet:

def get_instance_availability_zone(context, instance):
    """Return availability zone of specified instance."""
    host = str(instance.get('host'))

    if not host:
        return None

It's quite easy to understand this code snippet that it tries to read value of the host from the "instance". If the "host" is empty or None, then the function will be returned and nothing will happen.
However, Python interpreter makes all surprised that even though the "host" value is None, but the if-clause does not hold and consequently the execution is continued with the "host" whose value is None. Finally, it leads to the fact that the "az" returned is the default one (as mentioned above).

For proof, please look at the output of the Python interactive session:

Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> host = str(None)
>>> print host
None
>>> if not host:
... print "if not host"
... else:
... print "else"
...
else
>>> var = None
>>> if not var:
... print "if not var"
... else:
... print "else"
...
if not var

So, it can be seen that str(None) returns an object whose value is None, but it is not interpreted False as expected!

However, this is still not the root cause of this bug! Further investigation is being carried out.

Revision history for this message
Trung Trinh (trung-t-trinh) wrote :

The patch for this function is as follows:

def get_instance_availability_zone(context, instance):
    """Return availability zone of specified instance."""
    host = instance.get('host')

    if not host:
        return None

    host = str(host)

    .....

no longer affects: horizon
Revision history for this message
Trung Trinh (trung-t-trinh) wrote :

Root cause of this bug is as follows.

During the process of launching a new VM instance, the function "get_instance_availability_zone()" of the module "nova.availability_zones" is called multiple times to retrieve the info of availability zone of the VM being created.

This function, at first, it will try to get the info of availability zone from the memory cache. If the info of availability zone is not contained in memory cache, then the function continues retrieving the info of availability zone from the host aggregate.
However, the default availability zone (for example "nova") is always specified in the memory cache. Furthermore, the memory cache is not updated immediately after a new availability zone is created. Instead, the memory cache is updated for every one hour (the const AZ_CACHE_SECONDS is the proof) . Consequently, the if-clause never holds.

def get_instance_availability_zone(context, instance):
    """Return availability zone of specified instance."""
    host = str(instance.get('host'))

    if not host:
        return None

    cache_key = _make_cache_key(host)
    cache = _get_cache()
    az = cache.get(cache_key)

    if not az:
        elevated = context.elevated()
        az = get_host_availability_zone(elevated, host)
        cache.set(cache_key, az, AZ_CACHE_SECONDS)

    return az

It should be noted that the info of availability zone (during the process of launching new VM) is retrieved not only from the memory cache (as seen in this function) but it is also retrieved (somewhere else) directly from the host aggregate. Therefore, it results in the toggling effect of Horizon dashboard (i.e. value of availability zone is sometimes the default one and sometimes the newly-created one).

Proposal:
modify the function "get_instance_availability_zone()" of the module "nova.availability_zones" in such a manner that if the info of availability zone retrieved from memory cache is different from the one included in the VM instance (being launched), then memory cache will be updated manually and the info of availability zone included in the VM instance (being launched) will be returned.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/135566

Revision history for this message
Trung Trinh (trung-t-trinh) wrote :

This attached file contains the debug logs added manually to show that the az cache (implemented by the module "nova.openstack.common.memorycache" is changed un-expectedly.

In fact, have a close look at the following lines:

2014-11-19 18:03:58,932 - my_logger - DEBUG - nova.openstack.common.memorycache - Client::get - key:azcache-ubuntu, value:test-az, entire cache:{'azcache-ubuntu': (1416373265, u'test-az')}
2014-11-19 18:03:58,941 - my_logger - DEBUG - nova.availability_zones - get_instance_availability_zone - az_cache:test-az, cache:
 json: "<nova.openstack.common.memorycache.Client object at 0x7f8b56b93e10>"
 pickle: "ccopy_reg\n_reconstructor\np1\n(cnova.openstack.common.memorycache\nClient\np2\nc__builtin__\nobject\np3\nNtRp4\n(dp5\nS'cache'\np6\n(dp7\nS'azcache-ubuntu'\np8\n(I1416373265\nVtest-az\np9\ntp10\nssb."
2014-11-19 18:04:02,289 - my_logger - DEBUG - nova.openstack.common.memorycache - Client::get - key:azcache-ubuntu, value:nova, entire cache:{'azcache-ubuntu': (1416373238, 'nova')}
2014-11-19 18:04:02,292 - my_logger - DEBUG - nova.availability_zones - get_instance_availability_zone - az_cache:nova, cache:
 json: "<nova.openstack.common.memorycache.Client object at 0x7f8b56b93e10>"

We can find that the az cache is really changed un-expectedly by some unknown process because there does not exist the call of the "nova.openstack.common.memorycache - Client::set" between the two continuous calls of the "nova.openstack.common.memorycache - Client::get". So, it is really mysterious why the az cache is changed un-expectedly?

Is the module "nova.openstack.common.memorycache" buggy? or its user (that is the nova.availability_zones module) is buggy?

Changed in nova:
assignee: Trung Trinh (trung-t-trinh) → Pasquale Porreca (pasquale-porreca)
Changed in nova:
assignee: Pasquale Porreca (pasquale-porreca) → Trung Trinh (trung-t-trinh)
Changed in nova:
assignee: Trung Trinh (trung-t-trinh) → nobody
Changed in nova:
assignee: nobody → Trung Trinh (trung-t-trinh)
Changed in nova:
assignee: Trung Trinh (trung-t-trinh) → Pasquale Porreca (pasquale-porreca)
Changed in nova:
assignee: Pasquale Porreca (pasquale-porreca) → Trung Trinh (trung-t-trinh)
Changed in nova:
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/159014

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/159015

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/159019

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/159023

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/159730

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Trung Trinh (<email address hidden>) on branch: master
Review: https://review.openstack.org/159014
Reason: Please ignore and do not review this Patch

Changed in nova:
milestone: none → kilo-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Trung Trinh (<email address hidden>) on branch: master
Review: https://review.openstack.org/135566

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Trung Trinh (<email address hidden>) on branch: master
Review: https://review.openstack.org/159019

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Trung Trinh (<email address hidden>) on branch: master
Review: https://review.openstack.org/159023

Changed in nova:
assignee: Trung Trinh (trung-t-trinh) → Hans Lindgren (hanlind)
Revision history for this message
John Garbutt (johngarbutt) wrote :

it doesn't feel like this should block the kilo release, removing from kilo-rc1 for now.

Changed in nova:
milestone: kilo-rc1 → none
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/172906

Changed in nova:
assignee: Hans Lindgren (hanlind) → Sylvain Bauza (sylvain-bauza)
Changed in nova:
assignee: Sylvain Bauza (sylvain-bauza) → Alexis Lee (alexisl)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/159730
Reason: This patch has been stalled for a long time, so I am abandoning it. Please feel free to restore it when the code is ready for review.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/159015
Reason: This patch has been stalled for a long time, so I am abandoning it. Please feel free to restore it when the code is ready for review.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Sylvain Bauza (<email address hidden>) on branch: master
Review: https://review.openstack.org/172906
Reason: Since we're removing the cache in https://review.openstack.org/#/c/221384/ that's no longer needed

Changed in nova:
assignee: Alexis Lee (alexisl) → Sylvain Bauza (sylvain-bauza)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/149562
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=479005ce5a35039639fa809b1a4705c9d3dc28c4
Submitter: Jenkins
Branch: master

commit 479005ce5a35039639fa809b1a4705c9d3dc28c4
Author: Hans Lindgren <email address hidden>
Date: Mon Nov 24 12:03:32 2014 +0100

    Consolidate code to get the correct availability zone of an instance

    This makes getting the availability zone of an instance use the same
    code by placing it inside the helper get_instance_availability_zone().

    Related-Bug: #1390033
    Change-Id: I69b98eacbc8dc91e65611d6bf07272b517fe350d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/172906
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=222085dcafb6d94adbf20ce82e3245c46735f105
Submitter: Jenkins
Branch: master

commit 222085dcafb6d94adbf20ce82e3245c46735f105
Author: Sylvain Bauza <email address hidden>
Date: Mon Apr 13 15:03:06 2015 +0200

    Invalidate AZ cache when the instance AZ information is different

    When a new instance is created, the host value is set to None. As a corrolar,
    its AZ field is set to the default_availability_zone value.

    That's only when the host field is provided (ie. once the instance is started) that
    the AZ information is also updated. Unfornately, as the AZ module is using a cache
    system, it leaves the old AZ unchanged which can make the instance AZ data wrong.

    In order to fix that corner case, we need to compare both cache and instance values
    and if different, invalidate the cache key and leave it updated correctly by calling
    the AZ related information of the aggregate the host belongs to.

    Change-Id: I1b47ae07cf8a4245df00b9dc25e6000428bdc4f2
    Closes-Bug: #1390033

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → liberty-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: liberty-rc1 → 12.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.