register_storage_use JSON error

Bug #1194696 reported by Sam Morrison
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Morgan Fainberg

Bug Description

When using shared storage and image cache cleanup I get the following error:

2013-06-26 09:28:39.483 28612 ERROR nova.manager [-] Error during ComputeManager._run_image_cache_manager_pass: Extra data: line 1 column 51 - line 1 column 52 (char 51 - 52)
2013-06-26 09:28:39.483 28612 TRACE nova.manager Traceback (most recent call last):
2013-06-26 09:28:39.483 28612 TRACE nova.manager File "/usr/lib/python2.7/dist-packages/nova/manager.py", line 241, in periodic_tasks
2013-06-26 09:28:39.483 28612 TRACE nova.manager task(self, context)
2013-06-26 09:28:39.483 28612 TRACE nova.manager File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 4042, in _run_image_cache_manager_pass
2013-06-26 09:28:39.483 28612 TRACE nova.manager storage_users.register_storage_use(CONF.instances_path, CONF.host)
2013-06-26 09:28:39.483 28612 TRACE nova.manager File "/usr/lib/python2.7/dist-packages/nova/openstack/common/lockutils.py", line 228, in inner
2013-06-26 09:28:39.483 28612 TRACE nova.manager retval = f(*args, **kwargs)
2013-06-26 09:28:39.483 28612 TRACE nova.manager File "/usr/lib/python2.7/dist-packages/nova/virt/storage_users.py", line 40, in register_storage_use
2013-06-26 09:28:39.483 28612 TRACE nova.manager d = json.loads(f.read())
2013-06-26 09:28:39.483 28612 TRACE nova.manager File "/usr/lib/python2.7/json/__init__.py", line 326, in loads
2013-06-26 09:28:39.483 28612 TRACE nova.manager return _default_decoder.decode(s)
2013-06-26 09:28:39.483 28612 TRACE nova.manager File "/usr/lib/python2.7/json/decoder.py", line 369, in decode
2013-06-26 09:28:39.483 28612 TRACE nova.manager raise ValueError(errmsg("Extra data", s, end, len(s)))

The content of my compute_nodes file is:

{"cc1": 1372201741.468479, "cc2": 1372201743.64854}}

You'll see it has an extra }

Tags: compute
Revision history for this message
Sam Morrison (sorrison) wrote :

Got:

{"cc1": 1372208115.092, "cc2": 1372208111.789594}87}

In that file this time, sometime it works and sometimes it doesn't

tags: added: compute
Revision history for this message
melanie witt (melwitt) wrote :

Seems like a synchronization issue on the compute_nodes file, though the @synchronized decorator is being used.

Changed in nova:
importance: Undecided → Medium
status: New → Triaged
melanie witt (melwitt)
Changed in nova:
status: Triaged → Confirmed
Revision history for this message
Morgan Fainberg (mdrnstm) wrote :

This is caused because the decorator is not specifying the ``instances_path`` config option as ``lock_path`` argument to the synchronized decorator. The bigger issue here is that in a world of racy-conditions it would theoretically be possible to cause loss of an -in-use-base-file due to this bug.

Scenario:
2 Nodes

Node 1 goes to register itself for use of the storage
Before Node 1 writes, Node 2 reads the file
Node 1 writes the file with itself in it
Node 2 writes the file with itself but not Node 1 in it

ImageCacheManager runs on Node 2. The cleanup task, Node 1's instances are not guaranteed to be considered as part of the "in-use" backing files. Backing Files are cleaned up.

The more common event is a corrupted registration file. But there is a non-zero chance of removing the backing file from another node.

Revision history for this message
Sam Morrison (sorrison) wrote :

I think this code needs to go or be fixed somehow so it works.
The potential to lose instance data is just too high when using shared storage and image cache cleanup.

tags: added: grizzly-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/47943

Changed in nova:
assignee: nobody → Morgan Fainberg (mdrnstm)
status: Confirmed → In Progress
tags: added: havana-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/47943
Committed: http://github.com/openstack/nova/commit/150af94d881482d3d2fe1ac7c2e71ff1179d16ca
Submitter: Jenkins
Branch: master

commit 150af94d881482d3d2fe1ac7c2e71ff1179d16ca
Author: Morgan Fainberg <email address hidden>
Date: Mon Sep 23 17:24:46 2013 -0700

    Correct lock path for storage-registry-lock

    This fixes the lock path for storage-registry-lock to prevent a
    potential data loss due to a race-condition scenario. Since the
    storage-registry-lock is used to modify a json blob that is
    authoritative as to which nodes are using the storage so that
    we don't remove any "in-use" backing files on the shared storage
    segment (if in use).

    The potential data-loss could occur in limited cases of the race
    that the lock should prevent. This can only occur if the image
    cache cleanup is triggerd when a node has been eliminated from
    the register_storage_use JSON incorrectly by this race condition.

    Most of the time the race simplyresults in a corrupted json blob.

    Fixes-bug: 1194696
    Change-Id: Ib1a5785646e75dc90bdd2a7e69a697adc4fbb298

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → havana-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: havana-rc1 → 2013.2
Alan Pevec (apevec)
tags: removed: grizzly-backport-potential havana-rc-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.