OpenStack Compute (nova)

register_storage_use JSON error

Bug #1194696 reported by Sam Morrison on 2013-06-26

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	Medium	Morgan Fainberg	OpenStack Compute (nova) 2013.2 "havana"

Bug Description

When using shared storage and image cache cleanup I get the following error:

2013-06-26 09:28:39.483 28612 ERROR nova.manager [-] Error during ComputeManager._run_image_cache_manager_pass: Extra data: line 1 column 51 - line 1 column 52 (char 51 - 52)
2013-06-26 09:28:39.483 28612 TRACE nova.manager Traceback (most recent call last):
2013-06-26 09:28:39.483 28612 TRACE nova.manager File "/usr/lib/python2.7/dist-packages/nova/manager.py", line 241, in periodic_tasks
2013-06-26 09:28:39.483 28612 TRACE nova.manager task(self, context)
2013-06-26 09:28:39.483 28612 TRACE nova.manager File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 4042, in _run_image_cache_manager_pass
2013-06-26 09:28:39.483 28612 TRACE nova.manager storage_users.register_storage_use(CONF.instances_path, CONF.host)
2013-06-26 09:28:39.483 28612 TRACE nova.manager File "/usr/lib/python2.7/dist-packages/nova/openstack/common/lockutils.py", line 228, in inner
2013-06-26 09:28:39.483 28612 TRACE nova.manager retval = f(*args, **kwargs)
2013-06-26 09:28:39.483 28612 TRACE nova.manager File "/usr/lib/python2.7/dist-packages/nova/virt/storage_users.py", line 40, in register_storage_use
2013-06-26 09:28:39.483 28612 TRACE nova.manager d = json.loads(f.read())
2013-06-26 09:28:39.483 28612 TRACE nova.manager File "/usr/lib/python2.7/json/__init__.py", line 326, in loads
2013-06-26 09:28:39.483 28612 TRACE nova.manager return _default_decoder.decode(s)
2013-06-26 09:28:39.483 28612 TRACE nova.manager File "/usr/lib/python2.7/json/decoder.py", line 369, in decode
2013-06-26 09:28:39.483 28612 TRACE nova.manager raise ValueError(errmsg("Extra data", s, end, len(s)))

The content of my compute_nodes file is:

{"cc1": 1372201741.468479, "cc2": 1372201743.64854}}

You'll see it has an extra }

Tags:

Revision history for this message

Sam Morrison (sorrison) wrote on 2013-06-26:

Got:

{"cc1": 1372208115.092, "cc2": 1372208111.789594}87}

In that file this time, sometime it works and sometimes it doesn't

Russell Bryant (russellb) on 2013-06-27

tags:

added: compute

Revision history for this message

melanie witt (melwitt) wrote on 2013-06-27:

Seems like a synchronization issue on the compute_nodes file, though the @synchronized decorator is being used.

Changed in nova:
importance:	Undecided → Medium
status:	New → Triaged

melanie witt (melwitt) on 2013-06-28

Changed in nova:
status:	Triaged → Confirmed

Revision history for this message

Morgan Fainberg (mdrnstm) wrote on 2013-09-24:

This is caused because the decorator is not specifying the ``instances_path`` config option as ``lock_path`` argument to the synchronized decorator. The bigger issue here is that in a world of racy-conditions it would theoretically be possible to cause loss of an -in-use-base-file due to this bug.

Scenario:
2 Nodes

Node 1 goes to register itself for use of the storage
Before Node 1 writes, Node 2 reads the file
Node 1 writes the file with itself in it
Node 2 writes the file with itself but not Node 1 in it

ImageCacheManager runs on Node 2. The cleanup task, Node 1's instances are not guaranteed to be considered as part of the "in-use" backing files. Backing Files are cleaned up.

The more common event is a corrupted registration file. But there is a non-zero chance of removing the backing file from another node.

Revision history for this message

Sam Morrison (sorrison) wrote on 2013-09-24:

I think this code needs to go or be fixed somehow so it works.
The potential to lose instance data is just too high when using shared storage and image cache cleanup.

Morgan Fainberg (mdrnstm) on 2013-09-24

tags:

added: grizzly-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-09-24: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/47943

Changed in nova:
assignee:	nobody → Morgan Fainberg (mdrnstm)
status:	Confirmed → In Progress

Russell Bryant (russellb) on 2013-09-25

tags:

added: havana-rc-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-09-28: Fix merged to nova (master)

Reviewed: https://review.openstack.org/47943
Committed: http://github.com/openstack/nova/commit/150af94d881482d3d2fe1ac7c2e71ff1179d16ca
Submitter: Jenkins
Branch: master

commit 150af94d881482d3d2fe1ac7c2e71ff1179d16ca
Author: Morgan Fainberg <email address hidden>
Date: Mon Sep 23 17:24:46 2013 -0700

Correct lock path for storage-registry-lock

    This fixes the lock path for storage-registry-lock to prevent a
    potential data loss due to a race-condition scenario. Since the
    storage-registry-lock is used to modify a json blob that is
    authoritative as to which nodes are using the storage so that
    we don't remove any "in-use" backing files on the shared storage
    segment (if in use).

    The potential data-loss could occur in limited cases of the race
    that the lock should prevent. This can only occur if the image
    cache cleanup is triggerd when a node has been eliminated from
    the register_storage_use JSON incorrectly by this race condition.

Most of the time the race simplyresults in a corrupted json blob.

Fixes-bug: 1194696
Change-Id: Ib1a5785646e75dc90bdd2a7e69a697adc4fbb298

Changed in nova:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2013-10-03

Changed in nova:
milestone:	none → havana-rc1
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2013-10-17

Changed in nova:
milestone:	havana-rc1 → 2013.2

Alan Pevec (apevec) on 2014-03-30

tags:

removed: grizzly-backport-potential havana-rc-potential

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.