OpenStack Object Storage (swift)

stats output in reconstructor.py gives wrong device count

Bug #1488608 reported by Caleb Tennis on 2015-08-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Object Storage (swift)	Fix Released	Low	Bill Huber

Bug Description

The stats output of the reconstructor for our cluster looks like this:

430/430 (100.00%) partitions of 26/13 (200.00%) devices ...

The 26/13 is incorrect obvious. There are 13 physical devices, but we have two separate EC storage policies on this machine, so I believe it is counting 26 because it's counting one for each policy.

Tags:

Revision history for this message

Caleb Tennis (ctennis) wrote on 2015-08-25:

It's the 'reconstruction_device_count' that's being incorrectly calculated in reconstructor.py.

Note that this value being off also impacts the compute_eta function, which reports negative time to completion.

tags:	added: security
tags:	removed: security

paul luse (paul-e-luse) on 2015-08-25

tags:

added: ec

Revision history for this message

clayg (clay-gerrard) wrote on 2015-08-25:

For some back story see attempt #1 to make this better [1] and it's related lp bug #1468298

1. https://review.openstack.org/#/c/195275/

Revision history for this message

Caleb Tennis (ctennis) wrote on 2015-08-27:

I want to point out that we are running the patch from https://review.openstack.org/#/c/195275/ in our testing, my notes above apply to the fact that it's not fixed in that patchset.

Bill Huber (wbhuber) on 2015-09-08

Changed in swift:
status:	New → Confirmed

Revision history for this message

Bill Huber (wbhuber) wrote on 2015-09-09:

To reproduce this bug on the latest (master) this morning:

1) Create two more EC policies in swift.conf
2) Build the object rings for both policies
3) Ensure both rings point to the same drives
4) Create a container for first policy
5) Create a container for second policy
6) Upload objects to both containers
7) Run object-reconstructor from any object servers
8) Output:
Sep 8 17:38:52 wbhuber-swift object-reconstructor: 908/908 (100.00%) partitions of 6/2 (300.00%) devices reconstructed in 11.28s (80.51/sec, -7s remaining

Bill Huber (wbhuber) on 2015-09-09

Changed in swift:
assignee:	nobody → Bill Huber (wbhuber)

Revision history for this message

Bill Huber (wbhuber) wrote on 2015-09-09:

I would like to have core members weigh in on this bug - how much of a value is there when someone fixes this bug for a case in which the cluster has two EC storage policies running? I don't know if someone out there in the field would like to have the cluster running jerasure policy with 10+4 scheme and then liberasurecode policy with a different scheme - it just seems unlikely to me at the point. I could be wrong, but, thus, core members' input is desired here.

Basically, the reconstructor iterates each policy and checks the devices associated with the policy ring and assigns them to self.device_count instead of increments them.

The quickest way to remediate this bug would be the following in reconstructor.py, collect_parts method, but I haven't tested much yet:

Lines 800 thru 803:
if override_devices:
self.device_count += len(override_devices)
else:
self.device_count += len(local_devices)

Revision history for this message

John Dickinson (notmyname) wrote on 2015-09-09:

Multiple EC policies in the same cluster are about as likely (IMO) as multiple replicated policies. That is, I don't think it will be the majority of clusters, but it also won't be uncommon.

John Dickinson (notmyname) on 2015-09-09

Changed in swift:
importance:	Undecided → Low

Revision history for this message

clayg (clay-gerrard) wrote on 2016-11-08:

I think most of this is already underway in https://review.openstack.org/#/c/283946/6

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-12-23: Fix merged to swift (master)

Reviewed: https://review.openstack.org/283946
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=b09360d447447845898c9517742613ffecbe01c4
Submitter: Jenkins
Branch: master

commit b09360d447447845898c9517742613ffecbe01c4
Author: Kota Tsuyuzaki <email address hidden>
Date: Sun Oct 30 22:24:18 2016 -0700

Fix stats calculation in object-reconstructor

    This patch fixes the object-reconstructor to calculate device_count
    as the total number of local devices in all policies. Previously
    Swift counts it for each policy but reconstruction_device_count
    which means the number of devices actually swift needs to reconstruct
    is counted as sum of ones for all polices.

    With this patch, Swift will gather all local devices for all policies
    at first, and then, collect parts for each devices as well as current.
    To do so, we can see the statuses for remaining job/disks percentage via
    stats_line output.

    To enable this change, this patch also touchs the object replicator
    to get a DiskFileManager via the DiskFileRouter class so that
    DiskFileManager instances are policy specific. Currently the same
    replication policy DiskFileManager class is always used, but this
    change future proofs the replicator for possible other DiskFileManager
    implementations.

    The change also gives the ObjectReplicator a _df_router variable,
    making it consistent with the ObjectReconstructor, and allowing a
    common way for ssync.Sender to access DiskFileManager instances via
    it's daemon's _df_router instance.

    Also, remove the use of FakeReplicator from the ssync test suite. It
    was not necessary and risked masking divergence between ssync and the
    replicator and reconstructor daemon implementations.

Co-Author: Alistair Coles <email address hidden>

Closes-Bug: #1488608
Change-Id: Ic7a4c932b59158d21a5fb4de9ed3ed57f249d068