stats output in reconstructor.py gives wrong device count

Bug #1488608 reported by Caleb Tennis
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Low
Bill Huber

Bug Description

The stats output of the reconstructor for our cluster looks like this:

430/430 (100.00%) partitions of 26/13 (200.00%) devices ...

The 26/13 is incorrect obvious. There are 13 physical devices, but we have two separate EC storage policies on this machine, so I believe it is counting 26 because it's counting one for each policy.

Tags: ec
Revision history for this message
Caleb Tennis (ctennis) wrote :

It's the 'reconstruction_device_count' that's being incorrectly calculated in reconstructor.py.

Note that this value being off also impacts the compute_eta function, which reports negative time to completion.

tags: added: security
tags: removed: security
paul luse (paul-e-luse)
tags: added: ec
Revision history for this message
clayg (clay-gerrard) wrote :

For some back story see attempt #1 to make this better [1] and it's related lp bug #1468298

1. https://review.openstack.org/#/c/195275/

Revision history for this message
Caleb Tennis (ctennis) wrote :

I want to point out that we are running the patch from https://review.openstack.org/#/c/195275/ in our testing, my notes above apply to the fact that it's not fixed in that patchset.

Bill Huber (wbhuber)
Changed in swift:
status: New → Confirmed
Revision history for this message
Bill Huber (wbhuber) wrote :

To reproduce this bug on the latest (master) this morning:

1) Create two more EC policies in swift.conf
2) Build the object rings for both policies
3) Ensure both rings point to the same drives
4) Create a container for first policy
5) Create a container for second policy
6) Upload objects to both containers
7) Run object-reconstructor from any object servers
8) Output:
Sep 8 17:38:52 wbhuber-swift object-reconstructor: 908/908 (100.00%) partitions of 6/2 (300.00%) devices reconstructed in 11.28s (80.51/sec, -7s remaining

Bill Huber (wbhuber)
Changed in swift:
assignee: nobody → Bill Huber (wbhuber)
Revision history for this message
Bill Huber (wbhuber) wrote :

I would like to have core members weigh in on this bug - how much of a value is there when someone fixes this bug for a case in which the cluster has two EC storage policies running? I don't know if someone out there in the field would like to have the cluster running jerasure policy with 10+4 scheme and then liberasurecode policy with a different scheme - it just seems unlikely to me at the point. I could be wrong, but, thus, core members' input is desired here.

Basically, the reconstructor iterates each policy and checks the devices associated with the policy ring and assigns them to self.device_count instead of increments them.

The quickest way to remediate this bug would be the following in reconstructor.py, collect_parts method, but I haven't tested much yet:

Lines 800 thru 803:
if override_devices:
   self.device_count += len(override_devices)
else:
    self.device_count += len(local_devices)

Revision history for this message
John Dickinson (notmyname) wrote :

Multiple EC policies in the same cluster are about as likely (IMO) as multiple replicated policies. That is, I don't think it will be the majority of clusters, but it also won't be uncommon.

Changed in swift:
importance: Undecided → Low
Revision history for this message
clayg (clay-gerrard) wrote :

I think most of this is already underway in https://review.openstack.org/#/c/283946/6

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/283946
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=b09360d447447845898c9517742613ffecbe01c4
Submitter: Jenkins
Branch: master

commit b09360d447447845898c9517742613ffecbe01c4
Author: Kota Tsuyuzaki <email address hidden>
Date: Sun Oct 30 22:24:18 2016 -0700

    Fix stats calculation in object-reconstructor

    This patch fixes the object-reconstructor to calculate device_count
    as the total number of local devices in all policies. Previously
    Swift counts it for each policy but reconstruction_device_count
    which means the number of devices actually swift needs to reconstruct
    is counted as sum of ones for all polices.

    With this patch, Swift will gather all local devices for all policies
    at first, and then, collect parts for each devices as well as current.
    To do so, we can see the statuses for remaining job/disks percentage via
    stats_line output.

    To enable this change, this patch also touchs the object replicator
    to get a DiskFileManager via the DiskFileRouter class so that
    DiskFileManager instances are policy specific. Currently the same
    replication policy DiskFileManager class is always used, but this
    change future proofs the replicator for possible other DiskFileManager
    implementations.

    The change also gives the ObjectReplicator a _df_router variable,
    making it consistent with the ObjectReconstructor, and allowing a
    common way for ssync.Sender to access DiskFileManager instances via
    it's daemon's _df_router instance.

    Also, remove the use of FakeReplicator from the ssync test suite. It
    was not necessary and risked masking divergence between ssync and the
    replicator and reconstructor daemon implementations.

    Co-Author: Alistair Coles <email address hidden>

    Closes-Bug: #1488608
    Change-Id: Ic7a4c932b59158d21a5fb4de9ed3ed57f249d068

Changed in swift:
status: Confirmed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.13.0

This issue was fixed in the openstack/swift 2.13.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.