Some partitions of a 3 replicas Swift cluster may have only 2 replicas

Bug #1370070 reported by Florent Flament
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Undecided
Christian Schwede

Bug Description

Starting from a 3 replicas / 3 devices / 1 region Swift cluster, when
adding a node to a new region, we end up with some partitions having
only 2 replicas.

----- Script to reproduce bug -----
swift-ring-builder test.builder create 9 3 0

# Create a 3 replicas / 3 devices / 1 region Swift cluster
for d in $(seq 1 3); do
  swift-ring-builder test.builder add r1z0-127.0.0.1:6000/sda${d} 100
done
swift-ring-builder test.builder rebalance 0

# Add 1 device to a new region
swift-ring-builder test.builder add r2z0-127.0.0.1:6000/sdb1 10
swift-ring-builder test.builder rebalance 0

# Displaying number of replicas for each partition
swift-ring-builder test.builder list_parts 127.0.0.1 | awk '{print $2}' | sort -n | uniq -c

# Result is
#Reassigned 512 (100.00%) partitions. Balance is now 0.91.
# 1 Matches
# 3 2
# 509 3

Revision history for this message
Christian Schwede (cschwede) wrote :

Looks like the output of swift-ring-builder test.builder list_parts is wrong; using your example builder file and the following python code I see 512 partitions for every replica:

 from swift.common.ring.builder import RingBuilder
 rb = RingBuilder.load('test.builder')
 for replica in rb._replica2part2dev:
     print len(replica)

Looking into the swift-ring-builder list_parts code.

Changed in swift:
status: New → In Progress
assignee: nobody → Christian Schwede (cschwede)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (master)

Fix proposed to branch: master
Review: https://review.openstack.org/121893

Revision history for this message
Christian Schwede (cschwede) wrote :

Above code is wrong, this one is better:

from swift.common.ring.builder import RingBuilder
rb = RingBuilder.load('test.builder')

partition_count = {}
for _replica_id, replica in enumerate(rb._replica2part2dev):
    for partition, device in enumerate(replica):
            if partition not in partition_count:
                partition_count[partition] = 0
            partition_count[partition] += 1

for partition, count in partition_count.items():
    print partition, count

And similar to the submitted patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/121893
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=0a5268c34caa25487c48380a1821e4deac178538
Submitter: Jenkins
Branch: master

commit 0a5268c34caa25487c48380a1821e4deac178538
Author: Christian Schwede <email address hidden>
Date: Tue Sep 16 14:46:08 2014 +0000

    Fix bug in swift-ring-builder list_parts

    The number of shown replicas in the partition list might differ from the
    actual number of replicas (as shown in the bugreport).

    This codes simply iterates for the builder._replica2part2dev and
    remembers the number of replicas for each partition.

    The code to find the partitions was moved to swift/common/ring/utils.py
    to make it easier to test, and a test to ensure the correct number of
    replicas is returned was added.

    Closes-Bug: 1370070
    Change-Id: Id6a3ed437bb86df2f43f8b0b79aa8ccb50bbe13e

Changed in swift:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/ec)

Fix proposed to branch: feature/ec
Review: https://review.openstack.org/138165

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/ec)
Download full text (15.6 KiB)

Reviewed: https://review.openstack.org/138165
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=0d3ebf09b94b41782b2c2a6bbcf255bf1203eca0
Submitter: Jenkins
Branch: feature/ec

commit 977d7c14daa38ab9c9d79bbf8b92371024b93fc8
Author: John Dickinson <email address hidden>
Date: Wed Nov 26 14:19:08 2014 -0800

    Fix tempfile bugs from commit 6978275

    Commit 6978275 changed xprofile middleware's usage of mktemp
    and moved to using tempfile. But it was clearly never tested,
    because the os.close() calls never worked. This patch updates
    that previous patch to use a context to open and close the file.

    Change-Id: I40ee42e8539551fd8e4dfb353f50146ab40a7847

commit dec97fc3ba2c71884f1c098e7d9cd1f709f74958
Author: OpenStack Proposal Bot <email address hidden>
Date: Wed Nov 26 06:13:29 2014 +0000

    Imported Translations from Transifex

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: Ibf319f7cc1b5036ad8031776cf2c6018fb8a0159

commit 01f6e860066640a2ba1406a23c93a72b34ec495e
Author: Clay Gerrard <email address hidden>
Date: Fri Nov 21 17:28:13 2014 -0800

    Add Expected Failure for ssync with sys-meta

    Sysmeta included with an object PUT persists with the PUT data - if an
    internal operation such as POST-as-copy during partial failure, or ssync
    with fast-POST (not supported), causes that data to be lost then the
    associated sysmeta will also be lost.

    Since object sys-meta persistence in the face of a POST when the
    original .data is unavailable requires fast-POST with .meta files the
    probetest that validates object sys-meta persistence of a POST when the
    most up-to-date copy of the object with sys-meta is unavailable
    configures an InternalClient with object_post_as_copy = false.

    This non-default configuration option is not supported by ssync and
    results in a loss of sys-meta very similar to the object sys-meta
    failure you would see with object_post_as_copy = true when the COPY part
    of the POST is unable to retrieve the most recently written object with
    sys-meta.

    Until we can fix the default POST behavior to make metadata updates
    without stomping on newer data file timestamps we should expect object
    sys-meta to be "very very best possible but not really guaranteed
    effort".

    Until we can fix ssync to replicate metadata updates without stomping on
    newer data file timestamps we should expect this test to fail.

    When ssync replication of fast-POST metadata update is fixed this test
    will fail signaling that the expected failure cruft should be removed,
    but other parts of ssync replication will still work and some other bugs
    can be fixed while we wait.

    Change-Id: Ifc5d49514de79b78f7715408e0fe0908357771d3

commit a8751ae557616cab1cafd98a338cad352526a262
Author: Cedric Dos Santos <email address hidden>
Date: Tue Nov 25 12:37:05 2014 +0100

    Correct misspelled words

    In some files I found misspelling words.

    bin/swift-reconciler-enqueue#l26
       prima...

Thierry Carrez (ttx)
Changed in swift:
milestone: none → 2.2.1
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.