reconstructor list index out of range

Bug #1494359 reported by Caleb Tennis
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
New
Undecided
Unassigned

Bug Description

I'm not sure the context of how I got the system into this state, but my reconstructor logs now have this error churning out every 30 seconds or so:

Sep 10 14:46:36 localhost object-reconstructor: Exception in top-levelreconstruction loop: #012Traceback (most recent call last):#012 File "/usr/local/lib/python2.7/dist-packages/swift/obj/reconstructor.py", line 905, in reconstruct#012 jobs = self.build_reconstruction_jobs(part_info)#012 File "/usr/local/lib/python2.7/dist-packages/swift/obj/reconstructor.py", line 864, in build_reconstruction_jobs#012 jobs = self._get_part_jobs(**part_info)#012 File "/usr/local/lib/python2.7/dist-packages/swift/obj/reconstructor.py", line 754, in _get_part_jobs#012 sync_to=[part_nodes[fi]],#012IndexError: list index out of range

Tags: ec
Revision history for this message
Caleb Tennis (ctennis) wrote :

Actually, what's more interesting is that as a result, reconstruction never gets past a certain state, and just ends partially completed

ep 10 14:48:07 localhost object-reconstructor: Starting object reconstruction pass.
Sep 10 14:48:07 localhost object-reconstructor: Exception in top-levelreconstruction loop: #012Traceback (most recent call last):#012 File "/usr/local/lib/python2.7/dist-packages/swift/obj/reconstructor.py", line 905, in reconstruct#012 jobs = self.build_reconstruction_jobs(part_info)#012 File "/usr/local/lib/python2.7/dist-packages/swift/obj/reconstructor.py", line 864, in build_reconstruction_jobs#012 jobs = self._get_part_jobs(**part_info)#012 File "/usr/local/lib/python2.7/dist-packages/swift/obj/reconstructor.py", line 754, in _get_part_jobs#012 sync_to=[part_nodes[fi]],#012IndexError: list index out of range
Sep 10 14:48:07 localhost object-reconstructor: 35/380 (9.21%) partitions of 1/8 (12.50%) devices reconstructed in 0.21s (165.61/sec, 18s remaining)
Sep 10 14:48:07 localhost object-reconstructor: Object reconstruction complete. (0.00 minutes)
Sep 10 14:48:37 localhost object-reconstructor: Starting object reconstruction pass.
Sep 10 14:48:38 localhost object-reconstructor: Exception in top-levelreconstruction loop: #012Traceback (most recent call last):#012 File "/usr/local/lib/python2.7/dist-packages/swift/obj/reconstructor.py", line 905, in reconstruct#012 jobs = self.build_reconstruction_jobs(part_info)#012 File "/usr/local/lib/python2.7/dist-packages/swift/obj/reconstructor.py", line 864, in build_reconstruction_jobs#012 jobs = self._get_part_jobs(**part_info)#012 File "/usr/local/lib/python2.7/dist-packages/swift/obj/reconstructor.py", line 754, in _get_part_jobs#012 sync_to=[part_nodes[fi]],#012IndexError: list index out of range
Sep 10 14:48:38 localhost object-reconstructor: 35/380 (9.21%) partitions of 1/8 (12.50%) devices reconstructed in 0.20s (173.41/sec, 17s remaining)
Sep 10 14:48:38 localhost object-reconstructor: Object reconstruction complete. (0.00 minutes)
Sep 10 14:49:08 localhost object-reconstructor: Starting object reconstruction pass.
Sep 10 14:49:08 localhost object-reconstructor: Exception in top-levelreconstruction loop: #012Traceback (most recent call last):#012 File "/usr/local/lib/python2.7/dist-packages/swift/obj/reconstructor.py", line 905, in reconstruct#012 jobs = self.build_reconstruction_jobs(part_info)#012 File "/usr/local/lib/python2.7/dist-packages/swift/obj/reconstructor.py", line 864, in build_reconstruction_jobs#012 jobs = self._get_part_jobs(**part_info)#012 File "/usr/local/lib/python2.7/dist-packages/swift/obj/reconstructor.py", line 754, in _get_part_jobs#012 sync_to=[part_nodes[fi]],#012IndexError: list index out of range
Sep 10 14:49:08 localhost object-reconstructor: 35/380 (9.21%) partitions of 1/8 (12.50%) devices reconstructed in 0.21s (165.57/sec, 18s remaining)
Sep 10 14:49:08 localhost object-reconstructor: Object reconstruction complete. (0.00 minutes)

Revision history for this message
Caleb Tennis (ctennis) wrote :

Whatever caused this, it's impacting all 3 nodes in my cluster:

one node always get to:

Sep 10 14:56:30 localhost object-reconstructor: 498/857 (58.11%) partitions of 5/8 (62.50%) devices reconstructed in 2.86s (174.42/sec, 5s remaining)

before erroring, the other:

Sep 10 14:55:59 localhost object-reconstructor: 63/400 (15.75%) partitions of 1/8 (12.50%) devices reconstructed in 0.39s (162.89/sec, 19s remaining)

They all have the same exception in the logs every 30 seconds

Revision history for this message
clayg (clay-gerrard) wrote :

that looks like len(part_nodes) < fi

There should always be as many part_nodes as whatever the fragment index might be - maybe that code is stupid - but I bet it's the ring - can you validate this ring?

https://bugs.launchpad.net/swift/+bug/1452431

or instrument a print there to figure out which part and fragment we're blowing up on?

Revision history for this message
clayg (clay-gerrard) wrote :

well... or maybe if there's some crazy info in the on-disk hashes that doesn't match up with the current schema in swift.conf for the storage policy?

Have you changed EC settings at all? Maybe disks didn't get fully wiped out between the change?

Revision history for this message
Caleb Tennis (ctennis) wrote :
Download full text (3.6 KiB)

This is what the ring looked like at the time:

swiftstack@platform:/opt/ss/builder_configs/c57c2ecf-cf91-4db5-ad7b-669a8265b124$ swift-ring-builder object-1.builder
object-1.builder, build version 47
1024 partitions, 14.000000 replicas, 1 regions, 1 zones, 24 devices, 999.99 balance, 86.04 dispersion
The minimum number of hours before a partition can be reassigned is 1
The overload factor is 10000000.00% (100000.000000)
Devices: id region zone ip address port replication ip replication port name weight partitions balance meta
             0 1 1 172.30.3.43 6000 172.30.3.43 6003 d0 0.00 0 0.00
             1 1 1 172.30.3.43 6000 172.30.3.43 6003 d1 0.00 0 0.00
             2 1 1 172.30.3.43 6000 172.30.3.43 6003 d2 0.00 149 999.99
             3 1 1 172.30.3.43 6000 172.30.3.43 6003 d3 0.00 482 999.99
             4 1 1 172.30.3.43 6000 172.30.3.43 6003 d4 0.00 731 999.99
             5 1 1 172.30.3.43 6000 172.30.3.43 6003 d5 8.59 911 20.74
             6 1 1 172.30.3.43 6000 172.30.3.43 6003 d6 8.59 911 20.74
             7 1 1 172.30.3.43 6000 172.30.3.43 6003 d7 8.59 912 20.87
             8 1 1 172.30.3.44 6000 172.30.3.44 6003 d8 8.59 640 -15.18
             9 1 1 172.30.3.44 6000 172.30.3.44 6003 d9 8.59 640 -15.18
            10 1 1 172.30.3.44 6000 172.30.3.44 6003 d10 8.59 640 -15.18
            11 1 1 172.30.3.44 6000 172.30.3.44 6003 d11 8.59 640 -15.18
            12 1 1 172.30.3.44 6000 172.30.3.44 6003 d12 8.59 639 -15.31
            13 1 1 172.30.3.44 6000 172.30.3.44 6003 d13 8.59 641 -15.05
            14 1 1 172.30.3.44 6000 172.30.3.44 6003 d14 8.59 640 -15.18
            15 1 1 172.30.3.44 6000 172.30.3.44 6003 d15 8.59 640 -15.18
            16 1 1 172.30.3.45 6000 172.30.3.45 6003 d16 8.59 640 -15.18
            17 1 1 172.30.3.45 6000 172.30.3.45 6003 d17 8.59 640 -15.18
            18 1 1 172.30.3.45 6000 172.30.3.45 6003 d18 8.59 640 -15.18
            19 1 1 172.30.3.45 6000 172.30.3.45 6003 d19 8.59 640 -15.18
            20 1 1 172.30.3.45 6000 172.30.3.45 6003 d20 8.59 640 -15.18
            21 1 1 172.30.3.45 6000 172.30.3.45 6003 d21 8.59 ...

Read more...

Revision history for this message
Caleb Tennis (ctennis) wrote :

Here's an strace before it blows up. It doesn't seem to get a response to a REPLICATE call:

https://gist.github.com/ctennis/7f252118ac5fc5b206b1

Revision history for this message
clayg (clay-gerrard) wrote :

https://gist.github.com/ctennis/7f252118ac5fc5b206b1#file-gistfile1-txt-L43 is the last response I see - but I don't see another request after that? Looks like it got back the remote hashes and decided to do some sync jobs and blew up?

Revision history for this message
clayg (clay-gerrard) wrote :
Download full text (5.1 KiB)

ok, so this *is* a duplicate of https://bugs.launchpad.net/swift/+bug/1452431

    $ python validate_ec_ring_parts.py object-1.builder
    set([]) 14
    ERROR: all parts have 13 device ids for replicas instead of 14

^ pretty crappy, but explains the index error

I instrumented the reconstructor's _get_part_jobs method

    # diff reconstructor.py reconstructor.py.bak
    726d725
    < print partition
    729d727
    < print node
    752d749
    < print fi, len(part_nodes)

and so it's happening basically like I thought:

    object-reconstructor: STDOUT: 955
    object-reconstructor: STDOUT: {'index': 0, 'replication_port': 6003, 'weight': 8.589934592, 'zone': 1, 'ip': '172.30.3.45', 'region': 1, 'id': 18, 'replication_ip': '172.30.3.45', 'device': 'd18', 'port': 6000}
    object-reconstructor: STDOUT: {'index': 1, 'replication_port': 6003, 'weight': 8.589934592, 'zone': 1, 'ip': '172.30.3.43', 'region': 1, 'id': 5, 'replication_ip': '172.30.3.43', 'device': 'd5', 'port': 6000}
    object-reconstructor: STDOUT: {'index': 2, 'replication_port': 6003, 'weight': 8.589934592, 'zone': 1, 'ip': '172.30.3.44', 'region': 1, 'id': 11, 'replication_ip': '172.30.3.44', 'device': 'd11', 'port': 6000}
    object-reconstructor: STDOUT: {'index': 3, 'replication_port': 6003, 'weight': 8.589934592, 'zone': 1, 'ip': '172.30.3.45', 'region': 1, 'id': 16, 'replication_ip': '172.30.3.45', 'device': 'd16', 'port': 6000}
    object-reconstructor: STDOUT: {'index': 4, 'replication_port': 6003, 'weight': 8.589934592, 'zone': 1, 'ip': '172.30.3.44', 'region': 1, 'id': 13, 'replication_ip': '172.30.3.44', 'device': 'd13', 'port': 6000}
    object-reconstructor: STDOUT: {'index': 5, 'replication_port': 6003, 'weight': 8.589934592, 'zone': 1, 'ip': '172.30.3.43', 'region': 1, 'id': 6, 'replication_ip': '172.30.3.43', 'device': 'd6', 'port': 6000}
    object-reconstructor: STDOUT: {'index': 6, 'replication_port': 6003, 'weight': 8.589934592, 'zone': 1, 'ip': '172.30.3.44', 'region': 1, 'id': 9, 'replication_ip': '172.30.3.44', 'device': 'd9', 'port': 6000}
    object-reconstructor: STDOUT: {'index': 7, 'replication_port': 6003, 'weight': 8.589934592, 'zone': 1, 'ip': '172.30.3.45', 'region': 1, 'id': 20, 'replication_ip': '172.30.3.45', 'device': 'd20', 'port': 6000}
    object-reconstructor: STDOUT: {'index': 8, 'replication_port': 6003, 'weight': 8.589934592, 'zone': 1, 'ip': '172.30.3.43', 'region': 1, 'id': 7, 'replication_ip': '172.30.3.43', 'device': 'd7', 'port': 6000}
    object-reconstructor: STDOUT: {'index': 9, 'replication_port': 6003, 'weight': 8.589934592, 'zone': 1, 'ip': '172.30.3.44', 'region': 1, 'id': 10, 'replication_ip': '172.30.3.44', 'device': 'd10', 'port': 6000}
    object-reconstructor: STDOUT: {'index': 10, 'replication_port': 6003, 'weight': 8.589934592, 'zone': 1, 'ip': '172.30.3.45', 'region': 1, 'id': 21, 'replication_ip': '172.30.3.45', 'device': 'd21', 'port': 6000}
    object-reconstructor: STDOUT: {'index': 11, 'replication_port': 6003, 'weight': 8.589934592, 'zone': 1, 'ip': '172.30.3.45', 'region': 1, 'id': 23, 'replication_ip': '172.30.3.45', 'device': 'd23', 'port': 6000}
    object-reconstructor: ...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.