KeyError: 'storage_policy_index' in _really_merge_items swift/account/backend.py

Bug #1424108 reported by Omar Ali
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Critical
clayg

Bug Description

I upgraded some servers in our swift cluster from 1.12.0 to 2.2.2. One of the upgraded servers started having the error below.

Feb 20 15:57:44 localhost account-server: ERROR __call__ error with REPLICATE /d3/729530/b21baafc5e8847fdcce40eed9e373474 : #012
Traceback (most recent call last):#012
  File "/usr/lib/python2.7/site-packages/swift/account/server.py", line 269, in __call__#012
    res = method(req)#012
  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 2480, in wrapped#012
    return func(*a, **kw)#012
  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1054, in _timing_stats#012
    resp = func(ctrl, *args, **kwargs)#012
  File "/usr/lib/python2.7/site-packages/swift/account/server.py", line 230, in REPLICATE#012
    ret = self.replicator_rpc.dispatch(post_args, args)#012
  File "/usr/lib/python2.7/site-packages/swift/common/db_replicator.py", line 615, in dispatch#012
    return getattr(self, op)(self.broker_class(db_file), args)#012
  File "/usr/lib/python2.7/site-packages/swift/common/db_replicator.py", line 707, in merge_items#012
    broker.merge_items(args[0], args[1])#012
  File "/usr/lib/python2.7/site-packages/swift/account/backend.py", line 522, in merge_items#012
    _really_merge_items(conn)#012
  File "/usr/lib/python2.7/site-packages/swift/account/backend.py", line 466, in _really_merge_items#012
    rec['storage_policy_index']]#012KeyError: 'storage_policy_index'

Omar Ali (omarali)
description: updated
Revision history for this message
Pete Zaitcev (zaitcev) wrote :

This is the first time I encounter an SP-related schema migration failure. It worked so well that we, IIRC, never bothered with a tool for forced migrations. I would try to re-migrate step by step... e.g. 1.13->2.0, see if that helps. If that fails, it's the SQL wizardry turn, no other choice (unless someone has a tool already made).

Revision history for this message
clayg (clay-gerrard) wrote :

I think I see a bug, you may have to wait for a full account audit pass for the all the schemas to get migrated....

On the container implementation I see both in _commit_puts_load where it checks for the legacy entries in the .pending - but then also in _really_merge_items where it:

    item.setdefault('storage_policy_index', 0) # legacy

The account backend does the legacy compat dance in _commit_puts_load for pending container updates that weren't on disk, but it doesn't do anything in _really_merge_items that I can see for entries sent via replication directly? And since the db_replicator's get_items_since just returns whatever fields are in the table when it pulls out records it seems like you could get a db_replicator's _usync_db to ship over records to merge items that don't have storage_policy_index.

I really don't know how we didn't manage to catch this in all the upgrade testing, as I understand it this should happen a lot immediately following an upgrade until you auditors run over all the databases and call get_policy_stats with do_migrations (which should force the schema change and make the get_items_since ship all the expected fields to the remote node)

Can you confirm if these go away once the account auditor has ran - or after you put a new container into the account? Anything that forces the schema migration would fix this if I understand it correctly.

I'll redo the upgrade/replicate test - I'm guessing it wouldn't hurt to stick a setdefault into the account's really merge items like we did for containers.

Revision history for this message
Omar Ali (omarali) wrote :

If I'm understanding you correctly, the error will stop only after all three dbs' schemas are changed. Since I've only updated some of the servers, and since some dbs are on the upgraded servers and some dbs are on the old servers, I will keep receiving this error until I upgrade the remaining servers and wait for the schemas to change. Is that correct?

What about if I just patch _really_merge_items in swift/account/backend.py and add the line

    rec.setdefault('storage_policy_index', 0) # legacy

That should handle the case of rows that are missing the 'storage_policy_index' column without requiring me to upgrade all servers at once. Right?

Revision history for this message
clayg (clay-gerrard) wrote :

ok, so, there's more than one bug here :P

On one hand it's hard to get into this state and it tends to self heal if there's *any* write activity in the account, or pending with uncommitted data somewhere, but there's a number of things we could fix.

1) the account-auditor does a get_policy_stats call with do_migrations=True, but that only applies the that weird add container count migration (remember how we forgot container count in the policy_stat table!?), it definitely doesn't alter the containers table (that's normally handled in _migrate_add_storage_policy_index), and I'm not even sure if it will add the policy_stat table it doesn't exist.

2) in _really_merge_items we could just setdefault the records shipped across the wire in replication and avoid the KeyError - same as the container.backend does.

Omar,

If you still have some accounts that are acting like this it's because they were in the strange state of having been out-of-sync and also not having any recent activity (it's harder to be out of sync if there's not any activity) - so maybe a node was offline for awhile? Either way if replication doesn't fix it on it's own after a container-updater pass then none of the three replicas of the account have had their schema migrated and there's no new container info that will force it.

You could force it by making some activity - either through uploading an object (which will cause a container updater to send up new rows to account) or just creating and deleting a container - on the effected account(s).

If you don't want to deal with putting any data in them or you have more than one and don't want to track them all down a script to find get the db connection and call _migrate_add_storage_policy_index would work:

    """
    Usage:

    python migrate.py <path/to/db/file>
    """
    import sys
    from swift.account.backend import AccountBroker

    db_path = sys.argv[1]
    broker = AccountBroker(db_path)
    with broker.get() as conn:
        try:
            broker._migrate_add_storage_policy_index(conn)
        except Exception as e:
            if 'duplicate' not in str(e):
                raise
            print 'Already Migrated!'
        else:
            print 'Success!'

You could run it like this:

    for db in $(find /srv/node*/sdb*/account* -name \*.db); do python /vagrant/.scratch/migrate.py $db; done

Lot of over-head starting a new python process for each db, you might need to pull the os.walk into the script if you have more than a good number of accounts - or just be patient ;)

Hopefully we'll get it fixed up better for the next guy in the next version - thanks for the bug report!

Revision history for this message
jaKa (jaka-w) wrote :

since I seem to be hitting this as well (in a system with a fairly small number of accounts), intending to gradually replace nodes running swift 1.8 with nodes running swift 2.2 (adding the latter and phasing out the former), I was wondering:

do I just tolerate these errors until all the nodes are migrated (which will probably take a week or so)? or will I end up with a lost data in that case and I better do something before I proceed?

also, how does upgrading the account db schema affect remaining 1.8 nodes? I suppose they should have no problem with the upgraded schema.

Revision history for this message
Omar Ali (omarali) wrote :

@jaKa, I used the attached patch to stop getting KeyErrors. After all the nodes where upgraded, I used clayg's script to migrate inactive accounts' dbs. Then I removed the patch and didn't see any KeyErrors after that.

@clayg, No nodes were down for a while but we only use one account which gets so much activity that TimeOut errors happen ( we have four accounts defined in the config but the other three don't see any activity). Using your script, I was able to confirm that the main accout's db on the upgraded server was already migrated correctly. The KeyErrors must be because the account's other db files are on servers that were not yet upgraded. I used the attached patch to stop getting KeyErrors until all servers were upgraded. Then I removed it and didn't see any KeyErrors after that.

Revision history for this message
clayg (clay-gerrard) wrote :

This came up again, I'm going to take a crack at it.

Changed in swift:
status: New → Confirmed
importance: Undecided → Critical
assignee: nobody → clayg (clay-gerrard)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (master)

Fix proposed to branch: master
Review: https://review.openstack.org/177964

Changed in swift:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/177964
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=8cdf0fdebe9eb782322fccfc11253dc959cf321d
Submitter: Jenkins
Branch: master

commit 8cdf0fdebe9eb782322fccfc11253dc959cf321d
Author: Clay Gerrard <email address hidden>
Date: Mon Apr 27 13:29:50 2015 -0700

    Fix account replication during pre-storage-policy upgrade

    Old account schemas don't send the storage_policy_index key for container rows
    during replication, and if the recieving end is already running an upgraded
    server it is surprised with a KeyError. Normally this would work itself out
    if the old schema recieved any updates from container layer, or a new
    container is created, or requires a row sync from another account database -
    but if the account databases have rows out of sync and there's no activity in
    the account otherwise, there's nothing to force the old schemas to be
    upgraded.

    Rather than force the old schema that already has a complete set of container
    rows to migrate even in the absense of activity we can just fill in default
    legacy value for the storage policy index and allow the accounts to get back
    in sync and migrate the next time a container update occurs.

    FWIW, I never able to get a cluster upgrade to get stuck in this state without
    some sort of account failure that forced them to get their rows out of sync
    (in my cause I just unlinked a pending and then made sure to force all my
    account datbases to commit pending files before upgrading - leading to an
    upgraded cluster that absolutly needed account-replication to solve a row
    mismatch for inactive accounts with old schemas)

    Closes-Bug #1424108

    Change-Id: Iaf4ef834eb24f0e11a52cc22b93a864574fabf83

Changed in swift:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/hummingbird)

Fix proposed to branch: feature/hummingbird
Review: https://review.openstack.org/202227

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on swift (feature/hummingbird)

Change abandoned by Michael Barton (<email address hidden>) on branch: feature/hummingbird
Review: https://review.openstack.org/202227
Reason: Apparently I did this wrong.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/hummingbird)

Fix proposed to branch: feature/hummingbird
Review: https://review.openstack.org/202230

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/hummingbird)
Download full text (72.8 KiB)

Reviewed: https://review.openstack.org/202230
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=f7cb1777e1b514b3345b9e516ed8f89ad1a4ae87
Submitter: Jenkins
Branch: feature/hummingbird

commit 51f806d3e3d3a1fcbc80d2f7d7ddbe5cc4c024c9
Author: John Dickinson <email address hidden>
Date: Tue Jul 14 20:49:08 2015 -0700

    remove Python 2.6 from the classifier

    Change-Id: I67233e9c7b69826242546bd6bd98c24b81070579

commit 278adf5c20101a191979ce1e4d6277e5f209149e
Author: Hisashi Osanai <email address hidden>
Date: Tue Jul 14 15:33:45 2015 +0900

    Make logic of unit tests responsive to the method names

    The two methods, test_authorize_succeeds_for_tenant_name_in_roles and
    test_authorize_succeeds_for_tenant_id_in_roles, have names that don't
    match what they are testing. tenant_name and tenant_id need to be
    switched.

    Change-Id: I7cb0a7d2b2111127fd5d6b55f2da6a3eadf2235d

commit 1cc3eff958fdd4fb07c2b74c52df7829d3125466
Author: Victor Stinner <email address hidden>
Date: Fri Jul 10 13:04:44 2015 +0200

    Fixes for mock 1.1

    The new release of mock 1.1 is more strict. It helped to find bugs in
    tests.

    Closes-Bug: #1473369
    Change-Id: Id179513c6010d827cbcbdda7692a920e29213bcb

commit ff192cfe5705324497a389aa2f22227d75dc0f8e
Author: janonymous <email address hidden>
Date: Wed Jul 8 18:38:22 2015 +0530

    Replace reduce and unichr , these are no longer available in py3

    * Replace reduce() with six.moves.reduce()
    * Replace unichr with six.unichr

    Change-Id: I2038e47e0a6522dd992fd2a4aeff981cf7750fe0

commit 4beceab4f4be99f14025815cf7ed4510ea77f460
Author: OpenStack Proposal Bot <email address hidden>
Date: Thu Jul 9 06:14:56 2015 +0000

    Imported Translations from Transifex

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: I9ff1dde06be45fc7d6c441a1e1c07221f839a9a1

commit 56ee39a7e13417203c5e1816d7a3184a07f85826
Author: Matthew Oliver <email address hidden>
Date: Thu Jul 9 15:19:32 2015 +1000

    Ring builder code clean up follow up patch

    This is a simple change that cleans up a NIT from Sam's 'stop moving
    partitions unnecessarily when overload is on' patch.

    Change-Id: I9d9f1cc23e2bb625d8e158f4d3f64e10973176a1

commit 6cafd0a4c0bb8f311fc59df580b42e801214effd
Author: Oshrit Feder <email address hidden>
Date: Wed Jul 8 15:18:22 2015 +0300

    Fix Container Sync example

    Container-sync realm uses cluster_ as a prefix to specify clusters'
    names. At use, the prefix should not be included. Fixing the examples
    and sample conf to make it clearer that only the name of the cluster
    should be passed.

    Change-Id: I2e521d86faffb59e1b45d3f039987ee023c5e939

commit 125238612f58481316db68d7087252bb7729f447
Author: Janie Richling <email address hidden>
Date: Sat Jul 4 17:08:32 2015 -0500

    Add CORS unit tests to base

    In earlier versions of swift when a request was made with an
    existing origin, but without any CORS settings in the container,
    it was possible to get an u...

tags: added: in-feature-hummingbird
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/crypto)

Fix proposed to branch: feature/crypto
Review: https://review.openstack.org/205579

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/crypto)
Download full text (80.5 KiB)

Reviewed: https://review.openstack.org/205579
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=8ab46b64365b8eab80680f2562f81e8adb3032a3
Submitter: Jenkins
Branch: feature/crypto

commit 89f705e8aab144092d40a13fc4ef19ffef5f3eba
Author: OpenStack Proposal Bot <email address hidden>
Date: Thu Jul 23 06:11:27 2015 +0000

    Imported Translations from Transifex

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: I94cf347564cb33977f33b1e64259bcb39a8cf809

commit a65e9db8752793ec37b594dc9eca5066171825db
Author: Christian Schwede <email address hidden>
Date: Wed Jul 22 10:43:17 2015 +0000

    Removing commented out code in test/unit/account/test_backend.py

    Noticed this while reviewing another change. Looks like the test itself already
    ensures correct functionality of the reclaim() method in AccountBroker without
    the commented code, thus removing this stale code.

    Change-Id: I6a26a7591adef9fd794ca68a4e9c493d1127f93c

commit 99d052772a9585e0befdfd292fd03aefde77180a
Author: Kota Tsuyuzaki <email address hidden>
Date: Mon Jul 13 01:12:43 2015 -0700

    Fix 499 client disconnected on COPY EC object

    Currently, a COPY request for an EC object might go to fail as 499 Client
    disconnected because of the difference between destination request content
    length and actual transferred bytes.

    That is because the conditional response status and content length for
    an EC object range GET is handled at calling the response instance on
    proxy server. Therefore the calling response instance (resp()) will change
    the conditional status from 200 (HTTP_OK) to 206 (PartialContent) and will
    change the content length for the range GET.

    In EC case, sometimes Swift needs whole stored contents to decode a segment.
    It will make 200 HTTP OK response from object-server and proxy-server
    will unfortunately set whole content length to the destination content
    length and it makes the bug 1467677.

    This patch introduces a new method "fix_conditional_response" for
    swift.common.swob.Response that calling _response_iter() and cached the
    iter in the Response instance. By calling it, Swift can set correct condtional
    response any time after setting whole content_length to the response
    instance like EC case.

    Change-Id: If85826243f955d2f03c6ad395215c73daab509b1
    Closes-Bug: #1467677

commit 62ed4f81ef80440550633eaaaa962a4f9383c2d3
Author: Timur Alperovich <email address hidden>
Date: Tue Jul 14 16:56:44 2015 -0700

    Add two functional tests for delimiter.

    The first test verifies that a delimiter will trim entries beyond the
    first matching instance of delimiter (after the given matching prefix,
    if any) and squash duplicates. So, when setting the delimiter
    to "-", given blobs "test", "test-foo" and "test-bar-baz", we expect
    only "test" (no matching delim) and "test-" (trim all characters after
    the first "-", and squash duplicates).

    The second test verifies that when a prefix is provid...

tags: added: in-feature-crypto
Thierry Carrez (ttx)
Changed in swift:
milestone: none → 2.4.0
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.