UnicodeDecodeError when rebalancing a ring

Bug #1580678 reported by Admin6
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
In Progress
Medium
Christian Schwede

Bug Description

I had servers under Ubuntu 15.10 wih swift 2.5 where replication seems to work properly.
Since I've upgraded my servers to Ubuntu 16.04 with swift 2.7.0, I see a lot a UnicodeDecodeError each time I rebalance my object ring with a replica 3 policy.

Here is an extract of the centralized logs :

May 9 11:45:53 STACO3 object-replicator: <f+++++++++ 527/7913026d10c3c17071c94597808a3527/1461851500.88147.data
May 9 11:45:53 STACO3 object-replicator: STDERR: Traceback (most recent call last):
May 9 11:45:53 STACO3 object-replicator: STDERR: File "/usr/lib/python2.7/logging/handlers.py", line 845, in emit
May 9 11:45:53 STACO3 object-replicator: STDERR: msg = self.format(record) + '\000'
May 9 11:45:53 STACO3 object-replicator: STDERR: File "/usr/lib/python2.7/logging/__init__.py", line 734, in format
May 9 11:45:53 STACO3 object-replicator: STDERR: return fmt.format(record)
May 9 11:45:53 STACO3 object-replicator: STDERR: File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 1571, in format
May 9 11:45:53 STACO3 object-replicator: STDERR: record.message = record.getMessage()
May 9 11:45:53 STACO3 object-replicator: STDERR: File "/usr/lib/python2.7/logging/__init__.py", line 329, in getMessage
May 9 11:45:53 STACO3 object-replicator: STDERR: msg = msg % self.args
May 9 11:45:53 STACO3 object-replicator: STDERR: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
May 9 11:45:53 STACO3 object-replicator: STDERR: Logged from file replicator.py, line 207
May 9 11:45:53 STACO3 object-replicator: Suppression partition: /srv/node/s03z3rpd02/objects/123980
May 9 11:45:53 STACO1 object-replicator: Suppression partition: /srv/node/s01z1rpd01/objects/122268
May 9 11:45:53 STACO3 object-replicator: <f+++++++++ a32/78f6f82ce453c148b70dfb5626413a32/1462352950.61782.data
May 9 11:45:53 STACO3 object-replicator: STDERR: Traceback (most recent call last):
May 9 11:45:53 STACO3 object-replicator: STDERR: File "/usr/lib/python2.7/logging/handlers.py", line 845, in emit
May 9 11:45:53 STACO3 object-replicator: STDERR: msg = self.format(record) + '\000'
May 9 11:45:53 STACO3 object-replicator: STDERR: File "/usr/lib/python2.7/logging/__init__.py", line 734, in format
May 9 11:45:53 STACO3 object-replicator: STDERR: return fmt.format(record)
May 9 11:45:53 STACO3 object-replicator: STDERR: File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 1571, in format
May 9 11:45:53 STACO3 object-replicator: STDERR: record.message = record.getMessage()
May 9 11:45:53 STACO3 object-replicator: STDERR: File "/usr/lib/python2.7/logging/__init__.py", line 329, in getMessage
May 9 11:45:53 STACO3 object-replicator: STDERR: msg = msg % self.args
May 9 11:45:53 STACO3 object-replicator: STDERR: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
May 9 11:45:53 STACO3 object-replicator: STDERR: Logged from file replicator.py, line 207
May 9 11:45:54 STACO2 object-replicator: Suppression partition: /srv/node/s02z2rpd01/objects/195932
May 9 11:45:54 STACO3 object-replicator: <f+++++++++ 272/11b44d21141a8fca6ba08e0b6acf4272/1461853584.04076.data
May 9 11:45:54 STACO3 object-replicator: STDERR: Traceback (most recent call last):
May 9 11:45:54 STACO3 object-replicator: STDERR: File "/usr/lib/python2.7/logging/handlers.py", line 845, in emit
May 9 11:45:54 STACO3 object-replicator: STDERR: msg = self.format(record) + '\000'
May 9 11:45:54 STACO3 object-replicator: STDERR: File "/usr/lib/python2.7/logging/__init__.py", line 734, in format
May 9 11:45:54 STACO3 object-replicator: STDERR: return fmt.format(record)
May 9 11:45:54 STACO3 object-replicator: STDERR: File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 1571, in format
May 9 11:45:54 STACO3 object-replicator: STDERR: record.message = record.getMessage()
May 9 11:45:54 STACO3 object-replicator: STDERR: File "/usr/lib/python2.7/logging/__init__.py", line 329, in getMessage
May 9 11:45:54 STACO3 object-replicator: STDERR: msg = msg % self.args
May 9 11:45:54 STACO3 object-replicator: STDERR: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
May 9 11:45:54 STACO3 object-replicator: STDERR: Logged from file replicator.py, line 207
May 9 11:45:54 STACO2 object-replicator: <f+++++++++ 8c4/7706a79605def54433cf09c80da008c4/1462352944.46767.data
May 9 11:45:54 STACO2 object-replicator: STDERR: Traceback (most recent call last):
May 9 11:45:54 STACO2 object-replicator: STDERR: File "/usr/lib/python2.7/logging/handlers.py", line 845, in emit
May 9 11:45:54 STACO2 object-replicator: STDERR: msg = self.format(record) + '\000'
May 9 11:45:54 STACO2 object-replicator: STDERR: File "/usr/lib/python2.7/logging/__init__.py", line 734, in format
May 9 11:45:54 STACO2 object-replicator: STDERR: return fmt.format(record)
May 9 11:45:54 STACO2 object-replicator: STDERR: File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 1571, in format
May 9 11:45:54 STACO2 object-replicator: STDERR: record.message = record.getMessage()
May 9 11:45:54 STACO2 object-replicator: STDERR: File "/usr/lib/python2.7/logging/__init__.py", line 329, in getMessage
May 9 11:45:54 STACO2 object-replicator: STDERR: msg = msg % self.args
May 9 11:45:54 STACO2 object-replicator: STDERR: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
May 9 11:45:54 STACO2 object-replicator: STDERR: Logged from file replicator.py, line 207
May 9 11:45:54 STACO1 object-replicator: Suppression partition: /srv/node/s01z1rpd02/objects/128058

Here is the definition of this object ring :

# swift-ring-builder object.builder
object.builder, build version 49
262144 partitions, 3.000000 replicas, 1 regions, 4 zones, 12 devices, 7.97 balance, 0.00 dispersion
The minimum number of hours before a partition can be reassigned is 1 (0:37:13 remaining)
The overload factor is 0.00% (0.000000)
Ring file object.ring.gz is up-to-date
Devices: id region zone ip address port replication ip replication port name weight partitions balance flags meta
             0 1 1 10.10.1.51 6000 10.10.2.51 6000 s01z1rpd01 4000.00 70036 2.19
             1 1 1 10.10.1.51 6000 10.10.2.51 6000 s01z1rpd02 4000.00 70035 2.19
             2 1 1 10.10.1.51 6000 10.10.2.51 6000 s01z1rpd03 4000.00 70034 2.19
             3 1 2 10.10.1.52 6000 10.10.2.52 6000 s02z2rpd01 4000.00 70036 2.19
             4 1 2 10.10.1.52 6000 10.10.2.52 6000 s02z2rpd02 4000.00 70035 2.19
             5 1 3 10.10.1.53 6000 10.10.2.53 6000 s03z3rpd01 4000.00 70036 2.19
             6 1 3 10.10.1.53 6000 10.10.2.53 6000 s03z3rpd02 4000.00 70036 2.19
             7 1 2 10.10.1.52 6000 10.10.2.52 6000 s02z2rpd03 4000.00 70036 2.19
             8 1 3 10.10.1.53 6000 10.10.2.53 6000 s03z3rpd03 4000.00 70036 2.19
             9 1 4 10.10.1.54 6000 10.10.2.54 6000 s04z4rpd01 3300.00 52037 -7.97
            10 1 4 10.10.1.54 6000 10.10.2.54 6000 s04z4rpd02 3300.00 52037 -7.97
            11 1 4 10.10.1.54 6000 10.10.2.54 6000 s04z4rpd03 3300.00 52038 -7.96

Revision history for this message
clayg (clay-gerrard) wrote :

looks like some sort of log message coming from rsync? you might try changing your locale options on the server and restarting services? LC_ALL=C maybe?

Even if that fixes it, it's probably still a bug - but knowing what the default locale on the system was might make it easier to reproduce the failure.

Revision history for this message
Alistair Coles (alistair-coles) wrote :

Suppression partition suggests locale is French:
https://github.com/openstack/swift/blob/master/swift/locale/fr/LC_MESSAGES/swift.po#L871

which means we'll log this:
https://github.com/openstack/swift/blob/master/swift/locale/fr/LC_MESSAGES/swift.po#L968

"Succès de Rsync pour %(src)s dans %(dst)s (%(time).03f)"

which could be problematic if one of the format args is unicode:

>>> "Succès de Rsync pour %s" % u'foo'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

Revision history for this message
Admin6 (jmbonnefond) wrote :
Download full text (3.9 KiB)

This is a locale related problem. The current locales on the servers are french :

root@STACO4:~# locale
LANG=fr_FR.UTF-8
LANGUAGE=
LC_CTYPE="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_PAPER="fr_FR.UTF-8"
LC_NAME="fr_FR.UTF-8"
LC_ADDRESS="fr_FR.UTF-8"
LC_TELEPHONE="fr_FR.UTF-8"
LC_MEASUREMENT="fr_FR.UTF-8"
LC_IDENTIFICATION="fr_FR.UTF-8"
LC_ALL=

I've 4 account/Container/Object servers. STACO1 to STACO4, I've changes the locale to LC_ALL=C.UTF-8 on the first 3 of them, keeping STACO4 in french locale, and it works better for the 3 first servers.
Here is a new extract of the logs :

May 12 14:14:32 STACO3 object-replicator: Removing partition: /srv/node/s03z3rpd03/objects/68179
May 12 14:14:33 STACO2 object-replicator: <f+++++++++ c14/b69a972b6a5d9d3ebd9ac81100a07c14/1461846278.46015.data
May 12 14:14:33 STACO2 object-replicator: Successful rsync of /srv/node/s02z2rpd03/objects/186986/c14 at 10.10.2.54::object/s04z4rpd03/objects/186986 (0.178)
May 12 14:14:33 STACO3 object-replicator: <f+++++++++ 220/1276697801d6db7105bf60285bef1220/1463048094.60841.data
May 12 14:14:33 STACO3 object-replicator: Successful rsync of /srv/node/s03z3rpd02/objects/18905/220 at 10.10.2.51::object/s01z1rpd03/objects/18905 (0.160)
May 12 14:14:33 STACO1 object-replicator: <f+++++++++ 00e/19121c65af0d5bbf7b722bf99522500e/1462929131.07914.data
May 12 14:14:33 STACO1 object-replicator: <f+++++++++ 371/19120dab9be4fd00225ac9354bc2f371/1462933315.05776.data
May 12 14:14:33 STACO1 object-replicator: Successful rsync of /srv/node/s01z1rpd03/objects/25672/00e at 10.10.2.53::object/s03z3rpd02/objects/25672 (0.205)
May 12 14:14:33 STACO4 object-replicator: <f+++++++++ 0b8/34ca332fe3fb5912d9e3651b091810b8/1462918769.46525.data
May 12 14:14:33 STACO4 object-replicator: <f+++++++++ 13d/34ca1d594ef57bc7655a10b8344dd13d/1462939069.82513.data
May 12 14:14:33 STACO4 object-replicator: <f+++++++++ 871/34ca369cc1bf011985fca98b295dd871/1462898293.59999.data
May 12 14:14:33 STACO4 object-replicator: STDERR: Traceback (most recent call last):
May 12 14:14:33 STACO4 object-replicator: STDERR: File "/usr/lib/python2.7/logging/handlers.py", line 845, in emit
May 12 14:14:33 STACO4 object-replicator: STDERR: msg = self.format(record) + '\000'
May 12 14:14:33 STACO4 object-replicator: STDERR: File "/usr/lib/python2.7/logging/__init__.py", line 734, in format
May 12 14:14:33 STACO4 object-replicator: STDERR: return fmt.format(record)
May 12 14:14:33 STACO4 object-replicator: STDERR: File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 1571, in format
May 12 14:14:33 STACO4 object-replicator: STDERR: record.message = record.getMessage()
May 12 14:14:33 STACO4 object-replicator: STDERR: File "/usr/lib/python2.7/logging/__init__.py", line 329, in getMessage
May 12 14:14:33 STACO4 object-replicator: STDERR: msg = msg % self.args
May 12 14:14:33 STACO4 object-replicator: STDERR: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
May 12 14:14:33 STACO4 object-replicator: STDERR: Logged from file replicator.py, line 207
May 12 14:14:33 STACO1...

Read more...

Revision history for this message
Christian Schwede (cschwede) wrote :

I can confirm the error, and agree with Alistair - the problem occurs on non-english languages when the arg is a unicode. Most likely it happens in other parts of Swift as well.

I'm refactoring the locale tests right now (nearly done) to have something testable, afterwards we need to fix the issue itself.

Changed in swift:
assignee: nobody → Christian Schwede (cschwede)
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/hummingbird)

Fix proposed to branch: feature/hummingbird
Review: https://review.openstack.org/323946

Changed in swift:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (master)

Fix proposed to branch: master
Review: https://review.openstack.org/323950

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on swift (feature/hummingbird)

Change abandoned by Christian Schwede (<email address hidden>) on branch: feature/hummingbird
Review: https://review.openstack.org/323946
Reason: Wrong branch.

Revision history for this message
Brian Cline (briancline) wrote :
Download full text (3.5 KiB)

We are currently this with a 2.7.0 release on a node with LANG=en_US, which hasn't changed since the deployment of the last version we ran. However we see it in a much more problematic case: if a single object-server has services offline or is down, and Swift tries to log the corresponding error for it, any user-facing PUT request that attempts to touch that node fails with a 500, rather than gracefully trying a handoff node.

Un-mangled stack trace is at https://gist.github.com/briancline/82bef904104d21e91db7829f5d8fc192

swift-proxy: ERROR Unhandled exception in request:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/swift/proxy/server.py", line 411, in handle_request
    return handler(req)
  File "/usr/local/lib/python2.7/dist-packages/swift/proxy/controllers/base.py", line 264, in wrapped
    return func(*a, **kw)
  File "/usr/local/lib/python2.7/dist-packages/swift/proxy/controllers/obj.py", line 775, in PUT
    req, data_source, nodes, partition, outgoing_headers)
  File "/usr/local/lib/python2.7/dist-packages/swift/proxy/controllers/obj.py", line 1048, in _store_object
    outgoing_headers, policy, expect)
  File "/usr/local/lib/python2.7/dist-packages/swift/proxy/controllers/obj.py", line 639, in _get_put_connections
    conns = [conn for conn in pile if conn]
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenpool.py", line 222, in next
    return self.waiters.get().wait()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 175, in wait
    return self._exit_event.wait()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 121, in wait
    return hubs.get_hub().switch()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 294, in switch
    return self.greenlet.switch()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 214, in main
    result = function(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/swift/proxy/controllers/obj.py", line 935, in _connect_put_node
    _('Expect: 100-continue on %s') % path)
  File "/usr/local/lib/python2.7/dist-packages/swift/proxy/server.py", line 536, in exception_occurred
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/swift/common/utils.py", line 1513, in exception
    call('%s: %s' % (msg, emsg), *args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1445, in error
    self.logger.error(msg, *args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1174, in error
    self._log(ERROR, msg, args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1267, in _log
    self.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1277, in handle
    self.callHandlers(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1317, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 748, in handle
    self.emit(record)
  File "/usr/lib/python2.7/logging/handlers.py", line 828, in emit
    msg = self.format(record) + '\000'
  File "/usr/lib/python2.7/logging/__init__.py", line 723, in format
    return fmt.format(record)
  File "/usr/local/...

Read more...

Revision history for this message
Brian Cline (briancline) wrote :

Sorry, I should clarify -- it doesn't fail *ANY* PUT request, but seemingly only ones with URLs that get urllib.unquote'd to actual Unicode characters.

So we actually noticed this bug while running Swift functests against a freshly upgraded proxy. Specifically, test.functional.tests.TestAccountUTF8.testContainerSerializedInfo is the test that fails.

Updated the gist showing the func test failure and proxy log line in addition to just the stacktrace. (https://gist.github.com/briancline/82bef904104d21e91db7829f5d8fc192)

Revision history for this message
clayg (clay-gerrard) wrote :

seems to be blowing up here:

https://github.com/openstack/swift/blob/029c2782ddfd933af20b697c00a46b1375e0f23e/swift/proxy/controllers/obj.py#L1940

I'd say try to set the LANG on all nodes to LANG=en_US.UTF-8

But of course I'd also probably say you should rewrite gettext_ in swift/__init__.py as the identity function...

Revision history for this message
Brian Cline (briancline) wrote :

Sadly, no change in behavior with any of these:
 - applying patch 323950
 - changing gettext_ to just return whatever it was passed in
 - setting system locale to en_US.UTF-8
 - explicitly setting LANG/LC_ALL/LANGUAGE/etc to en_US.UTF-8 (verified via /proc/${pid}/environ that it picked it up)

After digging deeper, the exception I get isn't occurring because the translation string templates have unicode in them (the ones I'm using don't) -- it's occurring because the str it is trying to expand into the already-translated template is just that -- a str instead of a unicode type. This explains why 323950 doesn't apply to my issue.

Since my exception occurs at a different point I've opened bug #1597210 for it.

Revision history for this message
clayg (clay-gerrard) wrote :

and some of the args come from the ring devices - so they're unicode.

Changed in swift:
importance: Undecided → High
Revision history for this message
clayg (clay-gerrard) wrote :

I think this one might be related too -> https://review.openstack.org/#/c/339360/1

Revision history for this message
clayg (clay-gerrard) wrote :

It might be worth noting, that in the *other* critical bug related to this issue - the translation string was already unicode [1] ... so then the fix was to take the bytestring arguments and turn them into unicode.

In the suggested change [2] we're going out of way to ensure that we *always* get unicode out of gettext.

So going forward *anytime* we try to encode non-ascii bytes into into a translated string we're going to blow up regardless of the locale (in lp bug #1597210 - the locale didn't matter either - just the presence of non ascii bytes in the arguments to the unicode translation message).

So rather than fixing some "root cause" - making all translated gettext messages unicode will in someways make unicode errors *more* likely!

Bytes to bytes would always "just work" - so I think it's funny that the best path forward is probably to introduce *more* unicode so that we can't cheat - but it's very much dependent on our tests helping us out to ensure that when we try to introduce non-ascii utf-8 bytestring sources it run through the real machinery! :D

1. because of the additional unicode ring arguments interpolated into the utf-8 bytes coming out of the translation function forced the entire string to become unicode https://review.openstack.org/#/c/338950/1/swift/proxy/server.py@512

2. https://review.openstack.org/#/c/323950/4

Revision history for this message
Andrey Groshrev (greenx) wrote :

To some extent, this bug intersects with the recently described me:
https://bugs.launchpad.net/swift/+bug/1648082
Maybe my decision in your case, too, can help:
>> 2. Try get codec from gettext and change finction gettext_ in swift/__init__.py, as
def gettext_(msg):
    return _t.gettext(msg).decode(_t.charset())

Revision history for this message
clayg (clay-gerrard) wrote :

I think this is the same bug as lp bug #1648082

We should drop the translation on this message

Changed in swift:
importance: High → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on swift (master)

Change abandoned by Christian Schwede (<email address hidden>) on branch: master
Review: https://review.opendev.org/323950

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.