memcached exception when trying to connect to offline node

Bug #897451 reported by Florian Hines
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Medium
Matthew Oliver

Bug Description

Nov 28 22:36:02 proxy1-Z1 proxy-server STDOUT: ERROR:root:Error talking to memcached: x.x.x.x:11211 Traceback (most recent call last): File "/usr/lib/pymodules/python2.6/swift/common/memcached.py", line 160, in get while line[0].upper() != 'END': IndexError: list index out of range (txn: tx2c52f3d36bdd4041844face3ba87e0fb) (client_ip: x.x.x.x)

Revision history for this message
David Kranz (david-kranz) wrote :

I just saw this in the following scenario. A long-running cluster had a problem where the nodes lost their IP addresses due to a dhcp problem. After recovery I got Unauthorized rejections when attempting to access the cluster. I saw a bunch of errors in the logs just like this. Restarting memcached and the proxy-server processes seemed to fix the problem. In case it matters, the new IP addresses were the same as the old ones.

Changed in swift:
status: New → Triaged
status: Triaged → Confirmed
clayg (clay-gerrard)
Changed in swift:
importance: Undecided → Medium
Revision history for this message
Matthew Oliver (matt-0) wrote :

Any ideas on how to recreate this, I've played with flushing and nuking memcache to try to break it with the same error. I may need to do it mid-request, as after a flush the proxy correctly gives me an unauthorized and this error is not printed to the logs.

It is an index out of bounds, so we could just deal with the possibility of that happening in code. (ie catch the IndexError). But that isn't necessarily stopping it, rather removing the stack trace... which I guess is better then nothing.

Or maybe this isn't an issue anymore.

I'll keep playing.

Revision history for this message
clayg (clay-gerrard) wrote :

It sorta looks like you'd have to get memcache to bounce in the middle of the request/response *to memcache* - which tend to happen fairly quickly. I would suspect it'd be possible to reproduce the bug with a memcache fake where you could close the socket after sending the response and before END.

I think it'd be easy to argue our memcache client should be as robust as possible to a socket read returning empty string at any point in the request; but remediation would probably always be to close the connection and bubble up an error to the caller (who would have to decide to retry the request or treat memcache as temporarily unavailable)

Revision history for this message
Matthew Oliver (matt-0) wrote :

Hmm, yeah, a fake memcache could reproduce it, I'll have a go...

I guess the question is what can we really do about it if the item is missing from cache, we are talking the auth token mainly, so I assume an unauthorized is in order. I guess the solution would be to catch the IndexError and deal with it, and deal with it properly.

Maybe create a cache error/exception which can be raised and caught in the authtoken part of the code responds with an unauthorized. but would mean other areas can catch such errors in future if/when we decide to extend the use of the memcache client to other areas of swift.

Revision history for this message
Matthew Oliver (matt-0) wrote :

So while delving deeper into this bug, it looks like it happens when the socket to memcache doesn't return anything, or returns whitespace or an empty string. So maybe after a stale connection.

I have replicated it with complete and utter hackery (i.e. mocking the _get_conns method so it passes back an object that returns an emtpy string on readline).

The existing memcache client code already catches all exceptions and deals with it with the _exception_occurred method which logs it and then cleans up the connection.

So when my test triggers it, is correctly logs the exception to the log file:

  ======================================================================
  ERROR: test_failed_mid_request (test.unit.common.test_memcached.TestMemcached)
  ----------------------------------------------------------------------
  Traceback (most recent call last):
    File "/home/matt/programming/swift/test/unit/common/test_memcached.py", line 553, in test_failed_mid_request
      self.assertEquals(memcache_client.get('some_key'), [1, 2, 3])
    File "/home/matt/programming/swift/swift/common/memcached.py", line 333, in get
      self._exception_occurred(server, e, sock=sock, fp=fp)
    File "/home/matt/programming/swift/swift/common/memcached.py", line 211, in _exception_occurred
      self._return_conn(server, None, None)
    File "/home/matt/programming/swift/swift/common/memcached.py", line 254, in _return_conn
      self._client_cache[server].put((fp, sock))
  KeyError: <test.unit.common.test_memcached.MockMemcached object at 0x7f66598fdb50>
  -------------------- >> begin captured logging << --------------------
  root: ERROR: Error talking to memcached: <test.unit.common.test_memcached.MockMemcached object at 0x7f66598fdb50>
  Traceback (most recent call last):
    File "/home/matt/programming/swift/swift/common/memcached.py", line 317, in get
      while line[0].upper() != 'END':
  IndexError: list index out of range
  --------------------- >> end captured logging << ---------------------

NOTE: I'm using the test suite to run the code that deliberately causes it to fail.

Revision history for this message
clayg (clay-gerrard) wrote :

So if the connections are already being reset, and an error is already bubbling up to the caller - maybe all you need is to *raise a more specific error* so the memcache client can skip the generic exception logging and just let the caller catch the more specific error and retry or *they* can log a traceback or a warning or whatever...

    diff --git a/swift/common/memcached.py b/swift/common/memcached.py
    index f748579..6b19c34 100644
    --- a/swift/common/memcached.py
    +++ b/swift/common/memcached.py
    @@ -314,7 +314,11 @@ class MemcacheRing(object):
                     with Timeout(self._io_timeout):
                         sock.sendall('get %s\r\n' % key)
                         line = fp.readline().strip().split()
    - while line[0].upper() != 'END':
    + while True:
    + if not line:
    + raise MemcacheReadError()
    + if line[0].upper() != 'END':
    + break
                             if line[0].upper() == 'VALUE' and line[1] == key:
                                 size = int(line[3])
                                 value = fp.read(size)

Revision history for this message
John Dickinson (notmyname) wrote :

Clay or Matt, are you still looking at this? Are there any patches that have landed in Swift that resolve this?

Changed in swift:
assignee: nobody → Matthew Oliver (matt-0)
Revision history for this message
Caleb Tennis (ctennis) wrote :

Had a customer hit this exact error today, simply by taking a memcached node offline.

Jun 9 15:16:55 localhost proxy-server: STDERR: ERROR:root:Error talking to memcached: 10.0.5.72:11211#012Traceback (most recent call last):#012 File "/opt/ss/lib/python2.7/site-pac
kages/swift/common/memcached.py", line 317, in get#012 while line[0].upper() != 'END':#012IndexError: list index out of range (txn: txafcfc6b5bac34d3aba134-0055773ba5)

Would be nice if the log was cleaner and didn't show the backtrace.

Revision history for this message
Matthew Oliver (matt-0) wrote :

Well we can at least tidy up the error and catch the specific IndexError and give a nicer error in the logs. I'll put together a patch to do this.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (master)

Fix proposed to branch: master
Review: https://review.openstack.org/219963

Changed in swift:
status: Confirmed → In Progress
Jean Caron (jean.caron)
information type: Public → Public Security
Revision history for this message
Jeremy Stanley (fungi) wrote :

Please don't set to security without including a comment justifying the report type change.

information type: Public Security → Public
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/219963
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=e501ac7d2be5c11b2ed0005885c84023054ec041
Submitter: Jenkins
Branch: master

commit e501ac7d2be5c11b2ed0005885c84023054ec041
Author: Matthew Oliver <email address hidden>
Date: Thu Sep 3 12:19:05 2015 +1000

    Fix memcached exception out of range stacktrace

    When a memecached server goes offline in the middle of a
    MemcahceRing (swift memcache client) session then a request
    to memcached returns nothing and the client inside swift
    leaves an "IndexError: list index out of range" stacktrace.

    This change corrects that in all the places of MemcacheRing
    that is susceptible to it, and added some tests to stop
    regression.

    Clay added a diff to the bug that pretty much did the same
    thing I did, so I'll co-author him.

    Change-Id: I97c5420b4b4ecc127e9e94e9d0f91fbe92a5f623
    Co-Authored-By: Clay Gerrard <email address hidden>
    Closes-Bug: #897451

Changed in swift:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/deep)

Fix proposed to branch: feature/deep
Review: https://review.openstack.org/508700

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/deep)
Download full text (7.2 KiB)

Reviewed: https://review.openstack.org/508700
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=0c75ddf6fe5a4843fe60836b402f27cb1b83d8c5
Submitter: Zuul
Branch: feature/deep

commit 93fc9d2de86f37f62b1d6768600d0551e1b72fb6
Author: Alistair Coles <email address hidden>
Date: Wed Sep 27 16:35:27 2017 +0100

    Add cautionary note re delay_reaping in account-server.conf-sample

    Change-Id: I2c3eea783321338316eecf467d30ba0b3217256c
    Related-Bug: #1514528

commit c6aea4b3730c937c41815831a7b4d60ff2899fcb
Author: Tim Burke <email address hidden>
Date: Wed Sep 27 19:19:53 2017 +0000

    Fix intermittent failure in test_x_delete_after

    X-Delete-After: 1 is known to be flakey; use 2 instead.

    When the proxy receives an X-Delete-After header, it automatically
    converts it to an X-Delete-At header based on the current time. So far,
    so good. But in normalize_delete_at_timestamp we convert our

        time.time() + int(req.headers['X-Delete-After'])

    to a string representation of an integer and in the process always round
    *down*. As a result, we lose up to a second worth of object validity,
    meaning the object server can (rarely) respond 400, complaining that the
    X-Delete-At is in the past.

    Change-Id: Ib5e5a48f5cbed0eade8ba3bca96b26c82a9f9d84
    Related-Change: I643be9af8f054f33897dd74071027a739eaa2c5c
    Related-Change: I10d3b9fcbefff3c415a92fa284a1ea1eda458581
    Related-Change: Ifdb1920e5266aaa278baa0759fc0bfaa1aff2d0d
    Related-Bug: #1597520
    Closes-Bug: #1699114

commit 5c76b9e691166acc1f7b8483aaa3980ebc70bd3a
Author: Alistair Coles <email address hidden>
Date: Wed Sep 27 14:11:14 2017 +0100

    Add concurrent_gets to proxy.conf man page

    Change-Id: Iab1beff4899d096936c0e5915f3ec32364b3e517
    Closes-Bug: #1559347

commit b4f08b6090057897ac647ba6331a4ec867b8e3b8
Author: Jens Harbott <email address hidden>
Date: Wed Sep 27 09:10:54 2017 +0000

    Fix functest for IPv6 endpoints

    Currently the functional tests fail if the storage_url contains a quoted
    IPv6 address because we try to split on ':'.

    But actually we don't need to split hostname and port only in order to
    combine it back together lateron. Use the standard urlparse() function
    instead and work with the 'netloc' part of the URL which keeps hostname
    and port together.

    Change-Id: I64589e5f2d6fb3cebc6768dc9e4de6264c09cbeb
    Partial-Bug: 1656329

commit 53ab6f2907eff2bb90528010d881f2f87ee02505
Author: Alistair Coles <email address hidden>
Date: Tue Sep 26 11:43:53 2017 +0100

    Assert memcached connection error is logged

    Follow up to [1] - change logger mocking so that we can
    assert the memcached connection error is logged.

    [1] Related-Change: I97c5420b4b4ecc127e9e94e9d0f91fbe92a5f623

    Change-Id: I87cf4245082c5e0f0705c2c14ddfc0b5d5d89c06

commit e501ac7d2be5c11b2ed0005885c84023054ec041
Author: Matthew Oliver <email address hidden>
Date: Thu Sep 3 12:19:05 2015 +1000

    Fix memcached exception out of range stacktrace

    When a memecached server goes offline in the middle of a
    MemcahceRing...

Read more...

tags: added: in-feature-deep
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/s3api)

Fix proposed to branch: feature/s3api
Review: https://review.openstack.org/512277

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: feature/s3api
Review: https://review.openstack.org/512283

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on swift (feature/s3api)

Change abandoned by Alistair Coles (<email address hidden>) on branch: feature/s3api
Review: https://review.openstack.org/512283
Reason: I was just trying to get sensible topic

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/s3api)
Download full text (23.2 KiB)

Reviewed: https://review.openstack.org/512277
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=f94d6567a7e2e8b3ca1168b4a41c42c1ee371af5
Submitter: Zuul
Branch: feature/s3api

commit 24188beb81d39790034fa0902246163a7bf54c91
Author: Samuel Merritt <email address hidden>
Date: Thu Oct 12 16:13:25 2017 -0700

    Remove some leftover threadpool cruft.

    Change-Id: I43a1a428bd96a2e18aac334c03743a9f94f7d3e1

commit 1d67485c0b935719e0c8999eb353dfd84713add6
Author: Samuel Merritt <email address hidden>
Date: Fri Apr 15 12:43:44 2016 -0700

    Move all monkey patching to one function

    Change-Id: I2db2e53c50bcfa17f08a136581cfd7ac4958ada2

commit 407f5394f0f5cb422c06b4e5b2f9fbfdb07782d1
Author: OpenStack Proposal Bot <email address hidden>
Date: Thu Oct 12 08:12:38 2017 +0000

    Imported Translations from Zanata

    For more information about this automatic import see:
    https://docs.openstack.org/i18n/latest/reviewing-translation-import.html

    Change-Id: I628cb09aa78d8e339b4762a3c9ed8aed43941261

commit 45ca39fc68cdb42b382c1638a92cc8d3cec5529a
Author: Clay Gerrard <email address hidden>
Date: Tue Oct 10 11:47:50 2017 -0700

    add mangle_client_paths to example config

    Change-Id: Ic1126fc95e8152025fccf25356c253facce3e3ec

commit 94bac4ab2fe65104d602378e8e49c37b8187a75d
Author: Tim Burke <email address hidden>
Date: Fri May 12 10:55:21 2017 -0400

    domain_remap: stop mangling client-provided paths

    The root_path option for domain_remap seems to serve two purposes:
     - provide the first component (version) for the backend request
     - be an optional leading component for the client request, which
       should be stripped off

    As a result, we have mappings like:

       c.a.example.com/v1/o -> /v1/AUTH_a/c/o

    instead of

       c.a.example.com/v1/o -> /v1/AUTH_a/c/v1/o

    which is rather bizarre. Why on earth did we *ever* start doing this?

    Now, this second behavior is managed by a config option
    (mangle_client_paths) with the default being to disable it.

    Upgrade Consideration
    =====================

    If for some reason you *do* want to drop some parts of the
    client-supplied path, add

       mangle_client_paths = True

    to the [filter:domain_remap] section of your proxy-server.conf. Do this
    before upgrading to avoid any loss of availability.

    UpgradeImpact
    Change-Id: I87944bfbf8b767e1fc36dbc7910305fa1f11eeed

commit a4a5494fd2fe8a43a5d50a21a1951266cc7c4212
Author: Alistair Coles <email address hidden>
Date: Mon Oct 9 11:33:28 2017 +0100

    test account autocreate listing format

    Related-Change: Id3ce37aa0402e2d8dd5784ce329d7cb4fbaf700d
    Change-Id: I50c22225bbebff71600bea9158bda1edd18b48b0

commit 8b7f15223cde4c19fd9cbbd97e8ad79a1b4afa8d
Author: Alistair Coles <email address hidden>
Date: Mon Oct 9 10:06:19 2017 +0100

    Add example to container-sync-realms.conf.5 man page

    Related-Change: I0760ce149e6d74f2b3f1badebac3e36da1ab7e77

    Change-Id: I129de42f91d7924c7bcb9952f17fe8a1a10ae219

commit 816331155c624c444ed123bcab412...

tags: added: in-feature-s3api
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.16.0

This issue was fixed in the openstack/swift 2.16.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.