503 incomplete transfer While processing manifest of EC s3 MPU logged at 200 success

Bug #1895691 reported by clayg
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
In Progress
Undecided
Unassigned

Bug Description

When I added a sleep to my object servers just before returning their response:

 diff --git a/swift/obj/server.py b/swift/obj/server.py
 index 5a883e4c1..ea853992f 100644
 --- a/swift/obj/server.py
 +++ b/swift/obj/server.py
 @@ -27,6 +27,7 @@ import socket
  import math
  from swift import gettext_ as _
  from hashlib import md5
 +import random

  from eventlet import sleep, wsgi, Timeout, tpool
  from eventlet.greenthread import spawn
 @@ -1138,6 +1139,8 @@ class ObjectController(BaseStorageServer):
    headers['X-Backend-Timestamp'] = e.timestamp.internal
       resp = HTTPNotFound(request=request, headers=headers,
      conditional_response=True)
 + if obj_size > 1000000 and random.random() > 0.7:
 + sleep(2.0)
   return resp

      @public

I observed a similar set of log lines from an transaction we observed on our production cluster:

Sep 15 15:09:45 saio proxy-server: ERROR with Object server 127.0.0.4:6040/sdb8 re: Trying to GET /v1/AUTH_test/ectest/myslo: Timeout (1.0s)
Sep 15 15:27:32 saio proxy-server: ERROR with Object server 127.0.0.3:6030/sdb7 re: Trying to GET /v1/AUTH_test/ectest%2Bsegments/myslo/YjVjYjIyMmQtOTExZC00YzhjLWJiZDUtMTQ5NzcxYWUwZjBm/1: Timeout (1.0s)
Sep 15 15:27:37 saio proxy-server: ERROR with Object server 127.0.0.3:6030/sdb3 re: Trying to GET /v1/AUTH_test/ectest%2Bsegments/myslo/YjVjYjIyMmQtOTExZC00YzhjLWJiZDUtMTQ5NzcxYWUwZjBm/1: Timeout (1.0s)
Sep 15 15:28:49 saio proxy-server: ERROR with Object server 127.0.0.3:6030/sdb7 re: Trying to GET /v1/AUTH_test/ectest%2Bsegments/myslo/YjVjYjIyMmQtOTExZC00YzhjLWJiZDUtMTQ5NzcxYWUwZjBm/2: Timeout (1.0s)
Sep 15 15:28:50 saio proxy-server: ERROR with Object server 127.0.0.2:6020/sdb2 re: Trying to GET /v1/AUTH_test/ectest%2Bsegments/myslo/YjVjYjIyMmQtOTExZC00YzhjLWJiZDUtMTQ5NzcxYWUwZjBm/2: Timeout (1.0s)
Sep 15 15:28:51 saio proxy-server: ERROR with Object server 127.0.0.3:6030/sdb3 re: Trying to GET /v1/AUTH_test/ectest%2Bsegments/myslo/YjVjYjIyMmQtOTExZC00YzhjLWJiZDUtMTQ5NzcxYWUwZjBm/2: Timeout (1.0s)

Sep 15 15:28:51 saio proxy-server: Object returning 503 for [] (txn: txc9adaa0c1e394625a4978-005f60ddb0)
Sep 15 15:28:51 saio proxy-server: - - 15/Sep/2020/15/28/51 GET /v1/AUTH_test/ectest%252Bsegments/myslo/YjVjYjIyMmQtOTExZC00YzhjLWJiZDUtMTQ5NzcxYWUwZjBm/2%3Fmultipart-manifest%3Dget HTTP/1.0 499 - aws-cli/1.18.129%20Python/2.7.17%20Linux/4.15.0-115-generic%20botocore/1.17.52%20StaticWeb%20%20SLO%20MultipartGET - - 118 - txc9adaa0c1e394625a4978-005f60ddb0 - 3.0848 SLO - 1600183728.648982048 1600183731.733822107 1
Sep 15 15:28:51 saio proxy-server: While processing manifest /v1/AUTH_test/ectest/myslo, got 503 while retrieving /v1/AUTH_test/ectest%2Bsegments/myslo/YjVjYjIyMmQtOTExZC00YzhjLWJiZDUtMTQ5NzcxYWUwZjBm/2 (txn: txc9adaa0c1e394625a4978-005f60ddb0)
Sep 15 15:28:51 saio proxy-server: 127.0.0.1 127.0.0.1 15/Sep/2020/15/28/51 GET /ectest/myslo HTTP/1.0 200 - aws-cli/1.18.129%20Python/2.7.17%20Linux/4.15.0-115-generic%20botocore/1.17.52 - - 5242888 - txc9adaa0c1e394625a4978-005f60ddb0 - 3.2518 - - 1600183728.483520031 1600183731.735340118 -

This was with an s3 client making a request for a MPU/SLO stored in an EC bucket

Even though we already sent a 200 to the client, I think we could (should?) log (and emit stats) for the access line as a server error for better accountability and tracking.

Revision history for this message
clayg (clay-gerrard) wrote :

It seems timeouts can also manifest like:

 2020-09-17T16:02:29.965585+00:00 ss0206 proxy-server: err InternalError: 500 Internal Server Error (unexpected status code 409):
 Traceback (most recent call last):
   File "/opt/ss/lib/python2.7/site-packages/swift/common/middleware/s3api/s3api.py", line 303, in __call__
     resp = self.handle_request(req)
   File "/opt/ss/lib/python2.7/site-packages/swift/common/middleware/s3api/s3api.py", line 341, in handle_request
     res = handler(req)
   File "/opt/ss/lib/python2.7/site-packages/swift/common/middleware/s3api/controllers/obj.py", line 134, in GET
     return self.GETorHEAD(req)
   File "/opt/ss/lib/python2.7/site-packages/swift/common/middleware/s3api/controllers/obj.py", line 100, in GETorHEAD
     resp = req.get_response(self.app, query=query)
   File "/opt/ss/lib/python2.7/site-packages/swift/common/middleware/s3api/s3request.py", line 1419, in get_response
     headers, body, query)
   File "/opt/ss/lib/python2.7/site-packages/swift/common/middleware/s3api/s3request.py", line 1403, in _get_response
     raise InternalError('unexpected status code %d' % status)
 InternalError: InternalError: 500 Internal Server Error (unexpected status code 409) (txn: tx821f2f8a806c45378a97a-005f63888b)

Revision history for this message
Tim Burke (1-tim-z) wrote :

Yeah, with a slowdown in the object server like that, I'd expect one of three possible errors codes to be returned to the client today:

* 503 when the manifest can't be fetched;
* 500 when the manifest can be fetched, but the first segment used in the response can't; or
* 200 (possibly with a short read) when both the manifest and the first segment requests are successful.

The 500-resulting-from-a-409 boils down to https://github.com/openstack/swift/blob/2.25.0/swift/common/middleware/slo.py#L1042-L1050 Note that that we're validating the first segment *used in the response*, not necessarily the first segment of the manifest -- useful when the client retries with a Range request to skip the bytes it already received.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.