Object Node timeouts retrieving EC segments returns 404 to client

Bug #1512852 reported by Douglas Soltesz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Confirmed
Medium
Unassigned

Bug Description

When pushing an Object Server hard Getting EC objects timeouts result in the proxy sending a 404 to the client. Most clients do not retry on a 404, but would retry on a 503. Previous to 2.5, I was receiving a number of 503s during the same testing. These have been changed to 404s, which many applications take as the object does not exist or was deleted.

Here are the logs from my proxy server during a period with Object Server disk busy do to benchmark testing.

Oct 28 18:15:58 localhost proxy-server: ERROR with Object server 192.168.3.62:6022/d16 re: Trying to GET /v1/AUTH_ECTesting/EC10-2-1/objects-0-56-24: ConnectionTimeout (0.5s)
Oct 28 18:15:58 localhost proxy-server: ERROR with Object server 192.168.3.62:6026/d20 re: Trying to GET /v1/AUTH_ECTesting/EC10-2-1/objects-0-56-24: ConnectionTimeout (0.5s)
Oct 28 18:15:58 localhost proxy-server: ERROR with Object server 192.168.3.62:6031/d25 re: Trying to GET /v1/AUTH_ECTesting/EC10-2-1/objects-0-56-24: ConnectionTimeout (0.5s)
Oct 28 18:15:58 localhost proxy-server: ERROR with Object server 192.168.3.62:6041/d30 re: Trying to GET /v1/AUTH_ECTesting/EC10-2-1/objects-0-56-24: ConnectionTimeout (0.5s)
Oct 28 18:15:58 localhost proxy-server: ERROR with Object server 192.168.3.62:6020/d14 re: Trying to GET /v1/AUTH_ECTesting/EC10-2-1/objects-0-56-24: ConnectionTimeout (0.5s)
Oct 28 18:15:58 localhost proxy-server: ERROR with Object server 192.168.3.62:6016/d10 re: Trying to GET /v1/AUTH_ECTesting/EC10-2-1/objects-0-56-24: ConnectionTimeout (0.5s)
Oct 28 18:15:58 localhost proxy-server: ERROR with Object server 192.168.3.62:6015/d9 re: Trying to GET /v1/AUTH_ECTesting/EC10-2-1/objects-0-56-24: ConnectionTimeout (0.5s)

Oct 28 18:15:58 localhost proxy-server: 192.168.2.41 192.168.2.41 28/Oct/2015/22/15/58 GET /v1/AUTH_ECTesting/EC10-2-1/objects-0-67-22 HTTP/1.0 404 - python-swiftclient-2.4.0 ECTesting,AUTH_t... - 70 - txf5b7019b37a74d25ad51a-005631491d X-Backend-Etag-Is-At: X-Object-Sysmeta-Ec-Etag
X-Auth-Token: ECTesting,AUTH_tka7bb57d575214944b33c19bf6c13e442
User-Agent: python-swiftclient-2.4.0
Connection: keep-alive
X-Trans-Id: txf5b7019b37a74d25ad51a-005631491d
X-Backend-Storage-Policy-Index: 3
Host: 192.168.2.51
Accept: ​*/*​
Accept-Encoding: gzip, deflate
Content-Type: None 1.0855 - - 1446070557.623872042 1446070558.709361076 3

Here are my logs before Swift 2.5 (note I was still getting 404s, but I was also getting 503s in the same testing)

Oct 28 19:58:43 localhost proxy-server: ERROR with Object server 192.168.3.62:6028/d22 re: Trying to GET /v1/AUTH_ECTesting/EC15-4-1/objects-0-15-9: ConnectionTimeout (0.5s)
Oct 28 19:58:43 localhost proxy-server: ERROR with Object server 192.168.3.62:6032/d26 re: Trying to GET /v1/AUTH_ECTesting/EC15-4-1/objects-0-15-9: ConnectionTimeout (0.5s)
Oct 28 19:58:43 localhost proxy-server: ERROR with Object server 192.168.3.62:6007/d2 re: Trying to GET /v1/AUTH_ECTesting/EC15-4-1/objects-0-15-9: ConnectionTimeout (0.5s)
Oct 28 19:58:43 localhost proxy-server: ERROR with Object server 192.168.3.62:6008/d3 re: Trying to GET /v1/AUTH_ECTesting/EC15-4-1/objects-0-15-9: ConnectionTimeout (0.5s)
Oct 28 19:58:43 localhost proxy-server: ERROR with Object server 192.168.3.62:6033/d27 re: Trying to GET /v1/AUTH_ECTesting/EC15-4-1/objects-0-15-9: ConnectionTimeout (0.5s)
Oct 28 19:58:47 localhost proxy-server: 192.168.2.41 192.168.2.41 28/Oct/2015/23/58/47 GET /v1/AUTH_ECTesting/EC15-4-1/objects-0-15-9 HTTP/1.0 503 - python-swiftclient-2.4.0 ECTesting,AUTH_t... - 118 - txbeaed55205ac484ba1bd6-0056316133 X-Backend-Etag-Is-At: X-Object-Sysmeta-Ec-Etag
Oct 28 19:58:52 localhost proxy-server: 192.168.2.41 192.168.2.41 28/Oct/2015/23/58/52 GET /v1/AUTH_ECTesting/EC15-4-1/objects-0-15-9 HTTP/1.0 200 - python-swiftclient-2.4.0 ECTesting,AUTH_t... - 33554432 - tx8ae91dffa5ba4adaa17a4-0056316138 X-Backend-Etag-Is-At: X-Object-Sysmeta-Ec-Etag

Doug

clayg (clay-gerrard)
Changed in swift:
status: New → Confirmed
importance: Undecided → Low
Revision history for this message
clayg (clay-gerrard) wrote :
Download full text (3.7 KiB)

So, I shutdown half of my saio object servers and was able to reproduce this:

    Nov 3 22:15:15 saio proxy-server: ERROR with Object server 127.0.0.1:6040/sdb8 re: Trying to GET /v1/AUTH_test/ec-test/test: Connection refused
    Nov 3 22:15:15 saio object-6020: 127.0.0.1 - - [03/Nov/2015:22:15:15 +0000] "GET /sdb2/941/AUTH_test/ec-test/test" 200 82 "GET http://saio:8080/v1/AUTH_test/ec-test/test" "tx6be44b4f68c54c0f881f9-00563931f3" "proxy-server 27232" 0.0024 "-" 27264 1
    Nov 3 22:15:15 saio proxy-server: ERROR with Object server 127.0.0.1:6040/sdb4 re: Trying to GET /v1/AUTH_test/ec-test/test: Connection refused
    Nov 3 22:15:15 saio proxy-server: ERROR with Object server 127.0.0.1:6030/sdb7 re: Trying to GET /v1/AUTH_test/ec-test/test: Connection refused
    Nov 3 22:15:15 saio object-6010: 127.0.0.1 - - [03/Nov/2015:22:15:15 +0000] "GET /sdb1/941/AUTH_test/ec-test/test" 200 82 "GET http://saio:8080/v1/AUTH_test/ec-test/test" "tx6be44b4f68c54c0f881f9-00563931f3" "proxy-server 27232" 0.0008 "-" 27205 1
    Nov 3 22:15:15 saio object-6010: 127.0.0.1 - - [03/Nov/2015:22:15:15 +0000] "GET /sdb5/941/AUTH_test/ec-test/test" 200 82 "GET http://saio:8080/v1/AUTH_test/ec-test/test" "tx6be44b4f68c54c0f881f9-00563931f3" "proxy-server 27232" 0.0011 "-" 27205 1
    Nov 3 22:15:15 saio object-6020: 127.0.0.1 - - [03/Nov/2015:22:15:15 +0000] "GET /sdb6/941/AUTH_test/ec-test/test" 404 - "GET http://saio:8080/v1/AUTH_test/ec-test/test" "tx6be44b4f68c54c0f881f9-00563931f3" "proxy-server 27232" 0.0004 "-" 27264 1
    Nov 3 22:15:15 saio proxy-server: ERROR with Object server 127.0.0.1:6030/sdb3 re: Trying to GET /v1/AUTH_test/ec-test/test: Connection refused
    Nov 3 22:15:15 saio proxy-server: 127.0.0.1 127.0.0.1 03/Nov/2015/22/15/15 GET /v1/AUTH_test/ec-test/test HTTP/1.0 404 - python-swiftclient-2.6.1.dev10 AUTH_tk84450147e... - 70 - tx6be44b4f68c54c0f881f9-00563931f3 - 0.0284 - - 1446588915.392549038 1446588915.420908928 1

Oddly if I disable 3/4 object servers - it get the 503's:

    Nov 3 22:17:21 saio object-6010: 127.0.0.1 - - [03/Nov/2015:22:17:21 +0000] "GET /sdb5/941/AUTH_test/ec-test/test" 200 82 "GET http://saio:8080/v1/AUTH_test/ec-test/test" "txc2605e2641694ae69f1d2-0056393271" "proxy-server 27232" 0.0009 "-" 27205 1
    Nov 3 22:17:21 saio object-6010: 127.0.0.1 - - [03/Nov/2015:22:17:21 +0000] "GET /sdb1/941/AUTH_test/ec-test/test" 200 82 "GET http://saio:8080/v1/AUTH_test/ec-test/test" "txc2605e2641694ae69f1d2-0056393271" "proxy-server 27232" 0.0006 "-" 27205 1
    Nov 3 22:17:21 saio proxy-server: ERROR with Object server 127.0.0.1:6040/sdb4 re: Trying to GET /v1/AUTH_test/ec-test/test: Connection refused
    Nov 3 22:17:21 saio proxy-server: ERROR with Object server 127.0.0.1:6040/sdb8 re: Trying to GET /v1/AUTH_test/ec-test/test: Connection refused
    Nov 3 22:17:21 saio proxy-server: ERROR with Object server 127.0.0.1:6030/sdb7 re: Trying to GET /v1/AUTH_test/ec-test/test: Connection refused
    Nov 3 22:17:21 saio proxy-server: ERROR with Object server 127.0.0.1:6020/sdb2 re: Trying to GET /v1/AUTH_test/ec-test/test: Connection refused
    Nov 3 22:17:21 saio proxy-server: ERROR with Object server 127.0.0.1:6...

Read more...

clayg (clay-gerrard)
Changed in swift:
importance: Low → Medium
Revision history for this message
clayg (clay-gerrard) wrote :

I still need to do some more work to understand the series of backend failures that could trigger a 404 vs. 503 when more handoff nodes are available.

In my example I had more than ndata devices offline when I got the 404 and less than nparity handoffs available when I got the 503.

I think the client impact could be better advocated.

We want to be careful not to return an error when the client retrying the operation isn't going to make things any better.

Revision history for this message
Matthew Oliver (matt-0) wrote :

At an initial glance at the bug description, I think there is a problem with the logs you've shown us. I'm not saying there isn't a bug here. Looks like clay have managed to recreate it. However in the 2.5 snippet of logs, you show a bunch of connection timeouts when hitting object servers for:

  GET /v1/AUTH_ECTesting/EC10-2-1/objects-0-56-24

But the example 404 proxy response is for:

  GET /v1/AUTH_ECTesting/EC10-2-1/objects-0-67-22

So for a different object, which may actually not exist. (although maybe they do and this is also caused by the timeout).

Revision history for this message
Tim Burke (1-tim-z) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.