swift-object-reconstructor memory leak

Bug #1628906 reported by Mattia Belluco on 2016-09-29
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Medium
Unassigned

Bug Description

version: 2.5.0-0ubuntu1~cloud0

We have a cluster of two nodes with 120 disks each, two storage policies: EC and replica-2
Since we started to put data in a container with an EC policy we noticed the memory used by
swift-object-reconstructor process grew out of proportion eventually causing both nodes to go
out of memory (the reconstructor process was using 85% of 128 GB of RAM).

On the irc channel that appeared to be common knoledge but I couldn't find any bug related to
that.
The work around of restarting the process every N hours has the strong downside of making
the reconstruction start from the beginning each time and so not really a solution with nodes with
lots of storage/devices as the process will never complete.

Changed in swift:
status: New → Confirmed
importance: Undecided → High
Admin6 (jmbonnefond) wrote :

I confirm having the same behavior here, with a EC ring of 48 disk, over 4 nodes. 12 disks per node, currently filled at 75%.
At the end of the reconstruction process (which takes about 15 days) the reconstructor process is consuming more than 80GB on the 96GB RAM available on each node.
I need to kill the reconstructor once it has reach the end of the last device to free the memory.
I can't kill it earlier as it always start from the first device and perform one device after the other.

Romain LE DISEZ (rledisez) wrote :

I have the same behavior. I'm running Swift 2.7. It seems to only happen when executing SYNC jobs.

Admin6 (jmbonnefond) wrote :

I forgot to say that I'm using Swift v2.7 packaged into Ubuntu 16.04 : 2.7.0-0ubuntu2

Samuel Merritt (torgomatic) wrote :

Those who are experiencing the bug: what version of pyeclib is installed on your system(s)?

Antonio Messina (arcimboldo) wrote :

I work with Mattia (the OP) We run Ubuntu 14.04 with liberty packages from cloud-archive

ii python-pyeclib 1.0.8-3~cloud0 amd64 interface for implementing erasure codes - Python 2.x

Romain LE DISEZ (rledisez) wrote :

PyECLib 1.2.0
libjerasure2 2.0.0-2~bpo70+1

[storage-policy:1]
name = Policy-1
default = 1
policy_type = erasure_coding
ec_type = jerasure_rs_vand
ec_num_data_fragments = 12
ec_num_parity_fragments = 3
ec_object_segment_size = 1048576

Samuel Merritt (torgomatic) wrote :

I think this might be fixed in 2.10.0 by commit 2876f59d4cdbafc297d79f4e6295f05e3448ae47.

That's a pretty small commit, and it should be fairly easy to backport if anyone's interested in trying it.

Kota Tsuyuzaki (tsuyuzaki-kota) wrote :

If commit 876f59d4cdbafc297d79f4e6295f05e3448ae47 suggested by Sam resolve your memory leak, the root cause is in PyECLib and that has been resolved at [1]. So i think either way is fine, A. backport the patch or B. upgrade your PyECLib to the newest (or 1.3.0 seems to be released just this morning).

I'd suggest B because your Swift (since 2.7.0) may hit similar memory leak on Proxy Server (in particular, at the time of PUT/GET ec object), too. The detail and affected area is reported at the half bottom of [2]'s description.

1: https://github.com/openstack/pyeclib/commit/cbb3d9364ff9cf453b06866d24073330760f8918
2: https://bugs.launchpad.net/swift/+bug/1604335

Kota Tsuyuzaki (tsuyuzaki-kota) wrote :

Looking back at the irc log[1], the memory leak related to sort of sockets at reconstructor is still there? If true, I'm doubting the socket in ssync_server, I'm not sure for now though. The trial patch is with as attachment. The reason why I think this could resolve the socket leak is we got similar unclosed socket at Proxy-server which is resolved at [2][3]. To describe the bug a bit, when using swift.common.utils.BufferedHTTPConnection for the connection and once the process got the response, IIRC, we should either set None to resposne or call response.close() to release the reference for the backend socket. However, sssync_sender just calling self.connection.close() in the disconnected method.

In broken sync protocol, ssync_sender can call response.close() somewhere. However, some error cases (e.g. OSError (broken pipe) at connection.send() seems to fall into the disconnect method and IIUC, it cannot close anything. (maybe it will be closed after TIMEOUT seconds) anyway, we need more tests for this assumption.

I think Alistair or Samuel knows more detail around there.

1: http://eavesdrop.openstack.org/irclogs/%23openstack-swift/%23openstack-swift.2016-09-30.log.html#t2016-09-30T10:55:09
2: https://github.com/openstack/swift/blob/master/swift/proxy/controllers/obj.py#L1551-L1554
3: https://bugs.launchpad.net/swift/+bug/1594739

Romain LE DISEZ (rledisez) wrote :

When talking about socket leak I was talking about swift-proxy, not reconstructor. But thanks for the links, I will definitively look at that patch, it can help a lot.

After an upgrade of PyECLib to 1.3.0, I can say I don't see visible impact on memory usage on swift-proxy. Can't really test on reconstructor.

Little error message since the upgrade to PyECLib 1.3, but no big deal, everything works fine:

# /opt/swift/bin/pip freeze | grep -i pyec
PyECLib==1.3.0

# /opt/swift/bin/swift-recon -d
liberasurecode[44244]: liberasurecode_backend_open: dynamic linking error libisal.so.2: cannot open shared object file: No such file or directory
liberasurecode[44244]: liberasurecode_backend_open: dynamic linking error libshss.so.1: cannot open shared object file: No such file or directory
[...]

On socket:
Gotcha. I'm looking forward to your update if you find something.

On sarning message:
Thinking of if it reported a problem or not, you can ignore the warning unless you want to use the reported backend (isal or shss) it appears when backend ".so" library not found at checking the existence. And IIRC, it will run only once when initialize the pyeclib module to check the backends availability. However, i thought Timur was working[1] to suppress the annoying message, though.

What liberasurecode version are you using? The liberasurecode is maintained separated as PyECLib so that you have to update also liberasurecode if you 're using older one.

1: https://github.com/openstack/liberasurecode/commit/c7a94df0724af30b26e3856f9c14344fc9b73a09

clayg (clay-gerrard) on 2017-04-12
Changed in swift:
importance: High → Medium
Romain LE DISEZ (rledisez) wrote :

We updated production to swift 2.12, liberasurecode 1.4, pyeclib 1.4

Since then, we saw no more memory leaks. We removed, few weeks ago, the cron restarting the reconstructor every 6 hours. Memory consumption is stable since.

For me, the leak has been fixed.

@jmbonnefond, @arcimboldo, were you able to confirm what Romain described in his last comment?

Are you still seeing memory leak issues. There have been a number of fixes to liberasurecode in both 1.4 and 1.5 versions and it would be great to know if that has solved your issues and if this bug can be closed. Thanks!

On 07/07/2017 01:35 PM, Thiago da Silva wrote:
> @jmbonnefond, @arcimboldo, were you able to confirm what Romain
> described in his last comment?
>
> Are you still seeing memory leak issues. There have been a number of
> fixes to liberasurecode in both 1.4 and 1.5 versions and it would be
> great to know if that has solved your issues and if this bug can be
> closed. Thanks!
>

I can confirm the memory leak is no longer there.

Cheers,

Mattia

I'm closing this bug based on the reports from Mattia and Romain

Changed in swift:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers