Swift Erasure Code fails with liberasurecode 1.4.0 on CentOS

Bug #1707220 reported by Andy McCrae on 2017-07-28
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Andy McCrae

Bug Description

Our Swift gate tests are failing intermittently on CentOS 7 due to "cross policy write" tests - which are essentially testing cross policy as well as Erasure Code (since the second policy is an EC policy in testing) ( Sample gate failure - http://logs.openstack.org/25/485225/5/check/gate-openstack-ansible-os_swift-ansible-func-centos-7/8ad31e6/console.html#_2017-07-24_19_24_22_216603 )

Manually trying to uploading objects to Swift shows the following:

(swift-untagged) [root@swift-storage1 /]# swift post -H "X-Storage-Policy: ec-tests" ec_cont
(swift-untagged) [root@swift-storage1 /]# swift upload ec_cont test.file
('Connection aborted.', BadStatusLine("''",))
(swift-untagged) [root@swift-storage1 /]# swift post non_ec_cont
(swift-untagged) [root@swift-storage1 /]# swift upload non_ec_cont test.file

The non-ec container upload works fine, whereas the erasure code upload fails.

The version of liberasurecode deployed is:
(swift-untagged) [root@swift-storage1 /]# rpm -qa | grep liberasurecode

Updating to 1.5.0 works though:
[root@swift-storage1 /]# wget http://cbs.centos.org/kojifiles/packages/liberasurecode/1.5.0/1.el7/x86_64/liberasurecode-1.5.0-1.el7.x86_64.rpm
[root@swift-storage1 /]# wget http://cbs.centos.org/kojifiles/packages/liberasurecode/1.5.0/1.el7/x86_64/liberasurecode-devel-1.5.0-1.el7.x86_64.rpm
[root@swift-storage1 /]# rpm -U liberasurecode-devel-1.5.0-1.el7.x86_64.rpm liberasurecode-1.5.0-1.el7.x86_64.rpm
[root@swift-storage1 /]# rpm -qa | grep liberasure

Now after restarting swift services, the upload succeeds:
(swift-untagged) [root@swift-storage1 /]# swift upload ec_cont test.file


Tested against stable/ocata and Master for Swift.
For reference the CentOS7 kernel being used is:
[root@swift-cent openstack-ansible-os_swift]# uname -r

clayg (clay-gerrard) wrote :

The information needed to debug this is in the proxy log lines.

If newer liberasurecode fixes the issue - isn't this bug already "Fix Released"?

Changed in openstack-ansible:
status: New → Incomplete
status: Incomplete → New
clayg (clay-gerrard) wrote :

Sorry, I thought this was filed as a libec bug - I don't think I have anything helpful to contribute here - sorry.

Tim Burke (1-tim-z) wrote :

It sounds like the proxy worker died trying to service the request, and each time the parent daemon spawned a new one... all the "Removing dead child <pid>" messages like http://logs.openstack.org/25/485225/5/check/gate-openstack-ansible-os_swift-ansible-func-centos-7/8ad31e6/logs/openstack/swift-proxy/swift/proxy-error.log.txt.gz#_Jul_24_19_20_58 seem to confirm that.

Are there any core dumps that get produced?

What's the config for the EC policy? ec_type / ec_num_data_fragments / ec_num_parity_fragments

Andy McCrae (andrew-mccrae) wrote :

Thanks for the response Tim - I know its not technically a "libec" or swift issue as such, but would be cool to debug it further (I'm pretty sure we ran into a similar situation last cycle)

Here is the section for the ec-tests storage policy:
name = ec-tests
policy_type = erasure_coding

ec_type = liberasurecode_rs_vand
ec_num_data_fragments = 3
ec_num_parity_fragments = 2
ec_object_segment_size = 1048576

A couple things to note, on a "not working" install, I can update to liberasurecode-1.5.0 and it works fine (after restarting the services), however newer installs seem to be working with 1.4.0 - also this only seems to impact CentOS7 builds. (Swift settings are the same).

I've done a package comparison between a working build from http://logs.openstack.org/07/488507/1/check/gate-openstack-ansible-os_swift-ansible-func-centos-7/7e6277f/console.html#_2017-07-28_16_35_33_056669 for which I added some debug tasks, and a failed build I have:

[root@swift-storage1 ~]# diff good_rpms.txt bad_rpms.txt
< gpg-pubkey-e451e5b5-54c22d60

So I don't think there is an issue with different installed packages.

Here is a coredump (or atleast the first 10 lines from the back trace): http://paste.openstack.org/show/616921/

I can get more if that'd help! (It's on liberasurecode-1.4.0-1)

Tim Burke (1-tim-z) wrote :

Perfect, that's *exactly* what I needed!

> Program terminated with signal 4, Illegal instruction.

... with the backtrace landing right on a call to ceill -- looks like it matches the problem solved by https://github.com/openstack/liberasurecode/commit/960cdd0 and (more broadly) https://github.com/openstack/liberasurecode/commit/0962144 exactly!

I don't think Zaitcev ever made a liberasurecode bug for it, so I think I'll go ahead and associate this bug but mark it "Fix Released".

Changed in liberasurecode:
status: New → Fix Released
Andy McCrae (andrew-mccrae) wrote :

Sweet! Thanks Tim, that should be enough to get a version bump inside of RDO - @dmsimard thoughts? :)

David Moreau Simard (dmsimard) wrote :

Just cross referencing the Bugzilla on our end: https://bugzilla.redhat.com/show_bug.cgi?id=1468002

I'm sure we'll update it, it's just a matter of time.

Haïkel Guémar (hguemar) wrote :

Updates submitted in RDO repos: https://review.rdoproject.org/r/#/c/8045/
Please pay attention as upstream developper told us to be careful with this update, report any issue you'll find asap.

Changed in openstack-ansible:
assignee: nobody → Andy McCrae (andrew-mccrae)
status: New → In Progress
importance: Undecided → Low
David Moreau Simard (dmsimard) wrote :

We'll be able to update eclib to 1.5.0 in RDO once upstream has bumped upper-constraints to 1.5.0 for Ocata. Tim proposed the bump here: https://review.openstack.org/#/c/498521/

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.