Perl services can crash with a "Use of freed value in iteration" error

Bug #1953044 reported by Galen Charlton
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenSRF
Fix Released
Medium
Unassigned
3.3
Fix Released
Medium
Unassigned

Bug Description

Perl app listeners can occasionally throw the following exception:

server: died with error Use of freed value in iteration at /usr/lib/x86_64-linux-gnu/perl/5.28/IO/Select.pm line 70.

When this happens, the listener will kill its drones and attempt to reset itself (though the reset doesn't work for other reasons that I'll document in a separate bug).

We have seen this in servers running Perl 5.24.2 and 5.28.1; it may well affect other versions of Perl.

The cause appears to be an interaction between how OpenSRF::Server->check_status() sets up IO::Select to check on child pipes and OpenSRF::Server->reap_children() cleans up dead drones. In particular, if ->reap_children() is invoked while ->check_status() is adding pipes to the IO::Select object and happens to reap a child that was on the active list, IO::Select->add() can crash with the error listed above.

This bug appears to be very sensitive to changes in Perl's garbage collector and how it manages reference counts to stack variables. This may explain why this bug may have been hiding for a long time.

OpenSRF 3.1+

Galen Charlton (gmc)
Changed in opensrf:
importance: Undecided → Medium
milestone: none → 3.2.3
Revision history for this message
Galen Charlton (gmc) wrote :

A patch is available at the tip of

working/user/gmcharlt/lp1953044_fix_freed_value_error / https://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/gmcharlt/lp1953044_fix_freed_value_error

If anybody can come up with a more deterministic reproduction plan, that would be excellent.

tags: added: pullrequest
Changed in opensrf:
status: New → Confirmed
Revision history for this message
Jeff Davis (jdavis-sitka) wrote :

So far I haven't been able to reproduce the bug in testing. I've made the suggested changes from the commit message, and after 4000 opensrf.slooooooow.wait requests (200 parallel requests at a time) I'm not seeing the "freed value in iteration" error. My test environment is running Perl 5.30.0 and (roughly) OpenSRF 3.2.2.

Revision history for this message
Bill Erickson (berick) wrote :

I started hitting this issue frequently when load testing my experimental Redis code. Applying Galens' branch helped, but did not fully resolve it, especially at higher loads. After some experimenting, I found the issue is partly related to the freeing of the child, and partly related to the swapping of the active_list array mid-loop. The "freed value" is the active_list array reference.

Here's another branch that resolved the issue for me by copying the array and sanity checking the array values at runtime:

https://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/berick/lp1953044-loop-freed-value

Changed in opensrf:
milestone: 3.2.3 → 3.2.4
Revision history for this message
Bill Erickson (berick) wrote :

Just noting we've been running my patch in production for a while. So far so good. We occasionally had the "Use of freed value" issue on our utility server and it has stopped.

Galen Charlton (gmc)
Changed in opensrf:
milestone: 3.2.4 → 3.2.5
Revision history for this message
Bill Erickson (berick) wrote :

This error is popping up again on our servers. Galen et al., are you running your patch in production?

Revision history for this message
Mike Rylander (mrylander) wrote :

Bill, we are running Galen's patch in our main production environment, and we're not seeing the issue. We have seen it recently in an XMPP-backed instance that we don't host, but neither your patch nor ours was in place there, so that's not surprising.

Are you using either, or both, where you see it happening now?

Revision history for this message
Bill Erickson (berick) wrote :

Thanks, Mike. I'm only using my patch at the moment. It helped quite a bit, but did not fully solve the issue. I'm going to deploy Galen's patch next, and hopefully we can get this merged soon.

Revision history for this message
Bill Erickson (berick) wrote :

Here's a sign-off for Galen's patch:

https://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/berick/lp1953044-freed-value-err-signoff

I've run my previous patch in production for a while now with some improvement, but not complete success. I applied Galen's patch about a month ago. Things are stable and issue has not recurred. Previously it would happen roughly monthly.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I'm testing this with OpenSRF 3.3 right now. Bill, did you want to add the signedoff tag?

Revision history for this message
Bill Erickson (berick) wrote :

Oops, yes, thanks Jason. Done.

tags: added: signedoff
Changed in opensrf:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.