pcrud crashes when called with null authtoken

Bug #1836254 reported by Jason Boyer on 2019-07-11
This bug affects 2 people
Affects Status Importance Assigned to Milestone

Bug Description

Evergreen 3.2 (though likely affects all supported versions)

Jason Stephenson mentioned in IRC that he was seeing pcrud crashes on app servers caused by calling a pcrud method with a null authtoken. Checking my own servers I've seen similar things and may have tracked down a possible cause. In every case I could find today the same 3 calls were made in rapid enough succession that they had consecutive threadtrace numbers: open-ils.auth.session.retrieve, open-ils.auth.session.delete, and then the open-ils.pcrud.search.pgt call with a null authtoken. In every case the authtoken used in those session.retrieve and delete calls were logged in with a login_type of temp, which has a default of 5 minutes. Redacted log selections as of this afternoon are attached; I didn't include the thread trace showing the login but did verify that all were temp logins.

Testing this from the web client is difficult because it doesn't happen every time an operator change times out and I haven't found what combination of use and backgrounding, etc. is required to trigger it. As far as simply testing a fix to the pcrud service you can trigger a pcrud segfault on demand by entering 'request open-ils.pcrud open-ils.pcrud.search.pgt null {"id": {">":0}}' in srfsh.

Since pcrud drones are expected to recycle themselves on occasion this doesn't cause significant issue when it happens organically but because of the possibility of DOS attack by throwing a rapid succession of "pcrud.crash" calls at your infrastructure I've set this to private security. If that's shown to be overly-cautious it can always be downgraded later.

Changed in evergreen:
status: New → Confirmed
importance: Undecided → Medium
Jason Stephenson (jstephenson) wrote :

I have found an instance that I am fairly certain does not involve operator change. I found a case where a pcrud segfault happened at 00:44 last night. The session for the authtoken was deleted at 00:44:15 and at 00:44:16 there was a pcrud segfault with the following call:

CALL: open-ils.pcrud open-ils.pcrud.search.pgt null,{"parent":null},{"flesh":-1,"flesh_fields":{"pgt":["children"]}}

The trace id for the session delete is 1 less than that of the pcrud search.

Combing through the other longs, the prior activity for the authtoken occurred at 21:44. The auth.staff_timeout setting for this library is 3 hours.

I will keep looking to see if there are any other apparent causes for these segfaults, but so far, the only ones that I have seen are related to staff logins timing out.

Jason Stephenson (jstephenson) wrote :

I have found 1 that does not correspond to any time outs or logouts as far as I can tell:

Jul 5 09:10:24 bh4 gateway: [ACT:16323:osrf-websocket-stdio.c:559:1562332223163233] [] [] open-ils.pcrud open-ils.pcrud.search.aou null, {"parent_ou":null}, {"flesh":-1,"flesh_fields":{"aou":["children","ou_type"]}}

I find no open-ils.auth.session.delete call anywhere near this call. I do see this in the logs a few lines before the above:

ul 5 09:10:23 bh4 gateway: [ACT:16323:osrf-websocket-stdio.c:559:1562332223163230] [] [] open-ils.auth open-ils.auth.session.retrieve "REDACTED-AUTHTOKEN"
Jul 5 09:10:24 bh4 gateway: [ACT:16323:osrf-websocket-stdio.c:559:1562332223163231] [] [] open-ils.auth open-ils.auth.session.retrieve null, 0, 1

It could be that the attempt to renew the token happened just as it was epiring?

Jason Stephenson (jstephenson) wrote :

So, here is what is happening inside pcrud with these requests:

When it gets to verifyUserPCRUDfull() in oils_sql.c, the combined calls to jsonObjectGetIndex() and jsonObjectGetString() return a NULL pointer. Why? Well, it's a little complicated, but null as a bareword is a valid JSON value, so it gets through the parser and does not cause a parser error like any other unquoted string would. Bare numbers also get through the parser as do properly quoted strings. These typically end up producing a "permacrud received a bad auth token" error unless the properly quoted string happens to be a valid authtoken or the literal string ANONYMOUS.

The crash happens in the strcmp() check for the ANONYMOUS literal. Since the JSON literal null ends up producing a NULL C pointer, the strcmp() function crashes following that NULL pointer.

A simple fix for this is to add a check the auth pointer being not NULL before the strcmp() call inside the if condition. This fix has the side effect of treating a literal null authtoken the same as the literal "ANONYMOUS" authtoken. That could be considered a bug or a feature.

A more complicated fix would be to consider this null authtoken as invalid and return the "permacrud received a bad authtoken" message. This fix would require some slight refactoring of verifyUserPCRUDfull() to maintain simplicity of the code.

I have pushed the simple fix to a branch in the security repo: user/dyrcona/lp1836254-null-authtoken-fix

If someone wants the more complicated fix, I should be able to modify the above branch.

All of that said, we still have the open question of when and how the web staff client send null as an authtoken to the server. This issue should also be addressed.

Jason Stephenson (jstephenson) wrote :

Apparently, I'm wrong about where it segfaults, but roughly right about the fix. I should have checked with a debugger, but that's really tedious, if not nearly impossible, with OpenSRF.

Jason Stephenson (jstephenson) wrote :

Well, I stand corrected, corrected... gdb does show it crashing in the strcmp call:

(gdb) bt
#0 __strcmp_avx2 () at ../sysdeps/x86_64/multiarch/strcmp-avx2.S:102
#1 0x000055555555a282 in verifyUserPCRUDfull (ctx=0x5555555a99b0, anon_ok=1) at crashtest.c:1489
#2 0x000055555556b740 in main (argc=1, argv=0x7fffffffe588) at crashtest.c:7777
(gdb) frame 1
#1 0x000055555555a282 in verifyUserPCRUDfull (ctx=0x5555555a99b0, anon_ok=1) at crashtest.c:1489
1489 if( strcmp( "ANONYMOUS", auth ) ) {
(gdb) print auth
$1 = 0x0

I don't have debug libraries installed, so looking at frame 0 is relatively useless, but my instinct was correct on where it was crashing. I wrote some simple test programs using NULL in strcmp and they didn't crash last night, so that threw me off.

Setting a breakpoint on verifyUserPCRUDfull and then stepping through to the SIGSEGV at 102 in ../sysdeps/x86_64/multiarch/strcmp-avx2.S confirms it.

I'll see about making making a NULL authtoken return the bad token message, unless we think a silent failure is better in this case.

Jason Stephenson (jstephenson) wrote :

For those who want to see this for themselves, or to try it on a distro other than Debian Buster, here's the code for crasthest.c.

Mike Rylander (mrylander) wrote :

Thanks, Jason.

IMO, an invalid authtoken error would be better. When we invented the ANONYMOUS authtoken support we wanted to force the caller to ask for that functionality. Also, treating null and ANONYMOUS the same will tend to paper over bugs in client code.

I took a swing at doing that with security/user/miker/lp-1836254-alt-null-authtoken-fix.

Jason Stephenson (jstephenson) wrote :

Thanks, Mike!

I took the liberty of testing your branch, and fixed a couple of typos. I then squashed the typo fix into your commit, wrote a new commit message, and pushed to the security repo as a collab branch:


I added my sign off to the branch.

It works with all of the weird authtoken values that I've tried as well as legitimate ones. I think this is ready for more general testing, so I'm adding the pullrequest tag.

This still leaves open the question of fixing the web staff client from sending the null tokens in the first place.

tags: added: pullrequest signedoff
Changed in evergreen:
milestone: none → 3.4-beta1
Galen Charlton (gmc) on 2019-09-11
Changed in evergreen:
milestone: 3.4-beta1 → 3.4-beta2
Galen Charlton (gmc) on 2019-10-02
Changed in evergreen:
milestone: 3.4-beta2 → 3.4.1
Changed in evergreen:
milestone: 3.4.1 → 3.4.2
Jeff Davis (jdavis-sitka) wrote :

I applied the patch from the collab branch on a 3.3.4 test server and it didn't immediately break anything. I expect to do a bit more testing and then put the fix in production within the next week or so, assuming no issues are discovered in the meantime.

Jeff Davis (jdavis-sitka) wrote :

We've had the fix from the collab branch in production for about 2 weeks. We are no longer experiencing segfaults and haven't noticed any negative effects from the fix. I think it should be included in the next round of releases.

Jason pointed out that the web client is still incorrectly sending pcrud requests with null authtokens. Maybe we can open a new bug for that once the fix is released?

Changed in evergreen:
milestone: 3.4.2 → 3.4.3
Changed in evergreen:
milestone: 3.4.3 → 3.5.0
Changed in evergreen:
milestone: 3.5.0 → 3.5.1
Changed in evergreen:
milestone: 3.5.1 → 3.5.2

Patching this bug would warrant a 3.4.6 release.

Changed in evergreen:
milestone: 3.5.2 → 3.7-beta
assignee: nobody → Jason Stephenson (jstephenson)
Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
milestone: 3.7-beta → none
no longer affects: evergreen/3.6
Changed in evergreen:
milestone: none → 3.6.2
Jason Stephenson (jstephenson) wrote :

I started looking at this branch today on a VM with rel_3_4, and I came to the conclusion that it can no longer be considered signed off, so I have removed that tag and the targets for the upcoming security/patch releases.

I think this branch needs Perl live tests for the search behavior and fixes.

In addition, I think the top commits could be squashed, and I wonder if commit d4b07f5c should remain.

tags: added: needstest
removed: signedoff
Changed in evergreen:
milestone: 3.6.2 → none
Jason Stephenson (jstephenson) wrote :

Disregard my previous comment. That was meant for bug 1775958.

tags: added: signedoff
removed: needstest
Changed in evergreen:
milestone: none → 3.6.2
no longer affects: evergreen/3.1
no longer affects: evergreen/3.2
no longer affects: evergreen/3.3
Changed in evergreen:
status: Confirmed → Fix Committed
information type: Private Security → Public Security
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public Security information  Edit
Everyone can see this security related information.

Other bug subscribers