Infinite loop in tids exit handlers
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Project Moonshot |
Fix Released
|
High
|
Dan Breslau |
Bug Description
Matthew Vernon <email address hidden> reported the following via email. (Also
see my follow-up in the first comment):
Hi,
I've been bugging Jisc about this bug for ages, so thought I'd try
looking at it myself. I didn't make much headway, but perhaps enough to
let someone who knows what they're doing fix it
The failure mode is that tids processes do not die, and instead sit
around chewing 100% CPU - over time you have enough of these to bring
your IdP to its knees. We've bodged round this by having a cron job do
system tids restart ever 2 hours
strace on a spinning tids produces no output (suggesting no system calls
are being made), gdb (with moonshot-
looks roughly like:
Attaching to program: /usr/bin/tids, process 2374
[New LWP 2375]
[New LWP 2376]
[New LWP 2377]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_
0x00007fac5b1804c2 in log4shib:
from /usr/lib/
(gdb) bt
#0 0x00007fac5b1804c2 in log4shib:
from /usr/lib/
#1 0x00007fac5b17fc79 in log4shib:
const ()
from /usr/lib/
#2 0x00007fac5b1809fc in log4shib:
from /usr/lib/
#3 0x00007fac5d9a63c2 in shibsp:
from /usr/lib/
#4 0x00007fac5d9a6a88 in shibsp:
from /usr/lib/
#5 0x00007fac5dd6fd7e in shibresolver:
from /usr/lib/
#6 0x00007fac5e1c9189 in gssEapLocalAttr
from /usr/lib/
#7 0x00007fac5e1c1174 in ?? () from
/usr/lib/
#8 0x00007fac5f5ebff8 in __run_exit_handlers (status=
listp=
run_
#9 0x00007fac5f5ec045 in __GI_exit (status=
#10 0x00000000004059b7 in tids_accept (tids=0x190e200, listen=<optimized
out>)
at tid/tids.c:485
#11 0x0000000000405dec in tids_start (tids=tids@
req_
auth_
hostname=
cookie=
#12 0x0000000000403a94 in main (argc=<optimized out>, argv=<optimized out>)
at tid/example/
tid/tids.c:485 is
exit(0); /* exit to kill forked child process */
...so it appears to be a bug in something's exit handlers?
getChainedPriority does have a loop in it:
const Category* c = this;
c = c->getParent();
}
...which makes me wonder if something is being incorrectly initialised,
but I'm rather clutching at straws here.
Debian/Ubuntu don't ship a log4shib library with debugging symbols
installed.
I then installed moonshot-
to recur (to the point that I'd thought it had caused the problem to
entirely go away); now a bt looks like:
Using host libthread_db library "/lib/x86_
0x00007fd74ece84c2 in log4shib:
from /usr/lib/
(gdb) bt
#0 0x00007fd74ece84c2 in log4shib:
from /usr/lib/
#1 0x00007fd74ece7c79 in log4shib:
const ()
from /usr/lib/
#2 0x00007fd74ece89fc in log4shib:
from /usr/lib/
#3 0x00007fd75150e3c2 in shibsp:
from /usr/lib/
#4 0x00007fd75150ea88 in shibsp:
from /usr/lib/
#5 0x00007fd7518d7d7e in shibresolver:
from /usr/lib/
#6 0x00007fd751d310f7 in gss_eap_
at util_shib.cpp:481
#7 0x00007fd751d31189 in gssEapLocalAttr
minor=
#8 0x00007fd751d29174 in (anonymous
namespace)
__in_chrg=
#9 0x00007fd753153ff8 in __run_exit_handlers (status=
listp=
run_
#10 0x00007fd753154045 in __GI_exit (status=
#11 0x00000000004059b7 in tids_accept (tids=0x968200, listen=<optimized
out>)
at tid/tids.c:485
#12 0x0000000000405dec in tids_start (tids=tids@
req_
auth_
hostname=
cookie=
#13 0x0000000000403a94 in main (argc=<optimized out>, argv=<optimized out>)
at tid/example/
This is an Ubuntu Xenial system, but I've seen the runaway-tids problem
basically since I started looking at the moonshot pilot back at my
previous job (where we were running Debian).
Regards,
Matthew
description: | updated |
Also reported in the moonshot-tr project, in several forms:
https:/ /bugs.launchpad .net/moonshot- tr/+bug/ 1698394 /bugs.launchpad .net/moonshot- tr/+bug/ 1689591 /bugs.launchpad .net/moonshot- tr/+bug/ 1454166 (possibly)
https:/
https:/