Infinite loop in tids exit handlers

Bug #1705714 reported by Dan Breslau on 2017-07-21
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Project Moonshot
High
Dan Breslau

Bug Description

Matthew Vernon <email address hidden> reported the following via email. (Also
see my follow-up in the first comment):

Hi,

I've been bugging Jisc about this bug for ages, so thought I'd try
looking at it myself. I didn't make much headway, but perhaps enough to
let someone who knows what they're doing fix it

The failure mode is that tids processes do not die, and instead sit
around chewing 100% CPU - over time you have enough of these to bring
your IdP to its knees. We've bodged round this by having a cron job do
system tids restart ever 2 hours

strace on a spinning tids produces no output (suggesting no system calls
are being made), gdb (with moonshot-trust-router-dbg installed) always
looks roughly like:

Attaching to program: /usr/bin/tids, process 2374
[New LWP 2375]
[New LWP 2376]
[New LWP 2377]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fac5b1804c2 in log4shib::Category::getChainedPriority() const ()
   from /usr/lib/x86_64-linux-gnu/liblog4shib.so.1
(gdb) bt
#0 0x00007fac5b1804c2 in log4shib::Category::getChainedPriority() const ()
   from /usr/lib/x86_64-linux-gnu/liblog4shib.so.1
#1 0x00007fac5b17fc79 in log4shib::Category::isPriorityEnabled(int)
const ()
   from /usr/lib/x86_64-linux-gnu/liblog4shib.so.1
#2 0x00007fac5b1809fc in log4shib::Category::info(char const*, ...) ()
   from /usr/lib/x86_64-linux-gnu/liblog4shib.so.1
#3 0x00007fac5d9a63c2 in shibsp::SPConfig::term() ()
   from /usr/lib/x86_64-linux-gnu/libshibsp.so.6
#4 0x00007fac5d9a6a88 in shibsp::SPInternalConfig::term() ()
   from /usr/lib/x86_64-linux-gnu/libshibsp.so.6
#5 0x00007fac5dd6fd7e in shibresolver::ShibbolethResolver::term() ()
   from /usr/lib/x86_64-linux-gnu/libshibresolver.so.1
#6 0x00007fac5e1c9189 in gssEapLocalAttrProviderFinalize ()
   from /usr/lib/x86_64-linux-gnu/gss/mech_eap.so
#7 0x00007fac5e1c1174 in ?? () from
/usr/lib/x86_64-linux-gnu/gss/mech_eap.so
#8 0x00007fac5f5ebff8 in __run_exit_handlers (status=status@entry=0,
    listp=0x7fac5f9755f8 <__exit_funcs>,
    run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#9 0x00007fac5f5ec045 in __GI_exit (status=status@entry=0) at exit.c:104
#10 0x00000000004059b7 in tids_accept (tids=0x190e200, listen=<optimized
out>)
    at tid/tids.c:485
#11 0x0000000000405dec in tids_start (tids=tids@entry=0x190e200,
    req_handler=req_handler@entry=0x403dc0 <tids_req_handler>,
    auth_handler=auth_handler@entry=0x403d70 <auth_handler>,
    hostname=<optimized out>, port=port@entry=12309,
    cookie=cookie@entry=0x1904dc0) at tid/tids.c:546
#12 0x0000000000403a94 in main (argc=<optimized out>, argv=<optimized out>)
    at tid/example/tids_main.c:389

tid/tids.c:485 is
    exit(0); /* exit to kill forked child process */

...so it appears to be a bug in something's exit handlers?
getChainedPriority does have a loop in it:

        const Category* c = this;
        while(c->getPriority() >= Priority::NOTSET) {
            c = c->getParent();
        }

...which makes me wonder if something is being incorrectly initialised,
but I'm rather clutching at straws here.

Debian/Ubuntu don't ship a log4shib library with debugging symbols
installed.

I then installed moonshot-gss-eap-dbg and the problem seems much slower
to recur (to the point that I'd thought it had caused the problem to
entirely go away); now a bt looks like:

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fd74ece84c2 in log4shib::Category::getChainedPriority() const ()
   from /usr/lib/x86_64-linux-gnu/liblog4shib.so.1
(gdb) bt
#0 0x00007fd74ece84c2 in log4shib::Category::getChainedPriority() const ()
   from /usr/lib/x86_64-linux-gnu/liblog4shib.so.1
#1 0x00007fd74ece7c79 in log4shib::Category::isPriorityEnabled(int)
const ()
   from /usr/lib/x86_64-linux-gnu/liblog4shib.so.1
#2 0x00007fd74ece89fc in log4shib::Category::info(char const*, ...) ()
   from /usr/lib/x86_64-linux-gnu/liblog4shib.so.1
#3 0x00007fd75150e3c2 in shibsp::SPConfig::term() ()
   from /usr/lib/x86_64-linux-gnu/libshibsp.so.6
#4 0x00007fd75150ea88 in shibsp::SPInternalConfig::term() ()
   from /usr/lib/x86_64-linux-gnu/libshibsp.so.6
#5 0x00007fd7518d7d7e in shibresolver::ShibbolethResolver::term() ()
   from /usr/lib/x86_64-linux-gnu/libshibresolver.so.1
#6 0x00007fd751d310f7 in gss_eap_shib_attr_provider::finalize ()
    at util_shib.cpp:481
#7 0x00007fd751d31189 in gssEapLocalAttrProviderFinalize (
    minor=minor@entry=0x7ffeac0d31f4) at util_shib.cpp:550
#8 0x00007fd751d29174 in (anonymous
namespace)::finalize_class::~finalize_class (this=<optimized out>,
__in_chrg=<optimized out>) at util_attr.cpp:101
#9 0x00007fd753153ff8 in __run_exit_handlers (status=status@entry=0,
    listp=0x7fd7534dd5f8 <__exit_funcs>,
    run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#10 0x00007fd753154045 in __GI_exit (status=status@entry=0) at exit.c:104
#11 0x00000000004059b7 in tids_accept (tids=0x968200, listen=<optimized
out>)
    at tid/tids.c:485
#12 0x0000000000405dec in tids_start (tids=tids@entry=0x968200,
    req_handler=req_handler@entry=0x403dc0 <tids_req_handler>,
    auth_handler=auth_handler@entry=0x403d70 <auth_handler>,
    hostname=<optimized out>, port=port@entry=12309,
    cookie=cookie@entry=0x95edc0) at tid/tids.c:546
#13 0x0000000000403a94 in main (argc=<optimized out>, argv=<optimized out>)
    at tid/example/tids_main.c:389

This is an Ubuntu Xenial system, but I've seen the runaway-tids problem
basically since I started looking at the moonshot pilot back at my
previous job (where we were running Debian).

Regards,

Matthew

Dan Breslau (dbreslau) on 2017-07-21
description: updated
Dan Breslau (dbreslau) wrote :

Bumped the Importance to High based on the setting of the related reports.

Changed in moonshot:
importance: Medium → High
Dan Breslau (dbreslau) wrote :

Fixed in moonshot-gss-eap-1.0.1

Changed in moonshot:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers