RHEL5 nss ldap update cause stack size related failure
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
EPICS Base |
Fix Released
|
Medium
|
Unassigned |
Bug Description
Hi Jeff,
We've been having a problem lately with caget and other CA clients crashing
due to stack overflows in the nss_ldap library. We're running RHEL5, and
there's a change in the latest nss_ldap library that puts a 128K buffer on the stack.
The change happened between nss_ldap version 42.el5 and the newer 42.el5_7.4.
We're mostly running EPICS 3.14.9, which by default for linux is allocating a small
stack for this in src/libCom/
the library is overwriting the stack leading to random crashes. I've checked 3.14.12,
and it appears this is still the default setting for linux.
Have you had any other reports of this crash?
Any reason why we shouldn't just use the default stack size?
Are there any plans to change this in upcoming EPICS releases?
Thanks,
- Bruce
On 12/12/2011 12:17 PM, Amedeo Perazzo via RT wrote:
> Queue/Owner: PCDS-Help [open] Nobody
> Requestors: Hill, Bruce<email address hidden> x4752 901/131B [PPA Eng EE]
> Ticket: https:/
>
> Transaction: Correspondence added by perazzo
>
> I agree with Michael having 128KB on the stack is _not_ a good idea and
> I agree with Booker that a 128KB stack size on a modern Linux system is
> probably too small.
>
> My guess is that EPICS is trying to reduce the footprint as much as
> possible given that it must run on embedded systems which can have very
> limited resources.
>
> Bruce, should we ask the EPICS community how they plan to handle this?
> If RHEL6 has the same nss_ldap code as the one that broke EPICS, the
> community will be forced to handle this problem eventually.
>
>
> On 12/12/11 11:55, <email address hidden> via RT wrote:
>> Queue/Owner: PCDS-Help [open] Nobody
>> Requestors: Hill, Bruce<email address hidden> x4752 901/131B [PPA Eng EE]
>> Ticket: https:/
>>
>> Transaction: Correspondence added by mcbrowne
>>
>> Well, it's the code that we're running... I'm not willing to say it's correct
>> though! You're absolutely right... these seem like very small stack sizes.
>>
>> Proof that this is what is running: the full routine without ellipses is:
>>
>> unsigned int epicsThreadGetS
>> stackSizeClass)
>> {
>> #if ! defined (_POSIX_
>> return 0;
>> #elif defined (OSITHREAD_
>> return 0;
>> #else
>> static const unsigned stackSizeTable[
>> {128*ARCH_
>> if (stackSizeClass
>> errlogPrintf(
>> return stackSizeTable[
>> }
>>
>> if (stackSizeClass
>> errlogPrintf(
>> return stackSizeTable[
>> }
>>
>> return stackSizeTable[
>> #endif /*_POSIX_
>> }
>>
>> Running gdb on psusr117:
>>
>> psusr117% gdb caget
>> GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-37.el5_7.1)
>> Copyright (C) 2009 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later
>> <http://
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-
>> For bug reporting instructions, please see:
>> <http://
>> Reading symbols from
>> /reg/g/
>> (gdb) break main
>> Breakpoint 1 at 0x401d00: file ../caget.c, line 329.
>> (gdb) run
>> Starting program:
>> /reg/g/
>> warning: no loadable sections found in added symbol-file system-supplied
>> DSO at 0x2aaaaaac7000
>> [Thread debugging using libthread_db enabled]
>>
>> Breakpoint 1, main (argc=1, argv=0x7fffffff
>> 329 {
>> (gdb) x/20i epicsThreadGetS
>> 0x2aaaaaf5e670<
>> 0x2aaaaaf5e674<
>> 0x2aaaaaf5e677<
>> <epicsThreadGet
>> 0x2aaaaaf5e679<
>> lea 0xebfc(%rip),%rax # 0x2aaaaaf6d27c<
>> 0x2aaaaaf5e680<
>> 0x2aaaaaf5e682<
>> 0x2aaaaaf5e685<
>> 0x2aaaaaf5e689<
>> 0x2aaaaaf5e68a<
>> 0x2aaaaaf5e690<
>> 0x2aaaaaf6d000
>> 0x2aaaaaf5e697<
>> 0x2aaaaaf5e699<
>> <errlogPrintf@plt>
>> 0x2aaaaaf5e69e<
>> 0x2aaaaaf5e6a3<
>> 0x2aaaaaf5e6a7<
>> 0x2aaaaaf5e6a8: nopl 0x0(%rax,%rax,1)
>> 0x2aaaaaf5e6b0<
>> 0x2aaaaaf5e6b1<
>> 0x2aaaaaf5e6b4<
>> 0x2aaaaaf5e6b5<
>> (gdb) x/3d 0x2aaaaaf6d27c
>> 0x2aaaaaf6d27c<
>> (gdb)
>>
>> In any event, it isn't just returning 0, which would be the case if we were
>> using OSITHREAD_
>> --Mike
>>
>>
>>
>> Booker Bense via RT wrote:
>>
>> On Mon, 12 Dec 2011, <email address hidden> via RT wrote:
>>
>>
>>
>> /reg/g/
>> you will see that:
>>
>>
>>
>> Is this the correct code? Does anyone know why you are setting
>> the stacksize? It's generally not reccommended.
>> http://
>> Can you just recompile with OSITHREAD_
>>
>>
>> #if defined (_POSIX_
>> #if ! defined (OSITHREAD_
>> status = pthread_
>> &pthreadInfo-
>> checkStatusOnce
>> #endif /*OSITHREAD_
>> #endif /*_POSIX_
>>
>> I don't know all the details, but 128K seems very tiny compared
>> to current memory sizes. If I'm reading that page correctly,
>> all the local variables for the thread need to fit on the stack.
>>
>> Another solution might be to simply remove ldap from the
>> nsswitch file for hosts.
>>
>> - Booker C. Bense
>>
>>
>>
>>
>>
>>
>> Core was generated by `caget UND:R02:
>> ../../.
>
>
Related branches
Changed in epics-base: | |
importance: | Undecided → Medium |
From Bruce
It seems to me that there's no good reason for us to use the USE_DEFAULT_ STACK to YES
stack size feature in the CA lib for our linux based apps and tools,
so I defined OSITHREAD_
in the EPICS CONFIG_SITE file and rebuilt.
I did a couple of loops on psusr121 using the new caget and
nss_ldap version 42.el5_7.4 with over 1100 caget's and no
crashes.
EPICS 3.14.9-0.3.0, the one used by our current caget path,
is now rebuilt using default stack sizes.
I think we can close this now.