RHEL5 nss ldap update cause stack size related failure

Reported by Jeff Hill on 2011-12-12
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Medium
Unassigned

Bug Description

Hi Jeff,
We've been having a problem lately with caget and other CA clients crashing
due to stack overflows in the nss_ldap library. We're running RHEL5, and
there's a change in the latest nss_ldap library that puts a 128K buffer on the stack.

The change happened between nss_ldap version 42.el5 and the newer 42.el5_7.4.

We're mostly running EPICS 3.14.9, which by default for linux is allocating a small
stack for this in src/libCom/osi/os/posix/osdThread.c. Thus, it appears that
the library is overwriting the stack leading to random crashes. I've checked 3.14.12,
and it appears this is still the default setting for linux.

Have you had any other reports of this crash?

Any reason why we shouldn't just use the default stack size?

Are there any plans to change this in upcoming EPICS releases?

Thanks,
- Bruce

On 12/12/2011 12:17 PM, Amedeo Perazzo via RT wrote:
> Queue/Owner: PCDS-Help [open] Nobody
> Requestors: Hill, Bruce<email address hidden> x4752 901/131B [PPA Eng EE]
> Ticket: https://www-rt.slac.stanford.edu/rt3/Ticket/Display.html?id=351542
>
> Transaction: Correspondence added by perazzo
>
> I agree with Michael having 128KB on the stack is _not_ a good idea and
> I agree with Booker that a 128KB stack size on a modern Linux system is
> probably too small.
>
> My guess is that EPICS is trying to reduce the footprint as much as
> possible given that it must run on embedded systems which can have very
> limited resources.
>
> Bruce, should we ask the EPICS community how they plan to handle this?
> If RHEL6 has the same nss_ldap code as the one that broke EPICS, the
> community will be forced to handle this problem eventually.
>
>
> On 12/12/11 11:55, <email address hidden> via RT wrote:
>> Queue/Owner: PCDS-Help [open] Nobody
>> Requestors: Hill, Bruce<email address hidden> x4752 901/131B [PPA Eng EE]
>> Ticket: https://www-rt.slac.stanford.edu/rt3/Ticket/Display.html?id=351542
>>
>> Transaction: Correspondence added by mcbrowne
>>
>> Well, it's the code that we're running... I'm not willing to say it's correct
>> though! You're absolutely right... these seem like very small stack sizes.
>>
>> Proof that this is what is running: the full routine without ellipses is:
>>
>> unsigned int epicsThreadGetStackSize (epicsThreadStackSizeClass
>> stackSizeClass)
>> {
>> #if ! defined (_POSIX_THREAD_ATTR_STACKSIZE)
>> return 0;
>> #elif defined (OSITHREAD_USE_DEFAULT_STACK)
>> return 0;
>> #else
>> static const unsigned stackSizeTable[epicsThreadStackBig+1] =
>> {128*ARCH_STACK_FACTOR, 256*ARCH_STACK_FACTOR, 512*ARCH_STACK_FACTOR};
>> if (stackSizeClass<epicsThreadStackSmall) {
>> errlogPrintf("epicsThreadGetStackSize illegal argument (too small)");
>> return stackSizeTable[epicsThreadStackBig];
>> }
>>
>> if (stackSizeClass>epicsThreadStackBig) {
>> errlogPrintf("epicsThreadGetStackSize illegal argument (too large)");
>> return stackSizeTable[epicsThreadStackBig];
>> }
>>
>> return stackSizeTable[stackSizeClass];
>> #endif /*_POSIX_THREAD_ATTR_STACKSIZE*/
>> }
>>
>> Running gdb on psusr117:
>>
>> psusr117% gdb caget
>> GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-37.el5_7.1)
>> Copyright (C) 2009 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later
>> <http://gnu.org/licenses/gpl.html>
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-redhat-linux-gnu".
>> For bug reporting instructions, please see:
>> <http://www.gnu.org/software/gdb/bugs/>...
>> Reading symbols from
>> /reg/g/pcds/package/epics/3.14/base/R3.14.9-0.3.0/bin/linux-x86_64/caget...done.
>> (gdb) break main
>> Breakpoint 1 at 0x401d00: file ../caget.c, line 329.
>> (gdb) run
>> Starting program:
>> /reg/g/pcds/package/epics/3.14/base/R3.14.9-0.3.0/bin/linux-x86_64/caget
>> warning: no loadable sections found in added symbol-file system-supplied
>> DSO at 0x2aaaaaac7000
>> [Thread debugging using libthread_db enabled]
>>
>> Breakpoint 1, main (argc=1, argv=0x7fffffffdf68) at ../caget.c:329
>> 329 {
>> (gdb) x/20i epicsThreadGetStackSize
>> 0x2aaaaaf5e670<epicsThreadGetStackSize>: sub $0x8,%rsp
>> 0x2aaaaaf5e674<epicsThreadGetStackSize+4>: cmp $0x2,%edi
>> 0x2aaaaaf5e677<epicsThreadGetStackSize+7>: ja 0x2aaaaaf5e690
>> <epicsThreadGetStackSize+32>
>> 0x2aaaaaf5e679<epicsThreadGetStackSize+9>:
>> lea 0xebfc(%rip),%rax # 0x2aaaaaf6d27c<stackSizeTable.4846>
>> 0x2aaaaaf5e680<epicsThreadGetStackSize+16>: mov %edi,%edx
>> 0x2aaaaaf5e682<epicsThreadGetStackSize+18>: mov (%rax,%rdx,4),%eax
>> 0x2aaaaaf5e685<epicsThreadGetStackSize+21>: add $0x8,%rsp
>> 0x2aaaaaf5e689<epicsThreadGetStackSize+25>: retq
>> 0x2aaaaaf5e68a<epicsThreadGetStackSize+26>: nopw 0x0(%rax,%rax,1)
>> 0x2aaaaaf5e690<epicsThreadGetStackSize+32>: lea 0xe969(%rip),%rdi #
>> 0x2aaaaaf6d000
>> 0x2aaaaaf5e697<epicsThreadGetStackSize+39>: xor %eax,%eax
>> 0x2aaaaaf5e699<epicsThreadGetStackSize+41>: callq 0x2aaaaaf47940
>> <errlogPrintf@plt>
>> 0x2aaaaaf5e69e<epicsThreadGetStackSize+46>: mov $0x80000,%eax
>> 0x2aaaaaf5e6a3<epicsThreadGetStackSize+51>: add $0x8,%rsp
>> 0x2aaaaaf5e6a7<epicsThreadGetStackSize+55>: retq
>> 0x2aaaaaf5e6a8: nopl 0x0(%rax,%rax,1)
>> 0x2aaaaaf5e6b0<epicsThreadPrivateSet>: push %rbp
>> 0x2aaaaaf5e6b1<epicsThreadPrivateSet+1>: mov %rdi,%rbp
>> 0x2aaaaaf5e6b4<epicsThreadPrivateSet+4>: push %rbx
>> 0x2aaaaaf5e6b5<epicsThreadPrivateSet+5>: mov %rsi,%rbx
>> (gdb) x/3d 0x2aaaaaf6d27c
>> 0x2aaaaaf6d27c<stackSizeTable.4846>: 131072 262144 524288
>> (gdb)
>>
>> In any event, it isn't just returning 0, which would be the case if we were
>> using OSITHREAD_USE_DEFAULT_STACK.
>> --Mike
>>
>>
>>
>> Booker Bense via RT wrote:
>>
>> On Mon, 12 Dec 2011, <email address hidden> via RT wrote:
>>
>>
>>
>> /reg/g/pcds/package/epics/3.14/base/current/src/libCom/osi/os/posix/osdThread.c,
>> you will see that:
>>
>>
>>
>> Is this the correct code? Does anyone know why you are setting
>> the stacksize? It's generally not reccommended.
>> http://www.cognitus.net/html/howto/pthreadSemiFAQ_5.html
>> Can you just recompile with OSITHREAD_USE_DEFAULT_STACK?
>>
>>
>> #if defined (_POSIX_THREAD_ATTR_STACKSIZE)
>> #if ! defined (OSITHREAD_USE_DEFAULT_STACK)
>> status = pthread_attr_setstacksize(
>> &pthreadInfo->attr,(size_t)stackSize);
>> checkStatusOnce(status,"pthread_attr_setstacksize");
>> #endif /*OSITHREAD_USE_DEFAULT_STACK*/
>> #endif /*_POSIX_THREAD_ATTR_STACKSIZE*/
>>
>> I don't know all the details, but 128K seems very tiny compared
>> to current memory sizes. If I'm reading that page correctly,
>> all the local variables for the thread need to fit on the stack.
>>
>> Another solution might be to simply remove ldap from the
>> nsswitch file for hosts.
>>
>> - Booker C. Bense
>>
>>
>>
>>
>>
>>
>> Core was generated by `caget UND:R02:IOC:10:BAT:Fiducial'. Program terminated with signal 11, Segmentation fault. #0 0x00002aaaab2b7812 in _nss_ldap_readconfig () from /lib64/libnss_ldap.so.2 (gdb) bt #0 0x00002aaaab2b7812 in _nss_ldap_readconfig () from /lib64/libnss_ldap.so.2 #1 0x00002aaaab2ad298 in ?? () from /lib64/libnss_ldap.so.2 #2 0x00002aaaab2af530 in _nss_ldap_search_s () from /lib64/libnss_ldap.so.2 #3 0x00002aaaab2b02f8 in _nss_ldap_getbyname () from /lib64/libnss_ldap.so.2 #4 0x00002aaaab2b30d9 in _nss_ldap_gethostbyaddr_r () from /lib64/libnss_ldap.so.2 #5 0x00002b4c98528055 in gethostbyaddr_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 #6 0x00002b4c98527e41 in gethostbyaddr () from /lib64/libc.so.6 #7 0x00002b4c9719d348 in ipAddrToHostName (pAddr=0x419f5f34, pBuf=0x653e600 "", bufSize=1024) at ../../../src/libCom/osi/os/posix/osdSock.c:148 #8 0x00002b4c9719d6d9 in ipAddrToA (paddr=0x419f5f30, pBuf=0x419f43f0 "X¿v«ª*", bufSize=0) at
>> ../../../src/libCom/osi/osiSock.c:99 #9 0x00002b4c971981d2 in ipAddrToAsciiEnginePrivate::run (this=0x653e5f0) at ../../../src/libCom/misc/ipAddrToAsciiAsynchronous.cpp:289 #10 0x00002b4c97199a2d in epicsThreadCallEntryPoint (pPvt=<value optimized out>) at ../../../src/libCom/osi/epicsThread.cpp:59 #11 0x00002b4c9719f731 in start_routine (arg=<value optimized out>) at ../../../src/libCom/osi/os/posix/osdThread.c:322 #12 0x00002b4c973f373d in start_thread () from /lib64/libpthread.so.0 #13 0x00002b4c985124bd in clone () from /lib64/libc.so.6 (gdb) quit It's intermittant, and sometimes crashes before printing the results and sometimes after. I ran caget 10 times, and got 4 core dumps, and 7 successful printouts of the value. I've done the stack trace many times and each time it's in the same nss_ldap_readconfig() call. Does anyone have any idea why nss ldap may have changed on the psusr* machines in the last few weeks? Is anyone else seeing similar crashes? Thanks, - Bruce
>
>

Related branches

Jeff Hill (johill-lanl) on 2011-12-12
Changed in epics-base:
importance: Undecided → Medium
Jeff Hill (johill-lanl) wrote :

From Bruce

It seems to me that there's no good reason for us to use the
stack size feature in the CA lib for our linux based apps and tools,
so I defined OSITHREAD_USE_DEFAULT_STACK to YES
in the EPICS CONFIG_SITE file and rebuilt.

I did a couple of loops on psusr121 using the new caget and
nss_ldap version 42.el5_7.4 with over 1100 caget's and no
crashes.

EPICS 3.14.9-0.3.0, the one used by our current caget path,
is now rebuilt using default stack sizes.

I think we can close this now.

Jeff Hill (johill-lanl) wrote :

Should the system have different defaults specified in the build system depending on if its embedded linux arch or not?

Jeff Hill (johill-lanl) wrote :

Clarifying the issue.

Should the system have different defaults for OSITHREAD_USE_DEFAULT_STACK specified in the build system depending on if its embedded linux arch or not?

Andrew Johnson (anj) wrote :

I would be happy to change the OSITHREAD_USE_DEFAULT_STACK setting for 64-bit CPUs because their virtual address space is big enough for anything, but for 32-bit CPUs using the default stack size severely limits the number of CA servers that a single client process can talk to. According to the CVS log Marty changed the default from YES to NO in 2004, which may have been when we were trying to get the APS Gateway to run on a Linux box.

On current Linux systems the default stack is typically 8-10MB per thread, and the CA client library creates 2 threads per server, so it needs 16-20MB of address space per server. On 32-bit CPUs user-space used to be limited to half of the virtual address space, i.e. 2GB = 2048MB, which should accommodate somewhere between 100 and 128 servers — way too small for the APS and probably many other sites.

- Andrew

Jeff Hill (johill-lanl) wrote :

From Bruce:

I'm also satisfied for now with our fix, which is to use
the default stack size for all our linux architecture targets
by putting the def in CONFIG_SITE.

That has fixed the stack overflow in the nss_ldap lib for our
CA tools that run on linux, and our only embedded target is
RTEMS which doesn't use OSITHREAD_USE_DEFAULT_STACK.

I don't think this would be the right fix for sites with
embedded posix targets, whether linux or others.
Does this point to a need for embedded versions of the
configure/os/CONFIG.linux* files?

This issue should also go to the full tech-talk list soon, as
there will likely be other RHEL5 users that will be getting
these crashes as they update their nss_ldap libs.

Jeff Hill (johill-lanl) wrote :

From Andrew

I would be happy to change the OSITHREAD_USE_DEFAULT_STACK setting for
64-bit CPUs because their virtual address space is big enough for
anything, but for 32-bit CPUs using the default stack size severely
limits the number of CA servers that a single client process can talk
to. According to the CVS log Marty changed the default from YES to NO
in 2004, which may have been when we were trying to get the APS Gateway
to run on a Linux box.

On current Linux systems the default stack is typically 8-10MB per
thread, and the CA client library creates 2 threads per server, so it
needs 16-20MB of address space per server. On 32-bit CPUs user-space
used to be limited to half of the virtual address space, i.e. 2GB =
2048MB, which should accommodate somewhere between 100 and 128 servers —
way too small for the APS and probably many other sites.

Jeff Hill (johill-lanl) wrote :

From Bruce

It appears that the 3 EPICS stack sizes if we don't use defaults
are 128K, 256K, and 512K. There's a lot of room between
these and the typical 8-10MB linux default. Why so small for
these stack sizes, which are only used for posix systems?

Andrew Johnson (anj) wrote :

I'm about to commit a change to the 3.14 branch that will double the stack sizes on 32-bit systems and quadruple them on 64-bit. I believe this will solve Bruce's issue, which occurred on a 64-bit machine. I'm also significantly increasing the stack sizes on Windows which have been causing problems for Mark Rivers — the WIN32 version of osdThread.c currently uses the same stack sizes as vxWorks, which are tiny for an OS that has virtual memory. The Windows sizes will match the Posix ones, and include sizeof(void*) in their calculation.

- Andrew

Changed in epics-base:
status: New → Fix Committed
Andrew Johnson (anj) wrote :

Fixed in R3.14.12.3 release.

Changed in epics-base:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers