Client exit delays caused by reverse name lookup

Bug #1527636 reported by Ralph Lange on 2015-12-18
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
EPICS Base
Low
mdavidsaver
3.14
Undecided
Unassigned
3.15
Undecided
Unassigned
3.16
Low
mdavidsaver

Bug Description

From benjamin.franksen _AT_ helmholtz-berlin.de:

We've been hit by a problem reported several times on tech-talk, last
time in 2013, with the final message that identifies it's cause here:

  http://www.aps.anl.gov/epics/tech-talk/2013/msg00836.php

Mistakenly relying on Google search (instead of search on tech-talk
directly) we only found a thread from 2011 with no resolution. So we
debugged this again (after Mark and David already did in 2013), arriving
at the same conclusion: ca_context_destroy leads to destruction of an
object of the class ipAddrToAsciiEnginePrivate, its destructor calling
this->thread.exitWait(). Depending on your OS type, configuration, and
version, this may hang until the call to gethostbyaddr finally times
out, if the host that serves your PV does not have a DNS entry.

I think we can agree that this is not how things should be. Whatever the
purpose of starting the reverse name resolution (in the background
thread) may be, there are certainly lots of CA client applications that
can live without this feature, as witnessed by caget working flawlessly
(terminating without any delays) when I comment out the call to
ca_context_destroy.

(There is, by the way, nothing in the docs suggesting that CA servers
must have a valid DNS name or else programs may hang indefinitely inside
ca_context_destroy.)

I can see three ways to move forward from here:

(1) Remove the call to ca_context_destroy from the CA utilities. I don't
like this very much: their source code should serve as demonstration of
good practice when programming a CA client and thus should include
proper cleanup of the client context.

(2) Apply more forceful OS-specific ways of getting rid of the name
resolution thread (even when it is blocked on a call to gethostbyaddr).
Doing this properly would mean to adding some sort of "thread killing"
method to the epicsThread class, something which has been proposed
before and rejected for various good reasons.

(3) Let the user choose whether they want to have the extra features
enabled by the host name lookup, or whether they rather want to ensure
quick termination of their programs or threads. This could be made
configurable by an environment variable, for instance.

I think the third solution is preferable since it is backward compatible
(no API or ABI change) and can be applied without changing the source
code or even re-compiling (if dynamically linked) of the client
applications.

Cheers
Ben

Related branches

Ralph Lange (ralph-lange) wrote :

From Steven Hartman <hartmansm _AT_ ornl.gov>:

We also stumbled across this issue a few months ago. (A beneficial side effect, we identified and corrected a DNS misconfiguration.)

I agree with your recommendation of choice three, some mechanism to enable/disable the name lookup.

Changed in epics-base:
status: New → Confirmed
importance: Undecided → Low
mdavidsaver (mdavidsaver) wrote :

On 12/18/2015 06:44 AM, Benjamin Franksen wrote:
> I can see three ways to move forward from here:
>
> (1) Remove the call to ca_context_destroy from the CA utilities. I don't
> like this very much: their source code should serve as demonstration of
> good practice when programming a CA client and thus should include
> proper cleanup of the client context.

Agreed, this is the least attractive.

> (2) Apply more forceful OS-specific ways of getting rid of the name
> resolution thread (even when it is blocked on a call to gethostbyaddr).
> Doing this properly would mean to adding some sort of "thread killing"
> method to the epicsThread class, something which has been proposed
> before and rejected for various good reasons.

I see this as the most appropriate solution. IMO the root problem is the lack of a portable API for async. name lookups. In the case of Linux/posix this should be fixable by sending SIGINT to the resolver thread.

I'll do a little test to see if this works.

> (3) Let the user choose whether they want to have the extra features
> enabled by the host name lookup, or whether they rather want to ensure
> quick termination of their programs or threads. This could be made
> configurable by an environment variable, for instance.

This feels like a workaround that could also be achieved through /etc/hosts.

> I think the third solution is preferable since it is backward compatible
> (no API or ABI change) and can be applied without changing the source
> code or even re-compiling (if dynamically linked) of the client
> applications.

I think that #2 can also be done in a compatible and safe way, and without adding yet another configuration option.

mdavidsaver (mdavidsaver) wrote :

> I'll do a little test to see if this works.

With glibc gethostbyaddr(), and probably all resolver functions handle EINTR internally by retrying. So they are completely impervious to signals. Thanks glibc...

When the DNS server is absent (wrong IP in /etc/resolv.conf) I find two related workarounds to get short timeouts. Adding "options timeout:1 attempts:1" to /etc/resolve.conf or equivalently setting the RES_OPTIONS environment variable to "options timeout:1 attempts:1"

> cat << EOF >> /etc/resolve.conf
> options timeout:1 attempts:1
> EOF

and/or

> export RES_OPTIONS="options timeout:1 attempts:1"

Also install nscd (or equivalent) to cache the failed lookups.

And it is always possible to add entries in /etc/hosts, but of course this doesn't scale.

On 12/18/2015 05:22 PM, mdavidsaver wrote:
>> I'll do a little test to see if this works.
>
> With glibc gethostbyaddr(), and probably all resolver functions handle
> EINTR internally by retrying. So they are completely impervious to
> signals. Thanks glibc...

Götz Pfeiffer (here at HZB) has written a patch that allows forceful
thread termination (under Posix only), using pthread_cancel and
pthread_join rather than signals. This seems to work. Judging from past
discussions I was certain that proposing something like this for
addition to base would be hopeless. In case I was mistaken, I am sure
Götz would be delighted to provide more details including his patch. It
could at least serve as a basis for further discussion.

(For reference, let me include the bug tracker URL here
https://bugs.launchpad.net/bugs/1527636 since I doubt that Götz is
subscribed to the launchpad tracker).

Cheers
Ben
--
"Make it so they have to reboot after every typo." ― Scott Adams

Andrew Johnson (anj) wrote :

I haven't looked at the code, but shouldn't this be fixed in the ipAddrToAsciiEnginePrivate class destructor, such that it *doesn't* wait for the thread to exit but somehow leaves an indication that if/when gethostbyaddr() ever returns that it must clean up by itself and exit?

The background thread is there to allow the code to display host names instead of IP addresses where they can be resolved as such, but the API that the class provides still works even if there is no host name. Thus the background thread is there to provide an in-process version of nscd and the problem with it not exiting really ought to be completely hidden from its users. If we can't get it to clean up properly, maybe there should be a way to prevent the thread from being started in the first place, probably controlled by an environment variable. This would only be used in cases where there are IP addresses in use that can't have DNS entries created for them.

In any case I would be happy to see patches for the CAref documentation and the AppDevGuide that explain what's happening when ca_context_destroy() is slow to return.

mdavidsaver (mdavidsaver) wrote :

> Judging from past discussions I was certain that proposing something like this for addition to base would be hopeless.

I don't think the situation has changed. Forcing thread termination w/o cleanup might be acceptable on process exit, but that isn't the only situation where ca_context_destroy() is called.

I'm looking at the possibility of avoiding waiting for the resolver thread to join if exit happens while it's in gethostbyaddr().

mdavidsaver (mdavidsaver) wrote :

On 12/18/2015 12:09 PM, Andrew Johnson wrote:
> I haven't looked at the code, but shouldn't this be fixed in the
> ipAddrToAsciiEnginePrivate class destructor, such that it *doesn't* wait
> for the thread to exit but somehow leaves an indication that if/when
> gethostbyaddr() ever returns that it must clean up by itself and exit?

That's what I'm looking into. The first problem is understanding this oh so complicated API...

The issue as I see it is to allow a "transaction" to be aborted while a lookup is in progress.

Not stopping the thread is easy. Just remove the reference count so that, once started, it runs until process termination.

...
> In any case I would be happy to see patches for the CAref documentation
> and the AppDevGuide that explain what's happening when
> ca_context_destroy() is slow to return.

I'm for marking this API as deprecated for external use in preparation for replacing it.

mdavidsaver (mdavidsaver) wrote :

> The issue as I see it is to allow a "transaction" to be aborted while a lookup is in progress.

While it isn't immediately obvious, I think this is already handled. So it may be as simple as removing the ref. count. Please test the attached.

Changed in epics-base:
assignee: nobody → mdavidsaver (mdavidsaver)
Ben Franksen (bfrk) wrote :

I tried your patch and it solves the problem here perfectly. Thanks! This fix should definitely go (at least) into 3.14, perhaps even be listed as an official patch against 3.14.12.5.

mdavidsaver (mdavidsaver) wrote :

@Andrew, should this patch be added to the 3.14 branch? If so, do you want to go through a merge proposal, or should I just commit it?

Andrew Johnson (anj) wrote :

Go ahead and commit it to the 3.14 branch (with --fixes=lp:1527636 to let Launchpad know). Don't forget to add an entry to RELEASE_NOTES.html describing what's changed/what effect it has, which should also link to this bug URL.

Thanks.

Changed in epics-base:
milestone: none → 3.14.branch
status: Confirmed → Fix Committed
mdavidsaver (mdavidsaver) wrote :

As a note. The change to address this issue has apparently caused or reviled lp:1580623.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers